Created 09-21-2018 08:17 PM
I have a strong feeling Hortonworks is purposely not answering it as I got all my other questions answered ..so I m hoping one brave person will step forward and tell me the truth . after all that's what the purpose of the forum is.
my question that is still not answered is : is sqoop incremental load in hive ORC table supported and has anyone done it ?
i am sure many people will benefit from this answer
Created 09-21-2018 09:54 PM
1) NiFi is definitely an option.
If CDC is important for you, be aware that MySQL CDC processor is supported. Unfortunately, other supported CDC processors are not available due to licensing issues with vendors like Oracle etc. So, if you use NiFi, you need to write your queries smartly to catch the changes and limit the impact on source databases.
2) Another good option is Attunity, but that comes at higher cost.
3) I have seen others using Spark for your use case.
4) I am doing some guesswork here, because I don't have info from your 4 questions before. As I recall the incremental data import is supported via sqoop job and not directly via sqoop import. I see that you did it as a job, but could be it a typo in your syntax? I see a space between "dash-dash" and "import" 🙂 Joking, but you may want to check. I have seen strange messages out of Sqoop.
Could you point to previous questions URLs or clarify/add more info to the current question?
5) Regarding, "Also i use "--hcatalog-storage-stanza 'stored as orc tblproperties ("orc.compress"="SNAPPY")'" for non incremental loads and i was told that soon this would be working , still not ?" I need a little bit more context or maybe this was already another question you submitted and I can check.
Created 09-21-2018 09:09 PM
I doubt this is a conspiracy :). I found one instance of those 4th previous attempts: https://community.hortonworks.com/questions/131541/is-sqoop-incremental-load-possible-for-hive-orc-t...
It would have been good if you could have referenced the URLs of the previous 4 attempt so we can get some historical information. It is not clear what version of Hive you use, 1.2.1 or 2.1.0, also how whether you created the target Hive table as transactional, but anyway, long story short, the following is the practice that Hortonworks recommends on HDP 2.6.0, assuming that is your HDP version as your question was tagged: https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.0/bk_data-access/content/incrementally-updati...
Created 09-21-2018 09:25 PM
it must be my lucky day 🙂
Stanca but this will introduce a lot of delays in the data , we wanted a near real time data , is it not possible using sqoop ?
what about Nifi ?
Also i use "--hcatalog-storage-stanza 'stored as orc tblproperties ("orc.compress"="SNAPPY")'" for non incremental loads and i was told that soon this would be working , still not ?
Created 09-22-2018 04:36 AM
I read your DATA ACCESS document and I think its for cases where you don't have a date check column. since its doing many operations like merging tables , purging , compacting , deleting ..etc
why would I do all this when I can just
1- import the whole base table as ORC
2- bring in the incrementals as text to an exterenally mapped table
3- insert into the base ORC table selecting everything from the incremental table
4- delete all the files in the external table folder.
I tested this method and its working fine .
is there any flaw in this method that I am not seeing ?
Created 09-23-2018 02:49 AM
I just noted there is a small note on top saying
"
Note | |
---|---|
This procedure requires change data capture from the operational database that has a primary key and modified date field where you pulled the records from since the last update. |
we don't have CDC on our database so we cant do incremental imports? it should be possible by looking at the date field as that's constantly increasing ?
Created 09-21-2018 09:54 PM
1) NiFi is definitely an option.
If CDC is important for you, be aware that MySQL CDC processor is supported. Unfortunately, other supported CDC processors are not available due to licensing issues with vendors like Oracle etc. So, if you use NiFi, you need to write your queries smartly to catch the changes and limit the impact on source databases.
2) Another good option is Attunity, but that comes at higher cost.
3) I have seen others using Spark for your use case.
4) I am doing some guesswork here, because I don't have info from your 4 questions before. As I recall the incremental data import is supported via sqoop job and not directly via sqoop import. I see that you did it as a job, but could be it a typo in your syntax? I see a space between "dash-dash" and "import" 🙂 Joking, but you may want to check. I have seen strange messages out of Sqoop.
Could you point to previous questions URLs or clarify/add more info to the current question?
5) Regarding, "Also i use "--hcatalog-storage-stanza 'stored as orc tblproperties ("orc.compress"="SNAPPY")'" for non incremental loads and i was told that soon this would be working , still not ?" I need a little bit more context or maybe this was already another question you submitted and I can check.
Created 09-22-2018 01:18 AM
here is one of the post showing hive incremental import into ORC is possible using sqoop . so why its not working for me ? I was using the correct syntax . no space between dash-dash
kindly check my other post ,, I am trying to follow the link you posted earlier but getting errors.
https://community.hortonworks.com/questions/223262/malformed-orc-file-format.html
Created on 09-22-2018 01:26 AM - edited 08-17-2019 10:21 PM
please see the syntax in the attached screenshot .. its not complaining about dash-dash but its not liking the --append-mode with HCatalog