Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Sqoop import & hive ORC

Solved Go to solution
Highlighted

Sqoop import & hive ORC

Explorer

All,

I have question for sqooping , I am sqooping around 2tb of data for one table and then need to write ORC table wit h that . What's best way to achieve

1) sqoop all data in dir1 as text and write HQL to load into ORC table , where script fail for vertex issue

2) sqoop data in chucks and process and append into hive table ( have you done this ? )

3) sqoop hive import to write all data to hive ORC table

Which is best way ?

1 ACCEPTED SOLUTION

Accepted Solutions
Highlighted

Re: Sqoop import & hive ORC

If the table has primary keys through which you can identify unique records then make use of those keys to get chunks of data and load it into hive. Sqoop will always works good with bulk import. But when the data is too huge its not recommended to import in one shot. Its also depends upon your source RDBMS as well. I have encountered the same issue where I am able to import a table which is 20TB from teradata into hive which works perfectly fine. But when the table size increases to 30Tb im unable to import in one single stretch. In such cases I will go with multiple chucks and or import the table by using primary keys as split by and increase the mapper size it should also hold good for your scenario.

View solution in original post

1 REPLY 1
Highlighted

Re: Sqoop import & hive ORC

If the table has primary keys through which you can identify unique records then make use of those keys to get chunks of data and load it into hive. Sqoop will always works good with bulk import. But when the data is too huge its not recommended to import in one shot. Its also depends upon your source RDBMS as well. I have encountered the same issue where I am able to import a table which is 20TB from teradata into hive which works perfectly fine. But when the table size increases to 30Tb im unable to import in one single stretch. In such cases I will go with multiple chucks and or import the table by using primary keys as split by and increase the mapper size it should also hold good for your scenario.

View solution in original post

Don't have an account?
Coming from Hortonworks? Activate your account here