Created 04-24-2017 10:00 PM
All,
I have question for sqooping , I am sqooping around 2tb of data for one table and then need to write ORC table wit h that . What's best way to achieve
1) sqoop all data in dir1 as text and write HQL to load into ORC table , where script fail for vertex issue
2) sqoop data in chucks and process and append into hive table ( have you done this ? )
3) sqoop hive import to write all data to hive ORC table
Which is best way ?
Created 04-24-2017 10:38 PM
If the table has primary keys through which you can identify unique records then make use of those keys to get chunks of data and load it into hive. Sqoop will always works good with bulk import. But when the data is too huge its not recommended to import in one shot. Its also depends upon your source RDBMS as well. I have encountered the same issue where I am able to import a table which is 20TB from teradata into hive which works perfectly fine. But when the table size increases to 30Tb im unable to import in one single stretch. In such cases I will go with multiple chucks and or import the table by using primary keys as split by and increase the mapper size it should also hold good for your scenario.
Created 04-24-2017 10:38 PM
If the table has primary keys through which you can identify unique records then make use of those keys to get chunks of data and load it into hive. Sqoop will always works good with bulk import. But when the data is too huge its not recommended to import in one shot. Its also depends upon your source RDBMS as well. I have encountered the same issue where I am able to import a table which is 20TB from teradata into hive which works perfectly fine. But when the table size increases to 30Tb im unable to import in one single stretch. In such cases I will go with multiple chucks and or import the table by using primary keys as split by and increase the mapper size it should also hold good for your scenario.