Support Questions
Find answers, ask questions, and share your expertise
Check out our newest addition to the community, the Cloudera Innovation Accelerator group hub.

Sqoop directly to Hive advantages?




Is there is any advantage in importing data from DB2 to hive directly using Sqoop compared to importing to HDFS first and then loading to Hive table? Which is recommeded or preferred approach?




Master Guru
Sqoop's import to Hive is an extension of its import to HDFS (i.e. the Hive part is done after its regular HDFS import work), so if your formats are already acceptable and do not need further transformation you can do it as part of the Sqoop step directly.

Sqoop also supports (via its HCatalog options) inserting into partitioned tables via dynamic partitioning if you'll require that.


Thanks for the response.

I want to store them as parquet format. Sqoop supports parquet but hive does not recognize those parquet files. So in both options, I have to sqoop them in text format and then move to another hive table in parquet format.


option 1 - sqoop to hdfs in text format (hdfs location is hive external table). Then insert into hive table with parition and paquet format


option 2 - sqoop to hive internal table in text format, iinsert into hive table with parition and paquet format


Based on your response, hive part is done after import to hdfs, I think I can avoid that extra work (time) with option 1. Agree?


Master Guru
Could you elaborate on 'does not recognize it'? If its just an options
problem but the files appear OK if you try loading them manually into Hive,
then you can import to HDFS as Parquet and run a simple LOAD DATA INPATH
statement to move them to your Hive table as-is.

If there's a deeper inconsistency that requires transformation from text to
get it right, then both options seem fine.