- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
Sqoop import to HDFS and partitioning possibilities
- Labels:
-
Apache Hive
-
Apache Sqoop
Created 10-12-2016 06:57 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Upgraded to HDP-2.5.0.0 using Ambari 2.4.0.1
There are several SQL Server and Oracle database schema that need to imported to HDFS/Hive.
The current approach is working fine :
- Sqoop import from RDBMS to HDFS in avro format
- Creation of a Hive external table atop the avro files
- Copying the data from the Hive external table in a managed, ORC table as 'CREATE TABLE ... AS SELECT * FROM ...'
Many tables in the SQL Server and Oracle schema are partitioned. As per the Sqoop documentation, at least for Oracle, it seems that that the data on HDFS can be 'partitioned' based on the source table partitions, similar options don't seem to exist for SQL Server.
I have the following questions :
- Can Sqoop figure out the column(s) on which the source table is partitioned ?
- Irrespective of the source db, can the files resulting in Step 1. above be 'partitioned' (stored in different directories) on the HDFS?
- Assuming that partitioning won't help in step 1., would it make sense in step 2. ? If yes, will the the ORC table in Step 3. inherit the partitions ?
- Assuming that partitioning is possible only in step 3. :
- A repetition of question 1. - can the table's partitioning column be determined auto. and used auto. as well
- The Sqoop create hive table doesn't help with the partitioning, also, this approach means again hitting the source db, even though for just for the metadata
Created 10-12-2016 08:28 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I don't believe Sqoop import is able to automatically determine the partitions. You would typically pass in --hive-partition-key and --hive-partition-value arguments. Obviously that approach doesn't scale well when you have hundreds of partitions.
This approach may be useful: HCC Article however it does mean hitting the DB multiple times.
Generally I would bulk pull the tables out of the database and store them as external text files on HDFS. Then I would load the data into orc partitioned tables using dynamic partitions pulling from the external tables. Of course this required me to manage the schema myself. Your approach with avro files is a common one.
Created 10-12-2016 08:28 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I don't believe Sqoop import is able to automatically determine the partitions. You would typically pass in --hive-partition-key and --hive-partition-value arguments. Obviously that approach doesn't scale well when you have hundreds of partitions.
This approach may be useful: HCC Article however it does mean hitting the DB multiple times.
Generally I would bulk pull the tables out of the database and store them as external text files on HDFS. Then I would load the data into orc partitioned tables using dynamic partitions pulling from the external tables. Of course this required me to manage the schema myself. Your approach with avro files is a common one.
Created 10-13-2016 07:45 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I think your approach of 'load the data into orc partitioned tables using dynamic partitions pulling from the external tables' is good enough to achieve partitioning. I'm curious to try out if a Hive managed, ORC dynamically PARTITIONED table can be directly created from the external Avro based table so that one can specify only the partitioning key AND not the whole set of columns :
Current Step-3.
create table dimoriginal_orc ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' TBLPROPERTIES ('orc.compress'='ZLIB') AS select * from dimoriginal_avro_compressed;
Created 10-13-2016 07:05 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You may find Gobblin helpful: Gobblin It comes from LinkedIn. Here is the Hive Avro to Orc Converter: http://gobblin.readthedocs.io/en/latest/adaptors/Hive-Avro-To-ORC-Converter/