Support Questions

Find answers, ask questions, and share your expertise

Sqoop import to HDFS and partitioning possibilities

avatar
Super Collaborator

Upgraded to HDP-2.5.0.0 using Ambari 2.4.0.1

There are several SQL Server and Oracle database schema that need to imported to HDFS/Hive.

The current approach is working fine :

  1. Sqoop import from RDBMS to HDFS in avro format
  2. Creation of a Hive external table atop the avro files
  3. Copying the data from the Hive external table in a managed, ORC table as 'CREATE TABLE ... AS SELECT * FROM ...'

Many tables in the SQL Server and Oracle schema are partitioned. As per the Sqoop documentation, at least for Oracle, it seems that that the data on HDFS can be 'partitioned' based on the source table partitions, similar options don't seem to exist for SQL Server.

I have the following questions :

  1. Can Sqoop figure out the column(s) on which the source table is partitioned ?
  2. Irrespective of the source db, can the files resulting in Step 1. above be 'partitioned' (stored in different directories) on the HDFS?
  3. Assuming that partitioning won't help in step 1., would it make sense in step 2. ? If yes, will the the ORC table in Step 3. inherit the partitions ?
  4. Assuming that partitioning is possible only in step 3. :
    1. A repetition of question 1. - can the table's partitioning column be determined auto. and used auto. as well
    2. The Sqoop create hive table doesn't help with the partitioning, also, this approach means again hitting the source db, even though for just for the metadata
1 ACCEPTED SOLUTION

avatar
Super Guru

@Kaliyug Antagonist

I don't believe Sqoop import is able to automatically determine the partitions. You would typically pass in --hive-partition-key and --hive-partition-value arguments. Obviously that approach doesn't scale well when you have hundreds of partitions.

This approach may be useful: HCC Article however it does mean hitting the DB multiple times.

Generally I would bulk pull the tables out of the database and store them as external text files on HDFS. Then I would load the data into orc partitioned tables using dynamic partitions pulling from the external tables. Of course this required me to manage the schema myself. Your approach with avro files is a common one.

View solution in original post

3 REPLIES 3

avatar
Super Guru

@Kaliyug Antagonist

I don't believe Sqoop import is able to automatically determine the partitions. You would typically pass in --hive-partition-key and --hive-partition-value arguments. Obviously that approach doesn't scale well when you have hundreds of partitions.

This approach may be useful: HCC Article however it does mean hitting the DB multiple times.

Generally I would bulk pull the tables out of the database and store them as external text files on HDFS. Then I would load the data into orc partitioned tables using dynamic partitions pulling from the external tables. Of course this required me to manage the schema myself. Your approach with avro files is a common one.

avatar
Super Collaborator

I think your approach of 'load the data into orc partitioned tables using dynamic partitions pulling from the external tables' is good enough to achieve partitioning. I'm curious to try out if a Hive managed, ORC dynamically PARTITIONED table can be directly created from the external Avro based table so that one can specify only the partitioning key AND not the whole set of columns :

Current Step-3.

create table dimoriginal_orc ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' TBLPROPERTIES ('orc.compress'='ZLIB') AS select * from dimoriginal_avro_compressed;

avatar
Super Guru

@Kaliyug Antagonist

You may find Gobblin helpful: Gobblin It comes from LinkedIn. Here is the Hive Avro to Orc Converter: http://gobblin.readthedocs.io/en/latest/adaptors/Hive-Avro-To-ORC-Converter/