Created 10-12-2016 06:57 PM
Upgraded to HDP-2.5.0.0 using Ambari 2.4.0.1
There are several SQL Server and Oracle database schema that need to imported to HDFS/Hive.
The current approach is working fine :
Many tables in the SQL Server and Oracle schema are partitioned. As per the Sqoop documentation, at least for Oracle, it seems that that the data on HDFS can be 'partitioned' based on the source table partitions, similar options don't seem to exist for SQL Server.
I have the following questions :
Created 10-12-2016 08:28 PM
I don't believe Sqoop import is able to automatically determine the partitions. You would typically pass in --hive-partition-key and --hive-partition-value arguments. Obviously that approach doesn't scale well when you have hundreds of partitions.
This approach may be useful: HCC Article however it does mean hitting the DB multiple times.
Generally I would bulk pull the tables out of the database and store them as external text files on HDFS. Then I would load the data into orc partitioned tables using dynamic partitions pulling from the external tables. Of course this required me to manage the schema myself. Your approach with avro files is a common one.
Created 10-12-2016 08:28 PM
I don't believe Sqoop import is able to automatically determine the partitions. You would typically pass in --hive-partition-key and --hive-partition-value arguments. Obviously that approach doesn't scale well when you have hundreds of partitions.
This approach may be useful: HCC Article however it does mean hitting the DB multiple times.
Generally I would bulk pull the tables out of the database and store them as external text files on HDFS. Then I would load the data into orc partitioned tables using dynamic partitions pulling from the external tables. Of course this required me to manage the schema myself. Your approach with avro files is a common one.
Created 10-13-2016 07:45 AM
I think your approach of 'load the data into orc partitioned tables using dynamic partitions pulling from the external tables' is good enough to achieve partitioning. I'm curious to try out if a Hive managed, ORC dynamically PARTITIONED table can be directly created from the external Avro based table so that one can specify only the partitioning key AND not the whole set of columns :
Current Step-3.
create table dimoriginal_orc ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' TBLPROPERTIES ('orc.compress'='ZLIB') AS select * from dimoriginal_avro_compressed;
Created 10-13-2016 07:05 PM
You may find Gobblin helpful: Gobblin It comes from LinkedIn. Here is the Hive Avro to Orc Converter: http://gobblin.readthedocs.io/en/latest/adaptors/Hive-Avro-To-ORC-Converter/