<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Re: Sqoop import to HDFS and partitioning possibilities in Archives of Support Questions (Read Only)</title>
    <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Sqoop-import-to-HDFS-and-partitioning-possibilities/m-p/127118#M43363</link>
    <description>&lt;P&gt;&lt;A rel="user" href="https://community.cloudera.com/users/5134/kaliyugantagonist.html" nodeid="5134"&gt;@Kaliyug Antagonist&lt;/A&gt; &lt;/P&gt;&lt;P&gt;You may find Gobblin helpful: &lt;A target="_blank" href="http://gobblin.readthedocs.io/en/latest/Getting-Started/#download"&gt;Gobblin&lt;/A&gt; It comes from LinkedIn.  Here is the Hive Avro to Orc Converter: &lt;A target="_blank" href="http://gobblin.readthedocs.io/en/latest/adaptors/Hive-Avro-To-ORC-Converter/"&gt;http://gobblin.readthedocs.io/en/latest/adaptors/Hive-Avro-To-ORC-Converter/&lt;/A&gt;&lt;/P&gt;</description>
    <pubDate>Fri, 14 Oct 2016 02:05:55 GMT</pubDate>
    <dc:creator>myoung</dc:creator>
    <dc:date>2016-10-14T02:05:55Z</dc:date>
    <item>
      <title>Sqoop import to HDFS and partitioning possibilities</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Sqoop-import-to-HDFS-and-partitioning-possibilities/m-p/127115#M43360</link>
      <description>&lt;P&gt;Upgraded to HDP-2.5.0.0 using Ambari  2.4.0.1&lt;/P&gt;&lt;P&gt;There are several SQL Server and Oracle database schema that need to imported to HDFS/Hive.&lt;/P&gt;&lt;P&gt;The current approach is working fine :&lt;/P&gt;&lt;OL&gt;&lt;LI&gt;Sqoop import from RDBMS to HDFS in &lt;STRONG&gt;avro format&lt;/STRONG&gt;&lt;/LI&gt;&lt;LI&gt;Creation of a Hive &lt;STRONG&gt;external &lt;/STRONG&gt;table atop the avro files&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;Copying the data&lt;/STRONG&gt; from the Hive external table &lt;STRONG&gt;in &lt;/STRONG&gt;&lt;STRONG&gt;a managed, ORC table&lt;/STRONG&gt; as 'CREATE TABLE ... AS SELECT * FROM ...'&lt;/LI&gt;&lt;/OL&gt;&lt;P&gt;Many tables in the SQL Server and Oracle schema are partitioned. As per the &lt;A target="_blank" href="https://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.html#_match_hadoop_files_to_oracle_table_partitions"&gt;Sqoop documentation&lt;/A&gt;, at least for Oracle, it seems that that the &lt;STRONG&gt;data on HDFS can be 'partitioned'&lt;/STRONG&gt; based on the source table partitions, similar options don't seem to exist for SQL Server.&lt;/P&gt;&lt;P&gt;I have the following questions :&lt;/P&gt;&lt;OL&gt;&lt;LI&gt;&lt;STRONG&gt;Can Sqoop figure out the column(s) on which the source table is partitioned ?&lt;/STRONG&gt;&lt;/LI&gt;&lt;LI&gt;Irrespective of the source db, can the files resulting in Step 1. above be 'partitioned' (stored in different directories) on the HDFS?&lt;/LI&gt;&lt;LI&gt;Assuming that partitioning won't help in step 1., would it make sense in step 2. ? If yes, will the the ORC table in Step 3. inherit the partitions ?&lt;/LI&gt;&lt;LI&gt;Assuming that partitioning is possible only in step 3. :&lt;OL&gt;&lt;LI&gt;A repetition of question 1. - can the table's partitioning column be determined auto. and used auto. as well&lt;/LI&gt;&lt;LI&gt;The &lt;A target="_blank" href="https://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.html#_literal_sqoop_create_hive_table_literal"&gt;Sqoop create hive table&lt;/A&gt; doesn't help with the partitioning, also, this approach means again hitting the source db, even though for just for the metadata&lt;/LI&gt;&lt;/OL&gt;&lt;/LI&gt;&lt;/OL&gt;</description>
      <pubDate>Thu, 13 Oct 2016 01:57:57 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Sqoop-import-to-HDFS-and-partitioning-possibilities/m-p/127115#M43360</guid>
      <dc:creator>kaliyugantagoni</dc:creator>
      <dc:date>2016-10-13T01:57:57Z</dc:date>
    </item>
    <item>
      <title>Re: Sqoop import to HDFS and partitioning possibilities</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Sqoop-import-to-HDFS-and-partitioning-possibilities/m-p/127116#M43361</link>
      <description>&lt;P&gt;&lt;A rel="user" href="https://community.cloudera.com/users/5134/kaliyugantagonist.html" nodeid="5134"&gt;@Kaliyug Antagonist&lt;/A&gt;&lt;/P&gt;&lt;P&gt;I don't believe Sqoop import is able to automatically determine the partitions.  You would typically pass in --hive-partition-key and --hive-partition-value arguments.  Obviously that approach doesn't scale well when you have hundreds of partitions.&lt;/P&gt;&lt;P&gt;This approach may be useful: &lt;A target="_blank" href="https://community.hortonworks.com/articles/17469/creating-hive-partitioned-tables-using-sqoop.html"&gt;HCC Article&lt;/A&gt; however it does mean hitting the DB multiple times.&lt;/P&gt;&lt;P&gt;Generally I would bulk pull the tables out of the database and store them as external text files on HDFS.  Then I would load the data into orc partitioned tables using dynamic partitions pulling from the external tables.  Of course this required me to manage the schema myself.  Your approach with avro files is a common one.&lt;/P&gt;</description>
      <pubDate>Thu, 13 Oct 2016 03:28:13 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Sqoop-import-to-HDFS-and-partitioning-possibilities/m-p/127116#M43361</guid>
      <dc:creator>myoung</dc:creator>
      <dc:date>2016-10-13T03:28:13Z</dc:date>
    </item>
    <item>
      <title>Re: Sqoop import to HDFS and partitioning possibilities</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Sqoop-import-to-HDFS-and-partitioning-possibilities/m-p/127117#M43362</link>
      <description>&lt;P&gt;I think your approach of 'load the data into orc partitioned tables using dynamic partitions pulling from the external tables' is good enough to achieve partitioning. I'm curious to try out if a Hive managed, ORC dynamically PARTITIONED table can be directly created from the external Avro based table so that one can specify only the partitioning key AND not the whole set of columns :&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Current &lt;/STRONG&gt;Step-3.&lt;/P&gt;&lt;PRE&gt;create table dimoriginal_orc ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' TBLPROPERTIES ('orc.compress'='ZLIB') AS select * from dimoriginal_avro_compressed;&lt;/PRE&gt;</description>
      <pubDate>Thu, 13 Oct 2016 14:45:23 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Sqoop-import-to-HDFS-and-partitioning-possibilities/m-p/127117#M43362</guid>
      <dc:creator>kaliyugantagoni</dc:creator>
      <dc:date>2016-10-13T14:45:23Z</dc:date>
    </item>
    <item>
      <title>Re: Sqoop import to HDFS and partitioning possibilities</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Sqoop-import-to-HDFS-and-partitioning-possibilities/m-p/127118#M43363</link>
      <description>&lt;P&gt;&lt;A rel="user" href="https://community.cloudera.com/users/5134/kaliyugantagonist.html" nodeid="5134"&gt;@Kaliyug Antagonist&lt;/A&gt; &lt;/P&gt;&lt;P&gt;You may find Gobblin helpful: &lt;A target="_blank" href="http://gobblin.readthedocs.io/en/latest/Getting-Started/#download"&gt;Gobblin&lt;/A&gt; It comes from LinkedIn.  Here is the Hive Avro to Orc Converter: &lt;A target="_blank" href="http://gobblin.readthedocs.io/en/latest/adaptors/Hive-Avro-To-ORC-Converter/"&gt;http://gobblin.readthedocs.io/en/latest/adaptors/Hive-Avro-To-ORC-Converter/&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 14 Oct 2016 02:05:55 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Sqoop-import-to-HDFS-and-partitioning-possibilities/m-p/127118#M43363</guid>
      <dc:creator>myoung</dc:creator>
      <dc:date>2016-10-14T02:05:55Z</dc:date>
    </item>
  </channel>
</rss>

