<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Re: Best practices to work Sqoop, HDFS and Hive in Support Questions</title>
    <link>https://community.cloudera.com/t5/Support-Questions/Best-practices-to-work-Sqoop-HDFS-and-Hive/m-p/133619#M96286</link>
    <description>&lt;P&gt;Here's a good article for that &lt;A href="https://community.hortonworks.com/articles/85165/scheduled-incremental-ingestion-of-ms-sql-data-to.html" target="_blank"&gt;https://community.hortonworks.com/articles/85165/scheduled-incremental-ingestion-of-ms-sql-data-to.html&lt;/A&gt;&lt;/P&gt;</description>
    <pubDate>Thu, 02 Mar 2017 20:58:50 GMT</pubDate>
    <dc:creator>aervits</dc:creator>
    <dc:date>2017-03-02T20:58:50Z</dc:date>
    <item>
      <title>Best practices to work Sqoop, HDFS and Hive</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Best-practices-to-work-Sqoop-HDFS-and-Hive/m-p/133618#M96285</link>
      <description>&lt;P&gt;I have to use &lt;CODE&gt;sqoop&lt;/CODE&gt; to import all tables from a mysql database to &lt;CODE&gt;hdfs&lt;/CODE&gt; and to &lt;CODE&gt;external tables&lt;/CODE&gt;in &lt;CODE&gt;hive&lt;/CODE&gt; (no filters, with the same structure)&lt;/P&gt;&lt;P&gt;In import I want to bring:&lt;/P&gt;&lt;BLOCKQUOTE&gt;
&lt;UL&gt;
&lt;LI&gt;New data for existing tables&lt;/LI&gt;&lt;LI&gt;Updated data for existing tables (using only the id column)&lt;/LI&gt;&lt;LI&gt;New tables created in mysql (y to create external table in hive)&lt;/LI&gt;&lt;/UL&gt;&lt;/BLOCKQUOTE&gt;&lt;P&gt;Then create a &lt;CODE&gt;sqoop job&lt;/CODE&gt; to do it all automatically.&lt;/P&gt;&lt;P&gt;(I have a &lt;CODE&gt;mysql&lt;/CODE&gt; database with approximately 60 tables, and with each new client going into production, a new table is created. So I need &lt;CODE&gt;sqoop&lt;/CODE&gt; to work as automatically as possible)&lt;/P&gt;&lt;P&gt;The first command executed to import all the tables was:&lt;/P&gt;&lt;P&gt;&lt;CODE&gt;sqoop import-all-tables &lt;/CODE&gt;&lt;/P&gt;&lt;P&gt;&lt;CODE&gt;--connect jdbc:mysql://IP/db_name &lt;/CODE&gt;&lt;/P&gt;&lt;P&gt;&lt;CODE&gt; --username user &lt;/CODE&gt;&lt;/P&gt;&lt;P&gt;&lt;CODE&gt; 
--password pass &lt;/CODE&gt;&lt;/P&gt;&lt;P&gt;&lt;CODE&gt;--warehouse-dir /user/hdfs/db_name &lt;/CODE&gt;&lt;/P&gt;&lt;P&gt;&lt;CODE&gt;
-m 1&lt;/CODE&gt;&lt;/P&gt;&lt;P&gt;Here &lt;A href="https://issues.apache.org/jira/browse/SQOOP-816"&gt;Scoop and support for external Hive tables&lt;/A&gt; says that support was added for the creation of external tables in &lt;CODE&gt;hive&lt;/CODE&gt;, but I did not find documentation or examples on the mentioned commands&lt;/P&gt;&lt;BLOCKQUOTE&gt;
&lt;P&gt;What are the best practices to work with in &lt;CODE&gt;sqoop&lt;/CODE&gt; where it looks at all the updates from a &lt;CODE&gt;mysql&lt;/CODE&gt; database and passes to &lt;CODE&gt;hdfs&lt;/CODE&gt; and &lt;CODE&gt;hive&lt;/CODE&gt;?&lt;/P&gt;&lt;/BLOCKQUOTE&gt;&lt;P&gt;Any ideas would be good.&lt;/P&gt;&lt;P&gt;Thanks in advance.&lt;/P&gt;</description>
      <pubDate>Thu, 02 Mar 2017 17:11:28 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Best-practices-to-work-Sqoop-HDFS-and-Hive/m-p/133618#M96285</guid>
      <dc:creator>sola_carol</dc:creator>
      <dc:date>2017-03-02T17:11:28Z</dc:date>
    </item>
    <item>
      <title>Re: Best practices to work Sqoop, HDFS and Hive</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Best-practices-to-work-Sqoop-HDFS-and-Hive/m-p/133619#M96286</link>
      <description>&lt;P&gt;Here's a good article for that &lt;A href="https://community.hortonworks.com/articles/85165/scheduled-incremental-ingestion-of-ms-sql-data-to.html" target="_blank"&gt;https://community.hortonworks.com/articles/85165/scheduled-incremental-ingestion-of-ms-sql-data-to.html&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 02 Mar 2017 20:58:50 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Best-practices-to-work-Sqoop-HDFS-and-Hive/m-p/133619#M96286</guid>
      <dc:creator>aervits</dc:creator>
      <dc:date>2017-03-02T20:58:50Z</dc:date>
    </item>
    <item>
      <title>Re: Best practices to work Sqoop, HDFS and Hive</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Best-practices-to-work-Sqoop-HDFS-and-Hive/m-p/133620#M96287</link>
      <description>&lt;P&gt;The patch to create Hive external tables from Sqoop is still unresolved:&lt;/P&gt;&lt;P&gt;&lt;A href="https://issues.apache.org/jira/browse/SQOOP-816" target="_blank"&gt;https://issues.apache.org/jira/browse/SQOOP-816&lt;/A&gt;&lt;/P&gt;&lt;P&gt;Unfortunately you will not be able to pull updates from source tables using only id column. You will need a timestamped (last modified) column for Sqoop to know which rows were updated. So the best practice is rather at your database side where it is always best to keep columns like 'modified', 'modified by' in your tables.&lt;/P&gt;&lt;P&gt;&lt;A href="https://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html#_incremental_imports" target="_blank"&gt;https://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html#_incremental_imports&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 03 Mar 2017 01:27:11 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Best-practices-to-work-Sqoop-HDFS-and-Hive/m-p/133620#M96287</guid>
      <dc:creator>umair_khan</dc:creator>
      <dc:date>2017-03-03T01:27:11Z</dc:date>
    </item>
  </channel>
</rss>

