Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Best practices to work Sqoop, HDFS and Hive

avatar
New Contributor

I have to use sqoop to import all tables from a mysql database to hdfs and to external tablesin hive (no filters, with the same structure)

In import I want to bring:

  • New data for existing tables
  • Updated data for existing tables (using only the id column)
  • New tables created in mysql (y to create external table in hive)

Then create a sqoop job to do it all automatically.

(I have a mysql database with approximately 60 tables, and with each new client going into production, a new table is created. So I need sqoop to work as automatically as possible)

The first command executed to import all the tables was:

sqoop import-all-tables

--connect jdbc:mysql://IP/db_name

--username user

--password pass

--warehouse-dir /user/hdfs/db_name

-m 1

Here Scoop and support for external Hive tables says that support was added for the creation of external tables in hive, but I did not find documentation or examples on the mentioned commands

What are the best practices to work with in sqoop where it looks at all the updates from a mysql database and passes to hdfs and hive?

Any ideas would be good.

Thanks in advance.

1 ACCEPTED SOLUTION

avatar
Expert Contributor

The patch to create Hive external tables from Sqoop is still unresolved:

https://issues.apache.org/jira/browse/SQOOP-816

Unfortunately you will not be able to pull updates from source tables using only id column. You will need a timestamped (last modified) column for Sqoop to know which rows were updated. So the best practice is rather at your database side where it is always best to keep columns like 'modified', 'modified by' in your tables.

https://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html#_incremental_imports

View solution in original post

2 REPLIES 2

avatar
Master Mentor

avatar
Expert Contributor

The patch to create Hive external tables from Sqoop is still unresolved:

https://issues.apache.org/jira/browse/SQOOP-816

Unfortunately you will not be able to pull updates from source tables using only id column. You will need a timestamped (last modified) column for Sqoop to know which rows were updated. So the best practice is rather at your database side where it is always best to keep columns like 'modified', 'modified by' in your tables.

https://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html#_incremental_imports