Support Questions

Find answers, ask questions, and share your expertise
Announcements
Check out our newest addition to the community, the Cloudera Data Analytics (CDA) group hub.

Best practices to work Sqoop, HDFS and Hive

New Contributor

I have to use sqoop to import all tables from a mysql database to hdfs and to external tablesin hive (no filters, with the same structure)

In import I want to bring:

  • New data for existing tables
  • Updated data for existing tables (using only the id column)
  • New tables created in mysql (y to create external table in hive)

Then create a sqoop job to do it all automatically.

(I have a mysql database with approximately 60 tables, and with each new client going into production, a new table is created. So I need sqoop to work as automatically as possible)

The first command executed to import all the tables was:

sqoop import-all-tables

--connect jdbc:mysql://IP/db_name

--username user

--password pass

--warehouse-dir /user/hdfs/db_name

-m 1

Here Scoop and support for external Hive tables says that support was added for the creation of external tables in hive, but I did not find documentation or examples on the mentioned commands

What are the best practices to work with in sqoop where it looks at all the updates from a mysql database and passes to hdfs and hive?

Any ideas would be good.

Thanks in advance.

1 ACCEPTED SOLUTION

Expert Contributor

The patch to create Hive external tables from Sqoop is still unresolved:

https://issues.apache.org/jira/browse/SQOOP-816

Unfortunately you will not be able to pull updates from source tables using only id column. You will need a timestamped (last modified) column for Sqoop to know which rows were updated. So the best practice is rather at your database side where it is always best to keep columns like 'modified', 'modified by' in your tables.

https://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html#_incremental_imports

View solution in original post

2 REPLIES 2

Mentor

Expert Contributor

The patch to create Hive external tables from Sqoop is still unresolved:

https://issues.apache.org/jira/browse/SQOOP-816

Unfortunately you will not be able to pull updates from source tables using only id column. You will need a timestamped (last modified) column for Sqoop to know which rows were updated. So the best practice is rather at your database side where it is always best to keep columns like 'modified', 'modified by' in your tables.

https://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html#_incremental_imports

Take a Tour of the Community
Don't have an account?
Your experience may be limited. Sign in to explore more.