Created 03-02-2017 09:11 AM
I have to use sqoop to import all tables from a mysql database to hdfs and to external tablesin hive (no filters, with the same structure)
In import I want to bring:
- New data for existing tables
- Updated data for existing tables (using only the id column)
- New tables created in mysql (y to create external table in hive)
Then create a sqoop job to do it all automatically.
(I have a mysql database with approximately 60 tables, and with each new client going into production, a new table is created. So I need sqoop to work as automatically as possible)
The first command executed to import all the tables was:
sqoop import-all-tables
--connect jdbc:mysql://IP/db_name
--username user
--password pass
--warehouse-dir /user/hdfs/db_name
-m 1
Here Scoop and support for external Hive tables says that support was added for the creation of external tables in hive, but I did not find documentation or examples on the mentioned commands
What are the best practices to work with in
sqoopwhere it looks at all the updates from amysqldatabase and passes tohdfsandhive?
Any ideas would be good.
Thanks in advance.
Created 03-02-2017 05:27 PM
The patch to create Hive external tables from Sqoop is still unresolved:
https://issues.apache.org/jira/browse/SQOOP-816
Unfortunately you will not be able to pull updates from source tables using only id column. You will need a timestamped (last modified) column for Sqoop to know which rows were updated. So the best practice is rather at your database side where it is always best to keep columns like 'modified', 'modified by' in your tables.
https://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html#_incremental_imports
Created 03-02-2017 12:58 PM
Here's a good article for that https://community.hortonworks.com/articles/85165/scheduled-incremental-ingestion-of-ms-sql-data-to.h...
Created 03-02-2017 05:27 PM
The patch to create Hive external tables from Sqoop is still unresolved:
https://issues.apache.org/jira/browse/SQOOP-816
Unfortunately you will not be able to pull updates from source tables using only id column. You will need a timestamped (last modified) column for Sqoop to know which rows were updated. So the best practice is rather at your database side where it is always best to keep columns like 'modified', 'modified by' in your tables.
https://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html#_incremental_imports