Created 03-02-2017 09:11 AM
I have to use sqoop
to import all tables from a mysql database to hdfs
and to external tables
in hive
(no filters, with the same structure)
In import I want to bring:
- New data for existing tables
- Updated data for existing tables (using only the id column)
- New tables created in mysql (y to create external table in hive)
Then create a sqoop job
to do it all automatically.
(I have a mysql
database with approximately 60 tables, and with each new client going into production, a new table is created. So I need sqoop
to work as automatically as possible)
The first command executed to import all the tables was:
sqoop import-all-tables
--connect jdbc:mysql://IP/db_name
--username user
--password pass
--warehouse-dir /user/hdfs/db_name
-m 1
Here Scoop and support for external Hive tables says that support was added for the creation of external tables in hive
, but I did not find documentation or examples on the mentioned commands
What are the best practices to work with in
sqoop
where it looks at all the updates from amysql
database and passes tohdfs
andhive
?
Any ideas would be good.
Thanks in advance.
Created 03-02-2017 05:27 PM
The patch to create Hive external tables from Sqoop is still unresolved:
https://issues.apache.org/jira/browse/SQOOP-816
Unfortunately you will not be able to pull updates from source tables using only id column. You will need a timestamped (last modified) column for Sqoop to know which rows were updated. So the best practice is rather at your database side where it is always best to keep columns like 'modified', 'modified by' in your tables.
https://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html#_incremental_imports
Created 03-02-2017 12:58 PM
Here's a good article for that https://community.hortonworks.com/articles/85165/scheduled-incremental-ingestion-of-ms-sql-data-to.h...
Created 03-02-2017 05:27 PM
The patch to create Hive external tables from Sqoop is still unresolved:
https://issues.apache.org/jira/browse/SQOOP-816
Unfortunately you will not be able to pull updates from source tables using only id column. You will need a timestamped (last modified) column for Sqoop to know which rows were updated. So the best practice is rather at your database side where it is always best to keep columns like 'modified', 'modified by' in your tables.
https://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html#_incremental_imports