Created 05-07-2016 06:44 PM
what would be the most optimal way to import tables in oozie (let's say 8-10 tables ) from a database that has 100's of them. Now, I can not use exclude option with import all and specify 90 tables which I need to exclude.
Also, I believe that the table imports should be parallel jobs in oozie? Since these are independent of each other, why not? In that case,
1. how exactly do I specify tables imports in parallel? (from same and different databases).
2. How do I have incremental imports every day i.e I import only the newly added data everyday at a particular time?(I have a co-ordinator.xml but not sure about passing the date / recordnum. for incremental imports).
3. How do I specify different frequencies ? i.e some of the imports happen daily and rest may be once in 3 days?
4. The APPEND option can limit data movement only to new or updated records. How exactly? how is it different from incremental imports? Which one is performance-wise better and faster)?
Also, Thank you so much for maintaining this wonderfully informed community :).
Created on 05-07-2016 07:34 PM - edited 08-19-2019 01:47 AM
Hi @simran kaur,
#!/bin/bash # t='Table1 Table2 Table2' for i in $t; do echo "Debut import table $i" ; sqoop import --connect jdbc:******* (the rest of parameters) echo "Fin import table $i" ; done
The import will be sequential and this is not a bad thing. Sqoop import can be hard on the Database if you have lot of data and high degree of paralelisme.
Note that row 3 has been modified (Ohio -> Oh). However, it has not been imported in the second import
Hope this helps
Created on 05-07-2016 07:34 PM - edited 08-19-2019 01:47 AM
Hi @simran kaur,
#!/bin/bash # t='Table1 Table2 Table2' for i in $t; do echo "Debut import table $i" ; sqoop import --connect jdbc:******* (the rest of parameters) echo "Fin import table $i" ; done
The import will be sequential and this is not a bad thing. Sqoop import can be hard on the Database if you have lot of data and high degree of paralelisme.
Note that row 3 has been modified (Ohio -> Oh). However, it has not been imported in the second import
Hope this helps
Created 05-07-2016 07:34 PM
1) You essentially have two options. Use Sqoop import-all-tables with exclude as you mention. However in that case you have a single sqoop action in oozie and no parallelity in oozie. However sqoop might provide that. You have some limitations though ( only straight imports all columns , ... )
Alternatively you make an oozie flow that uses a fork and then one single table sqoop action per table. In that case you have fine grained control over how much you want to run in parallel. ( You could for example load 4 at a time by doing
Start -> Fork -> 4 Sqoop Actions -> Join -> Fork -> 4 Sqoop Actions -> Join -> End
2) If you want incremental load I don't think the Sqoop import-all-tables is possible. So one Sqoop action per table it is. Essentially you can either use Sqoop incremental import functionality ( using a property file ) or use WHERE conditions and give through the date parameter from the coordinator. You can use coord:dateformat to transform your execution date.
3) Run One coord for each table OR have a Decision action in the oozie workflow that skips some sqoop actions
Like
Start -> Sqoop1 where date = mydate -> Decision if mydate % 3 = 0 then Sqoop2 else end.
4) incremental imports load the new data into a folder in HDFS. If you run it the folder needs to be deleted. If you use append it doesn't delete the old data in HDFS. Now you may ask why would I ever not want append and the reason is that you mostly do something with the data after like importing the new data to a hive partitioned table. If you would use append he would load the same data over and over.