Support Questions

sim6 · ‎05-07-2016

what would be the most optimal way to import tables in oozie (let's say 8-10 tables ) from a database that has 100's of them. Now, I can not use exclude option with import all and specify 90 tables which I need to exclude.

Also, I believe that the table imports should be parallel jobs in oozie? Since these are independent of each other, why not? In that case,

1. how exactly do I specify tables imports in parallel? (from same and different databases).

2. How do I have incremental imports every day i.e I import only the newly added data everyday at a particular time?(I have a co-ordinator.xml but not sure about passing the date / recordnum. for incremental imports).

3. How do I specify different frequencies ? i.e some of the imports happen daily and rest may be once in 3 days?

4. The APPEND option can limit data movement only to new or updated records. How exactly? how is it different from incremental imports? Which one is performance-wise better and faster)?

Also, Thank you so much for maintaining this wonderfully informed community :).

ahadjidj · ‎05-07-2016

Hi @simran kaur,

To import several tables you can write a script and call it with Oozie. In the script you put your table names in a variable and iterate over it

#!/bin/bash
#
t='Table1 Table2 Table2'
for i in $t; 
do 
 echo "Debut import table $i" ;
 sqoop import --connect jdbc:******* (the rest of parameters)
 echo "Fin import table $i" ;
done

The import will be sequential and this is not a bad thing. Sqoop import can be hard on the Database if you have lot of data and high degree of paralelisme.

To do incremental import you need to have a column in your table that helps you. Mainly two options
- Your table contains a timestamp column that contains the timestamps of last modification for each row. You can use last modified mode (--incremental lastmodified). Rows having a timestamp value greater than --last-value are imported. To specify which column contains the timestamps use --check-column
- The ID in your table are always incremental. Here you can use append mode (--incremental append). At each import Sqoop will save the max ID retrieved from your table. In the next import Sqoop will get only rows with ID greater than the last saved max value (--last-value). To specify which column contains the incrementing ID us --check-column. Note: in this mode we suppose that each new row is an append and you don't modify existing rows. If a row is modified after it is imported, you wont have the new values at the next import
To have several import frequency you can use several Oozie workflows with several parameters
Append mode: see point 2. Append import only new lines and not the modified one. This also requires having an incremental ID

Note that row 3 has been modified (Ohio -> Oh). However, it has not been imported in the second import

Hope this helps

View solution in original post

ahadjidj · ‎05-07-2016

Hi @simran kaur,

To import several tables you can write a script and call it with Oozie. In the script you put your table names in a variable and iterate over it

#!/bin/bash
#
t='Table1 Table2 Table2'
for i in $t; 
do 
 echo "Debut import table $i" ;
 sqoop import --connect jdbc:******* (the rest of parameters)
 echo "Fin import table $i" ;
done

The import will be sequential and this is not a bad thing. Sqoop import can be hard on the Database if you have lot of data and high degree of paralelisme.

To do incremental import you need to have a column in your table that helps you. Mainly two options
- Your table contains a timestamp column that contains the timestamps of last modification for each row. You can use last modified mode (--incremental lastmodified). Rows having a timestamp value greater than --last-value are imported. To specify which column contains the timestamps use --check-column
- The ID in your table are always incremental. Here you can use append mode (--incremental append). At each import Sqoop will save the max ID retrieved from your table. In the next import Sqoop will get only rows with ID greater than the last saved max value (--last-value). To specify which column contains the incrementing ID us --check-column. Note: in this mode we suppose that each new row is an append and you don't modify existing rows. If a row is modified after it is imported, you wont have the new values at the next import
To have several import frequency you can use several Oozie workflows with several parameters
Append mode: see point 2. Append import only new lines and not the modified one. This also requires having an incremental ID

Note that row 3 has been modified (Ohio -> Oh). However, it has not been imported in the second import

Hope this helps

bleonhardi · ‎05-07-2016

1) You essentially have two options. Use Sqoop import-all-tables with exclude as you mention. However in that case you have a single sqoop action in oozie and no parallelity in oozie. However sqoop might provide that. You have some limitations though ( only straight imports all columns , ... )

Alternatively you make an oozie flow that uses a fork and then one single table sqoop action per table. In that case you have fine grained control over how much you want to run in parallel. ( You could for example load 4 at a time by doing

Start -> Fork -> 4 Sqoop Actions -> Join -> Fork -> 4 Sqoop Actions -> Join -> End

2) If you want incremental load I don't think the Sqoop import-all-tables is possible. So one Sqoop action per table it is. Essentially you can either use Sqoop incremental import functionality ( using a property file ) or use WHERE conditions and give through the date parameter from the coordinator. You can use coord:dateformat to transform your execution date.

3) Run One coord for each table OR have a Decision action in the oozie workflow that skips some sqoop actions

Like

Start -> Sqoop1 where date = mydate -> Decision if mydate % 3 = 0 then Sqoop2 else end.

4) incremental imports load the new data into a folder in HDFS. If you run it the folder needs to be deleted. If you use append it doesn't delete the old data in HDFS. Now you may ask why would I ever not want append and the reason is that you mostly do something with the data after like importing the new data to a hive partitioned table. If you would use append he would load the same data over and over.

Cloudera Community

Support Questions

optimized oozie workflow to import multiple tables

Tips and best practices for optimizing Hive perfor...

Oozie Python workflow example walkthrough

Optimizing Hive queries for ORC formatted tables

Apache Ambari Workflow Manager View for Apache Ooz...

How to re-run failed action from oozie workflow

Shell action in oozie workflow via Hue

Oozie - Simulating looping with sub workflows

Price Optimization with PyGurobi in Cloudera Machi...

Apache Ambari Workflow Manager View for Apache Ooz...

How to optimize IMPALA/KUDU queries