Created 06-20-2017 09:04 AM
Hi, I need to sqoop about 700 tables from 2 Oracle instances and I am using a custom query to extract them.
To accelerate a bit more the process, I set
--fetch-size 2000000
on Sqoop.
I have a file with a table on every line, plus some arguments and the query. I built a shell script that uses GNU Parallel to run more than one offload at the same time. It works correctly, however I don't understand why I need to tune the Heapsize of the processes, otherwise it fails with OOM.
I understand that Sqoop uses the HDFS client to write data to HDFS, and since I force Sqoop to fetch 2mil records per time, I need to tune the process to have room for them all.
So I tune HDFS client via
HADOOP_CLIENT_OPTS="-Xmx6144m $HADOOP_CLIENT_OPTS"
inside the script, and the Sqoop process's heapsize via
-Dmapreduce.map.memory.mb=8192 and -Dmapreduce.map.java.opts=-Xmx6553m
in the sqoop import command.
My point is: why some tables complete and other don't ? Why it just can slow down to keep pace?
I don't like this approach because as soon as a table grows larger, Sqoop will fail. I can't go to production with something that I know in advance will break in the future.