Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

[Sqoop1] OOM on offloading from Oracle

[Sqoop1] OOM on offloading from Oracle

Contributor

Hi, I need to sqoop about 700 tables from 2 Oracle instances and I am using a custom query to extract them.

To accelerate a bit more the process, I set 

--fetch-size 2000000

on Sqoop.

 

I have a file with a table on every line, plus some arguments and the query. I built a shell script that uses GNU Parallel to run more than one offload at the same time. It works correctly, however I don't understand why I need to tune the Heapsize of the processes, otherwise it fails with OOM.

 

I understand that Sqoop uses the HDFS client to write data to HDFS, and since I force Sqoop to fetch 2mil records per time, I need to tune the process to have room for them all.

So I tune HDFS client via

HADOOP_CLIENT_OPTS="-Xmx6144m $HADOOP_CLIENT_OPTS"

 inside the script, and the Sqoop process's heapsize via  

-Dmapreduce.map.memory.mb=8192 and -Dmapreduce.map.java.opts=-Xmx6553m

in the sqoop import command.

 

My point is: why some tables complete and other don't ? Why it just can slow down to keep pace?

I don't like this approach because as soon as a table grows larger, Sqoop will fail. I can't go to production with something that I know in advance will break in the future.

 

Don't have an account?
Coming from Hortonworks? Activate your account here