Support Questions
Find answers, ask questions, and share your expertise
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here. Want to know more about what has changed? Check out the Community News blog.

Optimising the time required to extract data from Oracle


Optimising the time required to extract data from Oracle

New Contributor



I'm looking to extract a large amount of data from a few Oracle tables and transfering it to a HDFS file system. There appears to be two possible ways of achieving this:


  1. Use Sqoop to extract the data and copy it across the network directly to HDFS
  2. Extract the data to a local file system. Then this has been completed copy (ftp?) the data to the Hadoop system.


Clearly the second option is more work, however that isn't the issue. It's been suggested that because Sqoop is copying data across the network that the locks on the Oracle table might remain for longer would be otherwise required. I'll be extracting large amounts of data and copying it to a remote location (so there will be significant network latency).


Does anybody know if this is correct?




Re: Optimising the time required to extract data from Oracle

Master Guru
> Does anybody know if this is correct?

Sqoop's style of importing data from an RDBMS typically involves making JDBC connections to it. There are some specialised connectors for Oracle and other specific DBs that may change some factors in how the data is fetched and how parallelism or locking is controlled, but in the end its all over the network, and is powered via JDBC exchanges over the established connection.

If your goal is one-time, and requires a very wide transfer, it may be faster to do it via the (2) approach, especially given the latency issue. For subsequent incremental transfers, you can use Sqoop.

Checkout Sqoop's Oracle-specific docs at