Reply
New Contributor
Posts: 2
Registered: ‎01-18-2016

Optimising the time required to extract data from Oracle

Hi,

 

I'm looking to extract a large amount of data from a few Oracle tables and transfering it to a HDFS file system. There appears to be two possible ways of achieving this:

 

  1. Use Sqoop to extract the data and copy it across the network directly to HDFS
  2. Extract the data to a local file system. Then this has been completed copy (ftp?) the data to the Hadoop system.

 

Clearly the second option is more work, however that isn't the issue. It's been suggested that because Sqoop is copying data across the network that the locks on the Oracle table might remain for longer would be otherwise required. I'll be extracting large amounts of data and copying it to a remote location (so there will be significant network latency).

 

Does anybody know if this is correct?

 

 

Posts: 1,903
Kudos: 435
Solutions: 307
Registered: ‎07-31-2013

Re: Optimising the time required to extract data from Oracle

> Does anybody know if this is correct?

Sqoop's style of importing data from an RDBMS typically involves making JDBC connections to it. There are some specialised connectors for Oracle and other specific DBs that may change some factors in how the data is fetched and how parallelism or locking is controlled, but in the end its all over the network, and is powered via JDBC exchanges over the established connection.

If your goal is one-time, and requires a very wide transfer, it may be faster to do it via the (2) approach, especially given the latency issue. For subsequent incremental transfers, you can use Sqoop.

Checkout Sqoop's Oracle-specific docs at http://archive.cloudera.com/cdh5/cdh/5/sqoop/SqoopUserGuide.html#_data_connector_for_oracle_and_hado...