Created 02-20-2016 09:56 PM
Are there any benchmarks for SQOOP data transfers from an ORACE RDBMS to Hadoop cluster ?
Both Hadoop cluster and ORACLE servers are located in same datacenter and connected by 10G network and 10G TOR switches. What sort of data transfer rates I can really expect if I can run data transfer at a time when ORACLE servers are not being used by any other applications. I am able to get a rate of around ~200Mbps but I am not sure if that is the maximum that I can expect.
Created 02-20-2016 10:00 PM
I don't think there is any benchmarks like that.
You can follow this http://www.slideshare.net/alxslva/effective-sqoop-best-practices-pitfalls-and-lessons-40370936
Also, make sure that you have stats generated on Oracle Tables.
Another link
Direct = True and number of mappers plays a big role.
Your setup looks really good as you have source and trage are in the same DC and 10G network is there.
Created 02-20-2016 10:00 PM
I don't think there is any benchmarks like that.
You can follow this http://www.slideshare.net/alxslva/effective-sqoop-best-practices-pitfalls-and-lessons-40370936
Also, make sure that you have stats generated on Oracle Tables.
Another link
Direct = True and number of mappers plays a big role.
Your setup looks really good as you have source and trage are in the same DC and 10G network is there.
Created 02-20-2016 10:01 PM
Created 02-20-2016 11:05 PM
Thanks Neeraj. This was useful, though I still don't have a benchmark. In Quest example, they were able to achieve 50GB table in 1000 sec for effective rate of 50Mbps.
I also found some info here
http://grokbase.com/t/sqoop/user/146jhv8577/sqoop-to-oracle-transfer-rates
and here
In last case, it looks like 310GB table took only 100 seconds ( with around 25 mappers) in best case for a transfer rate of ~3.1 Gbps. That makes much more sense.
I will try to find out more details about my Oracle server configuration to see what else I can do to improve my performance.
Created 02-21-2016 12:14 AM
@Shishir Saxena Ok. I am going to share these numbers based on my experience..No official numbers
5 nodes cluster with 96GB , Dual 8 Core over 10G network from different datacenter
4 billion rows with 30 mappers = 40 mins
86 million rows ~ 12 mins
My best suggestion is to run a dummy test and based on that you can estimate the timings.
Created 02-21-2016 01:13 AM
Thank You Neeraj. I am running benchmarks on our cluster. Just wanted to understand what max upper limit I can target. Thank you again for quick response and so much help.
Created 02-21-2016 07:08 AM
Hi @Shishir Saxena, Oracle connector for Hadoop, the so-called Oraoop is included in Sqoop-1.4.5 and 1.4.6 (shipped with HDP-2.3.x). Sqoop user guide has a very detailed explanation here. It's enabled when "--direct" is used. Regarding benchmarks it's the best to build your own, for example using Sqoop with and without Oraoop with different number of mappers, various table sizes etc.
Created 02-22-2016 02:39 AM
Thanks. Looks like that is my only choice.