Support Questions
Find answers, ask questions, and share your expertise
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Procession 1 billion record tables with joins

Procession 1 billion record tables with joins

New Contributor

I need to join two tables one of which has 1 billion records and needs to joined with smaller table having few thousand records, there are multiple functions sum , group by and aggregate functions to be perform The current system is using Oracle and Java8 , We need to migrate on Hadoop

We have tried sqoop for importing data and is sucessfull but need to know on performance and faster aspects even,

We currently have 6 datanodes available , each datanode 22 core CPU , 320GB of ram , 200TB of hard disk, currently we are using Cloudera distrubution v.5.4.3

1) what kind of DB(HBase/Hive metastore/any other) I need to use

2) What kind of HDFS file format will be suitable for this, 3) Which processing SQL tool should I use Hive/Impala or other.

4) Can anyone suggest me even on infrastructure side what will be the best combination for processing the tables,

THE SLA for this is 30-45 minutes maximum.


Vishal Shimpi


Re: Procession 1 billion record tables with joins

Aggregating/Joining a billion row table?

( I will talk about HDP on Cloudera you would most likely use Impala but this is not the forum for detailled information about that )

1) You need Hive ( on HDP on Cloudera you most likely will use Impala )

2) ORC tables ( zip compression well sorted and partitioned )

3) Hive

4) Infrastructure? you said you have 6 datanodes with a ton of RAM that sounds pretty . Just make sure that you

- Run analyze on your tables for statistics.

- Have all Hive performance features of the newest version enabled ( vectorization, CBO, ... )

- Have set the task memory settings correctly so you fully utilize the cluster during queries ( run a top command on a datanode during query if its less than 80% you need to change map size, bytes per reducer etc. )

If you do all this well you shouldn't have an issue aggregating a 100GB table with some small joins. Depends obviously on the query. But I have run aggregation queries on 7 smaller datanodes with 300GB of data without the small join in 4-5 minutes.

Re: Procession 1 billion record tables with joins

New Contributor

I am using Impala , so sqooping as a AvroTable, The scrip is as follows

hadoop fs -rm -r /user/mcodev/* hdfsdir= /hivestage/mco/dev/data/VISHAL/avro/data_provider positiondate="31-DEC-2014" hadoop fs -test -d $hdfsdir if [ $? == 0 ]; then echo "Cleaning up $hdfsdir" hadoop fs -rm -r $hdfsdir fi export SQOOP_HOME=/opt/cloudera/parcels/CDH/lib/sqoop export PATH=$PATH:$SQOOP_HOME/bin export HIVE_HOME=/opt/cloudera/parcels/CDH/lib/hive export HCAT_HOME=/opt/cloudera/parcels/CDH/lib/hive-hcatalog sqoop import --verbose --connect jdbc:oracle:thin:@ \ --query "SELECT TO_CHAR(TIMEKEY, 'DD-MON-YYYY') AS TIMEKEY, AGGREGATION_KEY, CONTRACT_REFERENCE, BALANCE, CPTY_CODE, SYS_PARTITION_KEY, CNTRY_CDE, CPTY_BASEL_ENTITY_CLASS, TO_CHAR(MATURITY_DATE, 'DD-MON-YYYY HH24:MI:SS') AS MATURITY_DATE, ACCRUED_INTEREST, TO_CHAR(NEXT_INTEREST_DATE, 'DD-MON-YYYY HH24:MI:SS') AS NEXT_INTEREST_DATE, N_COUNTER FROM MV_MCO_ACCOUNT_DEALS WHERE TIMEKEY = '$positiondate' AND \$CONDITIONS" \ --username mcodbowner --password McouatDbOwner_2015 \ --split-by AGGREGATION_KEY \ --target-dir '$hdfsdir' \ --as-avrodatafile \ --compression-codec snappy

The problem here is I cant find any output in the target directory but can find many files genrated in /user/$UserName/


How can I import the sqoop data into my avro table.

I have followed steps as from the website

But no success, It will be really helpfull If you can guide me in sqooping oracle tables into avro table in HDFS

Re: Procession 1 billion record tables with joins

@Nikhil B

Ignore the comments below for a second, look how you defined the hdfsdir folder, there is a space after the = sign, its possible that he puts that as an empty string and then it would be no wonder that everything ends up in the home folder.

"hdfsdir= /hivesta..."

If the above doesn't fix your problem :

Perhaps you should make a new question here since the topics are totally different.

From a first look it all looks decent but there is a lot that can go wrong in this relatively complicated script you use.

So it runs through and the output of sqoop doesn't show any errors? Can you add the sqoop output log perhaps? It should specify the HDFS folder used etc. I would hardcode the target-dir for testing purposes for example.

Don't have an account?
Coming from Hortonworks? Activate your account here