Hi
I have a 7 node cluster on RHCP7.1. I have a sqoop job that is trying to import data from an Oracle table into Hive. It is a single table and I am using one partition that has 40 billlion rows in it. the job has 10 mappers and is split by on an ID column that is an Integer. There is an index on that ID column in Oracle
Now the problem is, the mappers that have greater than 2 billion rows are completing successfully but the row count in those is much higher. The table below will give you an idea
Mappers | Sqoop | Oracle |
m0 | 1709027700 | 1709027700 |
m1 | 340656511 | 340656511 |
m2 | 2147483000 | 3431813617 |
m3 | 2147483000 | 4649556868 |
m4 | 2147483000 | 4567876345 |
m5 | 2147483000 | 8156384917 |
m6 | 2147483000 | 7844967352 |
m7 | 2147483000 | 4153074965 |
m8 | 2147483000 | 2650539503 |
m9 | 1454645905 | 1454645905 |
What could it be? I read somewhere that this could be a cluster configuration issue. Someone else suggested this could be a limitation of the driver. Any pointers, anyone?