I have a 7 node cluster on RHCP7.1. I have a sqoop job that is trying to import data from an Oracle table into Hive. It is a single table and I am using one partition that has 40 billlion rows in it. the job has 10 mappers and is split by on an ID column that is an Integer. There is an index on that ID column in Oracle
Now the problem is, the mappers that have greater than 2 billion rows are completing successfully but the row count in those is much higher. The table below will give you an idea
What could it be? I read somewhere that this could be a cluster configuration issue. Someone else suggested this could be a limitation of the driver. Any pointers, anyone?
As a work around, we are now running smaller batches that are running successfully. I would, however, like to get to thee bottom of this as we have much larger tables to be ingested
The ResultSet.getRow method (which isn't required to be implemented) returns an int which, being signed, has a max value of 2^31
Since the getRow method, if implemented, returns an int it would NOT be possible for it to return a value larger than Integer.MAXSIZE.
May be related to :