Created 08-15-2016 02:10 PM
HDP-2.4.2.0-258 installed using Ambari 2.2.2.0
The source tables are in a SQL Server schema, below is a table with 1205028380 rows and a composite PK (DateDimensionId, DriverDimensionId, VehicleDimensionId) :
DateDimensionId bigint Unchecked DriverDimensionId int Unchecked VehicleDimensionId int Unchecked Odometer bigint Checked TotalFuel bigint Checked TotalFuelIdle bigint Checked TotalRuntime bigint Checked TotalRuntimeIdle bigint Checked TotalDistanceWithTrailer bigint Checked TotalFuelPTO bigint Checked TotalRuntimePTO bigint Checked TotalTimeOverspeeding bigint Checked TotalTimeOverreving bigint Checked TotalNoOfHarshBrakes bigint Checked TotalNoOfBrakeApplications bigint Checked TotalNoOfHarshAcceleration bigint Checked MinTimeMessage datetime2(7)Checked MaxTimeMessage datetime2(7)Checked TimeOutOfGreenBandDriving bigint Checked Coasting bigint Checked . . .
I used the following command, note that the format is ORC, also can '--num-mappers' cause any duplication ?
sqoop import --num-mappers 8 --hcatalog-home /usr/hdp/current/hive-webhcat --hcatalog-database FMS_FleetManagementDatawarehouse_VehicleData --hcatalog-table DateVehicleDriverAggregate --create-hcatalog-table --hcatalog-storage-stanza "stored as orcfile" --connect 'jdbc:sqlserver://<IP>;database=FleetManagementDatawarehouse' --username --password --table DateVehicleDriverAggregate -- --schema VehicleData
The Sqoop import job took a long time(5.6h) with the default 4 mappers but the concern is that it imported 1218843487 records, more than the source ! Is the composite key causing some issue or is it something else ?
There were no errors in the job but in case any specific logs are required, I can provide.
Created 08-18-2016 11:39 AM
I have either discovered something strange or I lack the understanding of how Sqoop works :
Created 08-15-2016 02:55 PM
Is this incremental or one time import? If it's incremental then is it possible that timestamp on some records is getting updated in source which you are not considering in your count?
Created 08-15-2016 03:02 PM
This is the first and one-time import.
Created 08-15-2016 08:31 PM
I believe your target table is text format. If that is the case, you have more records than your original table, means your table contains change line character "\n" in some of your fields. To avoid that, you should use ORC or RCfile as your target table format.
Created 08-16-2016 09:17 AM
Edited the original question to include the sqoop import command(in ORC format) that I have used, can you check ?
Created 08-17-2016 12:04 AM
Then, I think it is ORC format issue. Did you check with if the --hive-delims-replacement has impact on the importing?
Created 08-15-2016 08:40 PM
Add below four lines to your sqoop syntax and give a try:
--null-string '\\N' \
--null-non-string '\\N' \
--hive-delims-replacement '\0D' \
--fields-terminated-by '\001' \
Root Cause for your Issue:
It seems, In your source table text column users entered the data with spaces/tab delimited or with many space bars.
Created 08-16-2016 09:17 AM
Edited the original question to include the sqoop import command(in ORC format) that I have used, can you check ?
Created 08-18-2016 11:39 AM
I have either discovered something strange or I lack the understanding of how Sqoop works :