Support Questions

Find answers, ask questions, and share your expertise

Sqoop imported more records than source

avatar
Super Collaborator

HDP-2.4.2.0-258 installed using Ambari 2.2.2.0

The source tables are in a SQL Server schema, below is a table with 1205028380 rows and a composite PK (DateDimensionId, DriverDimensionId, VehicleDimensionId) :

DateDimensionId			bigint	Unchecked
DriverDimensionId		int	Unchecked
VehicleDimensionId		int	Unchecked
Odometer			bigint	Checked
TotalFuel			bigint	Checked
TotalFuelIdle			bigint	Checked
TotalRuntime			bigint	Checked
TotalRuntimeIdle		bigint	Checked
TotalDistanceWithTrailer	bigint	Checked
TotalFuelPTO			bigint	Checked
TotalRuntimePTO			bigint	Checked
TotalTimeOverspeeding		bigint	Checked
TotalTimeOverreving		bigint	Checked
TotalNoOfHarshBrakes		bigint	Checked
TotalNoOfBrakeApplications	bigint	Checked
TotalNoOfHarshAcceleration	bigint	Checked
MinTimeMessage			datetime2(7)Checked
MaxTimeMessage			datetime2(7)Checked
TimeOutOfGreenBandDriving	bigint	Checked
Coasting			bigint	Checked
.
.
.

I used the following command, note that the format is ORC, also can '--num-mappers' cause any duplication ?

sqoop import --num-mappers 8 --hcatalog-home /usr/hdp/current/hive-webhcat --hcatalog-database FMS_FleetManagementDatawarehouse_VehicleData --hcatalog-table DateVehicleDriverAggregate --create-hcatalog-table --hcatalog-storage-stanza "stored as orcfile" --connect 'jdbc:sqlserver://<IP>;database=FleetManagementDatawarehouse' --username --password --table DateVehicleDriverAggregate -- --schema VehicleData

The Sqoop import job took a long time(5.6h) with the default 4 mappers but the concern is that it imported 1218843487 records, more than the source ! Is the composite key causing some issue or is it something else ?

There were no errors in the job but in case any specific logs are required, I can provide.

1 ACCEPTED SOLUTION

avatar
Super Collaborator

I have either discovered something strange or I lack the understanding of how Sqoop works :

  • Sqoop doc. says that in case of a composite PK, the --split-by column should be specified during sqoop import, however, I proceeded without doing so. Sqoop then picked up one int column belonging to the PK
  • Only in case of few tables(all of them having at least 1.2 billion rows) did I face this mismatch issue
  • I then used --split-by for those tables and also added --validate. Then I got the same no. of rows imported

View solution in original post

8 REPLIES 8

avatar
Super Guru

@Kaliyug Antagonist

Is this incremental or one time import? If it's incremental then is it possible that timestamp on some records is getting updated in source which you are not considering in your count?

avatar
Super Collaborator

This is the first and one-time import.

avatar
Expert Contributor

I believe your target table is text format. If that is the case, you have more records than your original table, means your table contains change line character "\n" in some of your fields. To avoid that, you should use ORC or RCfile as your target table format.

avatar
Super Collaborator

Edited the original question to include the sqoop import command(in ORC format) that I have used, can you check ?

avatar
Expert Contributor

Then, I think it is ORC format issue. Did you check with if the --hive-delims-replacement has impact on the importing?

avatar
@Kaliyug Antagonist

Add below four lines to your sqoop syntax and give a try:

--null-string '\\N' \

--null-non-string '\\N' \

--hive-delims-replacement '\0D' \

--fields-terminated-by '\001' \

Root Cause for your Issue:

It seems, In your source table text column users entered the data with spaces/tab delimited or with many space bars.

avatar
Super Collaborator

Edited the original question to include the sqoop import command(in ORC format) that I have used, can you check ?

avatar
Super Collaborator

I have either discovered something strange or I lack the understanding of how Sqoop works :

  • Sqoop doc. says that in case of a composite PK, the --split-by column should be specified during sqoop import, however, I proceeded without doing so. Sqoop then picked up one int column belonging to the PK
  • Only in case of few tables(all of them having at least 1.2 billion rows) did I face this mismatch issue
  • I then used --split-by for those tables and also added --validate. Then I got the same no. of rows imported