Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Sqoop imported more records than source

Solved Go to solution

Sqoop imported more records than source

Expert Contributor

HDP-2.4.2.0-258 installed using Ambari 2.2.2.0

The source tables are in a SQL Server schema, below is a table with 1205028380 rows and a composite PK (DateDimensionId, DriverDimensionId, VehicleDimensionId) :

DateDimensionId			bigint	Unchecked
DriverDimensionId		int	Unchecked
VehicleDimensionId		int	Unchecked
Odometer			bigint	Checked
TotalFuel			bigint	Checked
TotalFuelIdle			bigint	Checked
TotalRuntime			bigint	Checked
TotalRuntimeIdle		bigint	Checked
TotalDistanceWithTrailer	bigint	Checked
TotalFuelPTO			bigint	Checked
TotalRuntimePTO			bigint	Checked
TotalTimeOverspeeding		bigint	Checked
TotalTimeOverreving		bigint	Checked
TotalNoOfHarshBrakes		bigint	Checked
TotalNoOfBrakeApplications	bigint	Checked
TotalNoOfHarshAcceleration	bigint	Checked
MinTimeMessage			datetime2(7)Checked
MaxTimeMessage			datetime2(7)Checked
TimeOutOfGreenBandDriving	bigint	Checked
Coasting			bigint	Checked
.
.
.

I used the following command, note that the format is ORC, also can '--num-mappers' cause any duplication ?

sqoop import --num-mappers 8 --hcatalog-home /usr/hdp/current/hive-webhcat --hcatalog-database FMS_FleetManagementDatawarehouse_VehicleData --hcatalog-table DateVehicleDriverAggregate --create-hcatalog-table --hcatalog-storage-stanza "stored as orcfile" --connect 'jdbc:sqlserver://<IP>;database=FleetManagementDatawarehouse' --username --password --table DateVehicleDriverAggregate -- --schema VehicleData

The Sqoop import job took a long time(5.6h) with the default 4 mappers but the concern is that it imported 1218843487 records, more than the source ! Is the composite key causing some issue or is it something else ?

There were no errors in the job but in case any specific logs are required, I can provide.

1 ACCEPTED SOLUTION

Accepted Solutions
Highlighted

Re: Sqoop imported more records than source

Expert Contributor

I have either discovered something strange or I lack the understanding of how Sqoop works :

  • Sqoop doc. says that in case of a composite PK, the --split-by column should be specified during sqoop import, however, I proceeded without doing so. Sqoop then picked up one int column belonging to the PK
  • Only in case of few tables(all of them having at least 1.2 billion rows) did I face this mismatch issue
  • I then used --split-by for those tables and also added --validate. Then I got the same no. of rows imported
8 REPLIES 8

Re: Sqoop imported more records than source

Super Guru

@Kaliyug Antagonist

Is this incremental or one time import? If it's incremental then is it possible that timestamp on some records is getting updated in source which you are not considering in your count?

Re: Sqoop imported more records than source

Expert Contributor

This is the first and one-time import.

Re: Sqoop imported more records than source

Expert Contributor

I believe your target table is text format. If that is the case, you have more records than your original table, means your table contains change line character "\n" in some of your fields. To avoid that, you should use ORC or RCfile as your target table format.

Re: Sqoop imported more records than source

Expert Contributor

Edited the original question to include the sqoop import command(in ORC format) that I have used, can you check ?

Re: Sqoop imported more records than source

Expert Contributor

Then, I think it is ORC format issue. Did you check with if the --hive-delims-replacement has impact on the importing?

Re: Sqoop imported more records than source

@Kaliyug Antagonist

Add below four lines to your sqoop syntax and give a try:

--null-string '\\N' \

--null-non-string '\\N' \

--hive-delims-replacement '\0D' \

--fields-terminated-by '\001' \

Root Cause for your Issue:

It seems, In your source table text column users entered the data with spaces/tab delimited or with many space bars.

Re: Sqoop imported more records than source

Expert Contributor

Edited the original question to include the sqoop import command(in ORC format) that I have used, can you check ?

Highlighted

Re: Sqoop imported more records than source

Expert Contributor

I have either discovered something strange or I lack the understanding of how Sqoop works :

  • Sqoop doc. says that in case of a composite PK, the --split-by column should be specified during sqoop import, however, I proceeded without doing so. Sqoop then picked up one int column belonging to the PK
  • Only in case of few tables(all of them having at least 1.2 billion rows) did I face this mismatch issue
  • I then used --split-by for those tables and also added --validate. Then I got the same no. of rows imported
Don't have an account?
Coming from Hortonworks? Activate your account here