<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Re: Sqoop imported more records than source in Support Questions</title>
    <link>https://community.cloudera.com/t5/Support-Questions/Sqoop-imported-more-records-than-source/m-p/174730#M136993</link>
    <description>&lt;P&gt;Edited the original question to include the sqoop import command&lt;STRONG&gt;(in ORC format)&lt;/STRONG&gt; that I have used, can you check ?&lt;/P&gt;</description>
    <pubDate>Tue, 16 Aug 2016 16:17:31 GMT</pubDate>
    <dc:creator>kaliyugantagoni</dc:creator>
    <dc:date>2016-08-16T16:17:31Z</dc:date>
    <item>
      <title>Sqoop imported more records than source</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Sqoop-imported-more-records-than-source/m-p/174724#M136987</link>
      <description>&lt;P&gt;HDP-2.4.2.0-258 installed using Ambari 2.2.2.0&lt;/P&gt;&lt;P&gt;The source tables are in a SQL Server schema, below is a table with &lt;STRONG&gt;1205028380 &lt;/STRONG&gt;rows and a composite PK (DateDimensionId, DriverDimensionId, VehicleDimensionId) :&lt;/P&gt;&lt;PRE&gt;DateDimensionId			bigint	Unchecked
DriverDimensionId		int	Unchecked
VehicleDimensionId		int	Unchecked
Odometer			bigint	Checked
TotalFuel			bigint	Checked
TotalFuelIdle			bigint	Checked
TotalRuntime			bigint	Checked
TotalRuntimeIdle		bigint	Checked
TotalDistanceWithTrailer	bigint	Checked
TotalFuelPTO			bigint	Checked
TotalRuntimePTO			bigint	Checked
TotalTimeOverspeeding		bigint	Checked
TotalTimeOverreving		bigint	Checked
TotalNoOfHarshBrakes		bigint	Checked
TotalNoOfBrakeApplications	bigint	Checked
TotalNoOfHarshAcceleration	bigint	Checked
MinTimeMessage			datetime2(7)Checked
MaxTimeMessage			datetime2(7)Checked
TimeOutOfGreenBandDriving	bigint	Checked
Coasting			bigint	Checked
.
.
.&lt;/PRE&gt;&lt;P&gt;I used the following command, note that &lt;STRONG&gt;the format is ORC&lt;/STRONG&gt;, &lt;STRONG&gt;also can '--num-mappers' cause any duplication ?&lt;/STRONG&gt;&lt;/P&gt;&lt;PRE&gt;sqoop import --num-mappers 8 --hcatalog-home /usr/hdp/current/hive-webhcat --hcatalog-database FMS_FleetManagementDatawarehouse_VehicleData --hcatalog-table DateVehicleDriverAggregate --create-hcatalog-table --hcatalog-storage-stanza "stored as orcfile" --connect 'jdbc:sqlserver://&amp;lt;IP&amp;gt;;database=FleetManagementDatawarehouse' --username --password --table DateVehicleDriverAggregate -- --schema VehicleData&lt;/PRE&gt;&lt;P&gt;The Sqoop import job took a long time(5.6h) with the default 4 mappers but the concern is that it imported 1218843487 records, more than the source ! Is the composite key causing some issue or is it something else ?&lt;/P&gt;&lt;P&gt;There were no errors in the job but in case any specific logs are required, I can provide.&lt;/P&gt;</description>
      <pubDate>Mon, 15 Aug 2016 21:10:07 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Sqoop-imported-more-records-than-source/m-p/174724#M136987</guid>
      <dc:creator>kaliyugantagoni</dc:creator>
      <dc:date>2016-08-15T21:10:07Z</dc:date>
    </item>
    <item>
      <title>Re: Sqoop imported more records than source</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Sqoop-imported-more-records-than-source/m-p/174725#M136988</link>
      <description>&lt;P&gt;&lt;A rel="user" href="https://community.cloudera.com/users/5134/kaliyugantagonist.html" nodeid="5134"&gt;@Kaliyug Antagonist&lt;/A&gt;&lt;/P&gt;&lt;P&gt;Is this incremental or one time import? If it's incremental then is it possible that timestamp on some records is getting updated in source which you are not considering in your count?&lt;/P&gt;</description>
      <pubDate>Mon, 15 Aug 2016 21:55:10 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Sqoop-imported-more-records-than-source/m-p/174725#M136988</guid>
      <dc:creator>mqureshi</dc:creator>
      <dc:date>2016-08-15T21:55:10Z</dc:date>
    </item>
    <item>
      <title>Re: Sqoop imported more records than source</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Sqoop-imported-more-records-than-source/m-p/174726#M136989</link>
      <description>&lt;P&gt;This is the first and one-time import.&lt;/P&gt;</description>
      <pubDate>Mon, 15 Aug 2016 22:02:48 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Sqoop-imported-more-records-than-source/m-p/174726#M136989</guid>
      <dc:creator>kaliyugantagoni</dc:creator>
      <dc:date>2016-08-15T22:02:48Z</dc:date>
    </item>
    <item>
      <title>Re: Sqoop imported more records than source</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Sqoop-imported-more-records-than-source/m-p/174727#M136990</link>
      <description>&lt;P&gt;I believe your target table is text format. If that is the case, you have more records than your original table, means your table contains change line character "\n" in some of your fields. To avoid that, you should use ORC or RCfile as your target table format.  &lt;/P&gt;</description>
      <pubDate>Tue, 16 Aug 2016 03:31:54 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Sqoop-imported-more-records-than-source/m-p/174727#M136990</guid>
      <dc:creator>ylu</dc:creator>
      <dc:date>2016-08-16T03:31:54Z</dc:date>
    </item>
    <item>
      <title>Re: Sqoop imported more records than source</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Sqoop-imported-more-records-than-source/m-p/174728#M136991</link>
      <description>&lt;A rel="user" href="https://community.cloudera.com/users/5134/kaliyugantagonist.html" nodeid="5134"&gt;@Kaliyug Antagonist&lt;/A&gt;&lt;P&gt;Add below four lines to your sqoop syntax and give a try:&lt;/P&gt;&lt;P&gt;--null-string '\\N' \ &lt;/P&gt;&lt;P&gt;--null-non-string '\\N' \ &lt;/P&gt;&lt;P&gt;--hive-delims-replacement '\0D' \ &lt;/P&gt;&lt;P&gt;--fields-terminated-by '\001' \&lt;/P&gt;&lt;P&gt;Root Cause for your Issue: &lt;/P&gt;&lt;P&gt;It seems, In your source table text column users entered the data with spaces/tab delimited or with many space bars.&lt;/P&gt;</description>
      <pubDate>Tue, 16 Aug 2016 03:40:08 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Sqoop-imported-more-records-than-source/m-p/174728#M136991</guid>
      <dc:creator>divakarreddy_a</dc:creator>
      <dc:date>2016-08-16T03:40:08Z</dc:date>
    </item>
    <item>
      <title>Re: Sqoop imported more records than source</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Sqoop-imported-more-records-than-source/m-p/174729#M136992</link>
      <description>&lt;P&gt;Edited the original question to include the sqoop import command&lt;STRONG&gt;(in ORC format)&lt;/STRONG&gt; that I have used, can you check ?&lt;/P&gt;</description>
      <pubDate>Tue, 16 Aug 2016 16:17:03 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Sqoop-imported-more-records-than-source/m-p/174729#M136992</guid>
      <dc:creator>kaliyugantagoni</dc:creator>
      <dc:date>2016-08-16T16:17:03Z</dc:date>
    </item>
    <item>
      <title>Re: Sqoop imported more records than source</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Sqoop-imported-more-records-than-source/m-p/174730#M136993</link>
      <description>&lt;P&gt;Edited the original question to include the sqoop import command&lt;STRONG&gt;(in ORC format)&lt;/STRONG&gt; that I have used, can you check ?&lt;/P&gt;</description>
      <pubDate>Tue, 16 Aug 2016 16:17:31 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Sqoop-imported-more-records-than-source/m-p/174730#M136993</guid>
      <dc:creator>kaliyugantagoni</dc:creator>
      <dc:date>2016-08-16T16:17:31Z</dc:date>
    </item>
    <item>
      <title>Re: Sqoop imported more records than source</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Sqoop-imported-more-records-than-source/m-p/174731#M136994</link>
      <description>&lt;P&gt;Then, I think it is ORC format issue. Did you check with if the --hive-delims-replacement has impact on the importing?&lt;/P&gt;</description>
      <pubDate>Wed, 17 Aug 2016 07:04:13 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Sqoop-imported-more-records-than-source/m-p/174731#M136994</guid>
      <dc:creator>ylu</dc:creator>
      <dc:date>2016-08-17T07:04:13Z</dc:date>
    </item>
    <item>
      <title>Re: Sqoop imported more records than source</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Sqoop-imported-more-records-than-source/m-p/174732#M136995</link>
      <description>&lt;P&gt;I have either discovered something strange or I lack the understanding of how Sqoop works :&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;Sqoop doc. says that in case of a composite PK, the --split-by column should be specified during sqoop import, however, I proceeded without doing so. Sqoop then picked up one int column belonging to the PK&lt;/LI&gt;&lt;LI&gt;Only in case of few tables(all of them having at least 1.2 billion rows) did I face this mismatch issue&lt;/LI&gt;&lt;LI&gt;I then used --split-by for those tables and also added --validate. Then I got the same no. of rows imported&lt;/LI&gt;&lt;/UL&gt;</description>
      <pubDate>Thu, 18 Aug 2016 18:39:40 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Sqoop-imported-more-records-than-source/m-p/174732#M136995</guid>
      <dc:creator>kaliyugantagoni</dc:creator>
      <dc:date>2016-08-18T18:39:40Z</dc:date>
    </item>
  </channel>
</rss>

