Support Questions
Find answers, ask questions, and share your expertise

Getting duplicate data when importing data to HDFS using sqoop

Solved Go to solution
Highlighted

Getting duplicate data when importing data to HDFS using sqoop

New Contributor

Hi,

I am trying to import data from mysql to HDFS using sqoop:: The command is as below::

sqoop import --connect jdbc:mysql://192.168.218.128/sqoopdb -username hadoop --table EMP_ADD --driver com.mysql.jdbc.Driver --m 1 --where "CITY='sec-bad'" --target-dir /Practice/SqoopToHDFSWhere

Post checking the respective generated files in HDFS, getting the data duplicated.

[hdfs@sandbox root]$ hadoop fs -cat /Practice/SqoopToHDFSWhere/part-m-00000

1202,108I,aoc,sec-bad

1204,78B,old city,sec-bad

1205,720X,hitec,sec-bad

1202,108I,aoc,sec-bad

1204,78B,old city,sec-bad

1205,720X,hitec,sec-bad

Please help me on this..

PS:- I am using HDP2.4

Regards,

Suresh Kumar

1 ACCEPTED SOLUTION

Accepted Solutions
Highlighted

Re: Getting duplicate data when importing data to HDFS using sqoop

@Suresh Kumar D

try this

sqoop import --connect jdbc:mysql://192.168.218.128/sqoopdb --driver "com.teradata.jdbc.TeraDriver" --username hadoop --password Hadoop@1 --query "select * from emp_add where city='sec-bad' AND \$CONDITIONS" --target-dir /Practice/SqoopToHDFSWhere/ --m 1;

View solution in original post

4 REPLIES 4
Highlighted

Re: Getting duplicate data when importing data to HDFS using sqoop

@Suresh Kumar D

try this

sqoop import --connect jdbc:mysql://192.168.218.128/sqoopdb --driver "com.teradata.jdbc.TeraDriver" --username hadoop --password Hadoop@1 --query "select * from emp_add where city='sec-bad' AND \$CONDITIONS" --target-dir /Practice/SqoopToHDFSWhere/ --m 1;

View solution in original post

Highlighted

Re: Getting duplicate data when importing data to HDFS using sqoop

New Contributor

Hi Divakar,

When I tried above getting a new error::

16/07/19 07:22:58 ERROR sqoop.Sqoop: Got exception running Sqoop: java.lang.RuntimeException: Could not load db driver class: com.teradata.jdbc.TeraDriver java.lang.RuntimeException: Could not load db driver class: com.teradata.jdbc.TeraDriver at org.apache.sqoop.manager.SqlManager.makeConnection(SqlManager.java:856) at org.apache.sqoop.manager.GenericJdbcManager.getConnection(GenericJdbcManager.java:52) at org.apache.sqoop.manager.SqlManager.execute(SqlManager.java:744) at org.apache.sqoop.manager.SqlManager.execute(SqlManager.java:767) at org.apache.sqoop.manager.SqlManager.getColumnInfoForRawQuery(SqlManager.java:270) at org.apache.sqoop.manager.SqlManager.getColumnTypesForRawQuery(SqlManager.java:241).....

I just want to know, whenever I run command once, data is getting inserted twice. any settings/configurations needs to be changed.

PFAsqoopduplication.jpg complete execution output::

as per MapReduce output it is retrieving 6 records. Please suggest.

Re: Getting duplicate data when importing data to HDFS using sqoop

Can you confirm if your mysql query is not resulting duplicates i.e. "select * from emp_add where city='sec-bad'"

Highlighted

Re: Getting duplicate data when importing data to HDFS using sqoop

Typo in sqoop command use mysql driver instead of using Teradata driver.

Here is modified script:

sqoop import --connect jdbc:mysql://192.168.218.128/sqoopdb --driver com.mysql.jdbc.Driver --username hadoop --password Hadoop@1 --query "select * from emp_add where city='sec-bad' AND \$CONDITIONS" --target-dir /Practice/SqoopToHDFSWhere/ --m 1;

Don't have an account?