Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Getting duplicate data when importing data to HDFS using sqoop

avatar
New Contributor

Hi,

I am trying to import data from mysql to HDFS using sqoop:: The command is as below::

sqoop import --connect jdbc:mysql://192.168.218.128/sqoopdb -username hadoop --table EMP_ADD --driver com.mysql.jdbc.Driver --m 1 --where "CITY='sec-bad'" --target-dir /Practice/SqoopToHDFSWhere

Post checking the respective generated files in HDFS, getting the data duplicated.

[hdfs@sandbox root]$ hadoop fs -cat /Practice/SqoopToHDFSWhere/part-m-00000

1202,108I,aoc,sec-bad

1204,78B,old city,sec-bad

1205,720X,hitec,sec-bad

1202,108I,aoc,sec-bad

1204,78B,old city,sec-bad

1205,720X,hitec,sec-bad

Please help me on this..

PS:- I am using HDP2.4

Regards,

Suresh Kumar

1 ACCEPTED SOLUTION

avatar

@Suresh Kumar D

try this

sqoop import --connect jdbc:mysql://192.168.218.128/sqoopdb --driver "com.teradata.jdbc.TeraDriver" --username hadoop --password Hadoop@1 --query "select * from emp_add where city='sec-bad' AND \$CONDITIONS" --target-dir /Practice/SqoopToHDFSWhere/ --m 1;

View solution in original post

4 REPLIES 4

avatar

@Suresh Kumar D

try this

sqoop import --connect jdbc:mysql://192.168.218.128/sqoopdb --driver "com.teradata.jdbc.TeraDriver" --username hadoop --password Hadoop@1 --query "select * from emp_add where city='sec-bad' AND \$CONDITIONS" --target-dir /Practice/SqoopToHDFSWhere/ --m 1;

avatar
New Contributor

Hi Divakar,

When I tried above getting a new error::

16/07/19 07:22:58 ERROR sqoop.Sqoop: Got exception running Sqoop: java.lang.RuntimeException: Could not load db driver class: com.teradata.jdbc.TeraDriver java.lang.RuntimeException: Could not load db driver class: com.teradata.jdbc.TeraDriver at org.apache.sqoop.manager.SqlManager.makeConnection(SqlManager.java:856) at org.apache.sqoop.manager.GenericJdbcManager.getConnection(GenericJdbcManager.java:52) at org.apache.sqoop.manager.SqlManager.execute(SqlManager.java:744) at org.apache.sqoop.manager.SqlManager.execute(SqlManager.java:767) at org.apache.sqoop.manager.SqlManager.getColumnInfoForRawQuery(SqlManager.java:270) at org.apache.sqoop.manager.SqlManager.getColumnTypesForRawQuery(SqlManager.java:241).....

I just want to know, whenever I run command once, data is getting inserted twice. any settings/configurations needs to be changed.

PFAsqoopduplication.jpg complete execution output::

as per MapReduce output it is retrieving 6 records. Please suggest.

avatar
Contributor

Can you confirm if your mysql query is not resulting duplicates i.e. "select * from emp_add where city='sec-bad'"

avatar

Typo in sqoop command use mysql driver instead of using Teradata driver.

Here is modified script:

sqoop import --connect jdbc:mysql://192.168.218.128/sqoopdb --driver com.mysql.jdbc.Driver --username hadoop --password Hadoop@1 --query "select * from emp_add where city='sec-bad' AND \$CONDITIONS" --target-dir /Practice/SqoopToHDFSWhere/ --m 1;