Support Questions

Find answers, ask questions, and share your expertise

Getting duplicate data when importing data to HDFS using sqoop

avatar
New Contributor

Hi,

I am trying to import data from mysql to HDFS using sqoop:: The command is as below::

sqoop import --connect jdbc:mysql://192.168.218.128/sqoopdb -username hadoop --table EMP_ADD --driver com.mysql.jdbc.Driver --m 1 --where "CITY='sec-bad'" --target-dir /Practice/SqoopToHDFSWhere

Post checking the respective generated files in HDFS, getting the data duplicated.

[hdfs@sandbox root]$ hadoop fs -cat /Practice/SqoopToHDFSWhere/part-m-00000

1202,108I,aoc,sec-bad

1204,78B,old city,sec-bad

1205,720X,hitec,sec-bad

1202,108I,aoc,sec-bad

1204,78B,old city,sec-bad

1205,720X,hitec,sec-bad

Please help me on this..

PS:- I am using HDP2.4

Regards,

Suresh Kumar

1 ACCEPTED SOLUTION

avatar

@Suresh Kumar D

try this

sqoop import --connect jdbc:mysql://192.168.218.128/sqoopdb --driver "com.teradata.jdbc.TeraDriver" --username hadoop --password Hadoop@1 --query "select * from emp_add where city='sec-bad' AND \$CONDITIONS" --target-dir /Practice/SqoopToHDFSWhere/ --m 1;

View solution in original post

4 REPLIES 4

avatar

@Suresh Kumar D

try this

sqoop import --connect jdbc:mysql://192.168.218.128/sqoopdb --driver "com.teradata.jdbc.TeraDriver" --username hadoop --password Hadoop@1 --query "select * from emp_add where city='sec-bad' AND \$CONDITIONS" --target-dir /Practice/SqoopToHDFSWhere/ --m 1;

avatar
New Contributor

Hi Divakar,

When I tried above getting a new error::

16/07/19 07:22:58 ERROR sqoop.Sqoop: Got exception running Sqoop: java.lang.RuntimeException: Could not load db driver class: com.teradata.jdbc.TeraDriver java.lang.RuntimeException: Could not load db driver class: com.teradata.jdbc.TeraDriver at org.apache.sqoop.manager.SqlManager.makeConnection(SqlManager.java:856) at org.apache.sqoop.manager.GenericJdbcManager.getConnection(GenericJdbcManager.java:52) at org.apache.sqoop.manager.SqlManager.execute(SqlManager.java:744) at org.apache.sqoop.manager.SqlManager.execute(SqlManager.java:767) at org.apache.sqoop.manager.SqlManager.getColumnInfoForRawQuery(SqlManager.java:270) at org.apache.sqoop.manager.SqlManager.getColumnTypesForRawQuery(SqlManager.java:241).....

I just want to know, whenever I run command once, data is getting inserted twice. any settings/configurations needs to be changed.

PFAsqoopduplication.jpg complete execution output::

as per MapReduce output it is retrieving 6 records. Please suggest.

avatar
Contributor

Can you confirm if your mysql query is not resulting duplicates i.e. "select * from emp_add where city='sec-bad'"

avatar

Typo in sqoop command use mysql driver instead of using Teradata driver.

Here is modified script:

sqoop import --connect jdbc:mysql://192.168.218.128/sqoopdb --driver com.mysql.jdbc.Driver --username hadoop --password Hadoop@1 --query "select * from emp_add where city='sec-bad' AND \$CONDITIONS" --target-dir /Practice/SqoopToHDFSWhere/ --m 1;