Archives of Support Questions (Read Only)

This is an archived board for historical reference. Information and links may no longer be available or relevant
Announcements
This board is archived and read-only for historical reference. To ask a new question, please post a new topic on the appropriate active board.

Getting duplicate data when importing data to HDFS using sqoop

avatar
New Member

Hi,

I am trying to import data from mysql to HDFS using sqoop:: The command is as below::

sqoop import --connect jdbc:mysql://192.168.218.128/sqoopdb -username hadoop --table EMP_ADD --driver com.mysql.jdbc.Driver --m 1 --where "CITY='sec-bad'" --target-dir /Practice/SqoopToHDFSWhere

Post checking the respective generated files in HDFS, getting the data duplicated.

[hdfs@sandbox root]$ hadoop fs -cat /Practice/SqoopToHDFSWhere/part-m-00000

1202,108I,aoc,sec-bad

1204,78B,old city,sec-bad

1205,720X,hitec,sec-bad

1202,108I,aoc,sec-bad

1204,78B,old city,sec-bad

1205,720X,hitec,sec-bad

Please help me on this..

PS:- I am using HDP2.4

Regards,

Suresh Kumar

1 ACCEPTED SOLUTION

avatar

@Suresh Kumar D

try this

sqoop import --connect jdbc:mysql://192.168.218.128/sqoopdb --driver "com.teradata.jdbc.TeraDriver" --username hadoop --password Hadoop@1 --query "select * from emp_add where city='sec-bad' AND \$CONDITIONS" --target-dir /Practice/SqoopToHDFSWhere/ --m 1;

View solution in original post

4 REPLIES 4

avatar

@Suresh Kumar D

try this

sqoop import --connect jdbc:mysql://192.168.218.128/sqoopdb --driver "com.teradata.jdbc.TeraDriver" --username hadoop --password Hadoop@1 --query "select * from emp_add where city='sec-bad' AND \$CONDITIONS" --target-dir /Practice/SqoopToHDFSWhere/ --m 1;

avatar
New Member

Hi Divakar,

When I tried above getting a new error::

16/07/19 07:22:58 ERROR sqoop.Sqoop: Got exception running Sqoop: java.lang.RuntimeException: Could not load db driver class: com.teradata.jdbc.TeraDriver java.lang.RuntimeException: Could not load db driver class: com.teradata.jdbc.TeraDriver at org.apache.sqoop.manager.SqlManager.makeConnection(SqlManager.java:856) at org.apache.sqoop.manager.GenericJdbcManager.getConnection(GenericJdbcManager.java:52) at org.apache.sqoop.manager.SqlManager.execute(SqlManager.java:744) at org.apache.sqoop.manager.SqlManager.execute(SqlManager.java:767) at org.apache.sqoop.manager.SqlManager.getColumnInfoForRawQuery(SqlManager.java:270) at org.apache.sqoop.manager.SqlManager.getColumnTypesForRawQuery(SqlManager.java:241).....

I just want to know, whenever I run command once, data is getting inserted twice. any settings/configurations needs to be changed.

PFAsqoopduplication.jpg complete execution output::

as per MapReduce output it is retrieving 6 records. Please suggest.

avatar

Can you confirm if your mysql query is not resulting duplicates i.e. "select * from emp_add where city='sec-bad'"

avatar

Typo in sqoop command use mysql driver instead of using Teradata driver.

Here is modified script:

sqoop import --connect jdbc:mysql://192.168.218.128/sqoopdb --driver com.mysql.jdbc.Driver --username hadoop --password Hadoop@1 --query "select * from emp_add where city='sec-bad' AND \$CONDITIONS" --target-dir /Practice/SqoopToHDFSWhere/ --m 1;