Support Questions

Find answers, ask questions, and share your expertise

Using --direct option in Sqoop import/export

avatar
New Contributor

Hi Cloudera team. 

I'm currently using Sqoop 1.4.6-cdh5.12.0. I've found strange behavior using sqoop import/export with --direct option. While trying to import table into hdfs using below sqoop import command, it works as expected:

sqoop import \
--connect jdbc:mysql://quickstart.cloudera/retail_db \
--username retail_dba \
--password cloudera \
--table categories \
--target-dir export_problem/categories \
--fields-terminated-by '\t' \
--direct

Result:

[cloudera@quickstart ~]$ hadoop fs -cat export_problem/categories/p* 
1	2	Football
2	2	Soccer
3	2	Baseball & Softball
4	2	Basketball
...

 

 

But when I try to export it back to mysql with the same --direct option it doesn't work as expected:

mysql> create table categories_export as select * from categories limit 0;
Query OK, 0 rows affected (0.02 sec)
Records: 0  Duplicates: 0  Warnings: 0
sqoop export \
--connect jdbc:mysql://quickstart.cloudera/retail_db \
--username retail_dba \
--password cloudera \
--export-dir export_problem/categories \
--table categories_export \
--input-fields-terminated-by "\t" \
--direct

Result:

 

mysql> select * from categories_export limit 5;
+-------------+------------------------+---------------+
| category_id | category_department_id | category_name |
+-------------+------------------------+---------------+
|           1 |                   NULL |               |
|           2 |                   NULL |               |
|           3 |                   NULL |               |
|           4 |                   NULL |               |
|           5 |                   NULL |               |
+-------------+------------------------+---------------+
5 rows in set (0.00 sec)

But without --direct option all works as expected:

sqoop export \
--connect jdbc:mysql://quickstart.cloudera/retail_db \
--username retail_dba \
--password cloudera \
--export-dir export_problem/categories \
--table categories_export \
--input-fields-terminated-by "\t" 

Result:

mysql> select * from categories_export limit 5;
+-------------+------------------------+---------------------+
| category_id | category_department_id | category_name       |
+-------------+------------------------+---------------------+
|           1 |                      2 | Football            |
|           2 |                      2 | Soccer              |
|           3 |                      2 | Baseball & Softball |
|           4 |                      2 | Basketball          |
|           5 |                      2 | Lacrosse            |
+-------------+------------------------+---------------------+
5 rows in set (0.00 sec)

Looks like in sqoop export --direct overwrites previous options but in sqoop import not.

Is it an expected behavior?

5 REPLIES 5

avatar
Super Guru
Hi,

Thanks for reporting the issue, I will try to re-produce this in my lab and will let you know my findings.

Cheers

avatar
Super Guru
Hi,

Sorry about the delay. I am able to re-produce the issue, and currently investigating, will update you again when I have findings.

avatar
Super Guru
Hi Vitalii,

I tested further, you need to use --fields-terminated-by, not --input-fields-terminated-by for the export. As using --fields-terminated-by fixed the issue for me.

I am not exactly sure the differences, but I am trying to figure it out.

avatar
New Contributor

Thanks for your reply, Eric

 

If I understand correctly: 

--fields-terminated-by is the option used during Sqoop Import (ie. they are the Output Formatting arguments) which describe how the data will be written to HDFS.

--input-fields-terminated-by is the option used during Sqoop Export (ie. they are Input Formatting arguments) which describe how the input data is present in HDFS before exporting to RDBMS.

 

And also as I mentioned before, without --direct option sqoop command has expected behavior


If I'm wrong  correct me, please

avatar
Super Guru

Hi Vitalii,

 

I think you are correct, based on below:

 

https://github.com/cloudera/sqoop/blob/cdh5-1.4.6_5.14.0/src/java/org/apache/sqoop/SqoopOptions.java...

 

public void setInputFieldsTerminatedBy(char c) {
    this.inputDelimiters.setFieldsTerminatedBy(c);
}

 

 

https://github.com/cloudera/sqoop/blob/cdh5-1.4.6_5.14.0/src/java/org/apache/sqoop/SqoopOptions.java...

 

public void setFieldsTerminatedBy(char c) {
    this.outputDelimiters.setFieldsTerminatedBy(c);
}

However, I can see that when --direct is used, 

 

 

1. it calls class DirectMySQLManager:

https://github.com/cloudera/sqoop/blob/cdh5-1.4.6_5.14.0/src/java/org/apache/sqoop/manager/DirectMyS...

 

 

  public void exportTable(com.cloudera.sqoop.manager.ExportJobContext context)
      throws IOException, ExportException {
    context.setConnManager(this);
    MySQLExportJob exportJob = new MySQLExportJob(context);
    exportJob.runExport();
  }

2. in MySQLExportJob class, it actually uses getOutputFieldDelim() function

 

https://github.com/cloudera/sqoop/blob/cdh5-1.4.6_5.14.0/src/java/org/apache/sqoop/mapreduce/MySQLEx...

 

 

    conf.setInt(MySQLUtils.OUTPUT_FIELD_DELIM_KEY,
        options.getOutputFieldDelim());

 

This explains that we need to use --fields-terminated-by rather than --input-fields-terminated-by. It looks like that it is considered as output for MySQL, as code uses MySQLUtils.OUTPUT_FIELD_DELIM_KEY.

 

I am not sure if it is expected or a bug. I will follow up with our engineering team. 

 

Look out for another update sometime next week.