Created on 03-21-2018 04:33 AM - edited 09-16-2022 06:00 AM
Hi Cloudera team.
I'm currently using Sqoop 1.4.6-cdh5.12.0. I've found strange behavior using sqoop import/export with --direct option. While trying to import table into hdfs using below sqoop import command, it works as expected:
sqoop import \ --connect jdbc:mysql://quickstart.cloudera/retail_db \ --username retail_dba \ --password cloudera \ --table categories \ --target-dir export_problem/categories \ --fields-terminated-by '\t' \ --direct
Result:
[cloudera@quickstart ~]$ hadoop fs -cat export_problem/categories/p* 1 2 Football 2 2 Soccer 3 2 Baseball & Softball 4 2 Basketball ...
But when I try to export it back to mysql with the same --direct option it doesn't work as expected:
mysql> create table categories_export as select * from categories limit 0; Query OK, 0 rows affected (0.02 sec) Records: 0 Duplicates: 0 Warnings: 0
sqoop export \ --connect jdbc:mysql://quickstart.cloudera/retail_db \ --username retail_dba \ --password cloudera \ --export-dir export_problem/categories \ --table categories_export \ --input-fields-terminated-by "\t" \ --direct
Result:
mysql> select * from categories_export limit 5; +-------------+------------------------+---------------+ | category_id | category_department_id | category_name | +-------------+------------------------+---------------+ | 1 | NULL | | | 2 | NULL | | | 3 | NULL | | | 4 | NULL | | | 5 | NULL | | +-------------+------------------------+---------------+ 5 rows in set (0.00 sec)
But without --direct option all works as expected:
sqoop export \ --connect jdbc:mysql://quickstart.cloudera/retail_db \ --username retail_dba \ --password cloudera \ --export-dir export_problem/categories \ --table categories_export \ --input-fields-terminated-by "\t"
Result:
mysql> select * from categories_export limit 5; +-------------+------------------------+---------------------+ | category_id | category_department_id | category_name | +-------------+------------------------+---------------------+ | 1 | 2 | Football | | 2 | 2 | Soccer | | 3 | 2 | Baseball & Softball | | 4 | 2 | Basketball | | 5 | 2 | Lacrosse | +-------------+------------------------+---------------------+ 5 rows in set (0.00 sec)
Looks like in sqoop export --direct overwrites previous options but in sqoop import not.
Is it an expected behavior?
Created 04-10-2018 04:21 AM
Created 05-02-2018 10:40 PM
Created 05-02-2018 11:53 PM
Created 05-03-2018 12:58 AM
Thanks for your reply, Eric
If I understand correctly:
--fields-terminated-by is the option used during Sqoop Import (ie. they are the Output Formatting arguments) which describe how the data will be written to HDFS.
--input-fields-terminated-by is the option used during Sqoop Export (ie. they are Input Formatting arguments) which describe how the input data is present in HDFS before exporting to RDBMS.
And also as I mentioned before, without --direct option sqoop command has expected behavior
If I'm wrong correct me, please
Created 05-03-2018 09:54 PM
Hi Vitalii,
I think you are correct, based on below:
public void setInputFieldsTerminatedBy(char c) { this.inputDelimiters.setFieldsTerminatedBy(c); }
public void setFieldsTerminatedBy(char c) { this.outputDelimiters.setFieldsTerminatedBy(c); }
However, I can see that when --direct is used,
1. it calls class DirectMySQLManager:
public void exportTable(com.cloudera.sqoop.manager.ExportJobContext context) throws IOException, ExportException { context.setConnManager(this); MySQLExportJob exportJob = new MySQLExportJob(context); exportJob.runExport(); }
2. in MySQLExportJob class, it actually uses getOutputFieldDelim() function
conf.setInt(MySQLUtils.OUTPUT_FIELD_DELIM_KEY, options.getOutputFieldDelim());
This explains that we need to use --fields-terminated-by rather than --input-fields-terminated-by. It looks like that it is considered as output for MySQL, as code uses MySQLUtils.OUTPUT_FIELD_DELIM_KEY.
I am not sure if it is expected or a bug. I will follow up with our engineering team.
Look out for another update sometime next week.