I am using sqoop 1.4.6-chd5.10.0 to transfer data between hive and MySQL. The sqoop-import job was succeed, but the sqoop-export job was failed with the error "Can't parse input data".
The main reason I think is that there are some special characters(i.e. ^@) in the HDFS plain file which was imported by sqoop-import as shown in below.
The sqoop import command I used is :
sqoop import --connect jdbc:mysql://******:3306/smart_voice_data --table sqoop_import_test --username ****** --password ****** --verbose --hive-database hive_smart_voice_data --hive-table hive_sqoop_import_test --lines-terminated-by "\n" --hive-drop-import-delims --split-by id --null-non-string NULL --null-string NULL --fields-terminated-by "|" --delete-target-dir --hive-overwrite
The sqoop export command I used is :
sqoop export --connect "jdbc:mysql://******:3306/smart_voice_data?useUnicode=true&characterEncoding=gbk" --username ****** --password ****** --table sqoop_hi_filog_goi_vidaa4_normalized --export-dir hdfs://cdh5node1:8020/user/hive/warehouse/hive_smart_voice_data.db/hive_sqoop_import_test/ --input-null-string "\\\\N" --input-null-non-string "\\\\N" --fields-terminated-by '|' --input-lines-terminated-by '\n'
1 、Is there any way to escape the special character ^@ (\000) when using sqoop to import data into HDFS?
2、 Is there any way to skip the special character ^@(\000) when using sqoop to export data to mysql , or is there any solution to make sqoop-export can parse the special character ^@ ?
3、Is there any way to skip the error and continue the remaining job when doing sqoop-export ?
Any help is really appreciated , thanks a lot !
You can do two things .
1 . You can perform import using certain column using --column skipping those coloumn with special character
--columns <col,col,col…> Columns to import from table
2. Would you consider runining a shell script to remove the special characters before runining sqoop export .
@csguna， Thanks for your reply.
1. the column with special character is needed in the production environment, and only some of rows in the column contain the special character. So the whole cloumn cannot be skipped.
2. I also thought about that, but the tables are imported into HDFS frequently, so every time when the tables get updated, I need to run the shell script to remove the special character.
And the 'special character' may not be able to be predicted, for example, this time it is ^@ , next time, it may become ^A or ^H or something else, as a result, it's hard to use the shell script to remove all special character
I would recommend you to use
--query 'select col1, string_function(col2), etc from sqoop_hi_filog_goi_vidaa4_normalized where $CONDITIONS'
1. String_function is not an actual function name. You can use the applicable function to remove unwanted characters
2. Your query will end with "where $CONDITIONS" by default and this is syntax
@saranvisa, Thanks for your reply, and it's really a good idea to use the 'string_function' to filter the special character.
but from my understanding, the applicable string_functions you mentioned are MySQL string functions, right? Is it possible that we write a customize function for specific case？
In fact I was referring to Hive funtions . In the mean time I understand your point that either source or target will be MySQL, but I believe the priority goes to the environment where the command get executed (need to double check)
Yes you can customize the functions (UDF).
But I would say ignore the complex part now, just start with basic funtions and mostly the string functions are common accorss the DBs.. you will get idea over the period.
@saranvisa, Yes, either source and target are MySQL.
I will try some basic functions , It's really kind of you !
But during my investigation, I found there is an extra argument called --ctrl-chars , but it seems it is only applicable for Netezza , do you have any ideas if this extra argument can also be applied to MySQL ?
thanks again for your help.
I never used --ctrl-chars option before, but it has the following limitations
1. As you mentioned, Currently supported by Netezza Connector
2. You should have sqoop version 1.4.6
3. It will work with direct mode and you have to be very careful when using direct as it may bring down source db depends upon your usage