Created 05-11-2017 02:16 PM
Hi team,
I created a HIVE database called employees_exercise and the creation was successful. However, when I am running a sqoop job to connect to a mysql instance and load into hdfs, i can getting the following error on running the sqoop command.
sqoop import --connect jdbc:mysql://ip-10-0-0-247.eu-west-2.compute.internal/employees --username <<Removed>> --password <<removed>> --table employees --hive-import --hive-table employees_exercise.employees --hive-import --as-parquetfile -m 1
error message is below:
17/05/11 10:10:26 INFO mapreduce.ImportJobBase: Beginning import of employees 17/05/11 10:10:27 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `employees` AS t LIMIT 1 17/05/11 10:10:28 ERROR sqoop.Sqoop: Got exception running Sqoop: org.kitesdk.data.ValidationException: Dataset name employees_exercise.employees is not alphanumeric (plus '_') org.kitesdk.data.ValidationException: Dataset name employees_exercise.employees is not alphanumeric (plus '_') at org.kitesdk.data.ValidationException.check(ValidationException.java:55) at org.kitesdk.data.spi.Compatibility.checkDatasetName(Compatibility.java:103) at org.kitesdk.data.spi.hive.HiveManagedMetadataProvider.create(HiveManagedMetadataProvider.java:43) at org.kitesdk.data.spi.hive.HiveManagedDatasetRepository.create(HiveManagedDatasetRepository.java:77) at org.kitesdk.data.Datasets.create(Datasets.java:239) at org.kitesdk.data.Datasets.create(Datasets.java:307) at org.apache.sqoop.mapreduce.ParquetJob.createDataset(ParquetJob.java:107) at org.apache.sqoop.mapreduce.ParquetJob.configureImportJob(ParquetJob.java:89) at org.apache.sqoop.mapreduce.DataDrivenImportJob.configureMapper(DataDrivenImportJob.java:108) at org.apache.sqoop.mapreduce.ImportJobBase.runImport(ImportJobBase.java:260) at org.apache.sqoop.manager.SqlManager.importTable(SqlManager.java:673) at org.apache.sqoop.manager.MySQLManager.importTable(MySQLManager.java:118) at org.apache.sqoop.tool.ImportTool.importTable(ImportTool.java:497) at org.apache.sqoop.tool.ImportTool.run(ImportTool.java:605) at org.apache.sqoop.Sqoop.run(Sqoop.java:148) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76) at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:184) at org.apache.sqoop.Sqoop.runTool(Sqoop.java:226) at org.apache.sqoop.Sqoop.runTool(Sqoop.java:235) at org.apache.sqoop.Sqoop.main(Sqoop.java:244)
Please could some one assist?
Thanks in advance
Regards
Naveen
Created 05-11-2017 02:55 PM
My hypothesis is this is an issue specific to using the --as-parquetfile argument as there appears to be a KiteSDK bug related to this behavior. Can you please try executing the command without that? The pattern would be to create a staging table first and then insert into a parquet-backed table.
Created 05-11-2017 02:55 PM
My hypothesis is this is an issue specific to using the --as-parquetfile argument as there appears to be a KiteSDK bug related to this behavior. Can you please try executing the command without that? The pattern would be to create a staging table first and then insert into a parquet-backed table.
Created 05-11-2017 03:34 PM
Thank you. @slachterman. That woirked, I removed the --as-parquetfile and that seemed to have fixed the problem. When you say insert into a parquet-backed table, do you mean I have to create a table in HIVE ahead of loading into it? Amy documentation that you can point to will be helpful.
Created 05-11-2017 04:38 PM
Yes, @Naveen Keshava, that is correct. You will create two tables, say, employees_staging and employees_parquet. Employees_staging will be stored as text and employees_parquet will be stored as parquet. You will sqoop into employees_staging and then run another SQL command to insert into employees_parquet from employees_staging.
Created 06-04-2017 10:05 PM
thank you, @slachterman. one final question, when you meant "run another SQL command to insert into employees_parquet from employees_staging." were you refering to a simple "insert into" sql command?
I tried it, and it looks like the insert worked into the target table. But how do I then make sure the file is in hdfs? with parquet format?
Created 06-04-2017 11:13 PM
Yes.
You can use describe extended to see the HDFS path associated with the Hive table and you can use parquet-tools to interact with the parquet file.
Created 05-07-2018 06:29 PM
I also had the same issue, but I was not ready to create a staging table, felt like that is not the real solution.
Just do like this
instead of using "--hive-table employees_exercise.employees" use like "--hive-database employees_exercise --hive-table employees"
It worked well for me
Created 06-28-2018 06:40 PM
I faced the same issue in a different case, a incremental import from oracle to a parquet file in hdfs. In my case, the issue ocurred when i specified the --table parameter.
After some researches, i concluded that this issue is caused by the dot when i specified the "schema.table". In adition to mentioned above, you can use the paremeter --query instead the --table in your sqoop import. Eg:
instead of:
sqoop --import --connect jdbc:oracle:thin:@host:port/db --username user --P --table schema.table --split-by column --target-dir /some_hdfs_path/schema/table --incremental append --check-column ID --as-parquetfile --fields-terminated-by '|'
i used:
sqoop --import --connect jdbc:oracle:thin:@host:port/db --username user --P --query "SELECT * FROM schema.table WHERE \$CONDITIONS" --split-by column --target-dir /some_hdfs_path/schema/table --incremental append --check-column column --as-parquetfile --fields-terminated-by '|'
the \$CONDITIONS is required by sqoop, without this you'll got another error.