Support Questions

Find answers, ask questions, and share your expertise

Using underscore _ in a database name in HIVE

avatar

Hi team,

I created a HIVE database called employees_exercise and the creation was successful. However, when I am running a sqoop job to connect to a mysql instance and load into hdfs, i can getting the following error on running the sqoop command.

sqoop import --connect jdbc:mysql://ip-10-0-0-247.eu-west-2.compute.internal/employees --username <<Removed>> --password <<removed>> --table employees --hive-import --hive-table employees_exercise.employees --hive-import --as-parquetfile -m 1

error message is below:

17/05/11 10:10:26 INFO mapreduce.ImportJobBase: Beginning import of employees 17/05/11 10:10:27 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `employees` AS t LIMIT 1 17/05/11 10:10:28 ERROR sqoop.Sqoop: Got exception running Sqoop: org.kitesdk.data.ValidationException: Dataset name employees_exercise.employees is not alphanumeric (plus '_') org.kitesdk.data.ValidationException: Dataset name employees_exercise.employees is not alphanumeric (plus '_') at org.kitesdk.data.ValidationException.check(ValidationException.java:55) at org.kitesdk.data.spi.Compatibility.checkDatasetName(Compatibility.java:103) at org.kitesdk.data.spi.hive.HiveManagedMetadataProvider.create(HiveManagedMetadataProvider.java:43) at org.kitesdk.data.spi.hive.HiveManagedDatasetRepository.create(HiveManagedDatasetRepository.java:77) at org.kitesdk.data.Datasets.create(Datasets.java:239) at org.kitesdk.data.Datasets.create(Datasets.java:307) at org.apache.sqoop.mapreduce.ParquetJob.createDataset(ParquetJob.java:107) at org.apache.sqoop.mapreduce.ParquetJob.configureImportJob(ParquetJob.java:89) at org.apache.sqoop.mapreduce.DataDrivenImportJob.configureMapper(DataDrivenImportJob.java:108) at org.apache.sqoop.mapreduce.ImportJobBase.runImport(ImportJobBase.java:260) at org.apache.sqoop.manager.SqlManager.importTable(SqlManager.java:673) at org.apache.sqoop.manager.MySQLManager.importTable(MySQLManager.java:118) at org.apache.sqoop.tool.ImportTool.importTable(ImportTool.java:497) at org.apache.sqoop.tool.ImportTool.run(ImportTool.java:605) at org.apache.sqoop.Sqoop.run(Sqoop.java:148) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76) at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:184) at org.apache.sqoop.Sqoop.runTool(Sqoop.java:226) at org.apache.sqoop.Sqoop.runTool(Sqoop.java:235) at org.apache.sqoop.Sqoop.main(Sqoop.java:244)

Please could some one assist?

Thanks in advance

Regards

Naveen

1 ACCEPTED SOLUTION

avatar

My hypothesis is this is an issue specific to using the --as-parquetfile argument as there appears to be a KiteSDK bug related to this behavior. Can you please try executing the command without that? The pattern would be to create a staging table first and then insert into a parquet-backed table.

View solution in original post

7 REPLIES 7

avatar

My hypothesis is this is an issue specific to using the --as-parquetfile argument as there appears to be a KiteSDK bug related to this behavior. Can you please try executing the command without that? The pattern would be to create a staging table first and then insert into a parquet-backed table.

avatar

Thank you. @slachterman. That woirked, I removed the --as-parquetfile and that seemed to have fixed the problem. When you say insert into a parquet-backed table, do you mean I have to create a table in HIVE ahead of loading into it? Amy documentation that you can point to will be helpful.

avatar

Yes, @Naveen Keshava, that is correct. You will create two tables, say, employees_staging and employees_parquet. Employees_staging will be stored as text and employees_parquet will be stored as parquet. You will sqoop into employees_staging and then run another SQL command to insert into employees_parquet from employees_staging.

avatar

thank you, @slachterman. one final question, when you meant "run another SQL command to insert into employees_parquet from employees_staging." were you refering to a simple "insert into" sql command?

I tried it, and it looks like the insert worked into the target table. But how do I then make sure the file is in hdfs? with parquet format?

avatar

Yes.

You can use describe extended to see the HDFS path associated with the Hive table and you can use parquet-tools to interact with the parquet file.

avatar
New Contributor

I also had the same issue, but I was not ready to create a staging table, felt like that is not the real solution.

Just do like this

instead of using "--hive-table employees_exercise.employees" use like "--hive-database employees_exercise --hive-table employees"

It worked well for me

avatar
New Contributor

I faced the same issue in a different case, a incremental import from oracle to a parquet file in hdfs. In my case, the issue ocurred when i specified the --table parameter.

After some researches, i concluded that this issue is caused by the dot when i specified the "schema.table". In adition to mentioned above, you can use the paremeter --query instead the --table in your sqoop import. Eg:

instead of:

sqoop --import --connect jdbc:oracle:thin:@host:port/db --username user --P --table schema.table --split-by column --target-dir /some_hdfs_path/schema/table --incremental append --check-column ID --as-parquetfile --fields-terminated-by '|'

i used:

sqoop --import --connect jdbc:oracle:thin:@host:port/db --username user --P --query "SELECT * FROM schema.table WHERE \$CONDITIONS" --split-by column --target-dir /some_hdfs_path/schema/table --incremental append --check-column column --as-parquetfile --fields-terminated-by '|'

the \$CONDITIONS is required by sqoop, without this you'll got another error.