Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Sqoop import --as-parquetfile with CDH5

Sqoop import --as-parquetfile with CDH5

New Contributor

I'm trying to import data directly from mysql to parquet but it doesn't seem to work correctly...

I'm using CDH5.3 which includes Sqoop 1.4.5.

Here is my command line :

sqoop import --connect jdbc:mysql://xx.xx.xx.xx/database --username username --password mypass --query 'SELECT page_id,user_id FROM pages_users WHERE $CONDITIONS' --split-by page_id --hive-import --hive-table default.pages_users3 --target-dir hive_pages_users --as-parquetfile


Then I get this error :


Warning: /opt/cloudera/parcels/CDH-5.3.0-1.cdh5.3.0.p0.30/bin/../lib/sqoop/../accumulo does not exist! Accumulo imports will fail.
    Please set $ACCUMULO_HOME to the root of your Accumulo installation.
    15/01/09 14:31:49 INFO sqoop.Sqoop: Running Sqoop version: 1.4.5-cdh5.3.0
    15/01/09 14:31:49 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
    15/01/09 14:31:49 INFO tool.BaseSqoopTool: Using Hive-specific delimiters for output. You can override
    15/01/09 14:31:49 INFO tool.BaseSqoopTool: delimiters with --fields-terminated-by, etc.
    15/01/09 14:31:49 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
    15/01/09 14:31:49 INFO tool.CodeGenTool: Beginning code generation
    15/01/09 14:31:50 INFO manager.SqlManager: Executing SQL statement: SELECT page_id,user_id FROM pages_users WHERE  (1 = 0)
    15/01/09 14:31:50 INFO manager.SqlManager: Executing SQL statement: SELECT page_id,user_id FROM pages_users WHERE  (1 = 0)
    15/01/09 14:31:50 INFO manager.SqlManager: Executing SQL statement: SELECT page_id,user_id FROM pages_users WHERE  (1 = 0)
    15/01/09 14:31:50 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce
    Note: /tmp/sqoop-root/compile/b90e7b492f5b66554f2cca3f88ef7a61/QueryResult.java uses or overrides a deprecated API.
    Note: Recompile with -Xlint:deprecation for details.
    15/01/09 14:31:51 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-root/compile/b90e7b492f5b66554f2cca3f88ef7a61/QueryResult.jar
    15/01/09 14:31:51 INFO mapreduce.ImportJobBase: Beginning query import.
    15/01/09 14:31:51 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
    15/01/09 14:31:51 INFO manager.SqlManager: Executing SQL statement: SELECT page_id,user_id FROM pages_users WHERE  (1 = 0)
    15/01/09 14:31:51 INFO manager.SqlManager: Executing SQL statement: SELECT page_id,user_id FROM pages_users WHERE  (1 = 0)
    15/01/09 14:31:51 WARN spi.Registration: Not loading URI patterns in org.kitesdk.data.spi.hive.Loader
    15/01/09 14:31:51 ERROR sqoop.Sqoop: Got exception running Sqoop: org.kitesdk.data.DatasetNotFoundException: Unknown dataset URI: hive?dataset=default.pages_users3
    org.kitesdk.data.DatasetNotFoundException: Unknown dataset URI: hive?dataset=default.pages_users3
        at org.kitesdk.data.spi.Registration.lookupDatasetUri(Registration.java:109)
        at org.kitesdk.data.Datasets.create(Datasets.java:189)
        at org.kitesdk.data.Datasets.create(Datasets.java:240)
        at org.apache.sqoop.mapreduce.ParquetJob.createDataset(ParquetJob.java:81)
        at org.apache.sqoop.mapreduce.ParquetJob.configureImportJob(ParquetJob.java:70)
        at org.apache.sqoop.mapreduce.DataDrivenImportJob.configureMapper(DataDrivenImportJob.java:112)
        at org.apache.sqoop.mapreduce.ImportJobBase.runImport(ImportJobBase.java:262)
        at org.apache.sqoop.manager.SqlManager.importQuery(SqlManager.java:721)
        at org.apache.sqoop.tool.ImportTool.importTable(ImportTool.java:499)
        at org.apache.sqoop.tool.ImportTool.run(ImportTool.java:605)
        at org.apache.sqoop.Sqoop.run(Sqoop.java:143)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
        at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:179)
        at org.apache.sqoop.Sqoop.runTool(Sqoop.java:218)
        at org.apache.sqoop.Sqoop.runTool(Sqoop.java:227)
        at org.apache.sqoop.Sqoop.main(Sqoop.java:236)

 
I have no problem importing data to hive file format but parquet is a problem... Do you have any idea why this occurs ?

Thank you :)

5 REPLIES 5

Re: Sqoop import --as-parquetfile with CDH5

Master Guru

This is caused due to some missing classpath entries (for "hive" URIs) that Kite's Hive connector relies on.

You can circumvent the issue by manually exporting the below env-vars to resolve the issue, for ex.:

 

 

 

export HIVE_HOME=/opt/cloudera/parcels/CDH/lib/hive
export HCAT_HOME=/opt/cloudera/parcels/CDH/lib/hive-hcatalog
sqoop import --connect jdbc:mysql://xx.xx.xx.xx/database --username username --password mypass --query 'SELECT page_id,user_id FROM pages_users WHERE $CONDITIONS' --split-by page_id --hive-import --hive-table default.pages_users3 --target-dir hive_pages_users --as-parquetfile

 

This will be resolved in a future CDH release as well (i.e. removing the need to manually export the two env-vars above).

 

Does this help?

 

Re: Sqoop import --as-parquetfile with CDH5

New Contributor

Hi, is this fixed? if yes, could you please let me know which version of CDH has this fix?

Or still do we need to manually export the two env-vars specified above? 

Re: Sqoop import --as-parquetfile with CDH5

New Contributor

any answer to this question , we are struggling with the same issue ?

Re: Sqoop import --as-parquetfile with CDH5

New Contributor

Hi guy, This works for me using the parquetfile with CDH5. I hope this helps!

 

sqoop import-all-tables \ -m 1 \ --connect jdbc:mysql://xx.xx.xx.xx/database \ --username= username \ --password= password \ --compression-codec=snappy \ --as-parquetfile \ --warehouse-dir=/user/hive/warehouse/ \ --hive-import

Highlighted

Re: Sqoop import --as-parquetfile with CDH5

New Contributor

leyendo parece ser que es un bug en el Kite SDK, la solucion definitiva es que se pueden guardar los archivos parquet que genera sqoop en un directorio dentro del mismo directorio que esta asociado a la tabla y despues ejecutar un comando hadoop fs -mv que mueva del directorio temporal al directorio de la tabla y tan tan. por ejemplo:

--target-dir "/user/hive/warehouse/

leyendo parece ser que es un bug en el Kite SDK, la solucion definitiva es que se pueden guardar los archivos parquet que genera sqoop en un directorio dentro del mismo directorio que esta asociado a la tabla y despues ejecutar un comando hadoop fs -mv que mueva del directorio temporal al directorio de la tabla y tan tan. por ejemplo:

--target-dir "/user/hive/warehouse/bd_equipo.db/phone_number/" --as-parquetfile (esta es la configuracion que falla)

la cambiamos por:

--target-dir "/user/hive/warehouse/bd_ConGuionBajo.db/mi_tabla/temp" --as-parquetfile 

despues de que acaba el sqoop-import ya podemos mover los archivos al directorio de la tabla

hadoop fs -mv /user/hive/warehouse/bd_ConGuionBajo.db/mi_tabla/temp/*.parquet /user/hive/warehouse/bd_ConGuionBajo.db/mi_tabla/

 

Felices imports

/" --as-parquetfile (esta es la configuracion que falla)

la cambiamos por:

--target-dir "/user/hive/warehouse/bd_ConGuionBajo.db/mi_tabla/temp" --as-parquetfile 

despues de que acaba el sqoop-import ya podemos mover los archivos al directorio de la tabla

hadoop fs -mv /user/hive/warehouse/bd_ConGuionBajo.db/mi_tabla/temp/*.parquet /user/hive/warehouse/bd_ConGuionBajo.db/mi_tabla/

 

Felices imports