Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

SQOOP Import to Snappy ORC

avatar
Master Guru

Can Sqoop tranfers data to HDFS as a Compressed Snappy in ORC format ? And can then go directly into a hive table?

1 ACCEPTED SOLUTION

avatar
Guru

@Timothy Spann Here is an example of sqooping data in compressed ORC format (without needing to use intermediate table). I see that this is not well documented.

sqoop import --connect jdbc:mysql://localhost/employees --username hive --password hive --table departments --hcatalog-database default --hcatalog-table my_table_orc --create-hcatalog-table --hcatalog-storage-stanza "stored as orcfile"

This example uses default ORC compression. If you want snappy, you can create the table in advance with property set to snappy compression and then take out '--create-hcatalog-table'

More details are in this thread.

https://community.hortonworks.com/questions/28060/can-sqoop-be-used-to-directly-import-data-into-an....

View solution in original post

8 REPLIES 8

avatar
Super Guru
@Timothy Spann

Currently Sqoop doesn't direct ORC hive table import, please refer below Jira.

https://issues.apache.org/jira/browse/SQOOP-2192

As a workaround, we need to import the data in a temp table with text format through sqoop and further copy the data from tmp table to ORC format table.

avatar
Guru

This is not true. Sqoop can directly import into a snappy compressed ORC table using HCatalog. Refer to my answer on how to do this.

avatar

I agree with @Ravi Mutyala.. So unaccepted the answer to get right answer in place..

avatar

For version 1.4.5 is not possible to import the data from RDBMS to ORC Hive table directly, you need to execute the workaround mentioned by @Timothy Spann.

avatar
Master Guru

Thanks, I was afraid that was the case.

avatar
Guru

@Timothy Spann Here is an example of sqooping data in compressed ORC format (without needing to use intermediate table). I see that this is not well documented.

sqoop import --connect jdbc:mysql://localhost/employees --username hive --password hive --table departments --hcatalog-database default --hcatalog-table my_table_orc --create-hcatalog-table --hcatalog-storage-stanza "stored as orcfile"

This example uses default ORC compression. If you want snappy, you can create the table in advance with property set to snappy compression and then take out '--create-hcatalog-table'

More details are in this thread.

https://community.hortonworks.com/questions/28060/can-sqoop-be-used-to-directly-import-data-into-an....

avatar

@Timothy Spann Here is a quick and dirty way to do it. Had some time, so tried @Ravi Mutyala's suggestion and it works :).

hive> create table test_orc_sqoop (name varchar(20)) ;
OK
Time taken: 0.729 seconds
hive> select * from test_orc_sqoop;
OK
Time taken: 0.248 seconds
hive> exit;


[hdfs@test-mon-wmt ~]$ sqoop import --connect jdbc:mysql://mon-WMT-upgrade.cloud.hortonworks.com/test --username test --password hadoop --table test --hcatalog-database default --hcatalog-table test_orc_sqoop --hcatalog-storage-stanza "stored as orcfile" -m 1
Warning: /usr/hdp/2.4.2.0-258/hbase does not exist! HBase imports will fail.
Please set $HBASE_HOME to the root of your HBase installation.
Warning: /usr/hdp/2.4.2.0-258/accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
16/05/26 05:56:03 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6.2.4.2.0-258
16/05/26 05:56:03 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
16/05/26 05:56:04 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
16/05/26 05:56:04 INFO tool.CodeGenTool: Beginning code generation
16/05/26 05:56:05 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `test` AS t LIMIT 1
16/05/26 05:56:05 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `test` AS t LIMIT 1
16/05/26 05:56:05 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /usr/hdp/2.4.2.0-258/hadoop-mapreduce
Note: /tmp/sqoop-hdfs/compile/64f04ad998cebf113bf8ec1efdbf6b95/test.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
16/05/26 05:56:10 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-hdfs/compile/64f04ad998cebf113bf8ec1efdbf6b95/test.jar
16/05/26 05:56:10 WARN manager.MySQLManager: It looks like you are importing from mysql.
16/05/26 05:56:10 WARN manager.MySQLManager: This transfer can be faster! Use the --direct
16/05/26 05:56:10 WARN manager.MySQLManager: option to exercise a MySQL-specific fast path.
16/05/26 05:56:10 INFO manager.MySQLManager: Setting zero DATETIME behavior to convertToNull (mysql)
16/05/26 05:56:10 INFO mapreduce.ImportJobBase: Beginning import of test
16/05/26 05:56:11 INFO hcat.SqoopHCatUtilities: Configuring HCatalog for import job
16/05/26 05:56:11 INFO hcat.SqoopHCatUtilities: Configuring HCatalog specific details for job
16/05/26 05:56:11 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `test` AS t LIMIT 1
16/05/26 05:56:11 INFO hcat.SqoopHCatUtilities: Database column names projected : [name]
16/05/26 05:56:11 INFO hcat.SqoopHCatUtilities: Database column name - info map :
name : [Type : 12,Precision : 20,Scale : 0]
16/05/26 05:56:12 INFO hive.metastore: Trying to connect to metastore with URI thrift://rm-wmt-upgrade.cloud.hortonworks.com:9083
16/05/26 05:56:12 INFO hive.metastore: Connected to metastore.
16/05/26 05:56:14 INFO hcat.SqoopHCatUtilities: HCatalog full table schema fields = [name]
16/05/26 05:56:18 INFO hcat.SqoopHCatUtilities: HCatalog table partitioning key fields = []
16/05/26 05:56:18 INFO hcat.SqoopHCatUtilities: HCatalog projected schema fields = [name]
16/05/26 05:56:18 INFO hcat.SqoopHCatUtilities: HCatalog job : Hive Home = /usr/hdp/current/hive-client
16/05/26 05:56:18 INFO hcat.SqoopHCatUtilities: HCatalog job:  HCatalog Home = /usr/hdp/2.4.2.0-258//sqoop/../hive-hcatalog
16/05/26 05:56:18 INFO hcat.SqoopHCatUtilities: Adding jar files under /usr/hdp/2.4.2.0-258//sqoop/../hive-hcatalog/share/hcatalog to distributed cache
..............

...........
16/05/26 05:56:18 INFO hcat.SqoopHCatUtilities: Adding jar files under /usr/hdp/2.4.2.0-258//sqoop/../hive-hcatalog/share/hcatalog/storage-handlers to distributed cache (recursively)
16/05/26 05:56:18 WARN hcat.SqoopHCatUtilities: No files under /usr/hdp/2.4.2.0-258/sqoop/../hive-hcatalog/share/hcatalog/storage-handlers to add to distributed cache for hcatalog job
16/05/26 05:56:18 INFO hcat.SqoopHCatUtilities: Validating dynamic partition keys
16/05/26 05:56:18 WARN hcat.SqoopHCatUtilities: The HCatalog field name has type varchar(20).  Expected = varchar based on database column type : VARCHAR
16/05/26 05:56:18 WARN hcat.SqoopHCatUtilities: The Sqoop job can fail if types are not  assignment compatible
16/05/26 05:56:18 INFO mapreduce.DataDrivenImportJob: Configuring mapper for HCatalog import job
16/05/26 05:56:20 INFO impl.TimelineClientImpl: Timeline service address: http://rm-wmt-upgrade.cloud.hortonworks.com:8188/ws/v1/timeline/
16/05/26 05:56:21 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm2
16/05/26 05:57:07 INFO db.DBInputFormat: Using read commited transaction isolation
16/05/26 05:57:07 INFO mapreduce.JobSubmitter: number of splits:1
16/05/26 05:57:07 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1464036504999_0008
16/05/26 05:57:11 INFO impl.YarnClientImpl: Submitted application application_1464036504999_0008
16/05/26 05:57:11 INFO mapreduce.Job: The url to track the job: http://nn1-wmt-upgrade.cloud.hortonworks.com:8088/proxy/application_1464036504999_0008/
16/05/26 05:57:11 INFO mapreduce.Job: Running job: job_1464036504999_0008
16/05/26 05:57:56 INFO mapreduce.Job: Job job_1464036504999_0008 running in uber mode : false
16/05/26 05:57:56 INFO mapreduce.Job:  map 0% reduce 0%
16/05/26 05:58:25 INFO mapreduce.Job:  map 100% reduce 0%
16/05/26 05:58:27 INFO mapreduce.Job: Job job_1464036504999_0008 completed successfully
16/05/26 05:58:27 INFO mapreduce.Job: Counters: 30
File System Counters
FILE: Number of bytes read=0
FILE: Number of bytes written=302189
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=87
HDFS: Number of bytes written=22
HDFS: Number of read operations=4
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Launched map tasks=1
Other local map tasks=1
Total time spent by all maps in occupied slots (ms)=17899
Total time spent by all reduces in occupied slots (ms)=0
Total time spent by all map tasks (ms)=17899
Total vcore-seconds taken by all map tasks=17899
Total megabyte-seconds taken by all map tasks=12207118
Map-Reduce Framework
Map input records=3
Map output records=3
Input split bytes=87
Spilled Records=0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=189
CPU time spent (ms)=2610
Physical memory (bytes) snapshot=172310528
Virtual memory (bytes) snapshot=2470862848
Total committed heap usage (bytes)=58130432
File Input Format Counters
Bytes Read=0
File Output Format Counters
Bytes Written=0
16/05/26 05:58:27 INFO mapreduce.ImportJobBase: Transferred 22 bytes in 129.1108 seconds (0.1704 bytes/sec)
16/05/26 05:58:27 INFO mapreduce.ImportJobBase: Retrieved 3 records.


[hdfs@test-mon-wmt ~]$ hive
WARNING: Use "yarn jar" to launch YARN applications.
Logging initialized using configuration in file:/etc/hive/2.4.2.0-258/0/hive-log4j.properties
hive> select * from test_orc_sqoop;
OK
abcd
abcdef
abcdefghi
Time taken: 4.31 seconds, Fetched: 3 row(s)

avatar
New Contributor

if my hive table is a external table located on hdfs, could this solution work? thanks

,

if my hive table is a external table ,could this solution work?