Created 04-26-2016 02:37 PM
Hi Team,
How to import oracle tables in hive with compressed format via sqoop command.
Where to configure compression. i.e MapReduce or HDFS or Hive.
Should I configure compression for Mapreduce ,HDFS and Hive.
Which is the best compression formats supported by Hive.
Created 05-06-2016 05:41 PM
Below given is your solution:
Input:
mysql> select * from SERDES;
+----------+------+----------------------------------------------------+
| SERDE_ID | NAME | SLIB |
+----------+------+----------------------------------------------------+
| 56 | NULL | org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
| 57 | NULL | org.apache.hadoop.hive.ql.io.orc.OrcSerde |
| 58 | NULL | NULL |
| 59 | NULL | org.apache.hadoop.hive.ql.io.orc.OrcSerde |
| 60 | NULL | org.apache.hadoop.hive.ql.io.orc.OrcSerde |
| 61 | NULL | org.apache.hadoop.hive.ql.io.orc.OrcSerde |
| 62 | NULL | org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
+----------+------+----------------------------------------------------+
7 rows in set (0.00 sec)
Command:
sqoop import --connect jdbc:mysql://test:3306/hive \
--username hive \
--password test \
--table SERDES \
--hcatalog-database test \
--hcatalog-table SERDES \
--create-hcatalog-table \
--hcatalog-storage-stanza "stored as orcfile" \
--outdir sqoop_import \
-m 1 \
--compression-codec org.apache.hadoop.io.compress.SnappyCodec \
--driver com.mysql.jdbc.Driver
Logs:
...
...
16/05/06 13:30:46 INFO hcat.SqoopHCatUtilities: HCatalog Create table statement:
create table `demand_db`.`serdes` (
`serde_id` bigint,
`name` varchar(128),
`slib` varchar(4000))
stored as orcfile
...
...
16/05/06 13:32:55 INFO mapreduce.Job: Job job_1462201699379_0089 running in uber mode : false
16/05/06 13:32:55 INFO mapreduce.Job: map 0% reduce 0%
16/05/06 13:33:07 INFO mapreduce.Job: map 100% reduce 0%
16/05/06 13:33:09 INFO mapreduce.Job: Job job_1462201699379_0089 completed successfully
16/05/06 13:33:09 INFO mapreduce.Job: Counters: 30
File System Counters
FILE: Number of bytes read=0
FILE: Number of bytes written=297179
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=87
HDFS: Number of bytes written=676
HDFS: Number of read operations=4
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Launched map tasks=1
Other local map tasks=1
Total time spent by all maps in occupied slots (ms)=14484
Total time spent by all reduces in occupied slots (ms)=0
Total time spent by all map tasks (ms)=7242
Total vcore-seconds taken by all map tasks=7242
Total megabyte-seconds taken by all map tasks=11123712
Map-Reduce Framework
Map input records=8
Map output records=8
Input split bytes=87
Spilled Records=0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=92
CPU time spent (ms)=4620
Physical memory (bytes) snapshot=353759232
Virtual memory (bytes) snapshot=3276144640
Total committed heap usage (bytes)=175112192
File Input Format Counters
Bytes Read=0
File Output Format Counters
Bytes Written=0
16/05/06 13:33:09 INFO mapreduce.ImportJobBase: Transferred 676 bytes in 130.8366 seconds (5.1668 bytes/sec)
16/05/06 13:33:09 INFO mapreduce.ImportJobBase: Retrieved 8 records.
Output:
hive> select * from serdes;
OK
56 NULL org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
57 NULL org.apache.hadoop.hive.ql.io.orc.OrcSerde
58 NULL NULL
59 NULL org.apache.hadoop.hive.ql.io.orc.OrcSerde
60 NULL org.apache.hadoop.hive.ql.io.orc.OrcSerde
61 NULL org.apache.hadoop.hive.ql.io.orc.OrcSerde
62 NULL org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
63 NULL org.apache.hadoop.hive.ql.io.orc.OrcSerde
Time taken: 2.711 seconds, Fetched: 8 row(s)
hive>
Created 04-26-2016 02:44 PM
Below is an example where I imported a table that in mysql to hive in ORC format. You don't need to create the ORC backed table in advance. key is --hcatalog-storage-stanza.
sqoop import --connect "jdbc:mysql://sandbox.hortonworks.com/hive" --driver com.mysql.jdbc.Driver --username hive --password hive --table testtable --hcatalog-database default --hcatalog-table testtable --create-hcatalog-table --hcatalog-storage-stanza "stored as orcfile" -m 1
Created 04-27-2016 02:11 PM
Does sqoop import support orc file. Where is compressed parameter.
Created 04-27-2016 03:17 PM
Above example shows you that. --hcatalog-storage_stanza "stored as orcfile" will create a new hive table through hcatalog and data is stored as ORC. You can also create a hive table with ORC format and then use --hcatalog-table to write directly as ORC.
Created 04-29-2016 06:10 AM
Does it compressed the output.
Created 04-29-2016 01:17 PM
Yes. Output will be ORC file(s).
Created 05-06-2016 05:41 PM
Below given is your solution:
Input:
mysql> select * from SERDES;
+----------+------+----------------------------------------------------+
| SERDE_ID | NAME | SLIB |
+----------+------+----------------------------------------------------+
| 56 | NULL | org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
| 57 | NULL | org.apache.hadoop.hive.ql.io.orc.OrcSerde |
| 58 | NULL | NULL |
| 59 | NULL | org.apache.hadoop.hive.ql.io.orc.OrcSerde |
| 60 | NULL | org.apache.hadoop.hive.ql.io.orc.OrcSerde |
| 61 | NULL | org.apache.hadoop.hive.ql.io.orc.OrcSerde |
| 62 | NULL | org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
+----------+------+----------------------------------------------------+
7 rows in set (0.00 sec)
Command:
sqoop import --connect jdbc:mysql://test:3306/hive \
--username hive \
--password test \
--table SERDES \
--hcatalog-database test \
--hcatalog-table SERDES \
--create-hcatalog-table \
--hcatalog-storage-stanza "stored as orcfile" \
--outdir sqoop_import \
-m 1 \
--compression-codec org.apache.hadoop.io.compress.SnappyCodec \
--driver com.mysql.jdbc.Driver
Logs:
...
...
16/05/06 13:30:46 INFO hcat.SqoopHCatUtilities: HCatalog Create table statement:
create table `demand_db`.`serdes` (
`serde_id` bigint,
`name` varchar(128),
`slib` varchar(4000))
stored as orcfile
...
...
16/05/06 13:32:55 INFO mapreduce.Job: Job job_1462201699379_0089 running in uber mode : false
16/05/06 13:32:55 INFO mapreduce.Job: map 0% reduce 0%
16/05/06 13:33:07 INFO mapreduce.Job: map 100% reduce 0%
16/05/06 13:33:09 INFO mapreduce.Job: Job job_1462201699379_0089 completed successfully
16/05/06 13:33:09 INFO mapreduce.Job: Counters: 30
File System Counters
FILE: Number of bytes read=0
FILE: Number of bytes written=297179
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=87
HDFS: Number of bytes written=676
HDFS: Number of read operations=4
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Launched map tasks=1
Other local map tasks=1
Total time spent by all maps in occupied slots (ms)=14484
Total time spent by all reduces in occupied slots (ms)=0
Total time spent by all map tasks (ms)=7242
Total vcore-seconds taken by all map tasks=7242
Total megabyte-seconds taken by all map tasks=11123712
Map-Reduce Framework
Map input records=8
Map output records=8
Input split bytes=87
Spilled Records=0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=92
CPU time spent (ms)=4620
Physical memory (bytes) snapshot=353759232
Virtual memory (bytes) snapshot=3276144640
Total committed heap usage (bytes)=175112192
File Input Format Counters
Bytes Read=0
File Output Format Counters
Bytes Written=0
16/05/06 13:33:09 INFO mapreduce.ImportJobBase: Transferred 676 bytes in 130.8366 seconds (5.1668 bytes/sec)
16/05/06 13:33:09 INFO mapreduce.ImportJobBase: Retrieved 8 records.
Output:
hive> select * from serdes;
OK
56 NULL org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
57 NULL org.apache.hadoop.hive.ql.io.orc.OrcSerde
58 NULL NULL
59 NULL org.apache.hadoop.hive.ql.io.orc.OrcSerde
60 NULL org.apache.hadoop.hive.ql.io.orc.OrcSerde
61 NULL org.apache.hadoop.hive.ql.io.orc.OrcSerde
62 NULL org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
63 NULL org.apache.hadoop.hive.ql.io.orc.OrcSerde
Time taken: 2.711 seconds, Fetched: 8 row(s)
hive>
Created 05-11-2016 04:00 AM
Did it worked?
Created 07-24-2017 02:42 PM
For a comparison between compression formats take a look at this link:
Created 08-18-2016 11:06 AM
A late answer but maybe it will help someone 🙂
I am just adding to what @Ravi Mutyala has mentioned :
sqoop import --null-string '\\N' --null-non-string '\\N' --hive-delims-replacement '\0D' --num-mappers 8 --validate --hcatalog-home /usr/hdp/current/hive-webhcat --hcatalog-database default --hcatalog-table Inactivity --create-hcatalog-table --hcatalog-storage-stanza 'stored as orc tblproperties ("orc.compress"="ZLIB")' --connect 'jdbc:sqlserver://<IP>;database=<db-name>' --username --password --table Inactivity -- --schema QlikView 2>&1| tee -a log
Now if you describe the table :
0: jdbc:hive2://> describe formatted inactivity; OK 16/08/18 11:23:25 [main]: WARN lazy.LazyStruct: Extra bytes detected at the end of the row! Ignoring similar problems. +-------------------------------+----------------------------------------------------------------------+-----------------------+--+ | col_name | data_type | comment | +-------------------------------+----------------------------------------------------------------------+-----------------------+--+ | # col_name | data_type | comment | | | NULL | NULL | | period | int | | | vin | string | | | customerid | int | | | subscriberdealersisid | string | | | subscriberdistributorsisid | string | | | packagename | string | | | timemodify | string | | | | NULL | NULL | | # Detailed Table Information | NULL | NULL | | Database: | default | NULL | | Owner: | hive | NULL | | CreateTime: | Thu Aug 18 11:20:28 CEST 2016 | NULL | | LastAccessTime: | UNKNOWN | NULL | | Protect Mode: | None | NULL | | Retention: | 0 | NULL | | Location: | hdfs://l4283t.sss.com:8020/apps/hive/warehouse/inactivity | NULL | | Table Type: | MANAGED_TABLE | NULL | | Table Parameters: | NULL | NULL | | | orc.compress | ZLIB | | | transient_lastDdlTime | 1471512028 | | | NULL | NULL | | # Storage Information | NULL | NULL | | SerDe Library: | org.apache.hadoop.hive.ql.io.orc.OrcSerde | NULL | | InputFormat: | org.apache.hadoop.hive.ql.io.orc.OrcInputFormat | NULL | | OutputFormat: | org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat | NULL | | Compressed: | No | NULL | | Num Buckets: | -1 | NULL | | Bucket Columns: | [] | NULL | | Sort Columns: | [] | NULL | | Storage Desc Params: | NULL | NULL | | | serialization.format | 1 | +-------------------------------+----------------------------------------------------------------------+-----------------------+--+ 33 rows selected (0.425 seconds)
To verify if the compression has really taken place, you can first import the table without any compression and execute
analyze table <table-name>compute statistics
and note the 'totalSize'. Then repeat the process with compression and compare the 'totalSize'