Support Questions

Find answers, ask questions, and share your expertise

Sqoop Import to Hive with Compression

avatar
Expert Contributor

Hi Team,

How to import oracle tables in hive with compressed format via sqoop command.

Where to configure compression. i.e MapReduce or HDFS or Hive.

Should I configure compression for Mapreduce ,HDFS and Hive.

Which is the best compression formats supported by Hive.

1 ACCEPTED SOLUTION

avatar

@Nilesh

Below given is your solution:

Input:

mysql> select * from SERDES;

+----------+------+----------------------------------------------------+

| SERDE_ID | NAME | SLIB |

+----------+------+----------------------------------------------------+

| 56 | NULL | org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |

| 57 | NULL | org.apache.hadoop.hive.ql.io.orc.OrcSerde |

| 58 | NULL | NULL |

| 59 | NULL | org.apache.hadoop.hive.ql.io.orc.OrcSerde |

| 60 | NULL | org.apache.hadoop.hive.ql.io.orc.OrcSerde |

| 61 | NULL | org.apache.hadoop.hive.ql.io.orc.OrcSerde |

| 62 | NULL | org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |

+----------+------+----------------------------------------------------+

7 rows in set (0.00 sec)

Command:

sqoop import --connect jdbc:mysql://test:3306/hive \

--username hive \

--password test \

--table SERDES \

--hcatalog-database test \

--hcatalog-table SERDES \

--create-hcatalog-table \

--hcatalog-storage-stanza "stored as orcfile" \

--outdir sqoop_import \

-m 1 \

--compression-codec org.apache.hadoop.io.compress.SnappyCodec \

--driver com.mysql.jdbc.Driver

Logs:

...

...

16/05/06 13:30:46 INFO hcat.SqoopHCatUtilities: HCatalog Create table statement:

create table `demand_db`.`serdes` (

`serde_id` bigint,

`name` varchar(128),

`slib` varchar(4000))

stored as orcfile

...

...

16/05/06 13:32:55 INFO mapreduce.Job: Job job_1462201699379_0089 running in uber mode : false

16/05/06 13:32:55 INFO mapreduce.Job: map 0% reduce 0%

16/05/06 13:33:07 INFO mapreduce.Job: map 100% reduce 0%

16/05/06 13:33:09 INFO mapreduce.Job: Job job_1462201699379_0089 completed successfully

16/05/06 13:33:09 INFO mapreduce.Job: Counters: 30

File System Counters

FILE: Number of bytes read=0

FILE: Number of bytes written=297179

FILE: Number of read operations=0

FILE: Number of large read operations=0

FILE: Number of write operations=0

HDFS: Number of bytes read=87

HDFS: Number of bytes written=676

HDFS: Number of read operations=4

HDFS: Number of large read operations=0

HDFS: Number of write operations=2

Job Counters

Launched map tasks=1

Other local map tasks=1

Total time spent by all maps in occupied slots (ms)=14484

Total time spent by all reduces in occupied slots (ms)=0

Total time spent by all map tasks (ms)=7242

Total vcore-seconds taken by all map tasks=7242

Total megabyte-seconds taken by all map tasks=11123712

Map-Reduce Framework

Map input records=8

Map output records=8

Input split bytes=87

Spilled Records=0

Failed Shuffles=0

Merged Map outputs=0

GC time elapsed (ms)=92

CPU time spent (ms)=4620

Physical memory (bytes) snapshot=353759232

Virtual memory (bytes) snapshot=3276144640

Total committed heap usage (bytes)=175112192

File Input Format Counters

Bytes Read=0

File Output Format Counters

Bytes Written=0

16/05/06 13:33:09 INFO mapreduce.ImportJobBase: Transferred 676 bytes in 130.8366 seconds (5.1668 bytes/sec)

16/05/06 13:33:09 INFO mapreduce.ImportJobBase: Retrieved 8 records.

Output:

hive> select * from serdes;

OK

56 NULL org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

57 NULL org.apache.hadoop.hive.ql.io.orc.OrcSerde

58 NULL NULL

59 NULL org.apache.hadoop.hive.ql.io.orc.OrcSerde

60 NULL org.apache.hadoop.hive.ql.io.orc.OrcSerde

61 NULL org.apache.hadoop.hive.ql.io.orc.OrcSerde

62 NULL org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

63 NULL org.apache.hadoop.hive.ql.io.orc.OrcSerde

Time taken: 2.711 seconds, Fetched: 8 row(s)

hive>

View solution in original post

9 REPLIES 9

avatar
Guru

@Nilesh

Below is an example where I imported a table that in mysql to hive in ORC format. You don't need to create the ORC backed table in advance. key is --hcatalog-storage-stanza.

sqoop import --connect "jdbc:mysql://sandbox.hortonworks.com/hive" --driver com.mysql.jdbc.Driver --username hive --password hive --table testtable --hcatalog-database default --hcatalog-table testtable  --create-hcatalog-table --hcatalog-storage-stanza "stored as orcfile"  -m 1

avatar
Expert Contributor

@Ravi Mutyala

Does sqoop import support orc file. Where is compressed parameter.

avatar
Guru

Above example shows you that. --hcatalog-storage_stanza "stored as orcfile" will create a new hive table through hcatalog and data is stored as ORC. You can also create a hive table with ORC format and then use --hcatalog-table to write directly as ORC.

avatar
Expert Contributor
@Ravi Mutyala

Does it compressed the output.

avatar
Guru

Yes. Output will be ORC file(s).

avatar

@Nilesh

Below given is your solution:

Input:

mysql> select * from SERDES;

+----------+------+----------------------------------------------------+

| SERDE_ID | NAME | SLIB |

+----------+------+----------------------------------------------------+

| 56 | NULL | org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |

| 57 | NULL | org.apache.hadoop.hive.ql.io.orc.OrcSerde |

| 58 | NULL | NULL |

| 59 | NULL | org.apache.hadoop.hive.ql.io.orc.OrcSerde |

| 60 | NULL | org.apache.hadoop.hive.ql.io.orc.OrcSerde |

| 61 | NULL | org.apache.hadoop.hive.ql.io.orc.OrcSerde |

| 62 | NULL | org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |

+----------+------+----------------------------------------------------+

7 rows in set (0.00 sec)

Command:

sqoop import --connect jdbc:mysql://test:3306/hive \

--username hive \

--password test \

--table SERDES \

--hcatalog-database test \

--hcatalog-table SERDES \

--create-hcatalog-table \

--hcatalog-storage-stanza "stored as orcfile" \

--outdir sqoop_import \

-m 1 \

--compression-codec org.apache.hadoop.io.compress.SnappyCodec \

--driver com.mysql.jdbc.Driver

Logs:

...

...

16/05/06 13:30:46 INFO hcat.SqoopHCatUtilities: HCatalog Create table statement:

create table `demand_db`.`serdes` (

`serde_id` bigint,

`name` varchar(128),

`slib` varchar(4000))

stored as orcfile

...

...

16/05/06 13:32:55 INFO mapreduce.Job: Job job_1462201699379_0089 running in uber mode : false

16/05/06 13:32:55 INFO mapreduce.Job: map 0% reduce 0%

16/05/06 13:33:07 INFO mapreduce.Job: map 100% reduce 0%

16/05/06 13:33:09 INFO mapreduce.Job: Job job_1462201699379_0089 completed successfully

16/05/06 13:33:09 INFO mapreduce.Job: Counters: 30

File System Counters

FILE: Number of bytes read=0

FILE: Number of bytes written=297179

FILE: Number of read operations=0

FILE: Number of large read operations=0

FILE: Number of write operations=0

HDFS: Number of bytes read=87

HDFS: Number of bytes written=676

HDFS: Number of read operations=4

HDFS: Number of large read operations=0

HDFS: Number of write operations=2

Job Counters

Launched map tasks=1

Other local map tasks=1

Total time spent by all maps in occupied slots (ms)=14484

Total time spent by all reduces in occupied slots (ms)=0

Total time spent by all map tasks (ms)=7242

Total vcore-seconds taken by all map tasks=7242

Total megabyte-seconds taken by all map tasks=11123712

Map-Reduce Framework

Map input records=8

Map output records=8

Input split bytes=87

Spilled Records=0

Failed Shuffles=0

Merged Map outputs=0

GC time elapsed (ms)=92

CPU time spent (ms)=4620

Physical memory (bytes) snapshot=353759232

Virtual memory (bytes) snapshot=3276144640

Total committed heap usage (bytes)=175112192

File Input Format Counters

Bytes Read=0

File Output Format Counters

Bytes Written=0

16/05/06 13:33:09 INFO mapreduce.ImportJobBase: Transferred 676 bytes in 130.8366 seconds (5.1668 bytes/sec)

16/05/06 13:33:09 INFO mapreduce.ImportJobBase: Retrieved 8 records.

Output:

hive> select * from serdes;

OK

56 NULL org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

57 NULL org.apache.hadoop.hive.ql.io.orc.OrcSerde

58 NULL NULL

59 NULL org.apache.hadoop.hive.ql.io.orc.OrcSerde

60 NULL org.apache.hadoop.hive.ql.io.orc.OrcSerde

61 NULL org.apache.hadoop.hive.ql.io.orc.OrcSerde

62 NULL org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

63 NULL org.apache.hadoop.hive.ql.io.orc.OrcSerde

Time taken: 2.711 seconds, Fetched: 8 row(s)

hive>

avatar

@Nilesh

Did it worked?

avatar

For a comparison between compression formats take a look at this link:

http://comphadoop.weebly.com/

avatar
Super Collaborator

A late answer but maybe it will help someone 🙂

I am just adding to what @Ravi Mutyala has mentioned :

sqoop import --null-string '\\N' --null-non-string '\\N' --hive-delims-replacement '\0D' --num-mappers 8 --validate --hcatalog-home /usr/hdp/current/hive-webhcat --hcatalog-database default --hcatalog-table Inactivity --create-hcatalog-table --hcatalog-storage-stanza 'stored as orc tblproperties ("orc.compress"="ZLIB")' --connect 'jdbc:sqlserver://<IP>;database=<db-name>' --username --password --table Inactivity -- --schema QlikView 2>&1| tee -a log

Now if you describe the table :

0: jdbc:hive2://> describe formatted inactivity;
OK
16/08/18 11:23:25 [main]: WARN lazy.LazyStruct: Extra bytes detected at the end of the row! Ignoring similar problems.
+-------------------------------+----------------------------------------------------------------------+-----------------------+--+
|           col_name            |                              data_type                               |        comment        |
+-------------------------------+----------------------------------------------------------------------+-----------------------+--+
| # col_name                    | data_type                                                            | comment               |
|                               | NULL                                                                 | NULL                  |
| period                        | int                                                                  |                       |
| vin                           | string                                                               |                       |
| customerid                    | int                                                                  |                       |
| subscriberdealersisid         | string                                                               |                       |
| subscriberdistributorsisid    | string                                                               |                       |
| packagename                   | string                                                               |                       |
| timemodify                    | string                                                               |                       |
|                               | NULL                                                                 | NULL                  |
| # Detailed Table Information  | NULL                                                                 | NULL                  |
| Database:                     | default                                                              | NULL                  |
| Owner:                        | hive                                                                 | NULL                  |
| CreateTime:                   | Thu Aug 18 11:20:28 CEST 2016                                        | NULL                  |
| LastAccessTime:               | UNKNOWN                                                              | NULL                  |
| Protect Mode:                 | None                                                                 | NULL                  |
| Retention:                    | 0                                                                    | NULL                  |
| Location:                     | hdfs://l4283t.sss.com:8020/apps/hive/warehouse/inactivity  | NULL                  |
| Table Type:                   | MANAGED_TABLE                                                        | NULL                  |
| Table Parameters:             | NULL                                                                 | NULL                  |
|                               | orc.compress                                                         | ZLIB                  |
|                               | transient_lastDdlTime                                                | 1471512028            |
|                               | NULL                                                                 | NULL                  |
| # Storage Information         | NULL                                                                 | NULL                  |
| SerDe Library:                | org.apache.hadoop.hive.ql.io.orc.OrcSerde                            | NULL                  |
| InputFormat:                  | org.apache.hadoop.hive.ql.io.orc.OrcInputFormat                      | NULL                  |
| OutputFormat:                 | org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat                     | NULL                  |
| Compressed:                   | No                                                                   | NULL                  |
| Num Buckets:                  | -1                                                                   | NULL                  |
| Bucket Columns:               | []                                                                   | NULL                  |
| Sort Columns:                 | []                                                                   | NULL                  |
| Storage Desc Params:          | NULL                                                                 | NULL                  |
|                               | serialization.format                                                 | 1                     |
+-------------------------------+----------------------------------------------------------------------+-----------------------+--+
33 rows selected (0.425 seconds)

To verify if the compression has really taken place, you can first import the table without any compression and execute

analyze table <table-name>compute statistics

and note the 'totalSize'. Then repeat the process with compression and compare the 'totalSize'