question Re: Sqoop Import to Hive with Compression in Support Questions

Sqoop Import to Hive with Compression

Nilesh — Tue, 26 Apr 2016 21:37:01 GMT

Hi Team,

How to import oracle tables in hive with compressed format via sqoop command.

Where to configure compression. i.e MapReduce or HDFS or Hive.

Should I configure compression for Mapreduce ,HDFS and Hive.

Which is the best compression formats supported by Hive.

Re: Sqoop Import to Hive with Compression

ravi1 — Tue, 26 Apr 2016 21:44:16 GMT

@Nilesh

Below is an example where I imported a table that in mysql to hive in ORC format. You don't need to create the ORC backed table in advance. key is --hcatalog-storage-stanza.

sqoop import --connect "jdbc:mysql://sandbox.hortonworks.com/hive" --driver com.mysql.jdbc.Driver --username hive --password hive --table testtable --hcatalog-database default --hcatalog-table testtable  --create-hcatalog-table --hcatalog-storage-stanza "stored as orcfile"  -m 1

Re: Sqoop Import to Hive with Compression

Nilesh — Wed, 27 Apr 2016 21:11:41 GMT

@Ravi Mutyala

Does sqoop import support orc file. Where is compressed parameter.

Re: Sqoop Import to Hive with Compression

ravi1 — Wed, 27 Apr 2016 22:17:35 GMT

Above example shows you that. --hcatalog-storage_stanza "stored as orcfile" will create a new hive table through hcatalog and data is stored as ORC. You can also create a hive table with ORC format and then use --hcatalog-table to write directly as ORC.

Re: Sqoop Import to Hive with Compression

Nilesh — Fri, 29 Apr 2016 13:10:06 GMT

@Ravi Mutyala

Does it compressed the output.

Re: Sqoop Import to Hive with Compression

ravi1 — Fri, 29 Apr 2016 20:17:55 GMT

Yes. Output will be ORC file(s).

Re: Sqoop Import to Hive with Compression

rbiswas1 — Sat, 07 May 2016 00:41:10 GMT

@Nilesh

Below given is your solution:

Input:

mysql> select * from SERDES;

+----------+------+----------------------------------------------------+

| SERDE_ID | NAME | SLIB |

+----------+------+----------------------------------------------------+

| 56 | NULL | org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |

| 57 | NULL | org.apache.hadoop.hive.ql.io.orc.OrcSerde |

| 58 | NULL | NULL |

| 59 | NULL | org.apache.hadoop.hive.ql.io.orc.OrcSerde |

| 60 | NULL | org.apache.hadoop.hive.ql.io.orc.OrcSerde |

| 61 | NULL | org.apache.hadoop.hive.ql.io.orc.OrcSerde |

| 62 | NULL | org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |

+----------+------+----------------------------------------------------+

7 rows in set (0.00 sec)

Command:

sqoop import --connect jdbc:mysql://test:3306/hive \
--username hive \
--password test \
--table SERDES \
--hcatalog-database test \
--hcatalog-table SERDES \
--create-hcatalog-table \
--hcatalog-storage-stanza "stored as orcfile" \
--outdir sqoop_import \
-m 1 \
--compression-codec org.apache.hadoop.io.compress.SnappyCodec \
--driver com.mysql.jdbc.Driver

Logs:

...

16/05/06 13:30:46 INFO hcat.SqoopHCatUtilities: HCatalog Create table statement:

create table `demand_db`.`serdes` (

`serde_id` bigint,

`name` varchar(128),

`slib` varchar(4000))

stored as orcfile

...

16/05/06 13:32:55 INFO mapreduce.Job: Job job_1462201699379_0089 running in uber mode : false

16/05/06 13:32:55 INFO mapreduce.Job: map 0% reduce 0%

16/05/06 13:33:07 INFO mapreduce.Job: map 100% reduce 0%

16/05/06 13:33:09 INFO mapreduce.Job: Job job_1462201699379_0089 completed successfully

16/05/06 13:33:09 INFO mapreduce.Job: Counters: 30

File System Counters

FILE: Number of bytes read=0

FILE: Number of bytes written=297179

FILE: Number of read operations=0

FILE: Number of large read operations=0

FILE: Number of write operations=0

HDFS: Number of bytes read=87

HDFS: Number of bytes written=676

HDFS: Number of read operations=4

HDFS: Number of large read operations=0

HDFS: Number of write operations=2

Job Counters

Launched map tasks=1

Other local map tasks=1

Total time spent by all maps in occupied slots (ms)=14484

Total time spent by all reduces in occupied slots (ms)=0

Total time spent by all map tasks (ms)=7242

Total vcore-seconds taken by all map tasks=7242

Total megabyte-seconds taken by all map tasks=11123712

Map-Reduce Framework

Map input records=8

Map output records=8

Input split bytes=87

Spilled Records=0

Failed Shuffles=0

Merged Map outputs=0

GC time elapsed (ms)=92

CPU time spent (ms)=4620

Physical memory (bytes) snapshot=353759232

Virtual memory (bytes) snapshot=3276144640

Total committed heap usage (bytes)=175112192

File Input Format Counters

Bytes Read=0

File Output Format Counters

Bytes Written=0

16/05/06 13:33:09 INFO mapreduce.ImportJobBase: Transferred 676 bytes in 130.8366 seconds (5.1668 bytes/sec)

16/05/06 13:33:09 INFO mapreduce.ImportJobBase: Retrieved 8 records.

Output:

hive> select * from serdes;

56 NULL org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

57 NULL org.apache.hadoop.hive.ql.io.orc.OrcSerde

58 NULL NULL

59 NULL org.apache.hadoop.hive.ql.io.orc.OrcSerde

60 NULL org.apache.hadoop.hive.ql.io.orc.OrcSerde

61 NULL org.apache.hadoop.hive.ql.io.orc.OrcSerde

62 NULL org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

63 NULL org.apache.hadoop.hive.ql.io.orc.OrcSerde

Time taken: 2.711 seconds, Fetched: 8 row(s)

hive>

Re: Sqoop Import to Hive with Compression

rbiswas1 — Wed, 11 May 2016 11:00:42 GMT

@Nilesh

Did it worked?

Re: Sqoop Import to Hive with Compression

kaliyugantagoni — Thu, 18 Aug 2016 18:06:20 GMT

A late answer but maybe it will help someone 🙂

I am just adding to what @Ravi Mutyala has mentioned :

sqoop import --null-string '\\N' --null-non-string '\\N' --hive-delims-replacement '\0D' --num-mappers 8 --validate --hcatalog-home /usr/hdp/current/hive-webhcat --hcatalog-database default --hcatalog-table Inactivity --create-hcatalog-table --hcatalog-storage-stanza 'stored as orc tblproperties ("orc.compress"="ZLIB")' --connect 'jdbc:sqlserver://<IP>;database=<db-name>' --username --password --table Inactivity -- --schema QlikView 2>&1| tee -a log

Now if you describe the table :

0: jdbc:hive2://> describe formatted inactivity;
OK
16/08/18 11:23:25 [main]: WARN lazy.LazyStruct: Extra bytes detected at the end of the row! Ignoring similar problems.
+-------------------------------+----------------------------------------------------------------------+-----------------------+--+
|           col_name            |                              data_type                               |        comment        |
+-------------------------------+----------------------------------------------------------------------+-----------------------+--+
| # col_name                    | data_type                                                            | comment               |
|                               | NULL                                                                 | NULL                  |
| period                        | int                                                                  |                       |
| vin                           | string                                                               |                       |
| customerid                    | int                                                                  |                       |
| subscriberdealersisid         | string                                                               |                       |
| subscriberdistributorsisid    | string                                                               |                       |
| packagename                   | string                                                               |                       |
| timemodify                    | string                                                               |                       |
|                               | NULL                                                                 | NULL                  |
| # Detailed Table Information  | NULL                                                                 | NULL                  |
| Database:                     | default                                                              | NULL                  |
| Owner:                        | hive                                                                 | NULL                  |
| CreateTime:                   | Thu Aug 18 11:20:28 CEST 2016                                        | NULL                  |
| LastAccessTime:               | UNKNOWN                                                              | NULL                  |
| Protect Mode:                 | None                                                                 | NULL                  |
| Retention:                    | 0                                                                    | NULL                  |
| Location:                     | hdfs://l4283t.sss.com:8020/apps/hive/warehouse/inactivity  | NULL                  |
| Table Type:                   | MANAGED_TABLE                                                        | NULL                  |
| Table Parameters:             | NULL                                                                 | NULL                  |
|                               | orc.compress                                                         | ZLIB                  |
|                               | transient_lastDdlTime                                                | 1471512028            |
|                               | NULL                                                                 | NULL                  |
| # Storage Information         | NULL                                                                 | NULL                  |
| SerDe Library:                | org.apache.hadoop.hive.ql.io.orc.OrcSerde                            | NULL                  |
| InputFormat:                  | org.apache.hadoop.hive.ql.io.orc.OrcInputFormat                      | NULL                  |
| OutputFormat:                 | org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat                     | NULL                  |
| Compressed:                   | No                                                                   | NULL                  |
| Num Buckets:                  | -1                                                                   | NULL                  |
| Bucket Columns:               | []                                                                   | NULL                  |
| Sort Columns:                 | []                                                                   | NULL                  |
| Storage Desc Params:          | NULL                                                                 | NULL                  |
|                               | serialization.format                                                 | 1                     |
+-------------------------------+----------------------------------------------------------------------+-----------------------+--+
33 rows selected (0.425 seconds)

To verify if the compression has really taken place, you can first import the table without any compression and execute

analyze table <table-name>compute statistics

and note the 'totalSize'. Then repeat the process with compression and compare the 'totalSize'

Re: Sqoop Import to Hive with Compression

egarelnabi — Mon, 24 Jul 2017 21:42:38 GMT

For a comparison between compression formats take a look at this link:

http://comphadoop.weebly.com/