Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Sqoop Import to Hive with Compression

Solved Go to solution
Highlighted

Sqoop Import to Hive with Compression

Contributor

Hi Team,

How to import oracle tables in hive with compressed format via sqoop command.

Where to configure compression. i.e MapReduce or HDFS or Hive.

Should I configure compression for Mapreduce ,HDFS and Hive.

Which is the best compression formats supported by Hive.

1 ACCEPTED SOLUTION

Accepted Solutions
Highlighted

Re: Sqoop Import to Hive with Compression

@Nilesh

Below given is your solution:

Input:

mysql> select * from SERDES;

+----------+------+----------------------------------------------------+

| SERDE_ID | NAME | SLIB |

+----------+------+----------------------------------------------------+

| 56 | NULL | org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |

| 57 | NULL | org.apache.hadoop.hive.ql.io.orc.OrcSerde |

| 58 | NULL | NULL |

| 59 | NULL | org.apache.hadoop.hive.ql.io.orc.OrcSerde |

| 60 | NULL | org.apache.hadoop.hive.ql.io.orc.OrcSerde |

| 61 | NULL | org.apache.hadoop.hive.ql.io.orc.OrcSerde |

| 62 | NULL | org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |

+----------+------+----------------------------------------------------+

7 rows in set (0.00 sec)

Command:

sqoop import --connect jdbc:mysql://test:3306/hive \

--username hive \

--password test \

--table SERDES \

--hcatalog-database test \

--hcatalog-table SERDES \

--create-hcatalog-table \

--hcatalog-storage-stanza "stored as orcfile" \

--outdir sqoop_import \

-m 1 \

--compression-codec org.apache.hadoop.io.compress.SnappyCodec \

--driver com.mysql.jdbc.Driver

Logs:

...

...

16/05/06 13:30:46 INFO hcat.SqoopHCatUtilities: HCatalog Create table statement:

create table `demand_db`.`serdes` (

`serde_id` bigint,

`name` varchar(128),

`slib` varchar(4000))

stored as orcfile

...

...

16/05/06 13:32:55 INFO mapreduce.Job: Job job_1462201699379_0089 running in uber mode : false

16/05/06 13:32:55 INFO mapreduce.Job: map 0% reduce 0%

16/05/06 13:33:07 INFO mapreduce.Job: map 100% reduce 0%

16/05/06 13:33:09 INFO mapreduce.Job: Job job_1462201699379_0089 completed successfully

16/05/06 13:33:09 INFO mapreduce.Job: Counters: 30

File System Counters

FILE: Number of bytes read=0

FILE: Number of bytes written=297179

FILE: Number of read operations=0

FILE: Number of large read operations=0

FILE: Number of write operations=0

HDFS: Number of bytes read=87

HDFS: Number of bytes written=676

HDFS: Number of read operations=4

HDFS: Number of large read operations=0

HDFS: Number of write operations=2

Job Counters

Launched map tasks=1

Other local map tasks=1

Total time spent by all maps in occupied slots (ms)=14484

Total time spent by all reduces in occupied slots (ms)=0

Total time spent by all map tasks (ms)=7242

Total vcore-seconds taken by all map tasks=7242

Total megabyte-seconds taken by all map tasks=11123712

Map-Reduce Framework

Map input records=8

Map output records=8

Input split bytes=87

Spilled Records=0

Failed Shuffles=0

Merged Map outputs=0

GC time elapsed (ms)=92

CPU time spent (ms)=4620

Physical memory (bytes) snapshot=353759232

Virtual memory (bytes) snapshot=3276144640

Total committed heap usage (bytes)=175112192

File Input Format Counters

Bytes Read=0

File Output Format Counters

Bytes Written=0

16/05/06 13:33:09 INFO mapreduce.ImportJobBase: Transferred 676 bytes in 130.8366 seconds (5.1668 bytes/sec)

16/05/06 13:33:09 INFO mapreduce.ImportJobBase: Retrieved 8 records.

Output:

hive> select * from serdes;

OK

56 NULL org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

57 NULL org.apache.hadoop.hive.ql.io.orc.OrcSerde

58 NULL NULL

59 NULL org.apache.hadoop.hive.ql.io.orc.OrcSerde

60 NULL org.apache.hadoop.hive.ql.io.orc.OrcSerde

61 NULL org.apache.hadoop.hive.ql.io.orc.OrcSerde

62 NULL org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

63 NULL org.apache.hadoop.hive.ql.io.orc.OrcSerde

Time taken: 2.711 seconds, Fetched: 8 row(s)

hive>

View solution in original post

9 REPLIES 9
Highlighted

Re: Sqoop Import to Hive with Compression

Guru

@Nilesh

Below is an example where I imported a table that in mysql to hive in ORC format. You don't need to create the ORC backed table in advance. key is --hcatalog-storage-stanza.

sqoop import --connect "jdbc:mysql://sandbox.hortonworks.com/hive" --driver com.mysql.jdbc.Driver --username hive --password hive --table testtable --hcatalog-database default --hcatalog-table testtable  --create-hcatalog-table --hcatalog-storage-stanza "stored as orcfile"  -m 1
Highlighted

Re: Sqoop Import to Hive with Compression

Contributor

@Ravi Mutyala

Does sqoop import support orc file. Where is compressed parameter.

Highlighted

Re: Sqoop Import to Hive with Compression

Guru

Above example shows you that. --hcatalog-storage_stanza "stored as orcfile" will create a new hive table through hcatalog and data is stored as ORC. You can also create a hive table with ORC format and then use --hcatalog-table to write directly as ORC.

Highlighted

Re: Sqoop Import to Hive with Compression

Contributor
@Ravi Mutyala

Does it compressed the output.

Highlighted

Re: Sqoop Import to Hive with Compression

Guru

Yes. Output will be ORC file(s).

Highlighted

Re: Sqoop Import to Hive with Compression

@Nilesh

Below given is your solution:

Input:

mysql> select * from SERDES;

+----------+------+----------------------------------------------------+

| SERDE_ID | NAME | SLIB |

+----------+------+----------------------------------------------------+

| 56 | NULL | org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |

| 57 | NULL | org.apache.hadoop.hive.ql.io.orc.OrcSerde |

| 58 | NULL | NULL |

| 59 | NULL | org.apache.hadoop.hive.ql.io.orc.OrcSerde |

| 60 | NULL | org.apache.hadoop.hive.ql.io.orc.OrcSerde |

| 61 | NULL | org.apache.hadoop.hive.ql.io.orc.OrcSerde |

| 62 | NULL | org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |

+----------+------+----------------------------------------------------+

7 rows in set (0.00 sec)

Command:

sqoop import --connect jdbc:mysql://test:3306/hive \

--username hive \

--password test \

--table SERDES \

--hcatalog-database test \

--hcatalog-table SERDES \

--create-hcatalog-table \

--hcatalog-storage-stanza "stored as orcfile" \

--outdir sqoop_import \

-m 1 \

--compression-codec org.apache.hadoop.io.compress.SnappyCodec \

--driver com.mysql.jdbc.Driver

Logs:

...

...

16/05/06 13:30:46 INFO hcat.SqoopHCatUtilities: HCatalog Create table statement:

create table `demand_db`.`serdes` (

`serde_id` bigint,

`name` varchar(128),

`slib` varchar(4000))

stored as orcfile

...

...

16/05/06 13:32:55 INFO mapreduce.Job: Job job_1462201699379_0089 running in uber mode : false

16/05/06 13:32:55 INFO mapreduce.Job: map 0% reduce 0%

16/05/06 13:33:07 INFO mapreduce.Job: map 100% reduce 0%

16/05/06 13:33:09 INFO mapreduce.Job: Job job_1462201699379_0089 completed successfully

16/05/06 13:33:09 INFO mapreduce.Job: Counters: 30

File System Counters

FILE: Number of bytes read=0

FILE: Number of bytes written=297179

FILE: Number of read operations=0

FILE: Number of large read operations=0

FILE: Number of write operations=0

HDFS: Number of bytes read=87

HDFS: Number of bytes written=676

HDFS: Number of read operations=4

HDFS: Number of large read operations=0

HDFS: Number of write operations=2

Job Counters

Launched map tasks=1

Other local map tasks=1

Total time spent by all maps in occupied slots (ms)=14484

Total time spent by all reduces in occupied slots (ms)=0

Total time spent by all map tasks (ms)=7242

Total vcore-seconds taken by all map tasks=7242

Total megabyte-seconds taken by all map tasks=11123712

Map-Reduce Framework

Map input records=8

Map output records=8

Input split bytes=87

Spilled Records=0

Failed Shuffles=0

Merged Map outputs=0

GC time elapsed (ms)=92

CPU time spent (ms)=4620

Physical memory (bytes) snapshot=353759232

Virtual memory (bytes) snapshot=3276144640

Total committed heap usage (bytes)=175112192

File Input Format Counters

Bytes Read=0

File Output Format Counters

Bytes Written=0

16/05/06 13:33:09 INFO mapreduce.ImportJobBase: Transferred 676 bytes in 130.8366 seconds (5.1668 bytes/sec)

16/05/06 13:33:09 INFO mapreduce.ImportJobBase: Retrieved 8 records.

Output:

hive> select * from serdes;

OK

56 NULL org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

57 NULL org.apache.hadoop.hive.ql.io.orc.OrcSerde

58 NULL NULL

59 NULL org.apache.hadoop.hive.ql.io.orc.OrcSerde

60 NULL org.apache.hadoop.hive.ql.io.orc.OrcSerde

61 NULL org.apache.hadoop.hive.ql.io.orc.OrcSerde

62 NULL org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

63 NULL org.apache.hadoop.hive.ql.io.orc.OrcSerde

Time taken: 2.711 seconds, Fetched: 8 row(s)

hive>

View solution in original post

Highlighted

Re: Sqoop Import to Hive with Compression

@Nilesh

Did it worked?

Re: Sqoop Import to Hive with Compression

For a comparison between compression formats take a look at this link:

http://comphadoop.weebly.com/

Highlighted

Re: Sqoop Import to Hive with Compression

Expert Contributor

A late answer but maybe it will help someone :)

I am just adding to what @Ravi Mutyala has mentioned :

sqoop import --null-string '\\N' --null-non-string '\\N' --hive-delims-replacement '\0D' --num-mappers 8 --validate --hcatalog-home /usr/hdp/current/hive-webhcat --hcatalog-database default --hcatalog-table Inactivity --create-hcatalog-table --hcatalog-storage-stanza 'stored as orc tblproperties ("orc.compress"="ZLIB")' --connect 'jdbc:sqlserver://<IP>;database=<db-name>' --username --password --table Inactivity -- --schema QlikView 2>&1| tee -a log

Now if you describe the table :

0: jdbc:hive2://> describe formatted inactivity;
OK
16/08/18 11:23:25 [main]: WARN lazy.LazyStruct: Extra bytes detected at the end of the row! Ignoring similar problems.
+-------------------------------+----------------------------------------------------------------------+-----------------------+--+
|           col_name            |                              data_type                               |        comment        |
+-------------------------------+----------------------------------------------------------------------+-----------------------+--+
| # col_name                    | data_type                                                            | comment               |
|                               | NULL                                                                 | NULL                  |
| period                        | int                                                                  |                       |
| vin                           | string                                                               |                       |
| customerid                    | int                                                                  |                       |
| subscriberdealersisid         | string                                                               |                       |
| subscriberdistributorsisid    | string                                                               |                       |
| packagename                   | string                                                               |                       |
| timemodify                    | string                                                               |                       |
|                               | NULL                                                                 | NULL                  |
| # Detailed Table Information  | NULL                                                                 | NULL                  |
| Database:                     | default                                                              | NULL                  |
| Owner:                        | hive                                                                 | NULL                  |
| CreateTime:                   | Thu Aug 18 11:20:28 CEST 2016                                        | NULL                  |
| LastAccessTime:               | UNKNOWN                                                              | NULL                  |
| Protect Mode:                 | None                                                                 | NULL                  |
| Retention:                    | 0                                                                    | NULL                  |
| Location:                     | hdfs://l4283t.sss.com:8020/apps/hive/warehouse/inactivity  | NULL                  |
| Table Type:                   | MANAGED_TABLE                                                        | NULL                  |
| Table Parameters:             | NULL                                                                 | NULL                  |
|                               | orc.compress                                                         | ZLIB                  |
|                               | transient_lastDdlTime                                                | 1471512028            |
|                               | NULL                                                                 | NULL                  |
| # Storage Information         | NULL                                                                 | NULL                  |
| SerDe Library:                | org.apache.hadoop.hive.ql.io.orc.OrcSerde                            | NULL                  |
| InputFormat:                  | org.apache.hadoop.hive.ql.io.orc.OrcInputFormat                      | NULL                  |
| OutputFormat:                 | org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat                     | NULL                  |
| Compressed:                   | No                                                                   | NULL                  |
| Num Buckets:                  | -1                                                                   | NULL                  |
| Bucket Columns:               | []                                                                   | NULL                  |
| Sort Columns:                 | []                                                                   | NULL                  |
| Storage Desc Params:          | NULL                                                                 | NULL                  |
|                               | serialization.format                                                 | 1                     |
+-------------------------------+----------------------------------------------------------------------+-----------------------+--+
33 rows selected (0.425 seconds)

To verify if the compression has really taken place, you can first import the table without any compression and execute

analyze table <table-name>compute statistics

and note the 'totalSize'. Then repeat the process with compression and compare the 'totalSize'

Don't have an account?
Coming from Hortonworks? Activate your account here