Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Sqoop Import to Hive with Compression

avatar
Expert Contributor

Hi Team,

How to import oracle tables in hive with compressed format via sqoop command.

Where to configure compression. i.e MapReduce or HDFS or Hive.

Should I configure compression for Mapreduce ,HDFS and Hive.

Which is the best compression formats supported by Hive.

1 ACCEPTED SOLUTION

avatar
hide-solution

This problem has been solved!

Want to get a detailed solution you have to login/registered on the community

Register/Login
9 REPLIES 9

avatar
Guru

@Nilesh

Below is an example where I imported a table that in mysql to hive in ORC format. You don't need to create the ORC backed table in advance. key is --hcatalog-storage-stanza.

sqoop import --connect "jdbc:mysql://sandbox.hortonworks.com/hive" --driver com.mysql.jdbc.Driver --username hive --password hive --table testtable --hcatalog-database default --hcatalog-table testtable  --create-hcatalog-table --hcatalog-storage-stanza "stored as orcfile"  -m 1

avatar
Expert Contributor

@Ravi Mutyala

Does sqoop import support orc file. Where is compressed parameter.

avatar
Guru

Above example shows you that. --hcatalog-storage_stanza "stored as orcfile" will create a new hive table through hcatalog and data is stored as ORC. You can also create a hive table with ORC format and then use --hcatalog-table to write directly as ORC.

avatar
Expert Contributor
@Ravi Mutyala

Does it compressed the output.

avatar
Guru

Yes. Output will be ORC file(s).

avatar
hide-solution

This problem has been solved!

Want to get a detailed solution you have to login/registered on the community

Register/Login

avatar

@Nilesh

Did it worked?

avatar

For a comparison between compression formats take a look at this link:

http://comphadoop.weebly.com/

avatar
Super Collaborator

A late answer but maybe it will help someone 🙂

I am just adding to what @Ravi Mutyala has mentioned :

sqoop import --null-string '\\N' --null-non-string '\\N' --hive-delims-replacement '\0D' --num-mappers 8 --validate --hcatalog-home /usr/hdp/current/hive-webhcat --hcatalog-database default --hcatalog-table Inactivity --create-hcatalog-table --hcatalog-storage-stanza 'stored as orc tblproperties ("orc.compress"="ZLIB")' --connect 'jdbc:sqlserver://<IP>;database=<db-name>' --username --password --table Inactivity -- --schema QlikView 2>&1| tee -a log

Now if you describe the table :

0: jdbc:hive2://> describe formatted inactivity;
OK
16/08/18 11:23:25 [main]: WARN lazy.LazyStruct: Extra bytes detected at the end of the row! Ignoring similar problems.
+-------------------------------+----------------------------------------------------------------------+-----------------------+--+
|           col_name            |                              data_type                               |        comment        |
+-------------------------------+----------------------------------------------------------------------+-----------------------+--+
| # col_name                    | data_type                                                            | comment               |
|                               | NULL                                                                 | NULL                  |
| period                        | int                                                                  |                       |
| vin                           | string                                                               |                       |
| customerid                    | int                                                                  |                       |
| subscriberdealersisid         | string                                                               |                       |
| subscriberdistributorsisid    | string                                                               |                       |
| packagename                   | string                                                               |                       |
| timemodify                    | string                                                               |                       |
|                               | NULL                                                                 | NULL                  |
| # Detailed Table Information  | NULL                                                                 | NULL                  |
| Database:                     | default                                                              | NULL                  |
| Owner:                        | hive                                                                 | NULL                  |
| CreateTime:                   | Thu Aug 18 11:20:28 CEST 2016                                        | NULL                  |
| LastAccessTime:               | UNKNOWN                                                              | NULL                  |
| Protect Mode:                 | None                                                                 | NULL                  |
| Retention:                    | 0                                                                    | NULL                  |
| Location:                     | hdfs://l4283t.sss.com:8020/apps/hive/warehouse/inactivity  | NULL                  |
| Table Type:                   | MANAGED_TABLE                                                        | NULL                  |
| Table Parameters:             | NULL                                                                 | NULL                  |
|                               | orc.compress                                                         | ZLIB                  |
|                               | transient_lastDdlTime                                                | 1471512028            |
|                               | NULL                                                                 | NULL                  |
| # Storage Information         | NULL                                                                 | NULL                  |
| SerDe Library:                | org.apache.hadoop.hive.ql.io.orc.OrcSerde                            | NULL                  |
| InputFormat:                  | org.apache.hadoop.hive.ql.io.orc.OrcInputFormat                      | NULL                  |
| OutputFormat:                 | org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat                     | NULL                  |
| Compressed:                   | No                                                                   | NULL                  |
| Num Buckets:                  | -1                                                                   | NULL                  |
| Bucket Columns:               | []                                                                   | NULL                  |
| Sort Columns:                 | []                                                                   | NULL                  |
| Storage Desc Params:          | NULL                                                                 | NULL                  |
|                               | serialization.format                                                 | 1                     |
+-------------------------------+----------------------------------------------------------------------+-----------------------+--+
33 rows selected (0.425 seconds)

To verify if the compression has really taken place, you can first import the table without any compression and execute

analyze table <table-name>compute statistics

and note the 'totalSize'. Then repeat the process with compression and compare the 'totalSize'