Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Sqoop import to HCatalog/Hive : Compression dilemma

avatar
Super Collaborator

HDP-2.4.2.0-258 installed using Ambari 2.2.2.0

There are aplenty schema in SQL Server and Oracle DB that need to be imported to Hadoop, I have chosen the rdbms to HCatalog/Hive approach.

I am quite confused because of the following threads :

One downside to compressing tables imported into Hive is that many codecs cannot be split for processing by parallel map tasks. The lzop codec, however, does support splitting. When importing tables with this codec, Sqoop will automatically index the files for splitting and configuring a new Hive table with the correct InputFormat. This feature currently requires that all partitions of a table be compressed with the lzop codec.

Does that mean that gzip/zlib will cause performance/data integrity issues during Sqoop import AND subsequent processing?

The parameters are all placed in the TBLPROPERTIES (see Create Table). They are:

Key

Default

Notes

orc.bloom.filter.columns""comma separated list of column names for which bloom filter should be created
orc.bloom.filter.fpp0.05false positive probability for bloom filter (must >0.0 and <1.0)
orc.compressZLIBhigh level compression (one of NONE, ZLIB, SNAPPY)
  • I guess the default compression codec is gzip, I executed the following command(with both -z and --compress ways) :
sqoop import --null-string '\\N' --null-non-string '\\N' --hive-delims-replacement '\0D' --num-mappers 8 --validate --hcatalog-home /usr/hdp/current/hive-webhcat --hcatalog-database default --hcatalog-table Inactivity --create-hcatalog-table --hcatalog-storage-stanza "stored as orcfile" -z --connect 'jdbc:sqlserver://<IP>;database=VehicleDriverServicesFollowUp' --username  --password --table Inactivity -- --schema QlikView 2>&1| tee -a log

but the ORC table says compression : NO (am I missing/misinterpreting something or is some lib. missing, I didn't get any exception/error) :

hive>
    >
    > describe formatted inactivity;
OK
# col_name              data_type               comment
period                  int
vin                     string
customerid              int
subscriberdealersisid   string
subscriberdistributorsisid      string
packagename             string
timemodify              string
# Detailed Table Information
Database:               default
Owner:                  hive
CreateTime:             Tue Aug 16 17:34:36 CEST 2016
LastAccessTime:         UNKNOWN
Protect Mode:           None
Retention:              0
Location:               hdfs://l4283t.sss.com:8020/apps/hive/warehouse/inactivity
Table Type:             MANAGED_TABLE
Table Parameters:
        transient_lastDdlTime   1471361676
# Storage Information
SerDe Library:          org.apache.hadoop.hive.ql.io.orc.OrcSerde
InputFormat:            org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
OutputFormat:           org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
Compressed:             No
Num Buckets:            -1
Bucket Columns:         []
Sort Columns:           []
Storage Desc Params:
        serialization.format    1
Time taken: 0.395 seconds, Fetched: 32 row(s)
hive>

How do I specify this Zlib during the import command ? Is it the case that I have to pre-create that tables in Hive to use Zlib ?

1 ACCEPTED SOLUTION

avatar
Expert Contributor

You can additional options to the --storage-stanza option. The storage stanza is just what gets appended to the create table statement and you can add syntactically valid options (like tblproperties)

View solution in original post

5 REPLIES 5

avatar
Master Guru

i would pre-create the table with ORC an ZLib

avatar
Super Collaborator

I agree that's one way but that also means that if there are 100s of tables, one has to either manually pre-create those or execute some sqoop script to do so which means total two scripts to import one table, is there another way ?

avatar
Expert Contributor

You can additional options to the --storage-stanza option. The storage stanza is just what gets appended to the create table statement and you can add syntactically valid options (like tblproperties)

avatar
Super Collaborator

@vranganathan

I already did that now only one confusion remains - is the compression taking place as expected ?

avatar
Expert Contributor

without compression

[numFiles=8, numRows=6547431, totalSize=66551787, rawDataSize=3154024078]

with zlib

[numFiles=8, numRows=6547431, totalSize=44046849, rawDataSize=3154024078]

As you can see, the totalSize is less with zlib.