Support Questions

kaliyugantagoni · ‎08-16-2016

HDP-2.4.2.0-258 installed using Ambari 2.2.2.0

There are aplenty schema in SQL Server and Oracle DB that need to be imported to Hadoop, I have chosen the rdbms to HCatalog/Hive approach.

I am quite confused because of the following threads :

As per the Sqoop 1.4.6 documentation :

One downside to compressing tables imported into Hive is that many codecs cannot be split for processing by parallel map tasks. The lzop codec, however, does support splitting. When importing tables with this codec, Sqoop will automatically index the files for splitting and configuring a new Hive table with the correct InputFormat. This feature currently requires that all partitions of a table be compressed with the lzop codec.

Does that mean that gzip/zlib will cause performance/data integrity issues during Sqoop import AND subsequent processing?

The following from the Hive documentation confused me :

The parameters are all placed in the TBLPROPERTIES (see Create Table). They are:

Key	Default	Notes
orc.bloom.filter.columns	""	comma separated list of column names for which bloom filter should be created
orc.bloom.filter.fpp	0.05	false positive probability for bloom filter (must >0.0 and <1.0)
orc.compress	ZLIB	high level compression (one of NONE, ZLIB, SNAPPY)

I guess the default compression codec is gzip, I executed the following command(with both -z and --compress ways) :

sqoop import --null-string '\\N' --null-non-string '\\N' --hive-delims-replacement '\0D' --num-mappers 8 --validate --hcatalog-home /usr/hdp/current/hive-webhcat --hcatalog-database default --hcatalog-table Inactivity --create-hcatalog-table --hcatalog-storage-stanza "stored as orcfile" -z --connect 'jdbc:sqlserver://<IP>;database=VehicleDriverServicesFollowUp' --username  --password --table Inactivity -- --schema QlikView 2>&1| tee -a log

but the ORC table says compression : NO (am I missing/misinterpreting something or is some lib. missing, I didn't get any exception/error) :

hive>
    >
    > describe formatted inactivity;
OK
# col_name              data_type               comment
period                  int
vin                     string
customerid              int
subscriberdealersisid   string
subscriberdistributorsisid      string
packagename             string
timemodify              string
# Detailed Table Information
Database:               default
Owner:                  hive
CreateTime:             Tue Aug 16 17:34:36 CEST 2016
LastAccessTime:         UNKNOWN
Protect Mode:           None
Retention:              0
Location:               hdfs://l4283t.sss.com:8020/apps/hive/warehouse/inactivity
Table Type:             MANAGED_TABLE
Table Parameters:
        transient_lastDdlTime   1471361676
# Storage Information
SerDe Library:          org.apache.hadoop.hive.ql.io.orc.OrcSerde
InputFormat:            org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
OutputFormat:           org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
Compressed:             No
Num Buckets:            -1
Bucket Columns:         []
Sort Columns:           []
Storage Desc Params:
        serialization.format    1
Time taken: 0.395 seconds, Fetched: 32 row(s)
hive>

As per this existing thread, for Hive, ORC + Zlib should be used

How do I specify this Zlib during the import command ? Is it the case that I have to pre-create that tables in Hive to use Zlib ?

vranganathan · ‎09-02-2016

You can additional options to the --storage-stanza option. The storage stanza is just what gets appended to the create table statement and you can add syntactically valid options (like tblproperties)

View solution in original post

TimothySpann · ‎08-16-2016

i would pre-create the table with ORC an ZLib

kaliyugantagoni · ‎08-16-2016

I agree that's one way but that also means that if there are 100s of tables, one has to either manually pre-create those or execute some sqoop script to do so which means total two scripts to import one table, is there another way ?

vranganathan · ‎09-02-2016

You can additional options to the --storage-stanza option. The storage stanza is just what gets appended to the create table statement and you can add syntactically valid options (like tblproperties)

kaliyugantagoni · ‎09-05-2016

@vranganathan

I already did that now only one confusion remains - is the compression taking place as expected ?

vranganathan · ‎09-06-2016

without compression

[numFiles=8, numRows=6547431, totalSize=66551787, rawDataSize=3154024078]

with zlib

[numFiles=8, numRows=6547431, totalSize=44046849, rawDataSize=3154024078]

As you can see, the totalSize is less with zlib.

Cloudera Community

Support Questions

Sqoop import to HCatalog/Hive : Compression dilemma