- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
Sqoop import to HCatalog/Hive : Compression dilemma
- Labels:
-
Apache Hive
-
Apache Sqoop
Created 08-16-2016 03:52 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
HDP-2.4.2.0-258 installed using Ambari 2.2.2.0
There are aplenty schema in SQL Server and Oracle DB that need to be imported to Hadoop, I have chosen the rdbms to HCatalog/Hive approach.
I am quite confused because of the following threads :
- As per the Sqoop 1.4.6 documentation :
One downside to compressing tables imported into Hive is that many codecs cannot be split for processing by parallel map tasks. The lzop codec, however, does support splitting. When importing tables with this codec, Sqoop will automatically index the files for splitting and configuring a new Hive table with the correct InputFormat. This feature currently requires that all partitions of a table be compressed with the lzop codec.
Does that mean that gzip/zlib will cause performance/data integrity issues during Sqoop import AND subsequent processing?
- The following from the Hive documentation confused me :
The parameters are all placed in the TBLPROPERTIES (see Create Table). They are:
Key |
Default |
Notes |
|---|---|---|
| orc.bloom.filter.columns | "" | comma separated list of column names for which bloom filter should be created |
| orc.bloom.filter.fpp | 0.05 | false positive probability for bloom filter (must >0.0 and <1.0) |
| orc.compress | ZLIB | high level compression (one of NONE, ZLIB, SNAPPY) |
- I guess the default compression codec is gzip, I executed the following command(with both -z and --compress ways) :
sqoop import --null-string '\\N' --null-non-string '\\N' --hive-delims-replacement '\0D' --num-mappers 8 --validate --hcatalog-home /usr/hdp/current/hive-webhcat --hcatalog-database default --hcatalog-table Inactivity --create-hcatalog-table --hcatalog-storage-stanza "stored as orcfile" -z --connect 'jdbc:sqlserver://<IP>;database=VehicleDriverServicesFollowUp' --username --password --table Inactivity -- --schema QlikView 2>&1| tee -a log
but the ORC table says compression : NO (am I missing/misinterpreting something or is some lib. missing, I didn't get any exception/error) :
hive>
>
> describe formatted inactivity;
OK
# col_name data_type comment
period int
vin string
customerid int
subscriberdealersisid string
subscriberdistributorsisid string
packagename string
timemodify string
# Detailed Table Information
Database: default
Owner: hive
CreateTime: Tue Aug 16 17:34:36 CEST 2016
LastAccessTime: UNKNOWN
Protect Mode: None
Retention: 0
Location: hdfs://l4283t.sss.com:8020/apps/hive/warehouse/inactivity
Table Type: MANAGED_TABLE
Table Parameters:
transient_lastDdlTime 1471361676
# Storage Information
SerDe Library: org.apache.hadoop.hive.ql.io.orc.OrcSerde
InputFormat: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
Compressed: No
Num Buckets: -1
Bucket Columns: []
Sort Columns: []
Storage Desc Params:
serialization.format 1
Time taken: 0.395 seconds, Fetched: 32 row(s)
hive>- As per this existing thread, for Hive, ORC + Zlib should be used
How do I specify this Zlib during the import command ? Is it the case that I have to pre-create that tables in Hive to use Zlib ?
Created 09-02-2016 01:43 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You can additional options to the --storage-stanza option. The storage stanza is just what gets appended to the create table statement and you can add syntactically valid options (like tblproperties)
Created 08-16-2016 04:46 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
i would pre-create the table with ORC an ZLib
Created 08-16-2016 05:50 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I agree that's one way but that also means that if there are 100s of tables, one has to either manually pre-create those or execute some sqoop script to do so which means total two scripts to import one table, is there another way ?
Created 09-02-2016 01:43 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You can additional options to the --storage-stanza option. The storage stanza is just what gets appended to the create table statement and you can add syntactically valid options (like tblproperties)
Created 09-05-2016 10:03 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I already did that now only one confusion remains - is the compression taking place as expected ?
Created 09-06-2016 05:14 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
without compression
[numFiles=8, numRows=6547431, totalSize=66551787, rawDataSize=3154024078]
with zlib
[numFiles=8, numRows=6547431, totalSize=44046849, rawDataSize=3154024078]
As you can see, the totalSize is less with zlib.