Created 08-16-2016 03:52 PM
HDP-2.4.2.0-258 installed using Ambari 2.2.2.0
There are aplenty schema in SQL Server and Oracle DB that need to be imported to Hadoop, I have chosen the rdbms to HCatalog/Hive approach.
I am quite confused because of the following threads :
One downside to compressing tables imported into Hive is that many codecs cannot be split for processing by parallel map tasks. The lzop codec, however, does support splitting. When importing tables with this codec, Sqoop will automatically index the files for splitting and configuring a new Hive table with the correct InputFormat. This feature currently requires that all partitions of a table be compressed with the lzop codec.
Does that mean that gzip/zlib will cause performance/data integrity issues during Sqoop import AND subsequent processing?
The parameters are all placed in the TBLPROPERTIES (see Create Table). They are:
Key |
Default |
Notes |
---|---|---|
orc.bloom.filter.columns | "" | comma separated list of column names for which bloom filter should be created |
orc.bloom.filter.fpp | 0.05 | false positive probability for bloom filter (must >0.0 and <1.0) |
orc.compress | ZLIB | high level compression (one of NONE, ZLIB, SNAPPY) |
sqoop import --null-string '\\N' --null-non-string '\\N' --hive-delims-replacement '\0D' --num-mappers 8 --validate --hcatalog-home /usr/hdp/current/hive-webhcat --hcatalog-database default --hcatalog-table Inactivity --create-hcatalog-table --hcatalog-storage-stanza "stored as orcfile" -z --connect 'jdbc:sqlserver://<IP>;database=VehicleDriverServicesFollowUp' --username --password --table Inactivity -- --schema QlikView 2>&1| tee -a log
but the ORC table says compression : NO (am I missing/misinterpreting something or is some lib. missing, I didn't get any exception/error) :
hive> > > describe formatted inactivity; OK # col_name data_type comment period int vin string customerid int subscriberdealersisid string subscriberdistributorsisid string packagename string timemodify string # Detailed Table Information Database: default Owner: hive CreateTime: Tue Aug 16 17:34:36 CEST 2016 LastAccessTime: UNKNOWN Protect Mode: None Retention: 0 Location: hdfs://l4283t.sss.com:8020/apps/hive/warehouse/inactivity Table Type: MANAGED_TABLE Table Parameters: transient_lastDdlTime 1471361676 # Storage Information SerDe Library: org.apache.hadoop.hive.ql.io.orc.OrcSerde InputFormat: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat OutputFormat: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat Compressed: No Num Buckets: -1 Bucket Columns: [] Sort Columns: [] Storage Desc Params: serialization.format 1 Time taken: 0.395 seconds, Fetched: 32 row(s) hive>
How do I specify this Zlib during the import command ? Is it the case that I have to pre-create that tables in Hive to use Zlib ?
Created 09-02-2016 01:43 PM
You can additional options to the --storage-stanza option. The storage stanza is just what gets appended to the create table statement and you can add syntactically valid options (like tblproperties)
Created 08-16-2016 04:46 PM
i would pre-create the table with ORC an ZLib
Created 08-16-2016 05:50 PM
I agree that's one way but that also means that if there are 100s of tables, one has to either manually pre-create those or execute some sqoop script to do so which means total two scripts to import one table, is there another way ?
Created 09-02-2016 01:43 PM
You can additional options to the --storage-stanza option. The storage stanza is just what gets appended to the create table statement and you can add syntactically valid options (like tblproperties)
Created 09-05-2016 10:03 AM
I already did that now only one confusion remains - is the compression taking place as expected ?
Created 09-06-2016 05:14 AM
without compression
[numFiles=8, numRows=6547431, totalSize=66551787, rawDataSize=3154024078]
with zlib
[numFiles=8, numRows=6547431, totalSize=44046849, rawDataSize=3154024078]
As you can see, the totalSize is less with zlib.