Created 06-29-2017 08:29 PM
I imported same table twice , once compressed and once uncompressed . comparing the two I have four questions marked as 1,2,3,4 ( please see below)
parameter used : --hcatalog-storage-stanza "stored as orcfile" Location: hdfs://hdfs-ha/apps/hive/warehouse/pa_lane_txn_orc Table Type: MANAGED_TABLE Table Parameters: numFiles 4 numRows 0 <<<< 1) number of rows shown as zero ? rawDataSize 0 <<<< 2) rawDataSize shown as zero ? totalSize 205994912 <<<< 3) totalSize is less than the compressed ? transient_lastDdlTime 1498767240 Compressed: No
parameter used: --hcatalog-storage-stanza 'stored as orc tblproperties ("orc.compress"="SNAPPY")' Location: hdfs://hdfs-ha/apps/hive/warehouse/pa_lane_txn_orc Table Type: MANAGED_TABLE Table Parameters: COLUMN_STATS_ACCURATE {\"BASIC_STATS\":\"true\"} numFiles 4 numRows 9999999 orc.compress SNAPPY rawDataSize 32315364706 totalSize 318486342 transient_lastDdlTime 1498766230 Compressed: No <<<< 4) compressed flag is showing NO even its SNAPPY compressed ?
Created 06-29-2017 08:45 PM
and I again imported the table in orc snappy compressed form , this time its showing the numRows and rawDataSize also as zero ?
Location: hdfs://hdfs-ha/apps/hive/warehouse/pa_lane_txn_orc Table Type: MANAGED_TABLE Table Parameters: numFiles 4 numRows 0 orc.compress SNAPPY rawDataSize 0 totalSize 318486342 transient_lastDdlTime 1498768617
Created 06-30-2017 08:22 PM
1 & 2) Trying running "analyze table" to generate row and data size statistics.
https://cwiki.apache.org/confluence/display/Hive/StatsDev#StatsDev-ExistingTables%E2%80%93ANALYZE
3) ORC files are compressed with zlib by default. zlib offers a higher level of compression than snappy. If you don't want compression you have to set orc.compress to "NONE"
4) I believe this is referencing the hive compression feature. Text files can be gzipped or bzipped and still read by Hive.
https://cwiki.apache.org/confluence/display/Hive/CompressedStorage