Created 02-21-2017 11:27 AM
Hello,
I want to create a compressed avro-backed hive table and load data in it.
The flow is as follow:
CREATE TABLE IF NOT EXISTS events (...) STORED AS AVRO LOCATION '...'; INSERT OVERWRITE TABLE events SELECT ... FROM other_table;
Then if I DESCRIBE FORMATTED the table, I see
Compressed: no
As far as I understand it, to have compressed data, I should just add before all statements
SET hive.exec.compress.output=true; SET avro.output.codec=snappy;
But this does not change anything.
I tried to add as well:
SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec; SET mapred.output.compression.type=BLOCK;
and even a TBLPROPERTIES:
TBLPROPERTIES("avro.output.codec"="snappy")
To no avail.
Could anybody point me to what I am missing?
I am on hdp 2.5.3, llap not enabled, commands run via beeline in a script file.
Created 02-21-2017 09:18 PM
Your steps are correct.
Please be advised that the Compressed field in your DESCRIBED FORMATTED is not a reliable indicator of whether the table contains compressed data. It typically shows No, because the compression settings only apply during the session that loads data and are not stored persistently with the table metadata. The compression in desc formatted may be input or intermediate compression rather than output.
Look at the actual files as they are stored for the Hive table in question.
***
If this cleared the dilemma, please vote and accept it as the best answer.
Created 02-21-2017 09:18 PM
Your steps are correct.
Please be advised that the Compressed field in your DESCRIBED FORMATTED is not a reliable indicator of whether the table contains compressed data. It typically shows No, because the compression settings only apply during the session that loads data and are not stored persistently with the table metadata. The compression in desc formatted may be input or intermediate compression rather than output.
Look at the actual files as they are stored for the Hive table in question.
***
If this cleared the dilemma, please vote and accept it as the best answer.
Created 02-21-2017 09:24 PM
To check hdfs run something like this:
<code>dfs -lsr hdfs://localhost:9000/user/hive/warehouse/events;
Replace with your host.
The extension will tell you whether is compressed.
Created 02-22-2017 06:33 AM
Although there is no extension, looking at the metadata of the avro file I see indeed that it is compressed.
This brings 2 questions to mind:
- If I load data is many sessions, some with compression and some without, I would have a set of files in the hdfs directory, some compressed, some not, is that correct?
- Is there a way to globally set the compression parameters in hive, to not have to explicitly give them for each session?
Thanks,
Created 02-22-2017 02:22 PM
I can at least confirm that setting the following in hive-site:
"hive.exec.compress.output" : "true" "hive.exec.compress.intermediate" : "true" "avro.output.codec": "snappy"
Are enough to have compression globally.
Created 06-07-2017 02:37 AM
I applied the same above setting that you have mentioned and created a table with "avro.compress=snappy" as TBLPROPERTIES, but the compression ratio is same. I am not sure if compression is applied on this table. Is there any way to validate if it is compressed or not?