Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Create Compressed avro Hive table

avatar
Expert Contributor

Hello,

I want to create a compressed avro-backed hive table and load data in it.

The flow is as follow:

CREATE TABLE IF NOT EXISTS events (...) STORED AS AVRO LOCATION '...';
INSERT OVERWRITE TABLE events SELECT ... FROM other_table;

Then if I DESCRIBE FORMATTED the table, I see

Compressed: no

As far as I understand it, to have compressed data, I should just add before all statements

SET hive.exec.compress.output=true;
SET avro.output.codec=snappy;

But this does not change anything.

I tried to add as well:

SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
SET mapred.output.compression.type=BLOCK;

and even a TBLPROPERTIES:

TBLPROPERTIES("avro.output.codec"="snappy")

To no avail.

Could anybody point me to what I am missing?

I am on hdp 2.5.3, llap not enabled, commands run via beeline in a script file.

1 ACCEPTED SOLUTION

avatar
Super Guru

@Guillaume Roger

Your steps are correct.

Please be advised that the Compressed field in your DESCRIBED FORMATTED is not a reliable indicator of whether the table contains compressed data. It typically shows No, because the compression settings only apply during the session that loads data and are not stored persistently with the table metadata. The compression in desc formatted may be input or intermediate compression rather than output.

Look at the actual files as they are stored for the Hive table in question.

***

If this cleared the dilemma, please vote and accept it as the best answer.

View solution in original post

5 REPLIES 5

avatar
Super Guru

@Guillaume Roger

Your steps are correct.

Please be advised that the Compressed field in your DESCRIBED FORMATTED is not a reliable indicator of whether the table contains compressed data. It typically shows No, because the compression settings only apply during the session that loads data and are not stored persistently with the table metadata. The compression in desc formatted may be input or intermediate compression rather than output.

Look at the actual files as they are stored for the Hive table in question.

***

If this cleared the dilemma, please vote and accept it as the best answer.

avatar
Super Guru

To check hdfs run something like this:

<code>dfs -lsr hdfs://localhost:9000/user/hive/warehouse/events;

Replace with your host.

The extension will tell you whether is compressed.

avatar
Expert Contributor

Although there is no extension, looking at the metadata of the avro file I see indeed that it is compressed.

This brings 2 questions to mind:

- If I load data is many sessions, some with compression and some without, I would have a set of files in the hdfs directory, some compressed, some not, is that correct?

- Is there a way to globally set the compression parameters in hive, to not have to explicitly give them for each session?

Thanks,

avatar
Expert Contributor

I can at least confirm that setting the following in hive-site:

"hive.exec.compress.output" : "true"
"hive.exec.compress.intermediate" : "true"
"avro.output.codec": "snappy"

Are enough to have compression globally.

avatar
New Contributor

I applied the same above setting that you have mentioned and created a table with "avro.compress=snappy" as TBLPROPERTIES, but the compression ratio is same. I am not sure if compression is applied on this table. Is there any way to validate if it is compressed or not?