Support Questions
Find answers, ask questions, and share your expertise

Preserving compression when copying Hive table to S3

Super Collaborator

I'm trying to find the optimal way to copy our Hive data to S3.

I know that I can do something like this:

CREATE EXTERNAL TABLE my_s3_table(
  id BIGINT, time STRING, log STRING
  ) 
 row format delimited fields terminated by ',' 
 lines terminated by '\n' 
 STORED AS TEXTFILE
 LOCATION 's3n://bucket/directory/';

And then something like this:

INSERT OVERWRITE TABLE my_s3_table
 select id, time, log from my_local_table;

But here's my question: What is optimal as far as uploading the minimum number of bits?

Some of these tables grow by hundreds of GB per day, and that's really unwieldy.

Supposing my local table instance is a GZIP-compressed ORC-formatted table, can I just upload the compressed bits?

If I create my_s3_table as a GZIP compressed ORC file, will this happen for free?

Any guidance here for a AWS rookie?

1 REPLY 1

It's best to just upload the compressed files using distcp; that's designed for large data uploads between filesystems.

Note that gzip is an inefficient format for processing, as the code reading through it has to go all the way through it. Things are lot ,ore efficient (especially with the S3A speedups we've added to S3a. Use bzip2, lzo or snappy instead.