Created on 08-11-201601:02 PM - edited 08-17-201910:52 AM
ORC provides many benefits, including support for compression. However, there are times when you need to store “raw” data in HDFS, but want to take advantage of compression. One such use case is with Hive external tables. Using PIG, you can LOAD the contents of uncompressed data from an HDFS directory, and then STORE that data in a compressed format into another HDFS directory. This approach is particularly useful if you are already using PIG for part of your ETL processing.
This tutorial has been tested with the following configuration:
Mac OS X 10.11.5
HDP 2.5 Tech Preview on Hortonworks Sandbox
The following should be completed before following this tutorial:
5. Create External Hive table on uncompressed data
We are going to create an external table in Hive to view the uncompressed data. We will do this using the Hive View in Ambari.
The file schema is straight forward. The first line of the CSV is the header line. Normally you would remove that line as part of your processing, but we’ll leave it in to save time.
Enter the following Hive DDL and click the Execute button
CREATE EXTERNAL TABLE external_nyse_uncompressed (
COMMENT 'Historical NYSE stock ticker data'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
6. Verify you can see the data in Hive
Click the New Worksheet button.
Enter the following query:
SELECT * FROM external_nyse_uncompressed LIMIT 100;
You should see something like the following:
7. Create Pig Script
We are going to use the Pig View to create a Pig script to compress our data on HDFS.
Enter Pig script
SET output.compression.enabled true;
SET output.compression.codec org.apache.hadoop.io.compress.BZip2Codec;
STOCK = LOAD '/user/admin/data/uncompressed/NYSE_daily_prices_A.csv' USING PigStorage(',') AS (
STORE STOCK INTO '/user/admin/data/compressed' USING PigStorage(',');
Note: The final output directory can not exist prior to running the script or Pig will throw an error. In this case, the “compressed” directory should not yet exist. Pig will create /user/admin/data/compressed when it stores the output data.
After you have entered the script, click the "Execute" button. When the Pig job finishes, you should see something like this:
8. Create External Hive table on compressed data
We are going to create an external table in Hive to view the compressed data.
Change permissions on the compressed data directory