Member since
02-09-2016
559
Posts
422
Kudos Received
98
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
2120 | 03-02-2018 01:19 AM | |
3443 | 03-02-2018 01:04 AM | |
2346 | 08-02-2017 05:40 PM | |
2336 | 07-17-2017 05:35 PM | |
1699 | 07-10-2017 02:49 PM |
08-12-2016
01:56 PM
Yes, pausing and suspending are the same thing.
... View more
08-12-2016
01:17 AM
2 Kudos
@Heath Yates Using Ambari, stop all the services. Then, in the sandbox via ssh, type "sudo init 0" or "sudo shutdown now"
... View more
08-11-2016
09:01 PM
Of course both of these options can be worked around by anyone with access and rights to the file system.
... View more
08-11-2016
08:51 PM
2 Kudos
You could symbolic link the hive cli script/command to the beeline script command. This will ensure the "hive" command always executes "beeline". Here is another solution: https://community.hortonworks.com/questions/10760/how-to-disable-hive-shell-for-all-users.html
... View more
08-11-2016
02:06 PM
@subhash parise I just posted an article demonstrating a very simple Pig + Hive example showing HDFS compression. https://community.hortonworks.com/content/kbentry/50921/using-pig-to-convert-uncompressed-data-to-compress.html
... View more
08-11-2016
01:02 PM
3 Kudos
Overview
ORC provides many benefits, including support for compression. However, there are times when you need to store “raw” data in HDFS, but want to take advantage of compression. One such use case is with Hive external tables. Using PIG, you can LOAD the contents of uncompressed data from an HDFS directory, and then STORE that data in a compressed format into another HDFS directory. This approach is particularly useful if you are already using PIG for part of your ETL processing. Scope
This tutorial has been tested with the following configuration:
Mac OS X 10.11.5 VirtualBox 5.1.2 HDP 2.5 Tech Preview on Hortonworks Sandbox Prerequisites
The following should be completed before following this tutorial:
VirtualBox for Mac OS X installed
VirtualBox Link Hortonworks HDP 2.5 Technical Preview Sandbox configured in VirtualBox
Sandbox Link Steps 1. Connect to the Sandbox
If you are using Vagrant to spin up your sandbox, then you can simply connect using:
$ vagrant ssh
If you are not using Vagrant but are interested, check out this article:
HCC Article
If you are using the standard Sandbox without Vagrant, then you can connect using:
$ ssh -p 2222 vagrant@127.0.0.1
Note: The Sandbox should already have port forwarding enabled so that local port 2222 is forwarded to the sandbox port 22. 2. Download sample data
We need sample data to work with. For this tutorial, we will use historical NYSE stock ticker data for stocks starting with the letter A. You can download the sample data in your sandbox using:
$ cd /tmp
$ curl -O https://s3.amazonaws.com/hw-sandbox/tutorial1/infochimps_dataset_4778_download_16677-csv.zip
Note: This file is 124MB and may take a few minutes to download. 3. Create data directories on HDFS
We are going to create a /user/admin/data/uncompressed directory on HDFS. This is where the uncompressed data will be stored.
Create the uncompressed data directory
$ sudo -u hdfs hdfs dfs -mkdir -p /user/admin/data/uncompressed
Change ownership to the admin user account
$ sudo -u hdfs hdfs dfs -chown -R admin:hadoop /user/admin
Change permissions on the uncompressed data directory
$ sudo -u hdfs hdfs dfs -chmod -R 775 /user/admin/data
Note: These permissions are needed to enable Hive access to the directories. You could alternatively setup Ranger HDFS policies.
4. Push the data to the uncompressed directory on HDFS
We are going to push the NYSE_daily_prices_A.csv file to the uncompressed data directory on HDFS and change the ownership for that file.
Extract the zip archive
$ cd /tmp
$ unzip infochimps_dataset_4778_download_16677.zip
Push the stock file from the local directory to HDFS
$ sudo -u hdfs hdfs dfs -put /tmp/infochimps_dataset_4778_download_16677/NYSE/NYSE_daily_prices_A.csv /user/admin/data/uncompressed/
Change ownership of the file in HDFS
$ sudo -u hdfs hdfs dfs -chown -R admin:hadoop /user/admin/data/uncompressed
Verify the permission changes
$ sudo -u hdfs hdfs dfs -ls /user/admin/data/uncompressed
Found 1 items
-rw-r--r-- 1 admin hadoop 40990992 2016-08-11 01:16 /user/admin/data/uncompressed/NYSE_daily_prices_A.csv
5. Create External Hive table on uncompressed data
We are going to create an external table in Hive to view the uncompressed data. We will do this using the Hive View in Ambari.
The file schema is straight forward. The first line of the CSV is the header line. Normally you would remove that line as part of your processing, but we’ll leave it in to save time.
Enter the following Hive DDL and click the Execute button
CREATE EXTERNAL TABLE external_nyse_uncompressed (
stock_exchange STRING,
symbol STRING,
sdate STRING,
open FLOAT,
high FLOAT,
low FLOAT,
close FLOAT,
volume INT,
adj_close FLOAT
)
COMMENT 'Historical NYSE stock ticker data'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION '/user/admin/data/uncompressed';
6. Verify you can see the data in Hive
Click the New Worksheet button.
Enter the following query:
SELECT * FROM external_nyse_uncompressed LIMIT 100;
You should see something like the following:
7. Create Pig Script
We are going to use the Pig View to create a Pig script to compress our data on HDFS.
Enter Pig script
SET output.compression.enabled true;
SET output.compression.codec org.apache.hadoop.io.compress.BZip2Codec;
STOCK = LOAD '/user/admin/data/uncompressed/NYSE_daily_prices_A.csv' USING PigStorage(',') AS (
exchange:chararray,
symbol:chararray,
date:chararray,
open:float,
high:float,
low:float,
close:float,
volume:int,
adj_close:float);
STORE STOCK INTO '/user/admin/data/compressed' USING PigStorage(',');
Note: The final output directory can not exist prior to running the script or Pig will throw an error. In this case, the “compressed” directory should not yet exist. Pig will create /user/admin/data/compressed when it stores the output data.
After you have entered the script, click the "Execute" button. When the Pig job finishes, you should see something like this:
8. Create External Hive table on compressed data
We are going to create an external table in Hive to view the compressed data. Change permissions on the compressed data directory $ sudo -u hdfs hdfs dfs -chmod -R 775/user/admin/data/compressed
Enter the following Hive DDL and click the Execute button
CREATE EXTERNAL TABLE external_nyse_compressed (
stock_exchange STRING,
symbol STRING,
sdate STRING,
open FLOAT,
high FLOAT,
low FLOAT,
close FLOAT,
volume INT,
adj_close FLOAT
)
COMMENT 'Historical NYSE stock ticker data'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION '/user/admin/data/compressed'; 9. Verify you can see the data Hive
Click the New Worksheet button.
Enter the following query:
SELECT * FROM external_nyse_compressed LIMIT 100;
You should see something like the following:
10. Compare file sizes
You can compare the file sizes of the two sets of data.
$ sudo -u hdfs hdfs dfs -du -s -h /user/admin/data/*
8.0 M /user/admin/data/compressed
39.1 M /user/admin/data/uncompressed
Review
We have successfully created an external Hive table using uncompressed data. We created an external Hive table using compressed data. We have converted uncompressed data to compressed data using Pig.
At this point, you could delete the uncompressed files if they are no longer needed. This process can be run in reverse to create uncompressed data from compressed data. All you need to do is to set:
SET output.compression.enabled false;
Note: When compressing the data, Pig does not maintain the original filenames.
... View more
Labels:
08-10-2016
03:38 PM
@subhash parise The default codec is zlib. If you want to explicitly set it to zlib, use the following: set output.compression.codec org.apache.hadoop.io.compress.DefaultCodec;
... View more
08-10-2016
02:11 PM
1 Kudo
@subhash parise As @Artem Ervits shared, you get compression when storing your data in ORC format. However, if you want to store "raw" data on HDFS and you want to selectively compress it, you can use a simple PIG script to do it. Load the data from HDFS and then write it out again. set output.compression.enabled true;
set output.compression.codec org.apache.hadoop.io.compress.BZip2Codec;
inputFiles = LOAD '/input/directory/uncompressed' using PigStorage();
STORE inputFiles INTO '/output/directory/compressed/' USING PigStorage(); You can either leave the uncompressed data or remove it, depending on what you are doing. This is an approach that I've used. You can use different codecs depending on your needs: set output.compression.codec com.hadoop.compression.lzo.LzopCodec;
set output.compression.codec org.apache.hadoop.io.compress.GzipCodec;
set output.compression.codec org.apache.hadoop.io.compress.BZip2Codec;
... View more
08-09-2016
06:08 PM
@Deepak k You may find Ambari Blueprints helpful? It is not the same thing as defining a stack, but it does offer a lot of control over which components are installed and where. https://cwiki.apache.org/confluence/display/AMBARI/Blueprints Ambari Blueprints are a declarative definition of a cluster. With a Blueprint, you specify a Stack, the Component layout and the Configurations to materialize a Hadoop cluster instance (via a REST API) without having to use the Ambari Cluster Install Wizard.
... View more
08-09-2016
04:10 PM
You can always remove the files in .Trash as you would any other directory/file. hdfs dfs -rm -r -skipTrash /user/hdfs/.Trash/*
... View more