Member since
02-26-2016
100
Posts
111
Kudos Received
12
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
2084 | 05-04-2017 12:38 PM | |
4050 | 03-21-2017 06:18 PM | |
13597 | 03-21-2017 01:14 AM | |
4412 | 02-14-2017 06:21 PM | |
7129 | 02-09-2017 03:49 AM |
01-27-2017
05:11 PM
Hbase works for your use case: 1. Need to quickly write streaming data coming in at a high velocity 2. Being able to perform random lookups against the dataset that your are writing to
... View more
12-18-2016
01:57 PM
4 Kudos
A colleague recently asked me how to create a custom function for Hive using Python. You can pretty much create a function in any language and plug it into your Hive query using the Hive TRANSFORM clause. TRANSFORM lets you add your own mappers and/or reducers to process the data. The example in this article is working code that I wrote a few years ago using an early version of Hive to demonstrate how to add a custom function. In earlier versions of Hive we had to implement our own functions to hash sensitive data for PII compliance. Beginning with Hive 1.3 the SHA2 UDF was added to calculate a hash using SHA-224, SHA-256, SHA-384, or SHA-512. In my example below I create a custom UDF using Python to calculate the SHA-256 hash for social security number. Keep in mind that when I did this there were no out of the box Hive UDF’s available. This example is to only demonstrate how to write your own custom functions for Hive using Python. First, we need to write some python code that will read each record passed in from Hive and process the data. Save this to a file: #!/usr/local/bin/python
import hashlib
import sys
## we are receiving each record passed in from Hive via standard input
## By default, columns will be transformed to STRING and delimited by TAB
## Also, by default, NULL values will be converted to literal string \N to differentiate from empty strings
for line in sys.stdin:
line = line.strip()
(customer_no,ssn,plan,join_date,status,balance,region) = line.split('\t')
## hash social security number and emit all the fields to standard out
x = hashlib.sha256(str(ssn))
ssn = x.hexdigest()
print '\t'.join([str(customer_no),str(ssn),plan,str(join_date),status,str(balance),region])
Now you can call the above python code from your HiveQL: ADD FILE /path-to-my-script/my_python_code.py;
CREATE VIEW customer_data_mart_view.v_customer_balance
SELECT
TRANSFORM (
customer_no
,ssn
,plan
,join_date
,status
,balance
,region)
USING '/path-to-my-script/my_python_code.py'
AS customer_no
,ssn
,plan
,join_date
,status
,balance
,region
FROM customer_data_mart.customer_details;
... View more
Labels:
12-17-2016
08:55 PM
Try using a configuration file that stores your AWS credentials. Follow the instructions here: https://hortonworks.github.io/hdp-aws/s3-security/#create-a-credential-file
... View more
12-07-2016
09:19 AM
4 Kudos
I recently worked with a customer to demo HDF/NiFi DataFlows for their uses cases. One use case involved converting delimited row-formatted text files into Parquet and ORC columnar formats. This can quickly be done with HDF/NiFi.
Here is an easy to follow DataFlow that will convert row-formatted text files to Parquet and ORC. It may look straightforward; however, it requires some basic knowledge of Avro file formats and use of the Kite API. This article will explain the DataFlow in detail.
STEP 1: Create an Avro schema for the source data
My source data is tab delimited and consists of 8 fields.
Notice in the DataFlow that before converting the data to Parquet and ORC the data is first converted to Avro. This is done so we have schema information for the data. Prior to converting the data to Avro we need to create an Avro schema definition file for the source data. Here is what my schema definition file looks like for the 8 fields. I stored this in a file named sample.avsc.
More information on Avro schemas can be found here: https://avro.apache.org/docs/1.7.7/spec.html
STEP 2: Use Kite API to create a Parquet dataset
The DataFlow uses the ‘StoreInKiteDataset’ processor. Before we can use this processor to convert the Avro data to Parquet we need to have a directory already created in HDFS to store the data as Parquet. This is done by calling the Kite API.
Kite is an API for Hadoop that lets you easily define how your data is stored:
Works with file formats including CSV, JSON, Avro, and Parquet
Hive
HDFS
Local File System
HBase
Amazon S3
Compress data: Snappy (default), Deflate, Bzip2, and Lzo
Kite will handle how the data is stored. For example, if I wanted to store incoming CSV data into a Parquet formatted Hive table, I could use the Kite API to create a schema for my CSV data and then call the Kite API to create the Hive table for me. Kite also works with partitioned data and will automatically partition records when writing.
In this example I am writing the data into HDFS.
Call Kite API to create a directory in HDFS to store the Parquet data. The file sample.avsc contains my schema definition:
./kite-dataset create dataset:hdfs://ip-172-31-2-101.xxxx.compute.internal:8020/tmp/sample_data/parquet --schema sample.avsc --format parquet
If you want to load directly into a Hive table, then you would call the Kite API using the following command:
./kite-dataset create sample_hive_table --schema sample.avsc --format parquet
To learn more about the Kite API and to download, follow this link: http://kitesdk.org/docs/current/
STEP 3: Get source data
The source files could exist in one more places including remote server, local file system, or in HDFS. This example is using files stored in HDFS.
STEP 4: Convert data to Avro format
Configure the 'ConvertCSVToAvro' processor to specify the location for the schema definition file and specify properties of the source delimited file so NiFi knows how to read the source data.
STEP 5: Convert data to columnar Parquet format
Configure the 'StoreInKiteDataset' processor to set the URI for your Kite dataset.
My target dataset URI is: dataset:hdfs://ip-172-31-2-101.xxxx.compute.internal:8020/tmp/sample_data/parquet
This is the directory I created in STEP 2. I'm writing to an HDFS directory. I could also write directly to a Hive table.
STEP 6: Convert data to columnar ORC format and store in HDFS
These 2 are straightforward. For the 'ConvertAvroToORC' processor you can specify the ORC stripe size as well as optionally compress the data.
'ConvertAvroToORC' processor settings:
Data before and after conversion
The text data is considerably larger than the Parquet and ORC files.
... View more
Labels:
12-05-2016
04:01 PM
2 Kudos
You can create views in Hive and apply Ranger permissions to them. This was a very common thing to do before we released HDP 2.5 with dynamic masking and row level filtering. For example, prior to HDP 2.5 clients would create a view on top of a base table, let's say a CUSTOMER table, and in the view they would mask/hash certain PII fields such as Social Security Number and email address. Next, they would use Ranger to restrict access to the base CUSTOMER table and apply SELECT access to the view.
... View more
12-05-2016
03:13 AM
I assumed you were referring to HDFS disk usage. What happens when you run 'hdfs dfsadmin -report' Are you seeing uneven distribution of blocks? If so, then rebalancing via Ambari will help with uniform distribution of your HDFS data
... View more
12-04-2016
09:40 PM
Try rebalancing the blocks across DataNodes in your cluster. HDFS data is not always placed uniformly across the DataNodes. Here is a link to the manual on how to rebalance: https://docs.hortonworks.com/HDPDocuments/Ambari-2.2.1.1/bk_Ambari_Users_Guide/content/_how_to_rebalance_hdfs.html
... View more
12-01-2016
08:42 PM
Here is a great blog on Dominant Resource Calculator by Varun Vasudev that you can reference. http://hortonworks.com/blog/managing-cpu-resources-in-your-hadoop-yarn-clusters/
... View more
12-01-2016
03:21 AM
4 Kudos
Row and Columnar Storage For Hive
Customers often ask about columnar storage formats for Hive tables and
when to use them. It depends on your uses cases. If your data access
patterns mostly involve selecting a few columns to perform aggregations,
then using columnar storage will save disk space, reduce I/O when
fetching data, and improve query execution time. In this article I
tested the columnar ORC file format for Hive to quantify how it
outperforms row based Text files when used for queries that aggregate and group on a few columns from a table with many columns.
ORC is a columnar storage format used in Hadoop for Hive tables. It is an efficient file format for storing data in which records contain many columns. An example is Clickstream (web) data to analyze website activity and performance. Queries typically retrieve a subset of columns for each row. SUMMARY OF TEST RESULTS:
Comparing ORC vs Text files for storing 22 million clickstream page view records in Hive tables
Each row as 40 columns
Data files are uncompressed (i.e. no CODEC such as Snappy, ZLIB, LZO, etc…)
52% Reduction in disk space when using ORC. High columnar storage compression:
97% Reduction in disk I/O when using ORC:
21% Improvement in HiveQL query execution time when using ORC:
HiveQL is counting the number of page views by Operating System
and then ranking each Operating System by order of most page views in
descending order. Sample output showing a few rows:
Using ORC file format may not always equate to significantly less Memory
and CPU usage for analytical queries than if using row based Text files
for high volumes of data. In fact, memory usage could be greater when
using ORC format. You can optimize Memory usage by optionally
compressing the data using a CODEC such as ZLIB or Snappy; however, CPU
time will then increase because of compression and decompression.
... View more
Labels:
12-01-2016
02:50 AM
I typically don't recommend using Hive atop HBase. The performance is terrible when you start getting into high data volumes. You could create your HBase tables and then use Spark to access data programmatically using the Data Frames API, and use Phoenix to create a view atop HBase for SQL analytics. Phoenix is orders of magnitude faster than Hive for SQL on HBase. Try it out. It's easy to use. You use Phoenix to create a schema for the HBase table. Hortonworks released a Spark on HBase connector that you can use. It's a DataFrame based connector: http://hortonworks.com/blog/spark-hbase-dataframe-based-hbase-connector/
... View more