About bmathew

bmathew · ‎01-27-2017

Hbase works for your use case: 1. Need to quickly write streaming data coming in at a high velocity 2. Being able to perform random lookups against the dataset that your are writing to

bmathew · ‎12-18-2016

A colleague recently asked me how to create a custom function for Hive using Python. You can pretty much create a function in any language and plug it into your Hive query using the Hive TRANSFORM clause. TRANSFORM lets you add your own mappers and/or reducers to process the data. The example in this article is working code that I wrote a few years ago using an early version of Hive to demonstrate how to add a custom function. In earlier versions of Hive we had to implement our own functions to hash sensitive data for PII compliance. Beginning with Hive 1.3 the SHA2 UDF was added to calculate a hash using SHA-224, SHA-256, SHA-384, or SHA-512. In my example below I create a custom UDF using Python to calculate the SHA-256 hash for social security number. Keep in mind that when I did this there were no out of the box Hive UDF’s available. This example is to only demonstrate how to write your own custom functions for Hive using Python. First, we need to write some python code that will read each record passed in from Hive and process the data. Save this to a file: #!/usr/local/bin/python import hashlib import sys ## we are receiving each record passed in from Hive via standard input ## By default, columns will be transformed to STRING and delimited by TAB ## Also, by default, NULL values will be converted to literal string \N to differentiate from empty strings for line in sys.stdin: line = line.strip() (customer_no,ssn,plan,join_date,status,balance,region) = line.split('\t') ## hash social security number and emit all the fields to standard out x = hashlib.sha256(str(ssn)) ssn = x.hexdigest() print '\t'.join([str(customer_no),str(ssn),plan,str(join_date),status,str(balance),region]) Now you can call the above python code from your HiveQL: ADD FILE /path-to-my-script/my_python_code.py; CREATE VIEW customer_data_mart_view.v_customer_balance SELECT TRANSFORM ( customer_no ,ssn ,plan ,join_date ,status ,balance ,region) USING '/path-to-my-script/my_python_code.py' AS customer_no ,ssn ,plan ,join_date ,status ,balance ,region FROM customer_data_mart.customer_details;

bmathew · ‎12-17-2016

Try using a configuration file that stores your AWS credentials. Follow the instructions here: https://hortonworks.github.io/hdp-aws/s3-security/#create-a-credential-file

bmathew · ‎12-07-2016

I recently worked with a customer to demo HDF/NiFi DataFlows for their uses cases. One use case involved converting delimited row-formatted text files into Parquet and ORC columnar formats. This can quickly be done with HDF/NiFi. Here is an easy to follow DataFlow that will convert row-formatted text files to Parquet and ORC. It may look straightforward; however, it requires some basic knowledge of Avro file formats and use of the Kite API. This article will explain the DataFlow in detail. STEP 1: Create an Avro schema for the source data My source data is tab delimited and consists of 8 fields. Notice in the DataFlow that before converting the data to Parquet and ORC the data is first converted to Avro. This is done so we have schema information for the data. Prior to converting the data to Avro we need to create an Avro schema definition file for the source data. Here is what my schema definition file looks like for the 8 fields. I stored this in a file named sample.avsc. More information on Avro schemas can be found here: https://avro.apache.org/docs/1.7.7/spec.html STEP 2: Use Kite API to create a Parquet dataset The DataFlow uses the ‘StoreInKiteDataset’ processor. Before we can use this processor to convert the Avro data to Parquet we need to have a directory already created in HDFS to store the data as Parquet. This is done by calling the Kite API. Kite is an API for Hadoop that lets you easily define how your data is stored: Works with file formats including CSV, JSON, Avro, and Parquet Hive HDFS Local File System HBase Amazon S3 Compress data: Snappy (default), Deflate, Bzip2, and Lzo Kite will handle how the data is stored. For example, if I wanted to store incoming CSV data into a Parquet formatted Hive table, I could use the Kite API to create a schema for my CSV data and then call the Kite API to create the Hive table for me. Kite also works with partitioned data and will automatically partition records when writing. In this example I am writing the data into HDFS. Call Kite API to create a directory in HDFS to store the Parquet data. The file sample.avsc contains my schema definition: ./kite-dataset create dataset:hdfs://ip-172-31-2-101.xxxx.compute.internal:8020/tmp/sample_data/parquet --schema sample.avsc --format parquet If you want to load directly into a Hive table, then you would call the Kite API using the following command: ./kite-dataset create sample_hive_table --schema sample.avsc --format parquet To learn more about the Kite API and to download, follow this link: http://kitesdk.org/docs/current/ STEP 3: Get source data The source files could exist in one more places including remote server, local file system, or in HDFS. This example is using files stored in HDFS. STEP 4: Convert data to Avro format Configure the 'ConvertCSVToAvro' processor to specify the location for the schema definition file and specify properties of the source delimited file so NiFi knows how to read the source data. STEP 5: Convert data to columnar Parquet format Configure the 'StoreInKiteDataset' processor to set the URI for your Kite dataset. My target dataset URI is: dataset:hdfs://ip-172-31-2-101.xxxx.compute.internal:8020/tmp/sample_data/parquet This is the directory I created in STEP 2. I'm writing to an HDFS directory. I could also write directly to a Hive table. STEP 6: Convert data to columnar ORC format and store in HDFS These 2 are straightforward. For the 'ConvertAvroToORC' processor you can specify the ORC stripe size as well as optionally compress the data. 'ConvertAvroToORC' processor settings: Data before and after conversion The text data is considerably larger than the Parquet and ORC files.

bmathew · ‎12-05-2016

You can create views in Hive and apply Ranger permissions to them. This was a very common thing to do before we released HDP 2.5 with dynamic masking and row level filtering. For example, prior to HDP 2.5 clients would create a view on top of a base table, let's say a CUSTOMER table, and in the view they would mask/hash certain PII fields such as Social Security Number and email address. Next, they would use Ranger to restrict access to the base CUSTOMER table and apply SELECT access to the view.

bmathew · ‎12-05-2016

I assumed you were referring to HDFS disk usage. What happens when you run 'hdfs dfsadmin -report' Are you seeing uneven distribution of blocks? If so, then rebalancing via Ambari will help with uniform distribution of your HDFS data

bmathew · ‎12-04-2016

Try rebalancing the blocks across DataNodes in your cluster. HDFS data is not always placed uniformly across the DataNodes. Here is a link to the manual on how to rebalance: https://docs.hortonworks.com/HDPDocuments/Ambari-2.2.1.1/bk_Ambari_Users_Guide/content/_how_to_rebalance_hdfs.html

bmathew · ‎12-01-2016

Here is a great blog on Dominant Resource Calculator by Varun Vasudev that you can reference. http://hortonworks.com/blog/managing-cpu-resources-in-your-hadoop-yarn-clusters/

bmathew · ‎12-01-2016

Row and Columnar Storage For Hive Customers often ask about columnar storage formats for Hive tables and when to use them. It depends on your uses cases. If your data access patterns mostly involve selecting a few columns to perform aggregations, then using columnar storage will save disk space, reduce I/O when fetching data, and improve query execution time. In this article I tested the columnar ORC file format for Hive to quantify how it outperforms row based Text files when used for queries that aggregate and group on a few columns from a table with many columns. ORC is a columnar storage format used in Hadoop for Hive tables. It is an efficient file format for storing data in which records contain many columns. An example is Clickstream (web) data to analyze website activity and performance. Queries typically retrieve a subset of columns for each row. SUMMARY OF TEST RESULTS: Comparing ORC vs Text files for storing 22 million clickstream page view records in Hive tables Each row as 40 columns Data files are uncompressed (i.e. no CODEC such as Snappy, ZLIB, LZO, etc…) 52% Reduction in disk space when using ORC. High columnar storage compression: 97% Reduction in disk I/O when using ORC: 21% Improvement in HiveQL query execution time when using ORC: HiveQL is counting the number of page views by Operating System and then ranking each Operating System by order of most page views in descending order. Sample output showing a few rows: Using ORC file format may not always equate to significantly less Memory and CPU usage for analytical queries than if using row based Text files for high volumes of data. In fact, memory usage could be greater when using ORC format. You can optimize Memory usage by optionally compressing the data using a CODEC such as ZLIB or Snappy; however, CPU time will then increase because of compression and decompression.

bmathew · ‎12-01-2016

I typically don't recommend using Hive atop HBase. The performance is terrible when you start getting into high data volumes. You could create your HBase tables and then use Spark to access data programmatically using the Data Frames API, and use Phoenix to create a view atop HBase for SQL analytics. Phoenix is orders of magnitude faster than Hive for SQL on HBase. Try it out. It's easy to use. You use Phoenix to create a schema for the HBase table. Hortonworks released a Spark on HBase connector that you can use. It's a DataFrame based connector: http://hortonworks.com/blog/spark-hbase-dataframe-based-hbase-connector/

Online	Offline
Last Visited	‎06-18-2017 01:08 AM

Member Since	‎02-26-2016 07:42 PM
Last Visited	‎06-18-2017 01:08 AM
Posts	100
Kudos received	111

Cloudera Community

Re: Hive databse backup

Re: Best approach to ingest CSV with changing sche...

Re: which way is the best when using hive to analy...

Re: Will there be any performance issues if we sel...

Re: Comparing sql data with text file using Spark.

Re: How best to store near-real time data from Spa...

How to create a custom UDF for Hive using Python

Re: Ranger and creating database pointing to S3 ...

HDF/NiFi to convert row-formatted text files to co...

Re: Can we create policies on Hive Views in Apache...

Re: hdfs data disk size is exceeding 90% threshold...

Re: hdfs data disk size is exceeding 90% threshold...

Re: Things to consider before using DominanatResou...

Row vs Columnar Storage For Hive

Re: Java Spark Program and Hive table backed by HB...