Code Repositories
Find and share code repositories
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.
Labels (2)
Repo Description

This project shows how to analyze an HBase Snapshot using Spark.

The main motivation for writing this code is to reduce the impact on the HBase Region Servers while analyzing HBase records. By creating a snapshot of the HBase table, we can run Spark jobs against the snapshot, eliminating the impact to region servers and reducing the risk to operational systems.

At a high-level, here's what the code is doing:

  1. Reads an HBase Snapshot into a Spark
  2. Parses the HBase KeyValue to a Spark Dataframe
  3. Applies arbitrary data processing (timestamp and rowkey filtering)
  4. Saves the results back to an HBase (HFiles / KeyValue) format within HDFS, using HFileOutputFormat.
    • The output format maintains the original rowkey, timestamp, column family, qualifier, and value structure.
  5. From here, you can bulkload the HDFS file into HBase
Repo Info
Github Repo URL https://github.com/zaratsian/SparkHBaseExample
Github account name zaratsian
Repo name SparkHBaseExample
706 Views
Comments
New Contributor

Zaratsian i followed your tutorial but getting error "Wrong FS". can you help me solve this issue.

I have posted the question at below link

https://community.hortonworks.com/questions/114572/getting-error-while-reading-hbase-snapshot-throug...

Don't have an account?
Coming from Hortonworks? Activate your account here
Version history
Revision #:
1 of 1
Last update:
‎09-30-2016 12:02 AM
Updated by:
 
Contributors
Top Kudoed Authors