Created 08-05-2016 04:20 AM
Does anyone have sample PySpark or Spark code to query an Hbase Snapshot?
I've created an Hbase table, loaded data, and then took a snapshot. I then moved the snapshot into HDFS. Now I would like to query this table to analyze the data and (more specifically) filter based on timestamp.
Does anyone have any code or best practices/advice for doing this?
Thanks!
Created 08-05-2016 06:58 AM
From the Spark 1.6.x docs: "For other Hadoop InputFormats, you can use the SparkContext.hadoopRDD
method, which takes an arbitrary JobConf
and input format class, key class and value class. Set these the same way you would for a Hadoop job with your input source. You can also useSparkContext.newAPIHadoopRDD
for InputFormats based on the “new” MapReduce API (org.apache.hadoop.mapreduce
)."
The InputFormat class you'd specify is (I believe) TableSnapshotInputFormat. I recommend reading a bit of that API doc, as it notes the need to use a CellScanner which gives you access to Cells, which gives you access to values and to their timestamps.
If you get an example working, an article would be excellent!
Created 08-05-2016 06:58 AM
From the Spark 1.6.x docs: "For other Hadoop InputFormats, you can use the SparkContext.hadoopRDD
method, which takes an arbitrary JobConf
and input format class, key class and value class. Set these the same way you would for a Hadoop job with your input source. You can also useSparkContext.newAPIHadoopRDD
for InputFormats based on the “new” MapReduce API (org.apache.hadoop.mapreduce
)."
The InputFormat class you'd specify is (I believe) TableSnapshotInputFormat. I recommend reading a bit of that API doc, as it notes the need to use a CellScanner which gives you access to Cells, which gives you access to values and to their timestamps.
If you get an example working, an article would be excellent!
Created 08-05-2016 05:34 PM
@randy is right. TableSnapshotInputFormat is the IF to use to be able to read from an HBase snapshot. You can pass a Scan object which you can configure with your Filters, or timestamp predicates to do the filtering for you.
Created 08-18-2016 03:29 PM
Thanks for the response Randy and Enis - very helpful! I was able to get this working and placed the github project here: https://github.com/zaratsian/SparkHBaseExample