Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

How to Query Hbase Snapshot (in HDFS) from Spark or PySpark?

avatar

Does anyone have sample PySpark or Spark code to query an Hbase Snapshot?

I've created an Hbase table, loaded data, and then took a snapshot. I then moved the snapshot into HDFS. Now I would like to query this table to analyze the data and (more specifically) filter based on timestamp.

Does anyone have any code or best practices/advice for doing this?

Thanks!

1 ACCEPTED SOLUTION

avatar

From the Spark 1.6.x docs: "For other Hadoop InputFormats, you can use the SparkContext.hadoopRDD method, which takes an arbitrary JobConf and input format class, key class and value class. Set these the same way you would for a Hadoop job with your input source. You can also useSparkContext.newAPIHadoopRDD for InputFormats based on the “new” MapReduce API (org.apache.hadoop.mapreduce)."

The InputFormat class you'd specify is (I believe) TableSnapshotInputFormat. I recommend reading a bit of that API doc, as it notes the need to use a CellScanner which gives you access to Cells, which gives you access to values and to their timestamps.

If you get an example working, an article would be excellent!

View solution in original post

3 REPLIES 3

avatar

From the Spark 1.6.x docs: "For other Hadoop InputFormats, you can use the SparkContext.hadoopRDD method, which takes an arbitrary JobConf and input format class, key class and value class. Set these the same way you would for a Hadoop job with your input source. You can also useSparkContext.newAPIHadoopRDD for InputFormats based on the “new” MapReduce API (org.apache.hadoop.mapreduce)."

The InputFormat class you'd specify is (I believe) TableSnapshotInputFormat. I recommend reading a bit of that API doc, as it notes the need to use a CellScanner which gives you access to Cells, which gives you access to values and to their timestamps.

If you get an example working, an article would be excellent!

avatar
Guru

@randy is right. TableSnapshotInputFormat is the IF to use to be able to read from an HBase snapshot. You can pass a Scan object which you can configure with your Filters, or timestamp predicates to do the filtering for you.

avatar

Thanks for the response Randy and Enis - very helpful! I was able to get this working and placed the github project here: https://github.com/zaratsian/SparkHBaseExample