Support Questions
Find answers, ask questions, and share your expertise

How to Query Hbase Snapshot (in HDFS) from Spark or PySpark?

Solved Go to solution

How to Query Hbase Snapshot (in HDFS) from Spark or PySpark?

Does anyone have sample PySpark or Spark code to query an Hbase Snapshot?

I've created an Hbase table, loaded data, and then took a snapshot. I then moved the snapshot into HDFS. Now I would like to query this table to analyze the data and (more specifically) filter based on timestamp.

Does anyone have any code or best practices/advice for doing this?

Thanks!

1 ACCEPTED SOLUTION

Accepted Solutions

Re: How to Query Hbase Snapshot (in HDFS) from Spark or PySpark?

From the Spark 1.6.x docs: "For other Hadoop InputFormats, you can use the SparkContext.hadoopRDD method, which takes an arbitrary JobConf and input format class, key class and value class. Set these the same way you would for a Hadoop job with your input source. You can also useSparkContext.newAPIHadoopRDD for InputFormats based on the “new” MapReduce API (org.apache.hadoop.mapreduce)."

The InputFormat class you'd specify is (I believe) TableSnapshotInputFormat. I recommend reading a bit of that API doc, as it notes the need to use a CellScanner which gives you access to Cells, which gives you access to values and to their timestamps.

If you get an example working, an article would be excellent!

View solution in original post

3 REPLIES 3

Re: How to Query Hbase Snapshot (in HDFS) from Spark or PySpark?

From the Spark 1.6.x docs: "For other Hadoop InputFormats, you can use the SparkContext.hadoopRDD method, which takes an arbitrary JobConf and input format class, key class and value class. Set these the same way you would for a Hadoop job with your input source. You can also useSparkContext.newAPIHadoopRDD for InputFormats based on the “new” MapReduce API (org.apache.hadoop.mapreduce)."

The InputFormat class you'd specify is (I believe) TableSnapshotInputFormat. I recommend reading a bit of that API doc, as it notes the need to use a CellScanner which gives you access to Cells, which gives you access to values and to their timestamps.

If you get an example working, an article would be excellent!

View solution in original post

Re: How to Query Hbase Snapshot (in HDFS) from Spark or PySpark?

Guru

@randy is right. TableSnapshotInputFormat is the IF to use to be able to read from an HBase snapshot. You can pass a Scan object which you can configure with your Filters, or timestamp predicates to do the filtering for you.

Re: How to Query Hbase Snapshot (in HDFS) from Spark or PySpark?

Thanks for the response Randy and Enis - very helpful! I was able to get this working and placed the github project here: https://github.com/zaratsian/SparkHBaseExample