question Re: How to Query Hbase Snapshot (in HDFS) from Spark or PySpark? in Archives of Support Questions (Read Only)

How to Query Hbase Snapshot (in HDFS) from Spark or PySpark?

dzaratsian — Fri, 05 Aug 2016 11:20:56 GMT

Does anyone have sample PySpark or Spark code to query an Hbase Snapshot?

I've created an Hbase table, loaded data, and then took a snapshot. I then moved the snapshot into HDFS. Now I would like to query this table to analyze the data and (more specifically) filter based on timestamp.

Does anyone have any code or best practices/advice for doing this?

Thanks!

Re: How to Query Hbase Snapshot (in HDFS) from Spark or PySpark?

rgelhausen — Fri, 05 Aug 2016 13:58:49 GMT

From the Spark 1.6.x docs: "For other Hadoop InputFormats, you can use the SparkContext.hadoopRDD method, which takes an arbitrary JobConf and input format class, key class and value class. Set these the same way you would for a Hadoop job with your input source. You can also useSparkContext.newAPIHadoopRDD for InputFormats based on the “new” MapReduce API (org.apache.hadoop.mapreduce)."

The InputFormat class you'd specify is (I believe) TableSnapshotInputFormat. I recommend reading a bit of that API doc, as it notes the need to use a CellScanner which gives you access to Cells, which gives you access to values and to their timestamps.

If you get an example working, an article would be excellent!

Re: How to Query Hbase Snapshot (in HDFS) from Spark or PySpark?

Enis — Sat, 06 Aug 2016 00:34:28 GMT

@randy is right. TableSnapshotInputFormat is the IF to use to be able to read from an HBase snapshot. You can pass a Scan object which you can configure with your Filters, or timestamp predicates to do the filtering for you.

Re: How to Query Hbase Snapshot (in HDFS) from Spark or PySpark?

dzaratsian — Thu, 18 Aug 2016 22:29:23 GMT

Thanks for the response Randy and Enis - very helpful! I was able to get this working and placed the github project here: https://github.com/zaratsian/SparkHBaseExample