<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Re: How to Query Hbase Snapshot (in HDFS) from Spark or PySpark? in Archives of Support Questions (Read Only)</title>
    <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/How-to-Query-Hbase-Snapshot-in-HDFS-from-Spark-or-PySpark/m-p/166140#M37021</link>
    <description>&lt;P&gt;Thanks for the response Randy and Enis - very helpful! I was able to get this working and placed the github project here: &lt;A href="https://github.com/zaratsian/SparkHBaseExample" target="_blank"&gt;https://github.com/zaratsian/SparkHBaseExample&lt;/A&gt;&lt;/P&gt;</description>
    <pubDate>Thu, 18 Aug 2016 22:29:23 GMT</pubDate>
    <dc:creator>dzaratsian</dc:creator>
    <dc:date>2016-08-18T22:29:23Z</dc:date>
    <item>
      <title>How to Query Hbase Snapshot (in HDFS) from Spark or PySpark?</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/How-to-Query-Hbase-Snapshot-in-HDFS-from-Spark-or-PySpark/m-p/166137#M37018</link>
      <description>&lt;P&gt;&lt;STRONG&gt;Does anyone have sample PySpark or Spark code to query an Hbase Snapshot?&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;I've created an Hbase table, loaded data, and then took a snapshot. I then moved the snapshot into HDFS. Now I would like to query this table to analyze the data and (more specifically) filter based on timestamp. &lt;/P&gt;&lt;P&gt;Does anyone have any code or best practices/advice for doing this? &lt;/P&gt;&lt;P&gt;Thanks!&lt;/P&gt;</description>
      <pubDate>Fri, 05 Aug 2016 11:20:56 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/How-to-Query-Hbase-Snapshot-in-HDFS-from-Spark-or-PySpark/m-p/166137#M37018</guid>
      <dc:creator>dzaratsian</dc:creator>
      <dc:date>2016-08-05T11:20:56Z</dc:date>
    </item>
    <item>
      <title>Re: How to Query Hbase Snapshot (in HDFS) from Spark or PySpark?</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/How-to-Query-Hbase-Snapshot-in-HDFS-from-Spark-or-PySpark/m-p/166138#M37019</link>
      <description>&lt;P&gt;From the &lt;A href="https://spark.apache.org/docs/1.6.2/programming-guide.html"&gt;Spark 1.6.x docs&lt;/A&gt;: "For other Hadoop InputFormats, you can use the &lt;CODE&gt;SparkContext.hadoopRDD&lt;/CODE&gt; method, which takes an arbitrary &lt;CODE&gt;JobConf&lt;/CODE&gt; and input format class, key class and value class. Set these the same way you would for a Hadoop job with your input source. You can also use&lt;CODE&gt;SparkContext.newAPIHadoopRDD&lt;/CODE&gt; for InputFormats based on the “new” MapReduce API (&lt;CODE&gt;org.apache.hadoop.mapreduce&lt;/CODE&gt;)."&lt;/P&gt;&lt;P&gt;The InputFormat class you'd specify is (I believe) &lt;A href="https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableSnapshotInputFormat.html"&gt;TableSnapshotInputFormat&lt;/A&gt;. I recommend reading a bit of that API doc, as it notes the need to use a CellScanner which gives you access to Cells, which gives you access to values and to their timestamps.&lt;/P&gt;&lt;P&gt;If you get an example working, an article would be excellent!&lt;/P&gt;</description>
      <pubDate>Fri, 05 Aug 2016 13:58:49 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/How-to-Query-Hbase-Snapshot-in-HDFS-from-Spark-or-PySpark/m-p/166138#M37019</guid>
      <dc:creator>rgelhausen</dc:creator>
      <dc:date>2016-08-05T13:58:49Z</dc:date>
    </item>
    <item>
      <title>Re: How to Query Hbase Snapshot (in HDFS) from Spark or PySpark?</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/How-to-Query-Hbase-Snapshot-in-HDFS-from-Spark-or-PySpark/m-p/166139#M37020</link>
      <description>&lt;P&gt;@randy is right. TableSnapshotInputFormat is the IF to use to be able to read from an HBase snapshot. You can pass a Scan object which you can configure with your Filters, or timestamp predicates to do the filtering for you. &lt;/P&gt;</description>
      <pubDate>Sat, 06 Aug 2016 00:34:28 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/How-to-Query-Hbase-Snapshot-in-HDFS-from-Spark-or-PySpark/m-p/166139#M37020</guid>
      <dc:creator>Enis</dc:creator>
      <dc:date>2016-08-06T00:34:28Z</dc:date>
    </item>
    <item>
      <title>Re: How to Query Hbase Snapshot (in HDFS) from Spark or PySpark?</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/How-to-Query-Hbase-Snapshot-in-HDFS-from-Spark-or-PySpark/m-p/166140#M37021</link>
      <description>&lt;P&gt;Thanks for the response Randy and Enis - very helpful! I was able to get this working and placed the github project here: &lt;A href="https://github.com/zaratsian/SparkHBaseExample" target="_blank"&gt;https://github.com/zaratsian/SparkHBaseExample&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 18 Aug 2016 22:29:23 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/How-to-Query-Hbase-Snapshot-in-HDFS-from-Spark-or-PySpark/m-p/166140#M37021</guid>
      <dc:creator>dzaratsian</dc:creator>
      <dc:date>2016-08-18T22:29:23Z</dc:date>
    </item>
  </channel>
</rss>

