- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
What's the best practice to get data from hbase and form dataframe for Python/R?
- Labels:
-
Apache HBase
-
Apache Spark
Created ‎12-15-2015 08:00 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
What's the best practice to get data from hbase and form dataframe for Python/R? If we want to use our Panda/R libraries, how to get data from hbase and form dataframe automatically?
Created ‎01-08-2016 07:04 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
We have an experimental Spark HBase connector, https://github.com/zhzhan/shc
With the following features
- First class support for DataFrame API
- JSON based catalog with rich data type support
- Performant, scalable, enterprise-ready
- Partition Pruning
- Predicate Pushdown
- Scan optimizations
- Data Locality
- Composite Rowkey
- Leverage existing work in the HBase community
Please take look at the README of the above project.
Also see example https://github.com/zhzhan/shc/blob/master/src/main/scala/org/apache/spark/sql/execution/datasources/...
Created ‎12-16-2015 05:46 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@Cui Lin I am not R guy but this would give you a good starting point depending on you want to use RevR, R or Python.
RHbase tutorials -->
https://github.com/RevolutionAnalytics/RHadoop/wik...
http://www.odbms.org/2015/06/intro-to-hbase-via-r-...
http://radar.oreilly.com/2014/08/scaling-up-data-f...
PandaHbase -->
Created ‎01-08-2016 06:21 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
It seems that the above can't satisfy all my need. What's the best way to get data out of hbase and save into files instead?
Created ‎01-08-2016 06:25 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Created ‎01-08-2016 06:30 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I need to firstly run query to select records based on time, and then dump data into files or data frame. Happybase can't support query and its index has to be integer. Could you lead me some example on mapreduce or pig example?
Created ‎01-08-2016 06:34 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@Cui Lin I updated my response above with links to mapreduce examples. You will need to setup a scanner based on your criteria and then run mapreduce to write out the data to files for Pig, here's an example to read data from Hbase table, then you just call "store data into 'location' using storage of your choice.
Created ‎01-08-2016 06:35 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Is there any example to get data from Hbase using Spark in Hortonworks? MapR and Cloudera has some packages like this, not sure if it could work in Hortonworks.
Created ‎01-08-2016 06:38 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
there's work in progress to make Spark and HBase work efficiently together on the Hortonworks Side, we're not publishing anything until we can support it.
Created ‎01-08-2016 06:40 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Created ‎01-08-2016 07:04 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
We have an experimental Spark HBase connector, https://github.com/zhzhan/shc
With the following features
- First class support for DataFrame API
- JSON based catalog with rich data type support
- Performant, scalable, enterprise-ready
- Partition Pruning
- Predicate Pushdown
- Scan optimizations
- Data Locality
- Composite Rowkey
- Leverage existing work in the HBase community
Please take look at the README of the above project.
Also see example https://github.com/zhzhan/shc/blob/master/src/main/scala/org/apache/spark/sql/execution/datasources/...
