Archives of Support Questions (Read Only)

This is an archived board for historical reference. Information and links may no longer be available or relevant
Announcements
This board is archived and read-only for historical reference. To ask a new question, please post a new topic on the appropriate active board.

What's the best practice to get data from hbase and form dataframe for Python/R?

avatar
New Member

What's the best practice to get data from hbase and form dataframe for Python/R? If we want to use our Panda/R libraries, how to get data from hbase and form dataframe automatically?

1 ACCEPTED SOLUTION

avatar

We have an experimental Spark HBase connector, https://github.com/zhzhan/shc

With the following features

  • First class support for DataFrame API
  • JSON based catalog with rich data type support
  • Performant, scalable, enterprise-ready
  • Partition Pruning
  • Predicate Pushdown
  • Scan optimizations
  • Data Locality
  • Composite Rowkey
  • Leverage existing work in the HBase community

Please take look at the README of the above project.

Also see example https://github.com/zhzhan/shc/blob/master/src/main/scala/org/apache/spark/sql/execution/datasources/...

View solution in original post

11 REPLIES 11

avatar
Visitor

@Artem Ervits, Is there any progress on the Spark on HBase by Hortonworks. We are using the HDP platform but I am not able to easily conclude from the internet that confirms there is progress beyond the above discussion in 2016.