Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

What's the best practice to get data from hbase and form dataframe for Python/R?

avatar
Contributor

What's the best practice to get data from hbase and form dataframe for Python/R? If we want to use our Panda/R libraries, how to get data from hbase and form dataframe automatically?

1 ACCEPTED SOLUTION

avatar

We have an experimental Spark HBase connector, https://github.com/zhzhan/shc

With the following features

  • First class support for DataFrame API
  • JSON based catalog with rich data type support
  • Performant, scalable, enterprise-ready
  • Partition Pruning
  • Predicate Pushdown
  • Scan optimizations
  • Data Locality
  • Composite Rowkey
  • Leverage existing work in the HBase community

Please take look at the README of the above project.

Also see example https://github.com/zhzhan/shc/blob/master/src/main/scala/org/apache/spark/sql/execution/datasources/...

View solution in original post

11 REPLIES 11

avatar
Explorer

@Artem Ervits, Is there any progress on the Spark on HBase by Hortonworks. We are using the HDP platform but I am not able to easily conclude from the internet that confirms there is progress beyond the above discussion in 2016.