The following scenario.
We have our data sitting in the NoSql DB HBase (on HDP - currently 2.3.x - potentially soon 2.4). The requirement exists to access this data with SQL-type queries via JDBC, ODBC.
Now my question: Are there best practices here how todo this ? What worked well, what did not ? We have tried out Apache Phoenix but it is not so ideal so far (e.g. has no ODBC driver – only available via 3rd party is our understanding).
Any experiences with Apache Drill ? Is this more suitable ? https://drill.apache.org/team/
From the description it sounds promising.
Another alternative now especially as of HDP 2.4 seems to be Apache HAWQ (MPP).
Unsure how it works in practice in combination with HBase. For sure we don’t want to store data twice (might not be required when using PXF - http://hawq.docs.pivotal.io/docs-hawq/topics/PXFInstallationandAdministration.html#accessinghbasedat.... Unsure though if PXF on HBase files then is still fast or really you need the data then in HAWQ directly which we dont want as the store is HBase.
And finally - http://kylin.apache.org/.
That is the OLAP Cube on Hadoop Approach which has metadata in Hive and actual data stored in HBase and as per description allows sub-second sql results via JDBC, ODBC etc.
If there are best practices, lessons learned, experiences with one of those in combination with HBase, I would be interested in the details.
There is customer who deploys Kylin along side hbase in production.
Note you need to checkout the following branch of Kylin to work with hbase 1.x releases:
Apache phoenix can be accessed through a jdbc connection (https://phoenix.apache.org/) or by establishing a queryServer connection (https://phoenix.apache.org/server.html). If you want to test the jdbc connection, feel free to try sqlline.py shipped script. For the queryServer sample, please use sqlline-thin.py.
thank you all for the answers so far. it seems there are quite some projects in this area here. Somehow the best practice etc, benchmarks will come-out over time I assume. Another option is probably also soon out there. http://de.hortonworks.com/blog/future-apache-hadoop/ Hive 2.0 - "The Hive community is working towards a 2.0 release of Hive that includes significant new features and performance improvements. These include: * Adding LLAP, a daemon layer that enables sub-second response time."