I'm using Apache Nifi to make data ingestion (CDC) from tables running on Microsoft SQL Server and Progress OpenEdge. I'm doing this using triggers to record changed data on "shadow tables". Using a sort of Nifi Processors I'm identifying the operation type (INSERT/UPDATE/DELETE) and making changes on the Data Lake.
Right now, I'm persisting these tables twice inside the Data Lake (Phoenix/Hbase and Hive) to make some evaluations about performance, problems, etc.
Phoenix performed very well for real time apps, like dashboards that needs small amount of data, but we are not doing so well with Hive (even using LLAP Daemon) for real time apps.
On the other hand, Hive has a great performance for "Heavy" Analytics tasks, like summarizing lots of millions of rows, but we cannot say the same for Phoenix in these cases.
To make matters more complicated, processing CDC requires each transaction to be executed in chronological order to keep consistency with the source databases. Processing these queries one by one on Hive implies in a very poor performance, generating queues and latency to update records on Hive.
Considering this scenario, I would appreciate any help with the questions bellow:
1) What is the best option to have data replicated from Phoenix to Hive as much real time as possible?
2) Considering HDP 2.6 and Phoenix 4.7, there is any way to have Phoenix Tables loaded on Hive?
3) Any clue about improving puthiveql performance using merge content processor (like suggest by @Matt Burgess here ) considering that the flow is coming from a ConvertJSONToSQL processor?
4) Any kind of suggestion or comments will be appreciate.
Thanks in advance,
Hey @Timothy Spann,
Thanks for your reply.
I took a look at Druid, but it looses precision on aggregations which is not acceptable in this project.
I had seen the first link but it requires Phoenix 4.8+, and we have Phoenix 4.7 on HDP 2.6.
I'll take a look on the second link to check if it will take advantage of MR power or it is just a label on the top of HBase tables. Do you know about it?
If you look outside the Hortonworks distribution 😉 , Cloudera is pushing Kudu, which is supposed to be a middle ground between Hive and Phoenix. There is also Splice Machines, an MVCC SQL engine on top of HBase which is now open-sourced.