We are planning to explore a use case where we will be using phoenix(with hbase) as a datawarehouse. Has anyone of you explored a similar usecase. Any thoughts or experience to share with? I have a couple of questions based on the requirements we have
1.Is there any functionality in phoenix that can do the role of stored procedures in a rdbms?
2. We need to archive data stored in phoenix after a certain duration. How can this be done in phoenix?
Any thoughts/ideas would be great.
I would generally recommend Hive LLAP for this kind of use case, although Phoenix could make sense depending on your users' data access patterns and requirements. For example, Phoenix is best for queries against a known physical data model (not ad hoc queries).
@slachterman If it were to be done using Hive LLAP using the HPL, is there any kind of IDE that could be used to facilitate development? Could Phoenix be used just as such an interface into the hplsql or would it be to cumbersome? Are you aware of any other Editors or Development environments that would work with hplsql?
1. The role of stored procedures in big data usually goes to application level. The logic has to be redesigned as a Java program for instance. See also coprocessors in HBase.
In particular, Phoenix is designed to execute quick and easy SQL queries, so even if you had no stored procedure but a complex SELECT statement for something in the RDBMS, Phoenix might not support some constructs in it. Consider using Hive with or instead of Phoenix for complex queries.
2. About archiving, first question would be, why would you need it? No disk space? Speed up table scans? There should be a big data way to solve the problem.
@cskrabak, the archiving part is for a reason. we dont use data which is more than 2 quarters old. But we cant discard that data. we may use it at a later point of time if required. so we should archive those data in a place from where we could retrieve in a reasonable period of time.
We went down the route of using Phoenix/HBase as a DW but abandoned it as our data size grew larger because we were going to have to dramatically increase the # of severs as the dataset grew to have enough region servers. The query times were horrible because we were almost always ignoring the row keys, other than dates so we ended up doing a lot of large table scans. HBase is not designed to perform well when you ignore row keys, and that's what treating Phoenix like a data warehouse ultimately caused. We had to set the timeout up to 1 hour, which Ambari didn't like since the default is 3 minute timeouts. Better to stick with Hive LLAP which is intended as a DW, but with LLAP can cache data to improve speed.