We are now using Hive in fairly standard ways, with one wrinkle... Our data is binary (protobuf) so we have writen a SerDe to handle this. I am wondering about the future roadmap for Hive within the Cloudera umbrella.
Impala is one route, but it does not support SerDe plugins, as far as I know. What is Cloudera's position on Shark and Stinger, which are explicitly designed as Hive improvements?
1. First, keep in mind that Impala and Hive have different use cases. Impala offers the low latency and high concurrency that analysts doing BI-style queries are going to expect. In contrast, Hive/MR is still more appropriate for batch-oriented processing.
2. Based on #1, it stands to reason that any and all improvements to Hive are good news insofar as they help users with those workloads. To that end, Cloudera employs Hive committers, actively contributes code to Hive (e.g., HiveServer 2), and provides complementary infrastructure (e.g., the incubating Apache Sentry project for RBAC, which is built for both Hive and Impala and which we hope is embraced by the entire ecosystem).
3. Shark (which is a Hive port actually, not an "improvement" to Hive) is another example of having the right tool for the right job. I think most would agree with the premise that Shark is generally used for complex analytics/iterative machine learning, not "mainstream" BI.
Thanks for your reply. I disagree somewhat with your reasoning. Many Hive users put up with batch processing and slow response times because they have no other choice, when what they really want is faster results. So Impala and Shark *are* seen by many Hive users as hoped-for improvements.
What is Cloudera's plan for Stinger, which is from your competitor HortonWorks, but is explicitly a project to improve Hive? Are you accepting Stinger code changes into future releases of Hive within CDH?