Recently I am interested of BI on Hadoop.
We are already operating Hadoop(2.7)/Hive(1.6)/Spark(2.x).
(installed across 40 nodes)
But there is lack of something.. current Hive is not so fast.
Cause of many old style tables(csv, tsv, avro, json, custom text, external table) we can not go to Hive 2.x(LLAP).
When after I upgraded hive to 1.x from 0.x many of recent Hive performance boostring options seems make problems with old tables when turned it on(CBO, Vectorized.. etc) . And I feel very challenging to upgrade to Hive 2.x.
BI on Hadoop, means
- Online interactive analytics
- BI tools integration.
So we need some new system something like SparkThriftServer, Impala(with KUDU), Hive 2.x. And this is pros, cons I think,
- pros : easy to setup, can use hive table data.
- cons : in-memory compute can cause OOM. Hive table formats are not parquet, but some of orc, avro, txt ..
- pros : looks have low latency.
- cons : need to set up from bottom. We now operating HDP, not CDH.
- pros : Can use previous hive tables.
- cons : Legacy applications should tested strictly. I am not sure it fit to online analytics.
I know It depends. but any one who moved from hive 1.x to something with large backend, please give some advice.
I know a lot of people using Hive 2 LLAP with AtScale, Tableau and other BI tools and it's performing great.
Spark 2 thrift server is working great.
Any chance of making additional copies of the AVRO and TXT files?
Can you add more RAM to Hive?
Also for ORC tables, make sure you have bucketing and other optimizations in place.
We have seen a lot of people improve performance by optimizing their queries, reducing joins/order bys.
Thanks for the response.
Make copy of old format table is difficult because still new data ingested into those tables.
ORC tables are relatively well defined with partition, cluster, blommfiter, stripe index..
BTW which one do you prefer or recommend between llap and sparkThriftServer for large scale bi backend?
Check out comcasts
Also you can do Spark on Hive LLAP.
Hi , BI clients are connected with Spark Thrift server. Hive LLAP integration is also increasing.
- Major difference would be around processing. Spark will do in memory processing and will need higher amount of memory then Hive LLAP. Thus more costly as infrastructure wise.
- Hive is well connected with Ranger PlugIn for security and Ambari.