Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

BI on Hadoop architecture question.

BI on Hadoop architecture question.

Explorer

Recently I am interested of BI on Hadoop.

We are already operating Hadoop(2.7)/Hive(1.6)/Spark(2.x).
(installed across 40 nodes)

But there is lack of something.. current Hive is not so fast.
Cause of many old style tables(csv, tsv, avro, json, custom text, external table) we can not go to Hive 2.x(LLAP).
When after I upgraded hive to 1.x from 0.x many of recent Hive performance boostring options seems make problems with old tables when turned it on(CBO, Vectorized.. etc) . And I feel very challenging to upgrade to Hive 2.x.


BI on Hadoop, means

- Online interactive analytics
- BI tools integration.

So we need some new system something like SparkThriftServer, Impala(with KUDU), Hive 2.x. And this is pros, cons I think,

SparkThriftServer
- pros : easy to setup, can use hive table data.
- cons : in-memory compute can cause OOM. Hive table formats are not parquet, but some of orc, avro, txt ..

Impala+KUDU
- pros : looks have low latency.
- cons : need to set up from bottom. We now operating HDP, not CDH.

Hive 2.x
- pros : Can use previous hive tables.
- cons : Legacy applications should tested strictly. I am not sure it fit to online analytics.

I know It depends. but any one who moved from hive 1.x to something with large backend, please give some advice.

4 REPLIES 4
Highlighted

Re: BI on Hadoop architecture question.

Super Guru

I know a lot of people using Hive 2 LLAP with AtScale, Tableau and other BI tools and it's performing great.

Spark 2 thrift server is working great.

Any chance of making additional copies of the AVRO and TXT files?

Can you add more RAM to Hive?

Also for ORC tables, make sure you have bucketing and other optimizations in place.

We have seen a lot of people improve performance by optimizing their queries, reducing joins/order bys.

Highlighted

Re: BI on Hadoop architecture question.

Explorer

Thanks for the response.

Make copy of old format table is difficult because still new data ingested into those tables.

ORC tables are relatively well defined with partition, cluster, blommfiter, stripe index..

BTW which one do you prefer or recommend between llap and sparkThriftServer for large scale bi backend?

Highlighted

Re: BI on Hadoop architecture question.

Explorer

Hi , BI clients are connected with Spark Thrift server. Hive LLAP integration is also increasing.

- Major difference would be around processing. Spark will do in memory processing and will need higher amount of memory then Hive LLAP. Thus more costly as infrastructure wise.

- Hive is well connected with Ranger PlugIn for security and Ambari.

Regards,

Fahim

Don't have an account?
Coming from Hortonworks? Activate your account here