Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Why is there data limitation with LLAP?

Solved Go to solution
Highlighted

Why is there data limitation with LLAP?

Super Guru

LLAP design document states caching is stored off heap. Since GC is not in the picture for caching, why is it generally recommended to use data set around 1TB or less. If I have 10TB of ram what issues will I hit with LLAP if loading data > 1TB? I am speaking of using only llap deamons and not "regular containers".

My initial guess is due to LLAP stores metadata on heap. Is this the bottle neck since at around 1TB the volume of metadata causes GC issues?

1 ACCEPTED SOLUTION

Accepted Solutions

Re: Why is there data limitation with LLAP?

Hi @Sunile. I suspect this is not a firm number but a bang-for-the-buck recommendation - i.e. you will get the most substantial relative performance improvements with dataset sizes that can remain in-cache.

In Carter and Nita's recent blog post, they go into testing 10TB TPC-DS datasets - much larger than the aggregate cluster LLAP cache size. They wanted to see (direct quote) "if LLAP can truly tackle the big data challenge or if it’s limited to reporting roles on smaller datasets."

What they found was that (another quote) "If we look at the full results, we see substantial performance benefits across the board, with an average speedup of around 7x. The biggest winners are smaller-scale ad-hoc or star-schema-join type queries, but even extremely complex queries realized almost 2x benefit versus Hive 1."

View solution in original post

1 REPLY 1

Re: Why is there data limitation with LLAP?

Hi @Sunile. I suspect this is not a firm number but a bang-for-the-buck recommendation - i.e. you will get the most substantial relative performance improvements with dataset sizes that can remain in-cache.

In Carter and Nita's recent blog post, they go into testing 10TB TPC-DS datasets - much larger than the aggregate cluster LLAP cache size. They wanted to see (direct quote) "if LLAP can truly tackle the big data challenge or if it’s limited to reporting roles on smaller datasets."

What they found was that (another quote) "If we look at the full results, we see substantial performance benefits across the board, with an average speedup of around 7x. The biggest winners are smaller-scale ad-hoc or star-schema-join type queries, but even extremely complex queries realized almost 2x benefit versus Hive 1."

View solution in original post

Don't have an account?
Coming from Hortonworks? Activate your account here