Created 10-07-2016 04:25 AM
LLAP design document states caching is stored off heap. Since GC is not in the picture for caching, why is it generally recommended to use data set around 1TB or less. If I have 10TB of ram what issues will I hit with LLAP if loading data > 1TB? I am speaking of using only llap deamons and not "regular containers".
My initial guess is due to LLAP stores metadata on heap. Is this the bottle neck since at around 1TB the volume of metadata causes GC issues?
Created 10-07-2016 05:24 PM
Hi @Sunile. I suspect this is not a firm number but a bang-for-the-buck recommendation - i.e. you will get the most substantial relative performance improvements with dataset sizes that can remain in-cache.
In Carter and Nita's recent blog post, they go into testing 10TB TPC-DS datasets - much larger than the aggregate cluster LLAP cache size. They wanted to see (direct quote) "if LLAP can truly tackle the big data challenge or if it’s limited to reporting roles on smaller datasets."
What they found was that (another quote) "If we look at the full results, we see substantial performance benefits across the board, with an average speedup of around 7x. The biggest winners are smaller-scale ad-hoc or star-schema-join type queries, but even extremely complex queries realized almost 2x benefit versus Hive 1."
Created 10-07-2016 05:24 PM
Hi @Sunile. I suspect this is not a firm number but a bang-for-the-buck recommendation - i.e. you will get the most substantial relative performance improvements with dataset sizes that can remain in-cache.
In Carter and Nita's recent blog post, they go into testing 10TB TPC-DS datasets - much larger than the aggregate cluster LLAP cache size. They wanted to see (direct quote) "if LLAP can truly tackle the big data challenge or if it’s limited to reporting roles on smaller datasets."
What they found was that (another quote) "If we look at the full results, we see substantial performance benefits across the board, with an average speedup of around 7x. The biggest winners are smaller-scale ad-hoc or star-schema-join type queries, but even extremely complex queries realized almost 2x benefit versus Hive 1."