About gopalv

gopalv · ‎04-12-2017

LLAP localizes all permanent functions when you restart it - temporary functions aren't allowed into LLAP, since that has potential for conflicting between users. So expect "add jar" to be a problem, but "create function ... " with an HDFS location to handle something like the ESRI udfs (I recommend creating an esri DB and naming all udfs as esri.ST_Contains etc). The LLAP decider module should throw an error if you try to use a temporary UDF (when llap.execution.mode is "only", the default). LLAP can run some of those queries in mixed mode (i.e all mappers run as Tez tasks, all reducers run in LLAP etc), but it's not the best way to use LLAP.

gopalv · ‎01-26-2017

> Can anyone tell me how to use llap using cli or point what I am missing here? You can't use LLAP using the hive CLI, since it is aimed at BI workloads. LLAP can be accessed via beeline & HiveServer2. The new URL will show up in Ambari under the Hive sidebar. You can then copy that URL and use beeline -u '<url>' to connect to HiveServer2-Interactive, which will use LLAP. To confirm you are on the right host, you can try doing an explain of the query and see the word "LLAP" in the explain marking parts of the query.

gopalv · ‎01-10-2017

Nope - compaction retains old files until all readers are out. The implementation has ACID isolation levels - you could start a query and then do "delete from table" & still get no errors on the query which is already in motion.

gopalv · ‎10-06-2016

The cache, for sure (the math is the container * 0.8 == Xmx + Cache) I would recommend scaling down both Xmx and Cache by 0.8x, equally and picking 1 executor thread for your LLAP to avoid thrashing such a small cache. LLAP really shines when you give it (~4Gb per core) + enough cache to hold all the data needed for all executors in motion (not the total data size, but if you have 10 threads ... need at least 10 ORC stripes worth of cache).

gopalv · ‎09-28-2016

@Peter Coates: There is no local download and upload (distcp does that, which is bad). This makes more sense if you think of S3 as a sharded key-value store (instead of a NAS). The filename is the key, so that whenever the key changes, the data moves from one shard to the other - the command will not return successfully until the KV store is done moving the data between those shards, which is a data operation and not a metadata operation - this can be pretty fast in some scenarios where the change of the key does not result in a shard change, In a FileSystem like HDFS, the block-ids of the data are independent of the name of the file - The name maps to an Inode and the Inode maps to the blocks. So the rename is entirely within metadata, due to the extra indirection of the Inode.

gopalv · ‎08-16-2016

The beeline client doesn't actually have a clean way of doing that unlike the in-place CLI UI. The current method is to run "explain <query>" and look for the LLAP annotation next to the vectorization.

gopalv · ‎08-12-2016

Your question seems to be that count(distinct Id) != count(id) ?

gopalv · ‎08-04-2016

Can you do a "dfs -ls" on the output for Spark job? The total # of files might be very different between SparkSQL and Hive-Tez.

gopalv · ‎07-20-2016

Looks like your datanodes are dying from too many open files - check the nofiles setting for the "hdfs" user in /etc/security/limits.d/ If you want to bypass that particular problem by changing the query plan, try with set hive.optimize.sort.dynamic.partition=true;

gopalv · ‎07-12-2016

> A cartesian join. +1, right on. > still needs to do 2b computations but at least he doesn't need to shuffle 2b rows around. Even without any modifications, the shuffle won't move 2b rows, it will move exactly 3 aggregates per a.name because the map-side aggregation will fold that away into the sum(). > tez.grouping.max-size Rather than playing with the split-sizes which are fragile, you can however shuffle the 54,000 row set - the SORT BY can do that more predictably set hive.exec.reducers.bytes.per.reducer=4096; select sum() ... (select x.name, <gis-func>() from (select name, lon, lat from a sort by a.name) x, b) y; I tried it out with a sample query and it works more predictably this way. hive (tpcds_bin_partitioned_orc_200)> select count(1) from (select d_date, t_time_id from (select d_date from date_dim sort by d_date) d, time_dim) x; 6,311,433,600 Time taken: 20.412 seconds, Fetched: 1 row(s) hive (tpcds_bin_partitioned_orc_200)>

Online	Offline
Last Visited	‎05-11-2020 09:14 PM

Member Since	‎09-28-2015 08:23 PM
Last Visited	‎05-11-2020 09:14 PM
Posts	41
Kudos received	44

Cloudera Community

Re: How are UDF's treated with Hive LLAP?

Re: Hive LLAP CLI Usage

Re: What happens when a hive partition is queried ...

Re: how to show a query is using LLAP

Re: org.apache.hadoop.hive.ql.metadata.HiveExcepti...

Re: How are UDF's treated with Hive LLAP?

Re: Hive LLAP CLI Usage

Re: What happens when a hive partition is queried ...

Re: Error Configuring Hive LLAP

Re: set hive.tez.exec.print.summary=true causes od...

Re: how to show a query is using LLAP

Re: Count mismatch while using the parquet file in...

Re: Insert Overwrite running too slow when inserti...

Re: org.apache.hadoop.hive.ql.metadata.HiveExcepti...

Re: Hive query running on Tez contains a Mapper th...