Member since
09-28-2015
41
Posts
44
Kudos Received
12
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
3161 | 04-12-2017 12:19 PM | |
3298 | 01-26-2017 04:38 AM | |
822 | 01-10-2017 10:39 PM | |
2157 | 08-16-2016 07:12 PM | |
14561 | 07-20-2016 06:14 PM |
04-12-2017
12:19 PM
3 Kudos
LLAP localizes all permanent functions when you restart it - temporary functions aren't allowed into LLAP, since that has potential for conflicting between users.
So expect "add jar" to be a problem, but "create function ... " with an HDFS location to handle something like the ESRI udfs (I recommend creating an esri DB and naming all udfs as esri.ST_Contains etc).
The LLAP decider module should throw an error if you try to use a temporary UDF (when llap.execution.mode is "only", the default).
LLAP can run some of those queries in mixed mode (i.e all mappers run as Tez tasks, all reducers run in LLAP etc), but it's not the best way to use LLAP.
... View more
01-26-2017
04:38 AM
1 Kudo
> Can anyone tell me how to use llap using cli or point what I am missing here? You can't use LLAP using the hive CLI, since it is aimed at BI workloads. LLAP can be accessed via beeline & HiveServer2.
The new URL will show up in Ambari under the Hive sidebar.
You can then copy that URL and use beeline -u '<url>' to connect to HiveServer2-Interactive, which will use LLAP. To confirm you are on the right host, you can try doing an explain of the query and see the word "LLAP" in the explain marking parts of the query.
... View more
01-10-2017
10:39 PM
1 Kudo
Nope - compaction retains old files until all readers are out.
The implementation has ACID isolation levels - you could start a query and then do "delete from table" & still get no errors on the query which is already in motion.
... View more
10-06-2016
06:55 PM
3 Kudos
The cache, for sure (the math is the container * 0.8 == Xmx + Cache)
I would recommend scaling down both Xmx and Cache by 0.8x, equally and picking 1 executor thread for your LLAP to avoid thrashing such a small cache.
LLAP really shines when you give it (~4Gb per core) + enough cache to hold all the data needed for all executors in motion (not the total data size, but if you have 10 threads ... need at least 10 ORC stripes worth of cache).
... View more
09-28-2016
08:22 PM
@Peter Coates: There is no local download and upload (distcp does that, which is bad).
This makes more sense if you think of S3 as a sharded key-value store (instead of a NAS). The filename is the key, so that whenever the key changes, the data moves from one shard to the other - the command will not return successfully until the KV store is done moving the data between those shards, which is a data operation and not a metadata operation - this can be pretty fast in some scenarios where the change of the key does not result in a shard change,
In a FileSystem like HDFS, the block-ids of the data are independent of the name of the file - The name maps to an Inode and the Inode maps to the blocks. So the rename is entirely within metadata, due to the extra indirection of the Inode.
... View more
08-16-2016
07:12 PM
2 Kudos
The beeline client doesn't actually have a clean way of doing that unlike the in-place CLI UI.
The current method is to run "explain <query>" and look for the LLAP annotation next to the vectorization.
... View more
08-12-2016
07:21 PM
1 Kudo
Your question seems to be that count(distinct Id) != count(id) ?
... View more
08-04-2016
08:07 PM
Can you do a "dfs -ls" on the output for Spark job?
The total # of files might be very different between SparkSQL and Hive-Tez.
... View more
07-20-2016
06:14 PM
Looks like your datanodes are dying from too many open files - check the nofiles setting for the "hdfs" user in /etc/security/limits.d/
If you want to bypass that particular problem by changing the query plan, try with
set hive.optimize.sort.dynamic.partition=true;
... View more
07-12-2016
06:33 PM
1 Kudo
> A cartesian join. +1, right on. > still needs to do 2b computations but at least he doesn't need to shuffle 2b rows around. Even without any modifications, the shuffle won't move 2b rows, it will move exactly 3 aggregates per a.name because the map-side aggregation will fold that away into the sum(). > tez.grouping.max-size Rather than playing with the split-sizes which are fragile, you can however shuffle the 54,000 row set - the SORT BY can do that more predictably set hive.exec.reducers.bytes.per.reducer=4096;
select sum() ... (select x.name, <gis-func>() from (select name, lon, lat from a sort by a.name) x, b) y; I tried it out with a sample query and it works more predictably this way. hive (tpcds_bin_partitioned_orc_200)> select count(1) from (select d_date, t_time_id from (select d_date from date_dim sort by d_date) d, time_dim) x;
6,311,433,600
Time taken: 20.412 seconds, Fetched: 1 row(s)
hive (tpcds_bin_partitioned_orc_200)>
... View more