Member since
09-28-2015
41
Posts
44
Kudos Received
12
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
856 | 04-12-2017 12:19 PM | |
1327 | 01-26-2017 04:38 AM | |
143 | 01-10-2017 10:39 PM | |
476 | 08-16-2016 07:12 PM | |
2813 | 07-20-2016 06:14 PM |
01-29-2018
08:13 PM
1 Kudo
LLAP has better SQL support than the older versions of Hive (the CLI is Hive 1.x and LLAP is Hive 2.x).
... View more
09-08-2017
08:12 PM
> I'd like to limit the allowed port range for LLAP. LLAP does not use the Slider port assignment scheme and instead pulls it from hive-site.xml (the interactive one) Here are the port numbers and configs you might want to whitelist.
hive.llap.daemon.yarn.shuffle.port (15551) hive.llap.daemon.web.port (15002) hive.llap.daemon.rpc.port (0) The last rpc port is likely where you're seeing issues because it is unassigned by default. Also any tez changes you make need to go into tez_hive2/conf/tez-site.xml as well.
... View more
04-12-2017
12:19 PM
3 Kudos
LLAP localizes all permanent functions when you restart it - temporary functions aren't allowed into LLAP, since that has potential for conflicting between users.
So expect "add jar" to be a problem, but "create function ... " with an HDFS location to handle something like the ESRI udfs (I recommend creating an esri DB and naming all udfs as esri.ST_Contains etc).
The LLAP decider module should throw an error if you try to use a temporary UDF (when llap.execution.mode is "only", the default).
LLAP can run some of those queries in mixed mode (i.e all mappers run as Tez tasks, all reducers run in LLAP etc), but it's not the best way to use LLAP.
... View more
01-26-2017
04:38 AM
1 Kudo
> Can anyone tell me how to use llap using cli or point what I am missing here? You can't use LLAP using the hive CLI, since it is aimed at BI workloads. LLAP can be accessed via beeline & HiveServer2.
The new URL will show up in Ambari under the Hive sidebar.
You can then copy that URL and use beeline -u '<url>' to connect to HiveServer2-Interactive, which will use LLAP. To confirm you are on the right host, you can try doing an explain of the query and see the word "LLAP" in the explain marking parts of the query.
... View more
01-18-2017
09:22 AM
No, you need to write code to do this. hive.exec.pre.hooks allows you to hook into every command and log it/deny it.
... View more
01-18-2017
08:46 AM
1 Kudo
Ranger Audit tracks all Hive operations (also allows "REVOKE DELETE" for users).
http://hortonworks.com/hadoop-tutorial/manage-security-policy-hive-hbase-knox-ranger/
... View more
01-10-2017
10:39 PM
1 Kudo
Nope - compaction retains old files until all readers are out.
The implementation has ACID isolation levels - you could start a query and then do "delete from table" & still get no errors on the query which is already in motion.
... View more
10-28-2016
11:30 PM
Even before LLAP, HS2 can handle 1000 concurrent queries, because that was the exact question discussed in the YJP benchmark (not just concurrency, but throughput/utilization goals as well).
With LLAP, we realized that a BI user session which runs a 3s query every 15s, is wasting capacity & terribly common with BI tools where someone is actively reading the screen & navigating their view.
So "open sessions" are different from "queries" now, which in the case of the 3s (every 15s) is a significant improvement in session/user scalability.
Much fewer nodes and far easier to configure that.
... View more
10-27-2016
07:39 PM
3 Kudos
Yahoo japan has published their scale graph till 1024 concurrent queries, with HDP-2.4. http://www.slideshare.net/HadoopSummit/achieving-100k-queries-per-hour-on-hive-on-tez/11
Including a history of tuning options and fixes that went into their hot-fix branch. Just to be clear, 1000 concurrent BI users is a different problem from 1000 concurrent queries, which is easier to tackle in LLAP due to the cluster sharing models.The 1000 users will get balanced up internally so that the fact that they have left Tableau open & gone for lunch won't cost server side resources.
... View more
10-13-2016
04:47 AM
Can you post a "desc formatted" for the table?
... View more
10-10-2016
11:40 PM
hiveContext.setConf("hive.metastore.uris", "thrift://<metastore>:9083"); The thrift API should be safe from these problems.
... View more
10-10-2016
10:03 PM
> but this isn't an option for us, given our use of Spark's HiveContext/Spark on Hive. Can you explain that further?
... View more
10-06-2016
06:55 PM
3 Kudos
The cache, for sure (the math is the container * 0.8 == Xmx + Cache)
I would recommend scaling down both Xmx and Cache by 0.8x, equally and picking 1 executor thread for your LLAP to avoid thrashing such a small cache.
LLAP really shines when you give it (~4Gb per core) + enough cache to hold all the data needed for all executors in motion (not the total data size, but if you have 10 threads ... need at least 10 ORC stripes worth of cache).
... View more
10-04-2016
05:49 PM
1 Kudo
HDP builds are always tagged here - https://github.com/hortonworks/hive2-release/tree/HDP-2.5.0.3-tag
... View more
09-28-2016
08:22 PM
@Peter Coates: There is no local download and upload (distcp does that, which is bad).
This makes more sense if you think of S3 as a sharded key-value store (instead of a NAS). The filename is the key, so that whenever the key changes, the data moves from one shard to the other - the command will not return successfully until the KV store is done moving the data between those shards, which is a data operation and not a metadata operation - this can be pretty fast in some scenarios where the change of the key does not result in a shard change,
In a FileSystem like HDFS, the block-ids of the data are independent of the name of the file - The name maps to an Inode and the Inode maps to the blocks. So the rename is entirely within metadata, due to the extra indirection of the Inode.
... View more
09-14-2016
06:15 PM
Is this on some sort of cloud provider?
There's a known issue when DNS is not setup for forward/reverse resolutions to work the same way.
... View more
09-09-2016
05:03 PM
> Are there any properties which affect the creation and write performance of partitions?
Yes. Compare the values of set hive.optimize.sort.dynamic.partition;
... View more
09-06-2016
08:47 PM
1 Kudo
Is this running on a UTC Timezone?
In non UTC timezones, that particular overflow error can happen (+ 14 hours is Kiribati, which would overflow).
... View more
08-16-2016
07:12 PM
2 Kudos
The beeline client doesn't actually have a clean way of doing that unlike the in-place CLI UI.
The current method is to run "explain <query>" and look for the LLAP annotation next to the vectorization.
... View more
08-12-2016
07:21 PM
1 Kudo
Your question seems to be that count(distinct Id) != count(id) ?
... View more
08-04-2016
08:07 PM
Can you do a "dfs -ls" on the output for Spark job?
The total # of files might be very different between SparkSQL and Hive-Tez.
... View more
07-20-2016
06:14 PM
Looks like your datanodes are dying from too many open files - check the nofiles setting for the "hdfs" user in /etc/security/limits.d/
If you want to bypass that particular problem by changing the query plan, try with
set hive.optimize.sort.dynamic.partition=true;
... View more
07-15-2016
08:15 PM
The inputformat name matters in this case. NPE from Text.writeString(out, wrappedInputFormatName); The table desc formatted is more relevant than the query pattern.
... View more
07-12-2016
06:33 PM
1 Kudo
> A cartesian join. +1, right on. > still needs to do 2b computations but at least he doesn't need to shuffle 2b rows around. Even without any modifications, the shuffle won't move 2b rows, it will move exactly 3 aggregates per a.name because the map-side aggregation will fold that away into the sum(). > tez.grouping.max-size Rather than playing with the split-sizes which are fragile, you can however shuffle the 54,000 row set - the SORT BY can do that more predictably set hive.exec.reducers.bytes.per.reducer=4096;
select sum() ... (select x.name, <gis-func>() from (select name, lon, lat from a sort by a.name) x, b) y; I tried it out with a sample query and it works more predictably this way. hive (tpcds_bin_partitioned_orc_200)> select count(1) from (select d_date, t_time_id from (select d_date from date_dim sort by d_date) d, time_dim) x;
6,311,433,600
Time taken: 20.412 seconds, Fetched: 1 row(s)
hive (tpcds_bin_partitioned_orc_200)>
... View more
06-07-2016
06:15 AM
2 Kudos
To add to @emaxwell > If I am only concerned with performance
The bigger win is being able to skip decompressing blocks entirely - if you have hive.optimize.index.filter=true, that will kick in. > a few items in a where clause That's where the ORC indexes matter - if you have orc.create.index=true & orc.bloom.filter.columns contain those columns specifically (using "*" is easy, but slows down ETL when tables are wider and the measures are random) Clustering & Sorting on the most common column in the filter there can give you 2-3 magnitudes of performance (sorting specifically, because the min/max are stored at the footer of a file - this naturally happens for most ETL for date/timestamp columns, but for something randomly distributed like a location id, that is a big win). See for example, this ETL script
https://github.com/t3rmin4t0r/all-airlines-data/blob/master/ddl/orc.sql#L78
... View more
06-04-2016
05:34 AM
ORC is considering adding a faster decompression in 2016 - zstd (ZStandard). The enum values for that has already been reserved, but until we work through the trade-offs involved in ZStd - more on that sometime later this year.
https://issues.apache.org/jira/browse/ORC-46 But bigger wins are in motion for ORC with LLAP, the in-memory format for LLAP isn't compressed at all - so it performs like ORC without compression overheads, while letting the cold data on disk sit around in Zlib.
... View more
06-02-2016
08:22 AM
2 Kudos
floor(datediff(to_date(from_unixtime(unix_timestamp())), to_date(birthdate)) / 365.25 That unix_timestamp() could turn off a few optimizations in the planner, which might not be related to this issue. Start using CURRENT_DATE instead of the unix_timestamp() call, for faster queries.
... View more
03-22-2016
08:12 PM
1 Kudo
Is there more information on this?
There are 2 forms of logical predicate-pushdown and 2 forms of physical predicate-pushdown in Hive.
... View more
02-24-2016
02:08 AM
1 Kudo
Ignoring the actual backtrace (which is a bug), I have seen issues with uncompressed text tables in Tez related to Hive's use of Hadoop-1 APIs. Try re-running with set mapreduce.input.fileinputformat.split.minsize=67108864; or alternatively, compress the files before loading with gzip with something like this
https://gist.github.com/t3rmin4t0r/49e391eab4fbdfdc8ce1
... View more
02-24-2016
02:02 AM
This looks awfully like an HDFS bug and entirely unrelated to Tez.
The IndexOutOfBounds is thrown from HDFS block local readers.
... View more