Member since
09-28-2015
41
Posts
44
Kudos Received
12
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
3138 | 04-12-2017 12:19 PM | |
3250 | 01-26-2017 04:38 AM | |
817 | 01-10-2017 10:39 PM | |
2123 | 08-16-2016 07:12 PM | |
14448 | 07-20-2016 06:14 PM |
06-07-2016
06:15 AM
2 Kudos
To add to @emaxwell > If I am only concerned with performance
The bigger win is being able to skip decompressing blocks entirely - if you have hive.optimize.index.filter=true, that will kick in. > a few items in a where clause That's where the ORC indexes matter - if you have orc.create.index=true & orc.bloom.filter.columns contain those columns specifically (using "*" is easy, but slows down ETL when tables are wider and the measures are random) Clustering & Sorting on the most common column in the filter there can give you 2-3 magnitudes of performance (sorting specifically, because the min/max are stored at the footer of a file - this naturally happens for most ETL for date/timestamp columns, but for something randomly distributed like a location id, that is a big win). See for example, this ETL script
https://github.com/t3rmin4t0r/all-airlines-data/blob/master/ddl/orc.sql#L78
... View more
06-04-2016
05:34 AM
ORC is considering adding a faster decompression in 2016 - zstd (ZStandard). The enum values for that has already been reserved, but until we work through the trade-offs involved in ZStd - more on that sometime later this year.
https://issues.apache.org/jira/browse/ORC-46 But bigger wins are in motion for ORC with LLAP, the in-memory format for LLAP isn't compressed at all - so it performs like ORC without compression overheads, while letting the cold data on disk sit around in Zlib.
... View more
06-02-2016
08:22 AM
2 Kudos
floor(datediff(to_date(from_unixtime(unix_timestamp())), to_date(birthdate)) / 365.25 That unix_timestamp() could turn off a few optimizations in the planner, which might not be related to this issue. Start using CURRENT_DATE instead of the unix_timestamp() call, for faster queries.
... View more
03-22-2016
08:12 PM
1 Kudo
Is there more information on this?
There are 2 forms of logical predicate-pushdown and 2 forms of physical predicate-pushdown in Hive.
... View more
02-24-2016
02:08 AM
1 Kudo
Ignoring the actual backtrace (which is a bug), I have seen issues with uncompressed text tables in Tez related to Hive's use of Hadoop-1 APIs. Try re-running with set mapreduce.input.fileinputformat.split.minsize=67108864; or alternatively, compress the files before loading with gzip with something like this
https://gist.github.com/t3rmin4t0r/49e391eab4fbdfdc8ce1
... View more
02-24-2016
02:02 AM
This looks awfully like an HDFS bug and entirely unrelated to Tez.
The IndexOutOfBounds is thrown from HDFS block local readers.
... View more
01-27-2016
09:54 PM
1 Kudo
Workarounds are specific to the actual problem, not a symptom like running out of memory.
There are parts of the system which do use a lot of memory - the usual set of workarounds is to disable memory hungry features like map-joins, map-side hash aggregations. Alternatively, there are a few scalability features which reduce total memory required, but are disabled since they degrade performance on large RAM clusters (like dynamic partitioned insert optimizations). There are configuration issues which go unnoticed, like allocating 60% of a container as a single sort buffer.
At the very least, I ask for people to submit a jmap hprof or a jmap -histo to be able to diagnose these.
... View more
01-07-2016
11:47 PM
1 Kudo
@Ryan Tomczik: I can confirm this as a bug, even without views.
Filed https://issues.apache.org/jira/browse/HIVE-12808
Please have a look & see if that describes the issue clearly.
... View more
01-06-2016
08:46 AM
The ideal rewrite for such a query is
select * from latestposition where regionid='1d6a0be1-6366-4692-9597-ebd5cd0f01d1'and id=1422792010and deviceid='6c5d1a30-2331-448b-a726-a380d6b3a432' order by ts limit 1;
is the table partitioned on ts?
... View more
01-06-2016
12:30 AM
This is definitely a beeline bug of quoting in the hive-1.2.x branch - https://github.com/apache/hive/commit/36f7ed781271...
I tried this in the latest builds and it worked (though needs to collapse code into a 1 line compile command), but does not work with old beeline + new HS2.
Beeline version 2.1.0-SNAPSHOT by Apache Hive
0: jdbc:hive2://localhost:10003> compile `import org.apache.hadoop.hive.ql.exec.UDF \; import groovy.json.JsonSlurper \; import org.apache.hadoop.io.Text \; public class JsonExtract extends UDF { public int evaluate(Text a){ def jsonSlurper = new JsonSlurper() \; def obj = jsonSlurper.parseText(a.toString())\; return obj.val1\; } } ` AS GROOVY NAMED json_extract.groovy; No rows affected (1.092 seconds) 0: jdbc:hive2://localhost:10003> CREATE TEMPORARY FUNCTION json_extract as 'JsonExtract'; No rows affected (1.421 seconds) 0: jdbc:hive2://localhost:10003> 0: jdbc:hive2://localhost:10003>
... View more
- « Previous
-
- 1
- 2
- Next »