About gopalv

gopalv · ‎06-07-2016

To add to @emaxwell > If I am only concerned with performance The bigger win is being able to skip decompressing blocks entirely - if you have hive.optimize.index.filter=true, that will kick in. > a few items in a where clause That's where the ORC indexes matter - if you have orc.create.index=true & orc.bloom.filter.columns contain those columns specifically (using "*" is easy, but slows down ETL when tables are wider and the measures are random) Clustering & Sorting on the most common column in the filter there can give you 2-3 magnitudes of performance (sorting specifically, because the min/max are stored at the footer of a file - this naturally happens for most ETL for date/timestamp columns, but for something randomly distributed like a location id, that is a big win). See for example, this ETL script https://github.com/t3rmin4t0r/all-airlines-data/blob/master/ddl/orc.sql#L78

gopalv · ‎06-04-2016

ORC is considering adding a faster decompression in 2016 - zstd (ZStandard). The enum values for that has already been reserved, but until we work through the trade-offs involved in ZStd - more on that sometime later this year. https://issues.apache.org/jira/browse/ORC-46 But bigger wins are in motion for ORC with LLAP, the in-memory format for LLAP isn't compressed at all - so it performs like ORC without compression overheads, while letting the cold data on disk sit around in Zlib.

gopalv · ‎06-02-2016

floor(datediff(to_date(from_unixtime(unix_timestamp())), to_date(birthdate)) / 365.25 That unix_timestamp() could turn off a few optimizations in the planner, which might not be related to this issue. Start using CURRENT_DATE instead of the unix_timestamp() call, for faster queries.

gopalv · ‎03-22-2016

Is there more information on this? There are 2 forms of logical predicate-pushdown and 2 forms of physical predicate-pushdown in Hive.

gopalv · ‎02-24-2016

Ignoring the actual backtrace (which is a bug), I have seen issues with uncompressed text tables in Tez related to Hive's use of Hadoop-1 APIs. Try re-running with set mapreduce.input.fileinputformat.split.minsize=67108864; or alternatively, compress the files before loading with gzip with something like this https://gist.github.com/t3rmin4t0r/49e391eab4fbdfdc8ce1

gopalv · ‎02-24-2016

This looks awfully like an HDFS bug and entirely unrelated to Tez. The IndexOutOfBounds is thrown from HDFS block local readers.

gopalv · ‎01-27-2016

Workarounds are specific to the actual problem, not a symptom like running out of memory. There are parts of the system which do use a lot of memory - the usual set of workarounds is to disable memory hungry features like map-joins, map-side hash aggregations. Alternatively, there are a few scalability features which reduce total memory required, but are disabled since they degrade performance on large RAM clusters (like dynamic partitioned insert optimizations). There are configuration issues which go unnoticed, like allocating 60% of a container as a single sort buffer. At the very least, I ask for people to submit a jmap hprof or a jmap -histo to be able to diagnose these.

gopalv · ‎01-07-2016

@Ryan Tomczik: I can confirm this as a bug, even without views. Filed https://issues.apache.org/jira/browse/HIVE-12808 Please have a look & see if that describes the issue clearly.

gopalv · ‎01-06-2016

The ideal rewrite for such a query is select * from latestposition where regionid='1d6a0be1-6366-4692-9597-ebd5cd0f01d1'and id=1422792010and deviceid='6c5d1a30-2331-448b-a726-a380d6b3a432' order by ts limit 1; is the table partitioned on ts?

gopalv · ‎01-06-2016

This is definitely a beeline bug of quoting in the hive-1.2.x branch - https://github.com/apache/hive/commit/36f7ed781271... I tried this in the latest builds and it worked (though needs to collapse code into a 1 line compile command), but does not work with old beeline + new HS2. Beeline version 2.1.0-SNAPSHOT by Apache Hive 0: jdbc:hive2://localhost:10003> compile `import org.apache.hadoop.hive.ql.exec.UDF \; import groovy.json.JsonSlurper \; import org.apache.hadoop.io.Text \; public class JsonExtract extends UDF { public int evaluate(Text a){ def jsonSlurper = new JsonSlurper() \; def obj = jsonSlurper.parseText(a.toString())\; return obj.val1\; } } ` AS GROOVY NAMED json_extract.groovy; No rows affected (1.092 seconds) 0: jdbc:hive2://localhost:10003> CREATE TEMPORARY FUNCTION json_extract as 'JsonExtract'; No rows affected (1.421 seconds) 0: jdbc:hive2://localhost:10003> 0: jdbc:hive2://localhost:10003>

Online	Offline
Last Visited	‎05-11-2020 09:14 PM

Member Since	‎09-28-2015 08:23 PM
Last Visited	‎05-11-2020 09:14 PM
Posts	41
Kudos received	44

Cloudera Community

Re: How are UDF's treated with Hive LLAP?

Re: Hive LLAP CLI Usage

Re: What happens when a hive partition is queried ...

Re: how to show a query is using LLAP

Re: org.apache.hadoop.hive.ql.metadata.HiveExcepti...

Re: ORC with Zlib vs ORC with No Compression

Re: Snappy vs. Zlib - Pros and Cons for each compr...

Re: How to ask hive query to fetch data for specif...

Re: How Predicate Pushdown works for any specific ...

Re: Tez IndexOutOfBoundsException in select distin...

Re: Tez IndexOutOfBoundsException in select distin...

Re: What is the workaround when getting Hive OutOf...

Re: Hive on Tez Pushdown Predicate doesn't work in...

Re: Hive on Tez Pushdown Predicate doesn't work in...

Re: status of groovy custom udfs in beeline