About Tim Armstrong

Tim Armstrong · ‎04-28-2016

It looks like it's some kind of benchmark data? Is it a publicly available benchmark. It seems strange that they would have such skewed data. I think even if it did stay in memory it would run for an incredibly long time: if 2 billion keys in one table get matches to 2 billion keys in another table then you will get 10^18 output rows (over a quintillion). So I think either there is something strange with the benchmark data or it doesn't make sense to join the tables. We don't have the exact join algorithm documented aside from in the Impala code. It's a version of the Hybrid hash join https://en.wikipedia.org/wiki/Hash_join#Hybrid_hash_join

Tim Armstrong · ‎04-27-2016

It looks like you have extreme skew in the key that you're joining on (~2 billion duplicates). The error message is: "Cannot perform hash join at node with id 2. Repartitioning did not reduce the size of a spilled partition. Repartitioning level 8. Number of rows 2052516353." In order for the hash join to join on a key all the values for that key on the right side of the join need to be able to fit in memory. Impala has several ways to avoid this problem, but it looks like you defeated them all. First, it tries to put the smaller input on the right side of the join, but both your inputs are the same size, so that doesn't help. Second, it will try to spill some of the rows to disk and process a subset of those. Third, it will try to repeatedly split ("repartition") the right-side input based on the join key to try and get it small enough to fit in memory. Based on the error message, it tried to do that 8 times but still has 2 billion rows in one of the partitions, which probably means there are 2 billion rows with the same key.

Tim Armstrong · ‎04-22-2016

Are you sure it's Impala that's triggering it? I don't think Impala would use du for anything. HDFS apparently does and Cloudera Manager might use it. Have you tried tracing back what is running 'du'? E.g. run "ps auxf" to get a tree-view of processes.

Tim Armstrong · ‎04-22-2016

This looks like the canonical JIRA for the problem: https://issues.cloudera.org/browse/IMPALA-2184 If you can't upgrade a major version, the JIRA says the fix is also being backported to Impala 2.3.4 (CDH 5.5.4), so if you upgrade to 2.3.4 when that is released, you will get the fix.

Tim Armstrong · ‎04-09-2016

If you set num_nodes=1 that will force it to run on a single node.

Tim Armstrong · ‎04-08-2016

If the query runs with multiple fragments e.g. on 5 nodes you can get one file per fragment. If you look at the query plan (i.e. explain <the query text>) it will show the degree of parallelism for the insert.

Tim Armstrong · ‎03-03-2016

The Hive UDF won't help you. If you look at the Hive issue tracker https://issues.apache.org/jira/browse/HIVE-1304, row_sequence() was added as a workaround because they didn't support the row_number() analytic function at that point in time. We support the row_number() analytic function in Impala, so there's no reason to try to use that UDF. If you want to start froma particular number, can't you just add it like I suggested in my previous answer?

Tim Armstrong · ‎03-01-2016

You could possibly use the ROW_NUMBER() analytic function as part of the solution. http://www.cloudera.com/documentation/archive/impala/2-x/2-0-x/topics/impala_analytic_functions.html#row_number_unique_1 E.g. select t2.last_id + 1 + row_number() over (order by table1.col1), table1.col1 from table1 inner join (select max(id) last_id from table2) t2 This wouldn't be safe if you are running multiple insert concurrently.

Tim Armstrong · ‎02-23-2016

Hi, There are many possible variables, including the exact version of impala, the operating system it was built on, the build flags and environment variables, and what version/build of dependencies you're using. I think the specific thing you're probably seeing with file sizes is that in the CDH distribution the debug symbols are stripped from the binaries and included in separate impalad.debug files. Are you running into some error when trying to run your custom build of Impala? It probably makes more sense to debug that problem rather than trying to exactly reproduce Cloudera's build.

Tim Armstrong · ‎01-27-2016

You are most likely running into this bug with the aggregation: https://issues.cloudera.org/browse/IMPALA-2352 We fixed it in CDH5.5/Impala 2.3 but the change wasn't backported because it was deemed too risky for a maintenance release.

Online	Offline
Last Visited	‎02-11-2021 06:07 PM

Member Since	‎07-29-2015 04:07 PM
Last Visited	‎02-11-2021 06:07 PM
Posts	535
Kudos received	141

Cloudera Community

Re: Impala Queries which were previously working a...

Re: Impala queries are not distributing to all the...

Re: impala - `recover partitions` points to old da...

Re: impala catalog server JVM

Re: Impala - On-demand metadata

Re: impala memory limit exceed

Re: impala memory limit exceed

Re: What triggers du in Impala?

Re: query causes Impala to crash

Re: combine small parquet files

Re: combine small parquet files

Re: Sequence number generation in impala

Re: Sequence number generation in impala

Re: problem about building impala

Re: Unexpected Spill to Disk Activity