Member since
07-29-2015
535
Posts
140
Kudos Received
103
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
5813 | 12-18-2020 01:46 PM | |
3740 | 12-16-2020 12:11 PM | |
2653 | 12-07-2020 01:47 PM | |
1897 | 12-07-2020 09:21 AM | |
1229 | 10-14-2020 11:15 AM |
04-28-2016
08:48 AM
It looks like it's some kind of benchmark data? Is it a publicly available benchmark. It seems strange that they would have such skewed data. I think even if it did stay in memory it would run for an incredibly long time: if 2 billion keys in one table get matches to 2 billion keys in another table then you will get 10^18 output rows (over a quintillion). So I think either there is something strange with the benchmark data or it doesn't make sense to join the tables. We don't have the exact join algorithm documented aside from in the Impala code. It's a version of the Hybrid hash join https://en.wikipedia.org/wiki/Hash_join#Hybrid_hash_join
... View more
04-27-2016
04:00 PM
It looks like you have extreme skew in the key that you're joining on (~2 billion duplicates). The error message is: "Cannot perform hash join at node with id 2. Repartitioning did not reduce the size of a spilled partition. Repartitioning level 8. Number of rows 2052516353." In order for the hash join to join on a key all the values for that key on the right side of the join need to be able to fit in memory. Impala has several ways to avoid this problem, but it looks like you defeated them all. First, it tries to put the smaller input on the right side of the join, but both your inputs are the same size, so that doesn't help. Second, it will try to spill some of the rows to disk and process a subset of those. Third, it will try to repeatedly split ("repartition") the right-side input based on the join key to try and get it small enough to fit in memory. Based on the error message, it tried to do that 8 times but still has 2 billion rows in one of the partitions, which probably means there are 2 billion rows with the same key.
... View more
04-22-2016
09:21 AM
Are you sure it's Impala that's triggering it? I don't think Impala would use du for anything. HDFS apparently does and Cloudera Manager might use it. Have you tried tracing back what is running 'du'? E.g. run "ps auxf" to get a tree-view of processes.
... View more
04-22-2016
09:16 AM
This looks like the canonical JIRA for the problem: https://issues.cloudera.org/browse/IMPALA-2184 If you can't upgrade a major version, the JIRA says the fix is also being backported to Impala 2.3.4 (CDH 5.5.4), so if you upgrade to 2.3.4 when that is released, you will get the fix.
... View more
04-09-2016
10:44 AM
1 Kudo
If you set num_nodes=1 that will force it to run on a single node.
... View more
04-08-2016
05:58 PM
If the query runs with multiple fragments e.g. on 5 nodes you can get one file per fragment. If you look at the query plan (i.e. explain <the query text>) it will show the degree of parallelism for the insert.
... View more
03-03-2016
09:46 AM
1 Kudo
The Hive UDF won't help you. If you look at the Hive issue tracker https://issues.apache.org/jira/browse/HIVE-1304, row_sequence() was added as a workaround because they didn't support the row_number() analytic function at that point in time. We support the row_number() analytic function in Impala, so there's no reason to try to use that UDF. If you want to start froma particular number, can't you just add it like I suggested in my previous answer?
... View more
03-01-2016
07:12 AM
1 Kudo
You could possibly use the ROW_NUMBER() analytic function as part of the solution. http://www.cloudera.com/documentation/archive/impala/2-x/2-0-x/topics/impala_analytic_functions.html#row_number_unique_1 E.g. select t2.last_id + 1 + row_number() over (order by table1.col1), table1.col1 from table1 inner join (select max(id) last_id from table2) t2 This wouldn't be safe if you are running multiple insert concurrently.
... View more
02-23-2016
08:39 PM
Hi, There are many possible variables, including the exact version of impala, the operating system it was built on, the build flags and environment variables, and what version/build of dependencies you're using. I think the specific thing you're probably seeing with file sizes is that in the CDH distribution the debug symbols are stripped from the binaries and included in separate impalad.debug files. Are you running into some error when trying to run your custom build of Impala? It probably makes more sense to debug that problem rather than trying to exactly reproduce Cloudera's build.
... View more
01-27-2016
04:09 PM
You are most likely running into this bug with the aggregation: https://issues.cloudera.org/browse/IMPALA-2352 We fixed it in CDH5.5/Impala 2.3 but the change wasn't backported because it was deemed too risky for a maintenance release.
... View more