Member since
12-07-2015
83
Posts
23
Kudos Received
10
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
2966 | 07-11-2018 02:42 PM | |
7961 | 12-10-2017 08:26 PM | |
2240 | 11-14-2017 12:17 PM | |
16361 | 03-29-2017 06:42 AM | |
2195 | 02-22-2017 01:43 PM |
11-14-2017
12:17 PM
Hi mauricio, Impala currently does not support graceful node decommissioning. We're tracking work on this feature in IMPALA-1760, but we currently are not targeting it for a particular release. Unfortunately that only leaves the option of killing the daemon. Cheers, Lars
... View more
08-14-2017
01:39 PM
1 Kudo
I just had a look, but I couldn't spot an obvious problem. The HDFS scanner fragments read around 15 MB/s, which seems reasonable to me, given how computationally intensive Parquet decoding is. There also doesn't seem to be any considerable skew. Each of your 5 nodes reads ~ 100GB of data in 134s, so the overall throughput is around 764 MB/s. I suggest to have a look at the perf improvements around Parquet files in CDH 5.12 that I mentioned in an earlier reply.
... View more
08-11-2017
10:50 AM
The files shouldn't be too many. Impala processes files in parallel locally, too, so you should see a higher utilization on each node. Can you post a profile of one of the slow queries?
... View more
08-11-2017
10:08 AM
I'd try to reduce the file size to 256MB and make sure that the block size is at least that large, too. That way you should end up with 32GB * 4 = 128 files per partition. That should allow you to exploit parallelism across all your nodes. You can also try 512MB per file and see if that improves things, but I suspect it won't. Btw, we're currently working on improving the ETL performance. You may want to look at the "SORT BY" clause that is included in Impala 2.9 and how it allows you to write data in a way that allows Impala to skip row groups much more effectively. You can find more information in the umbrella JIRA: https://issues.apache.org/jira/browse/IMPALA-2522
... View more
08-11-2017
09:41 AM
Hi Shannon, Impala does not split up Parquet files over several readers when reading them. Instead, only one daemon will be assigned for each file and will read the whole file. Therefore it is recommended to have only one block per file. Otherwise some of the blocks can be on remote nodes and remote reads will slow down your queries. See this page for more information: https://www.cloudera.com/documentation/enterprise/latest/topics/impala_perf_cookbook.html Cheers, Lars
... View more
07-12-2017
03:19 PM
@adi91 - How did you set --mem_limit? What value did you pass to it? What did http://hostname:25000/memz?detailed=true say after applying --mem_limit to the command line options? Did your value show up there?
... View more
07-08-2017
12:24 PM
1 Kudo
After more investigation I found that this is already documented as a Known Issue in CM: Known Issues and Workarounds in Cloudera Manager 5 For Impala I opened IMPALA-5631 to explain the problem and possible solutions in the docs.
... View more
07-08-2017
11:57 AM
@mbigelow - Thank you for keeping the JIRA updated - I'm glad you found the solution through support. It looks like you are hitting a bug in CM and we are working on fixing it. I will reach out to our documentation team to point out this issue in the docs and the release notes of 5.11.1. I'm sorry for the troubles this has caused you.
... View more
05-31-2017
09:16 AM
num_nodes=1 forces Impala to execute the query on a single node (machine), which will then only write a single parquet file per partition.
... View more
04-11-2017
06:05 AM
Hi imad87, Your question looks related to Solr, so I think it may fit better into the "Search" community: http://community.cloudera.com/t5/Cloudera-Search-Apache-SolrCloud/bd-p/Search Cheers, Lars
... View more