Member since
12-07-2015
83
Posts
23
Kudos Received
10
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
1941 | 07-11-2018 02:42 PM | |
5640 | 12-10-2017 08:26 PM | |
1578 | 11-14-2017 12:17 PM | |
12208 | 03-29-2017 06:42 AM | |
1448 | 02-22-2017 01:43 PM |
07-24-2019
11:26 AM
5 Kudos
This is a known issue and has been fixed in CM 6.2. Here is the relevant item in the release notes. Cheers, Lars
... View more
07-11-2018
02:42 PM
Yes, creating two clusters is what you could try. I'm no expert in setting this up and unfortunately I also don't have good advice on which tooling to use. distcp certainly could be worth a try. Within a country your experience will depend on where your machines are, and you'll likely also be affected by reduced bandwidth between data centers. I'm not sure about other services' behavior when running across racks. Impala is not (yet) rack-aware in its scheduling and exchanges. However, even once we get to adding support for rack-awareness, we might assume that the racks are within a single data-center.
... View more
07-11-2018
08:52 AM
This sounds like a result of the drastically increased link latency between your two "racks". While within a single rack you should see latencies less than a millisecond, US-EU latencies will be around 150ms, depending on where in the US and EU your machines are located. Bandwidth between your locations is likely also much lower than between the racks. Impala currently does not do any rack-aware scheduling of I/O and data exchanges. In addition it is not optimized for high variance in link latencies and throughput. HDFS itself to my knowledge also makes no optimizations for such a case. Frankly, I don't think you will see good performance in such a scenario. If you want to increase data availability, you could explore replicating the data between your locations while running queries in only one at a time. If you want to increase service availability, you can look into using a load balancer and switching from one cluster to the other in case of failure.
... View more
04-27-2018
11:34 AM
Thank you Chris for providing more information. It looks like it crashed in the code that writes Parquet files (HdfsParquetTableWriter::ColumnWriter<impala::StringValue>::ProcessValue). However, your query should not write any data: " SELECT a.topLevelField, b.priceFromNestedField FROM db.table a LEFT JOIN a.nestedField b" I also noticed that the stack looks like it has been overwritten by something. I don't recall any recent issues in that method and will have a look at the code to see if I can spot anything obvious. In the meantime, can you double check that this query caused the crash and no other query was running? Thanks, Lars
... View more
04-26-2018
03:41 PM
Let's see what the hs_err_pid file contains next. Additionally, would you be willing to share the Minidump or a core dump with us in private? Please be aware that Minidumps contain process memory of each thread's stack, and core dumps contain all of the process's memory. Let me know if you'd like to do that and I'll share a private upload link with you. Alternatively you can follow these instructions to resolve the minidump yourself and share the contained stack traces: https://cwiki.apache.org/confluence/display/IMPALA/Debugging+Impala+Minidumps
... View more
04-26-2018
10:44 AM
Hi, Can you post the ends of the INFO and ERROR logs? Can you also post the content of the hs_err_pid<pid>.log file? Thanks, Lars
... View more
03-12-2018
12:18 PM
1 Kudo
For Python it makes a difference whether output gets printed to the terminal (which in this case likely supports unicode) or output is redirected to a file (which means it needs to be encoded in ASCII). This post on StackOverflow seems to describe the issue well. I linked the post in the JIRA for future reference. Cheers, Lars
... View more
03-09-2018
02:40 PM
Hi GeKas, I'm not sure I understood your question. In general, writing to stdout should respect the local language settings of your shell: $ echo $LANG
en_US.UTF-8 Writing to a file however does not need to respect these, so it's behavior may be different.
... View more
03-08-2018
10:48 AM
This looks like IMPALA-2717 to me. The Jira has a patch attached to it, but no-one ever seems to have pushed a code review for this. Unfortunately there's no targeted release for this issue. Contributions are always welcome, let me know if you want to give it a shot. Cheers, Lars
... View more
12-10-2017
08:26 PM
Hi Davood, Impala needs a column type for column3 and NULL does not allow the planner to infer the type. Using a cast to specify the type will work: create table v as select i, cast(null as int) as j from t; Cheers, Lars
... View more
11-14-2017
12:17 PM
Hi mauricio, Impala currently does not support graceful node decommissioning. We're tracking work on this feature in IMPALA-1760, but we currently are not targeting it for a particular release. Unfortunately that only leaves the option of killing the daemon. Cheers, Lars
... View more
11-02-2017
10:17 AM
Can you share a query profile? That could give insights into where Impala is spending the time.
... View more
11-01-2017
02:29 PM
Hi hrishi1dypim, Have you restarted all Impala roles including statestored and catalogd after the upgrade? Cheers, Lars
... View more
11-01-2017
02:25 PM
Hi yehudaks, How long does your processing step take? I.e., what do you mean by "takes a lot of time"? Can you share a query profile? Cheers, Lars
... View more
08-14-2017
01:39 PM
1 Kudo
I just had a look, but I couldn't spot an obvious problem. The HDFS scanner fragments read around 15 MB/s, which seems reasonable to me, given how computationally intensive Parquet decoding is. There also doesn't seem to be any considerable skew. Each of your 5 nodes reads ~ 100GB of data in 134s, so the overall throughput is around 764 MB/s. I suggest to have a look at the perf improvements around Parquet files in CDH 5.12 that I mentioned in an earlier reply.
... View more
08-11-2017
10:50 AM
The files shouldn't be too many. Impala processes files in parallel locally, too, so you should see a higher utilization on each node. Can you post a profile of one of the slow queries?
... View more
08-11-2017
10:08 AM
I'd try to reduce the file size to 256MB and make sure that the block size is at least that large, too. That way you should end up with 32GB * 4 = 128 files per partition. That should allow you to exploit parallelism across all your nodes. You can also try 512MB per file and see if that improves things, but I suspect it won't. Btw, we're currently working on improving the ETL performance. You may want to look at the "SORT BY" clause that is included in Impala 2.9 and how it allows you to write data in a way that allows Impala to skip row groups much more effectively. You can find more information in the umbrella JIRA: https://issues.apache.org/jira/browse/IMPALA-2522
... View more
08-11-2017
09:41 AM
Hi Shannon, Impala does not split up Parquet files over several readers when reading them. Instead, only one daemon will be assigned for each file and will read the whole file. Therefore it is recommended to have only one block per file. Otherwise some of the blocks can be on remote nodes and remote reads will slow down your queries. See this page for more information: https://www.cloudera.com/documentation/enterprise/latest/topics/impala_perf_cookbook.html Cheers, Lars
... View more
07-12-2017
03:19 PM
@adi91 - How did you set --mem_limit? What value did you pass to it? What did http://hostname:25000/memz?detailed=true say after applying --mem_limit to the command line options? Did your value show up there?
... View more
07-08-2017
12:24 PM
1 Kudo
After more investigation I found that this is already documented as a Known Issue in CM: Known Issues and Workarounds in Cloudera Manager 5 For Impala I opened IMPALA-5631 to explain the problem and possible solutions in the docs.
... View more
07-08-2017
11:57 AM
@mbigelow - Thank you for keeping the JIRA updated - I'm glad you found the solution through support. It looks like you are hitting a bug in CM and we are working on fixing it. I will reach out to our documentation team to point out this issue in the docs and the release notes of 5.11.1. I'm sorry for the troubles this has caused you.
... View more
06-20-2017
11:43 AM
Steps to generate SSL certificates are generally independent of Impala and well documented. One such place where you can find more information is here: https://www.digitalocean.com/community/tutorials/openssl-essentials-working-with-ssl-certificates-private-keys-and-csrs
... View more
06-14-2017
10:48 AM
1 Kudo
Hi, You would do something like 'select * from A a left outer join B b...'. You can find more documentation on the topic here: https://www.cloudera.com/documentation/enterprise/latest/topics/impala_joins.html Cheers, Lars
... View more
05-31-2017
09:16 AM
num_nodes=1 forces Impala to execute the query on a single node (machine), which will then only write a single parquet file per partition.
... View more
04-11-2017
06:05 AM
Hi imad87, Your question looks related to Solr, so I think it may fit better into the "Search" community: http://community.cloudera.com/t5/Cloudera-Search-Apache-SolrCloud/bd-p/Search Cheers, Lars
... View more
04-04-2017
04:32 AM
1 Kudo
Hi Alon, Have you tried the inc_stats_size_limit_bytes command line flag as suggested by Tim? It is supported on CDH5.10.0. Here's the full help text from impalad: -inc_stats_size_limit_bytes (Maximum size of incremental stats the catalog is allowed to serialize per table. This limit is set as a safety check, to prevent the JVM from hitting a maximum array limit of 1GB (or OOM) while building the thrift objects to send to impalads. By default, it's set to 200MB) type: int64 default: 209715200 This should allow you to increase the limit you are hitting. Cheers, Lars
... View more
04-04-2017
03:37 AM
Thank you for catching this Tim! The "SORTBY()" hint was added in IMPALA-4163, which was not included in Impala 2.8.0. It is currently being reworked into a SQL clause (IMPALA-4166), so I cannot make promises as to which release will contain this feature. My apologies for the confusion. I will make sure the documentation gets updated.
... View more
04-01-2017
05:30 AM
2 Kudos
Hi imad87, What is the purpose of specifying " ROW FORMAT DELIMITED" without a delimiter character? On first glance it looks like your data file contains the substring "\N" (the \ character followed by the N character) to delimit lines, instead of the "\n" character (ASCII 0xA). Can you double check the file in a hex editor? Cheers, Lars
... View more
03-30-2017
07:58 AM
Unfortunately I don't know how to speed up your particular query. You may want to have a look at Impala's query hints, especially the section about "Hints for join queries" on this page: https://www.cloudera.com/documentation/enterprise/latest/topics/impala_hints.html If that doesn't get you anywhere, maybe someone else here has an idea. Cheers, Lars
... View more
03-29-2017
06:42 AM
Hi Amit, Your first question has already been discussed in this thread: There's a bit of a story there. When we started preparing the 5.10 CDH release, the Apache 2.8 Impala release was not ready, so we had to call it "Impala 2.7" in the version number. Impala 2.8 was officially released after we finished putting together the CDH5.10 release - too late to bump the version in all places. CDH5.10 Impala is almost exactly the same as 2.8, plus or minus a few patches, so in most of the announcements we've just called it 2.8. You can find a full list of commits in CDH5.10.0 here: https://github.com/cloudera/Impala/commits/cdh5-2.7.0_5.10.0 The full list of commits in Impala 2.8 are here: https://github.com/apache/incubator-impala/commits/branch-2.8.0 To your second question: Impala does indeed not support the DELETE command for non-Kudu tables. You can use the TRUNCATE command to completely delete all data in a table. Cheers, Lars
... View more