Member since
12-16-2015
4
Posts
0
Kudos Received
0
Solutions
05-25-2016
03:18 AM
Right. Just accepted your reply as a solution. thanks
... View more
05-24-2016
08:08 AM
Alright, Quick update : with the slow start setting to "1", and impala in his own pool with Yarn having a bit less memory no more deadlocks. Job ran a little bit slower but this is to be expected. Planning the upgrade to CDH5.7 this week and we'll set it back to 0.8 when upgraded. I'll keep this thread updated.
... View more
05-19-2016
07:04 AM
We must be very unlucky because our specific pattern of ETLs resulted in almost 100% reproductivity of the deadlock when Yarn had less ressources. I thought I had fully rolled back the config but turns out I had forgotten 2 settings and we still had problems (not 100% but around 10% of the time, it hung) Container Memory Maximum : Wizard set it to 85gb, put it back down to 64gb Container Virtual CPU Cores Maximum : Wizard set it to 24, put it back down to 32 With a proper rollback to previous config, last night's run went well. I'll monitor the situation and switch back to static pools and make sure to double check everything, including setting the slow start parameter. The whole plan was to upgrade to CDH5.7, the first step being dropping LLama. I'll followup once the change is done. Thanks
... View more
05-18-2016
01:40 PM
Hi, We have a 12 nodes cluster running CDH 5.4.9 under CM 5.4.5. Each node has 12 cpus (24 vcores with HT), 128gb of ram and 6 HDDs. Each of them runs a HDFS datanode daemon, a yarn nodemanager and an impala daemon Cluster is pretty simple : no Hbase, no sentry, no kerberos. Since we are on 5.4.9, we were running Impala's ressource management inside of YARN. Because this setup is not supported in CDH 5.5, we wanted to separate Impala from Yarn and reconfigure the static pools accordingly. These are the steps that were taken on May 17th 1-Shutdown Impala 2-Delete the 2 llama role instances 3-Modify the Impala config to remove Yarn as a ressource manager 4-Start the Static Pool config wizard 5-Set the % as : 5% for HDFS, 10% for Impala and 85% for Yarn. (was 5% HDFS and 95% Yarn before) 6-Restart everything. Now, after the restart, everything looks like it's running fine. Hive and Impala query runs fine and no errors whatsoever. This morning (May 18th), we noticed that most of our ETLs (mainly Hive jobs) did not run. Looking up the "Yarn Applications" page in Cloudera Manager, there were 3 running applications and a huge list of pending ones. The 3 running applications had the EXACT same problem : 1-M/R was started with 3000+mapper to run and 1099 reducers 2-Most of the mappers completed successfully 3-The reducer started the copy phase while the rest of the mappers continued their job 4-Then, at one point, the job hangs because, for some reasons, pending mappers are never started. So we get stuck with a job that has 2-3 pending mappers and 100+ running reducers and it stays like that forever because for some reasons, the pending mappers never get submitted. At first, I noticed that we had a failover of the ressource manager during the night but it was unrelated : re-running the query ends up doing the exact same problem! What I've tried : 1-Disabled CGroups altogether --> all M/R jobs now failing 2-Rollback my config by manually setting back Impala inside of Yarn using Llama, re-enabling CGroups and rolling back the following config : Default Value for : Container Memory Maximum,Container Virtual CPU Cores Maximum,Cgroup CPU Shares,Cgroup I/O Weight Old values for : Default Number of Reduce Tasks per Job, Container Memory #2 did the trick but I'm back to square 1 where I can't upgrade to CDH5.5 or newer. Any clues on where to start the investigation? It's a bit of a pain since this behavior cannot be replicated on our test cluster (not enough data), only in Production... Thanks!
... View more