Reply
Contributor
Posts: 28
Registered: ‎05-12-2015

Cloudera agent creates failed data dir's on root parition causing, node failure.

[ Edited ]

Hello, we have an issue with root parition getting filled whenever a datanode directory fails and umounted. Cloudera agent is creating the data dir's upon server power cycle or cloudera agent restart.

 

Our environment:
CDH 4.7.1
CM 4.8.5

 

Our current data dir setup:

dfs.datanode.data.dir = /hadoopX/data
mapred.local.dir = /hadoopX/local

 

When ever a drive /hadoopX fails and drive gets unmount it for repair, Cloudera agent creates a /hadoopX/data and /hadoopX/local directories on root partition. Due to running jobs, root partition(200 gb) get filled pretty soon and results in service(datanode, tasktracker) failures.

 

Is there a work around it ? How to stop cloudera agent to not create data dir's on root parition. I see that Ambari had a similar issue, and its fixed. Jira - https://issues.apache.org/jira/browse/AMBARI-7506 Please suggest any work around . Thank you. Appreciate your help.


-
Thanks
Vganji

Engineer
Posts: 1,730
Kudos: 357
Solutions: 274
Registered: ‎07-31-2013

Re: Cloudera agent creates failed data dir's on root parition causing, node failure.

Create the mount directories (before the mount is done) with 700 permissions (root owned). This way the agent will not overwrite/create it, and the DN will not be able to run on it. When the real mounting is done, the permissions will be acceptable again and things will work only then.

We do have an internal JIRA tracking improvement of detecting such a situation, and it will be done for a future CM release.

Do you mount your disks manually BTW? You could have it done via fstab before the agent starts up, in the init order - to avoid the problem altogether maybe.

Note that if you are unmounting data directory volumes for repair on DNs, you should probably also reconfigure the DN: http://blog.cloudera.com/blog/2015/05/new-in-cdh-5-4-how-swapping-of-hdfs-datanode-drives/
Contributor
Posts: 28
Registered: ‎05-12-2015

Re: Cloudera agent creates failed data dir's on root parition causing, node failure.

Harsh, thank you for comments. 

 

I tried  changing permissions of /hadoopX to 700 and started datanode service on it, But cm-agent upon datanode service start, is creating /hadoopX/data & /hadoopX/local on root parition. Here is the log.

 

[07/Aug/2015 11:19:14 +0000] 14569 MainThread agent INFO Activating Process 608-hdfs-DATANODE
[07/Aug/2015 11:19:14 +0000] 14569 MainThread agent INFO Created /var/run/cloudera-scm-agent/process/608-hdfs-DATANODE
[07/Aug/2015 11:19:14 +0000] 14569 MainThread agent INFO Chowning /var/run/cloudera-scm-agent/process/608-hdfs-DATANODE to apps (513) apps (515)
[07/Aug/2015 11:19:14 +0000] 14569 MainThread agent INFO Chmod'ing /var/run/cloudera-scm-agent/process/608-hdfs-DATANODE to 0751
[07/Aug/2015 11:19:14 +0000] 14569 MainThread parcel INFO prepare_environment begin: {u'CDH': u'4.7.1-1.cdh4.7.1.p0.47'}, [u'cdh'], [u'cdh-plugin', u'hdfs-plugin']
[07/Aug/2015 11:19:14 +0000] 14569 MainThread parcel INFO The following requested parcels are not available: {}
[07/Aug/2015 11:19:14 +0000] 14569 MainThread parcel INFO Obtained tags ['cdh'] for parcel CDH
[07/Aug/2015 11:19:14 +0000] 14569 MainThread parcel INFO prepare_environment end: {'CDH': '4.7.1-1.cdh4.7.1.p0.47'}
[07/Aug/2015 11:19:14 +0000] 14569 MainThread util INFO Extracted 8 files and 0 dirs to /var/run/cloudera-scm-agent/process/608-hdfs-DATANODE.
[07/Aug/2015 11:19:14 +0000] 14569 MainThread agent INFO Created /var/run/cloudera-scm-agent/process/608-hdfs-DATANODE/logs
[07/Aug/2015 11:19:14 +0000] 14569 MainThread agent INFO Chowning /var/run/cloudera-scm-agent/process/608-hdfs-DATANODE/logs to apps (513) apps (515)
[07/Aug/2015 11:19:14 +0000] 14569 MainThread agent INFO Chmod'ing /var/run/cloudera-scm-agent/process/608-hdfs-DATANODE/logs to 0751
[07/Aug/2015 11:19:14 +0000] 14569 MainThread agent INFO Created /hadoop6/data
[07/Aug/2015 11:19:14 +0000] 14569 MainThread agent INFO Chowning /hadoop6/data to apps (513) hadoop (493)
[07/Aug/2015 11:19:14 +0000] 14569 MainThread agent INFO Chmod'ing /hadoop6/data to 0700
[07/Aug/2015 11:19:14 +0000] 14569 MainThread agent INFO Triggering supervisord update.
[07/Aug/2015 11:19:14 +0000] 14569 MainThread abstract_monitor INFO Refreshing DataNodeMonitor for None

 

And we mount disk using fstab only. Our main goal is not decomission services on node with disk failures <= dfs.datanode.failed.volumes.tolerated (2).

 

Could you help with some other workaournd of how to not allow cm-agent to create data dir's on root parition, may be adding a check in agent at $CMF_PATH/agent/src/cmf/agent.py ?

 

 

 

Engineer
Cloudera Employee
Posts: 225
Registered: ‎09-23-2013

Re: Cloudera agent creates failed data dir's on root parition causing, node failure.

show us the fstab please

Cloudera Employee
Posts: 225
Registered: ‎09-23-2013

Re: Cloudera agent creates failed data dir's on root parition causing, node failure.

[ Edited ]

(nevermind my post just now regarding what X is... I found your paths are hadoop# not literally hadoopX)

 

 

Explorer
Posts: 24
Registered: ‎07-18-2016

Re: Cloudera agent creates failed data dir's on root parition causing, node failure.

Hi @Harsh J,

 

I know I'm reviving an old thread, but can you please comment on the fact that this "fix" still does not work in CDH 5.12.1, managed by CM? Even if the folders are owned by root, with 700, the folders still get created, and data is being written to an underlying FS, often /, which is not really good, don't you agree?

 

Thanks,

Milan

Posts: 1,730
Kudos: 357
Solutions: 274
Registered: ‎07-31-2013

Re: Cloudera agent creates failed data dir's on root parition causing, node failure.

There isn't a CM improvement in for this corner-case yet - its still in the works.

To be clear, I meant 700-ing + root-owning the actual target path instead of just the mount point path.

For instance, you may be mounting your disks at /data1, /data2, etc.

And you may be configuring DataNodes to use /data1/dfs/dn, /data2/dfs/dn, etc.

What I suggested earlier here is that, in an unmounted state, do the following for each path:

# As root
mkdir -p /data1/dfs/dn
chmod 700 /data1/dfs/dn

The 700 permission set is not to be applied on the /data1 parent path, but on the path the agent actually tries to create for you if absent.

Now when the /data1 path does not get mounted, the agent will see that the path already exists (on the / mount) and will skip away to DN start. The DN start will fail cause it cannot look inside /data1/dfs/dn.

Does this make sense?
Explorer
Posts: 24
Registered: ‎07-18-2016

Re: Cloudera agent creates failed data dir's on root parition causing, node failure.

[ Edited ]

Interesting, @Harsh J, we'll try that as well, and post back.

Edit: it seems that this workaround indeed works - thanks again.

Let's hope it gets patched soon - it seems relatively trivial to resolve, but I might be wrong. :)

 

Cheers

Announcements