Support Questions

Find answers, ask questions, and share your expertise

Why we need multiple values for YARN_LOCAL_DIR & YARN_LOCAL_LOG_DIR while configuring yarn

avatar
Contributor

Why we need multiple values for YARN_LOCAL_DIR & YARN_LOCAL_LOG_DIR while configuring yarn ?

example YARN_LOCAL_DIR: /u01/app/hadoop/yarn/local, /u02/app/hadoop/yarn/local ,/u03/app/hadoop/yarn/local

For DFS_NAME_DIR , it make sence. But not for YARN_LOCAL_DIR & YARN_LOCAL_LOG_DIR . Felt one value is good.

Please clarify

Thanks

JJ

1 ACCEPTED SOLUTION

avatar

Having multiple values here allows for better scalability and performance for YARN and intermediate writes/reads. Much like HDFS has multiple directories (preferably on different mount point/physical drives), YARN LOCAL dirs can use this to spread the IO load.

I also seen trends where customers use SSD drives for YARN LOCAL DIRS, which can significantly improve job performance. IE: 12 drive system. 8 drives are SATA drives for HDFS directories and 4 drives are smaller, fast SSD drives for YARN LOCAL DIRS.

View solution in original post

4 REPLIES 4

avatar

Having multiple values here allows for better scalability and performance for YARN and intermediate writes/reads. Much like HDFS has multiple directories (preferably on different mount point/physical drives), YARN LOCAL dirs can use this to spread the IO load.

I also seen trends where customers use SSD drives for YARN LOCAL DIRS, which can significantly improve job performance. IE: 12 drive system. 8 drives are SATA drives for HDFS directories and 4 drives are smaller, fast SSD drives for YARN LOCAL DIRS.

avatar
Contributor

Hi David,

For DFS_NAME_DIR , if we have multiple value , storage data has redundancy / copys , of fs image and edits file. Even if one disks goes corrupted another is available .

Does YARN_LOCAL_DIR and YARN_LOCAL_LOG_DIR , also holds redundant / multiple copies of same data? . Was it related to , if one disk goes corrupted , other is available?

Thanks

JJ

avatar
Cloudera Employee

YARN_LOCAL_DIR could be configured with multiple dirs. This will YARN to choose one dir out of a set of good dirs in random way. YARN use roulette mode selection to ensure that all dirs used in LOCAL_DIR gets filled in similar way.

When YARN wants to get a file from the local dir, we know that suffix part of the file (file name etc), now this will be searched in all configured dirs till we get a one.

avatar

Jacqualin,

Yes, the local dir and log dir both support multiple locations. And I advise using multiple locations to scale better. These directories aren't HDFS and therefore don't support hdfs replication, but that's ok. It's used for file caches and intermediate data. If you lose a drive in the middle of processing, only the "task" is affected, which may fail. In this case, the task is rescheduled somewhere else. So the job would be affected.

A failed drive in yarn_local_dir is ok, as the NodeManager with tag it and not use it going forward. One more reason to have more than 1 drive specified here.

BUT, in older versions of YARN, a failed drive can prevent the NodeManager from "starting" or "restarting." It's pretty clear in the logs of the NodeManager if you have issues with it starting at any time. Yarn also indicated drive failures in the Resource Manager UI.

A Newer version of YARN is a bit more forgiving on startup.