Support Questions

steevan_rodrigu · ‎02-26-2018

Hi ,

I am using HDP 2.6 with Spark 2.1 ( also Spark 1.6) with Yarn as resource manager . I am trying out TeraSort benchmarking jobs on a experimental cluster.

I want to run 'hdfs dfs -du' or 'hdfs fs -du' command every time before starting a Spark job to analyse available disk space in data nodes.

From the following question I understand that running these commands is expensive on cluster

https://community.hortonworks.com/questions/92214/can-hdfs-dfsadmin-and-hdfs-dsfs-du-be-taxing-on-my...

So I wanted to know whether hadoop automatically runs dfs -du command in the background, whenever a new Spark job is started. Or do I need to run manually ?

Thanks,
Steev

arald · ‎02-26-2018

Perhaps you can provide some context on why you think a hdfs dfs -du is needed at the start of each job?
Anyway I am sure that Spark will not run hdfs dfs -du automatically at job start, as a Spark job doesn't necessarily access hdfs, Spark can also be operated without hdfs.

View solution in original post

arald · ‎02-26-2018

Perhaps you can provide some context on why you think a hdfs dfs -du is needed at the start of each job?
Anyway I am sure that Spark will not run hdfs dfs -du automatically at job start, as a Spark job doesn't necessarily access hdfs, Spark can also be operated without hdfs.

steevan_rodrigu · ‎02-27-2018

Thank you for the information.

The need for dfs -du is to check how much disk space is available (before starting the job) and check how the job is generating data (how much data)

Cloudera Community

Support Questions

Does hadoop run dfs -du automatically when a new job starts ?