Archives of Support Questions (Read Only)

steevan_rodrigu · ‎02-26-2018

Hi ,

I am using HDP 2.6 with Spark 2.1 ( also Spark 1.6) with Yarn as resource manager . I am trying out TeraSort benchmarking jobs on a experimental cluster.

I want to run 'hdfs dfs -du' or 'hdfs fs -du' command every time before starting a Spark job to analyse available disk space in data nodes.

From the following question I understand that running these commands is expensive on cluster

https://community.hortonworks.com/questions/92214/can-hdfs-dfsadmin-and-hdfs-dsfs-du-be-taxing-on-my...

So I wanted to know whether hadoop automatically runs dfs -du command in the background, whenever a new Spark job is started. Or do I need to run manually ?

Thanks,
Steev

arald · ‎02-26-2018

Perhaps you can provide some context on why you think a hdfs dfs -du is needed at the start of each job?
Anyway I am sure that Spark will not run hdfs dfs -du automatically at job start, as a Spark job doesn't necessarily access hdfs, Spark can also be operated without hdfs.

View solution in original post

arald · ‎02-26-2018

Perhaps you can provide some context on why you think a hdfs dfs -du is needed at the start of each job?
Anyway I am sure that Spark will not run hdfs dfs -du automatically at job start, as a Spark job doesn't necessarily access hdfs, Spark can also be operated without hdfs.

steevan_rodrigu · ‎02-27-2018

Thank you for the information.

The need for dfs -du is to check how much disk space is available (before starting the job) and check how the job is generating data (how much data)

Cloudera Community

Archives of Support Questions (Read Only)

Does hadoop run dfs -du automatically when a new job starts ?