Created 02-26-2018 10:53 AM
Hi ,
I am using HDP 2.6 with Spark 2.1 ( also Spark 1.6) with Yarn as resource manager . I am trying out TeraSort benchmarking jobs on a experimental cluster.
I want to run 'hdfs dfs -du' or 'hdfs fs -du' command every time before starting a Spark job to analyse available disk space in data nodes.
From the following question I understand that running these commands is expensive on cluster
So I wanted to know whether hadoop automatically runs dfs -du command in the background, whenever a new Spark job is started. Or do I need to run manually ?
Thanks,
Steev
Created 02-26-2018 01:33 PM
Perhaps you can provide some context on why you think a hdfs dfs -du is needed at the start of each job?
Anyway I am sure that Spark will not run hdfs dfs -du automatically at job start, as a Spark job doesn't necessarily access hdfs, Spark can also be operated without hdfs.
Created 02-26-2018 01:33 PM
Perhaps you can provide some context on why you think a hdfs dfs -du is needed at the start of each job?
Anyway I am sure that Spark will not run hdfs dfs -du automatically at job start, as a Spark job doesn't necessarily access hdfs, Spark can also be operated without hdfs.
Created 02-27-2018 06:11 AM
Thank you for the information.
The need for dfs -du is to check how much disk space is available (before starting the job) and check how the job is generating data (how much data)