Archives of Support Questions (Read Only)

This is an archived board for historical reference. Information and links may no longer be available or relevant
Announcements
This board is archived and read-only for historical reference. To ask a new question, please post a new topic on the appropriate active board.

Does hadoop run dfs -du automatically when a new job starts ?

avatar

Hi ,

I am using HDP 2.6 with Spark 2.1 ( also Spark 1.6) with Yarn as resource manager . I am trying out TeraSort benchmarking jobs on a experimental cluster.

I want to run 'hdfs dfs -du' or 'hdfs fs -du' command every time before starting a Spark job to analyse available disk space in data nodes.

From the following question I understand that running these commands is expensive on cluster

https://community.hortonworks.com/questions/92214/can-hdfs-dfsadmin-and-hdfs-dsfs-du-be-taxing-on-my...

So I wanted to know whether hadoop automatically runs dfs -du command in the background, whenever a new Spark job is started. Or do I need to run manually ?

Thanks,
Steev

1 ACCEPTED SOLUTION

avatar
Super Collaborator

Perhaps you can provide some context on why you think a hdfs dfs -du is needed at the start of each job?
Anyway I am sure that Spark will not run hdfs dfs -du automatically at job start, as a Spark job doesn't necessarily access hdfs, Spark can also be operated without hdfs.

View solution in original post

2 REPLIES 2

avatar
Super Collaborator

Perhaps you can provide some context on why you think a hdfs dfs -du is needed at the start of each job?
Anyway I am sure that Spark will not run hdfs dfs -du automatically at job start, as a Spark job doesn't necessarily access hdfs, Spark can also be operated without hdfs.

avatar

Thank you for the information.

The need for dfs -du is to check how much disk space is available (before starting the job) and check how the job is generating data (how much data)