Support Questions
Find answers, ask questions, and share your expertise

How to clean datanodes / nodemanagers data after multiple spark-submits?

Hello !

 

I am performing extensive experiments over my 3-nodes (VMs) cluster. my VMs have a disk space of 50GB each, and checking the space available (on localhost:9870 (namenode's UI)) after 10 spark-submit application submissions reveal that the hard disks are almost plenty. How to delete that created data without restarting and reformatting the hdfs ? 

I was thinking of a datanode clean up command to use here.

 

Thanks.

2 REPLIES 2

Cloudera Employee

Hello!

 

You can remove the data from HDFS using the following command

 

#hdfs dfs -rm -R -skipTrash <Extra-Data-folder>

#hdfs dfs -rm -r /tmp/spark

 

This issue is caused by having too many Datanodes with too high of disk utilization thus reducing the total number of Datanodes available for write requests.
As a result, Datanodes which are still available for writes will be targeted at a higher rate - increasing their transceiver activity to the point of being "overloaded".

  • In order to correct this, efforts should be made to reduce the disk utilization of Datanodes in the cluster whose disk capacity limits have been reached.
    Adding additional drives to increase storage space, deleting unwanted/non-critical data from HDFS, or adding additional Datanodes to the cluster are all worthwhile solutions to address this problem.
  • There is also a workaround available to address Datanode rejections due to higher-than-normal transceiver volumes. However, it should be noted that this is not a long-term solution, and should only be used temporarily:
  • Change the 'dfs.namenode.replication.considerLoad' parameter to equal 'false' under HDFS > Configurations > "NameNode Advanced Configuration Snippet (Safety Valve)"in Cloudera Manager. This will effectively tell the NameNode to ignore current transceiver activity when choosing a Datanode for block placement. This can have unintended consequences if left on permanently, as the NameNode can potentially overwhelm Datanodes with too many requests - the considerLoad parameter is there to prevent that.

Hopefully the provided solution will help resolve the issue.

 

Regards,

Vaishnavi Nalawade

 

Thanks, but i want to remove data resulting from executing Spark applications through the command spark-submit not from HDFS, could you confirm those are the commands to use in this case ?

; ;