Support Questions

yagoaparecidoti · ‎07-06-2023

hello cloudera community,

do I need to insert a new datanode in the hdfs and with that I need to know how much size will be used from this new datanode after performing a rebalance on the hdfs?

for example, I have the following scenario:

datanode 1 has 36tb total to use, 30tb is being used, containing about 2.5 million blocks
datanode 2 has 36tb total to use, 30tb is being used, containing about 2.5 million blocks
datanode 3 has 36tb total to use, 30tb is being used, containing about 2.5 million blocks
datanode 4 has 36tb total to use, 30tb is being used, containing about 2.5 million blocks
datanode 5 has 36tb total to use, 30tb is being used, containing about 2.5 million blocks
datanode 6 has 36tb total to use, 30tb is being used, containing about 2.5 million blocks

i will insert the "datanode 7" which has 36tb total, after performing the rebalance in hdfs, how much will this new datanode 7 receive data?

yagoaparecidoti · ‎03-04-2024

hi @ChethanYM

Thank you for your attention

At the time I opened this question in the community, I made a calculation and the values ended up matching what was being predicted.

View solution in original post

ChethanYM · ‎03-04-2024

@yagoaparecidoti

To estimate how much data the new DataNode 7 will receive after performing a rebalance in HDFS, we need to consider the current data distribution across the existing DataNodes and how the rebalancing algorithm will redistribute the data.

Even Data Distribution: The rebalancing process aims to achieve an even distribution of data blocks across all DataNodes in the cluster. This means that HDFS will attempt to redistribute the existing data blocks among all DataNodes, including the new DataNode 7, to balance storage utilization.
Redistribution Strategy: HDFS will analyze the current data distribution and determine an optimal redistribution strategy to achieve balance. This strategy may involve moving some data blocks from existing DataNodes to DataNode 7, but it's unlikely that all data from all existing DataNodes will be moved to the new DataNode.
Optimization and Efficiency: HDFS aims to minimize data movement and optimize the rebalancing process to achieve a balanced state with minimal disruption. The rebalancing algorithm considers factors such as network bandwidth, disk I/O, and cluster performance to determine the most efficient redistribution strategy.

Given these considerations, it's difficult to provide an exact estimate of how much data DataNode 7 will receive after the rebalance without knowing the specific details of the cluster configuration and the rebalancing algorithm used. However, DataNode 7 will likely receive a portion of the existing data blocks from the other DataNodes to help achieve a balanced distribution of data across the cluster.

Regards,

Chethan YM

yagoaparecidoti · ‎03-04-2024

hi @ChethanYM

Thank you for your attention

At the time I opened this question in the community, I made a calculation and the values ended up matching what was being predicted.

Cloudera Community

Support Questions

volume that will be used on a new datanode in hdfs

hdfs datanode one volume full

HDFS Recovery Time from Single DataNode Failure

Interacting with Hadoop HDFS using Python codes

Moving Journal Node directory from one volume to a...

Datanode Service Error Related to NFS Mount Issue

What causes a datanode to consider a volume as fai...

Ingesting EDI into HDFS using HDF 2.0

Using HDFS Centralized Cache Management

Writing parquet on HDFS using Spark Streaming

Datanode Balancer bandwidth configuration