Support Questions

Find answers, ask questions, and share your expertise
Check out our newest addition to the community, the Cloudera Data Analytics (CDA) group hub.

Storing HDFS data only on specific nodes

Expert Contributor


We have a 30 nodes production cluster. We want to add 5 data nodes for additional storage to handle the interim spike of data( around 2 TB). This data is to be stored temporarily and we want to get rid of it after 15 days.

Is it possible to make sure that the interim data (2 TB) coming in will be stored only on the newly added data nodes?

I am looking for something similar to YARN node labelling.




Expert Contributor

Hi SS,

you could try to declare the disks of the additional nodes as SSD-tier and flag the temporary data with One_SSD storage policy. This way, data should only reside on the declared "SSD-disks" and by that on the "burst nodes".

However, keep in mind the performance implications when storing data only on a subset of your cluster. Jobs that primarily use that data might create more heavy network load and suffer from a lower aggregated IO bandwith thus leading to degraded performance.



Take a Tour of the Community
Don't have an account?
Your experience may be limited. Sign in to explore more.