We have a 30 nodes production cluster. We want to add 5 data nodes for additional storage to handle the interim spike of data( around 2 TB). This data is to be stored temporarily and we want to get rid of it after 15 days.
Is it possible to make sure that the interim data (2 TB) coming in will be stored only on the newly added data nodes?
I am looking for something similar to YARN node labelling.
you could try to declare the disks of the additional nodes as SSD-tier and flag the temporary data with One_SSD storage policy. This way, data should only reside on the declared "SSD-disks" and by that on the "burst nodes".
However, keep in mind the performance implications when storing data only on a subset of your cluster. Jobs that primarily use that data might create more heavy network load and suffer from a lower aggregated IO bandwith thus leading to degraded performance.