Here are some tips to help answer your question.
First, use cloud storage services, like S3 on AWS or ADLS on Azure, for keeping data like Navigator lineage information. Those services provide availability and reliability automatically. Hadoop and other services can be configured to use cloud storage in various ways instead of local block (hard drive) storage.
Sometimes it's not efficient or high-performing enough to exclusively use cloud storage. For example, a typical data analysis job may have several stages where data is read, processed, and then written back, and those round trips to storage services can be slower and cost more money than local drive access. So, think about adjusting how data is managed, so that intermediate data resides in local block storage, but final results are sent to cloud storage for safekeeping.
Once all of the important data is safe in cloud storage, it becomes less important to keep cluster instances running. You can even destroy entire clusters, including master / manager nodes, knowing that the data is safe in cloud storage. At this point, you will want to use automation tools, like Altus Director or Cloudbreak, so that you can easily spin up new clusters that are configured to pull their initial data from cloud storage. Then, you only run clusters when you need them.
If that isn't feasible, you can still do something like what you suggest, with clusters that have some permanent nodes and some transient ones. If so, ensure that those transient nodes do not keep important state that isn't safe elsewhere. For example, YARN node managers are stateless, so scaling nodes only housing those ("worker" nodes) is an easy goal to achieve. By constrast, HDFS datanodes store file data, so those aren't as easy to scale down - you can, though, as long as they are decommissioned properly using Cloudera Manager or Ambari, which the cloud automation tools handle for you.