Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

How to write HDFS data to a specific device

avatar
Contributor

My existing EBS volumes are transparently encrypted. I added an extra volume that is not encrypted. Now I want to be able to control where HDFS writes a file. I think it must be possible because heterogeneous storage policies tell HDFS where to write. How can I do this?

1 ACCEPTED SOLUTION

avatar
Guru

Hi @Peter Coates

HDFS does support heterogeneous storage types but specifying your own storage type is not supported. You need to use one from pre-defined types (ARCHIVE, DISK, SSD and RAM_DISK). Each storage type comes with its own policy (which affects the way creation & replicas will be handled).

So if you can differentiate between your encrypted and non-encrypted volume based on these storage types, then only can control where HDFS writes a file.

Hope this helps.

Reference: https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-hdfs/ArchivalStorage.html

View solution in original post

2 REPLIES 2

avatar
Guru

Hi @Peter Coates

HDFS does support heterogeneous storage types but specifying your own storage type is not supported. You need to use one from pre-defined types (ARCHIVE, DISK, SSD and RAM_DISK). Each storage type comes with its own policy (which affects the way creation & replicas will be handled).

So if you can differentiate between your encrypted and non-encrypted volume based on these storage types, then only can control where HDFS writes a file.

Hope this helps.

Reference: https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-hdfs/ArchivalStorage.html

avatar
Contributor

I feared as much. Thank you for your suggestion--I think it work for us, as this is a cloud cluster, and we can archive to S3, obviating the need to use heterogeneous storage for its intended purpose. However, I would like to suggest a Jira ticket to add a storage class for this purpose. There are significant use-cases where it would be useful to know that a subset of your data is confined to specific drives (a) without the restrictions of the existing policies (b) without abusing a storage class for this purpose.