Support Questions

Find answers, ask questions, and share your expertise

How to define HDFS storage tiers and storage polices in CDH 5.4.x

avatar
New Contributor

In order to use Hadoop 2.6 storage policies you must specify the type for each mount point in dfs.datanode.data.dir the type (DISK, ARCHIVE, RAM_DISK, SSD). If editing the hdfs-site.xml file directly I would do -

 

<property>
     <name>dfs.datanode.data.dir</name>
     <value>[ARCHIVE]file:///mnt/archive/dfs/dn,[SSD]file:///mnt/flash/dfs/dn,[DISK]file:///mnt/disk/dfs/dn</value>
</property>

However, if I try and use this format in the CM GUI I get the following error - 

  • DataNode Data Directory: Path [ARCHIVE]:///mnt/archive/dfs/dn does not conform to the pattern "(/[-+=_.a-zA-Z0-9]+)+(/)*"
  • DataNode Data Directory: Path [DISK]:file///mnt/disk/dfs/dn does not conform to the pattern "(/[-+=_.a-zA-Z0-9]+)+(/)*"
  • DataNode Data Directory: Path [SSD]:file///mnt/flash/dfs/dn does not conform to the pattern "(/[-+=_.a-zA-Z0-9]+)+(/)*"

 

Does anyone know what is the correct format to specify storage tiers in the GUI or how to manually bypass the GUI and configure this.

 

Thank you 

 

Daniel

 

CM_Tiers_Error.png

 

1 ACCEPTED SOLUTION

avatar
Mentor
CM currently lacks support to define storage types. If you'd like to use this feature at the moment, place your XML override in the "DataNode Advanced Configuration Snippet (Safety Valve) for hdfs-site.xml" instead, which accepts <property/> tags.

View solution in original post

4 REPLIES 4

avatar
Mentor
CM currently lacks support to define storage types. If you'd like to use this feature at the moment, place your XML override in the "DataNode Advanced Configuration Snippet (Safety Valve) for hdfs-site.xml" instead, which accepts <property/> tags.

avatar
New Contributor

Thanks - that worked.

 

I'm assuming that I should leave the mounts in the dfs.datanode.data.dir section, so that CM knows to monitor the mounts.

avatar
Mentor
Yes, that'd be a good idea.

Glad to hear it worked! Feel free to also mark the discussion as solved so others looking at similar issues may find this thread faster.

avatar
Explorer

Some more questions  based on this thread

 

Once storage configuration is defined and SSDs/ Disks are identified by HDFS,

  1. does all drives (SSDs+ DIsks) are used and single virtual storage ?
    1. if yes does it mean while running jobs/queries some data blocks would be fetched from Disks while others from SSDs?
  2. or two different virtual storage hot and cold??  
    1. If Yes, while copying/generating data in HDFS, will there be 3 copies of data across disks+storage or 3 copies in Disks and 3 copies in SSDs ; total 6 copies?
    2. how do I force data to be used from SSDs only or DISKs only; while submitting any Jobs/queries using various tools(hive, Impala, spark etc)