Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

CLoudbreak + multiple data disks

CLoudbreak + multiple data disks

New Contributor

Hi,

I'm using CloudBreak 2.6.0 for deployment of my HDP clusters in Azure. I think it's an amazing tool, but I struggle to properly configure storage.

When I configure a cluster I set 4 data disks (each 1TB as it's a maximum allowed) for my data nodes (I have two of them).
When Ambari starts it says:

Disk Remaining: 1.6 TB / 1.7 TB (93.88%)


That's not what I wanted. I want more disk space!
The value of dfs.datanode.data.dir is as follows:

/hadoop/hdfs/data,/hadoopfs/fs1/hadoop/hdfs/data,/mnt/resource/hadoop/hdfs/data

And this is how CloudBreak mounts the disks:

Filesystem      Size  Used Avail Use% Mounted on
/dev/sda2        30G   21G  9.4G  69% /
devtmpfs         14G     0   14G   0% /dev
tmpfs            14G   24K   14G   1% /dev/shm
tmpfs            14G  113M   14G   1% /run
tmpfs            14G     0   14G   0% /sys/fs/cgroup
tmpfs           1.0G  108K  1.0G   1% /tmp
/dev/sda1       497M  105M  392M  22% /boot
/dev/sdb1       197G   61M  187G   1% /mnt/resource
/dev/sdc       1007G  2.2G  954G   1% /hadoopfs/fs1
/dev/sdd       1007G   77M  956G   1% /hadoopfs/fs2
/dev/sde       1007G   77M  956G   1% /hadoopfs/fs3
/dev/sdf       1007G   77M  956G   1% /hadoopfs/fs4
tmpfs           2.8G     0  2.8G   0% /run/user/1006
tmpfs           2.8G     0  2.8G   0% /run/user/1029
tmpfs           2.8G     0  2.8G   0% /run/user/1025
tmpfs           2.8G     0  2.8G   0% /run/user/1001

Not sure if that info is useful but:

ls /hadoopfs/fs1/

outputs this: hbase hdfs logs lost+found yarn


And this:

ls /hadoopfs/fs2/

outputs this: lost+found

What changes do I need to make, to have more storage?

11 REPLIES 11

Re: CLoudbreak + multiple data disks

Rising Star

Hi @Jakub Igla,

You are checking the HDFS disk space which doesn't equal with the sum of all the attached disks. HDFS is a redundant file system and replicates the data on the attached disks. So you couldn't use the entire size of the disks, except you configure the Block replication to lower then the default which is I think 3, but it is not recommended.

You can check the "Block replication" value under the HDFS config -> advanced page on Ambari UI.

Re: CLoudbreak + multiple data disks

New Contributor

Thanks @Tamas Bihari

However this wasn't a subject of my question as I'm aware how HDFS works. Finally I've found a value that works for my setup and it was:

/hadoopfs/fs1/hadoop/hdfs/data,/hadoopfs/fs2/hadoop/hdfs/data,/hadoopfs/fs3/hadoop/hdfs/data,/hadoopfs/fs4/hadoop/hdfs/data

Also I'm wondering what other settings should I set, so my cluster is resilient etc.
Like "dfs.datanode.failed.volumes.tolerated" etc.

Same for namenode.

Highlighted

Re: CLoudbreak + multiple data disks

@Jakub Igla

As you comment indicates, you need to update the configuration of the data dirs to include the missing paths:/hadoop/hdfs/data,/hadoopfs/fs1/hadoop/hdfs/data,/mnt/resource/hadoop/hdfs/data.

To automate this, you could update the blueprint to include these extra paths for the HDFS configuration. You would want to ensure the blueprint is used on nodes where you know these paths will be available.

Re: CLoudbreak + multiple data disks

Expert Contributor

@Jakub Igla thanks for reporting this. I was able to reproduce it and this is clearly a bug in Cloudbreak 2.6.0, since it does not generate the hdfs-site properties into the Blueprint. If you have 4 disks then it is expected that CB automatically generates the following fragment into the Blueprint, but it does not happen in Cloudbreak 2.6.0:

{
  "hdfs-site": {
    "properties": {
      "dfs.datanode.data.dir": "/hadoopfs/fs1/hdfs/datanode,/hadoopfs/fs2/hdfs/datanode,/hadoopfs/fs3/hdfs/datanode,/hadoopfs/fs4/hdfs/datanode"
    }
  }
}

Cloudbreak 2.4.2 (GA) does not suffer from this problem, so this is a newly introduced issue. Will be fixed with top prio.

Re: CLoudbreak + multiple data disks

New Contributor

Hi @Attila Kanto

Thanks for your response. The problem is, that if I provide this value (either via blueprint or manual change in Ambari), it doesn't change anything in terms of storage capacity.
I got it working once, but then could not reproduce it again... Any other settings I need to provide?
I also created /hdfs/datanode folders under /hadoopfs/fs{n} but it didn't help.

Also could you tell me what will I miss in terms of functionalities if I migrate to 2.4.2?

Re: CLoudbreak + multiple data disks

Rising Star

Hi @Jakub Igla,

After you updated the "dfs.datanode.data.dir" property in Ambari and saved the config then you should restart the entire HDFS service to your modification take effect on your cluster. It should work automatically if you extended your blueprint with the necessary configs.
About this issue I created a blocker ticket, so it should be fixed soon.

Here you can find the release notes of 2.6.0 mostly cover the new features compared to the 2.4.2 version:
https://docs.hortonworks.com/HDPDocuments/Cloudbreak/Cloudbreak-2.6.0/content/releasenotes/index.htm
The downgrading is not a supported feature so I just could recommend you to create a new Cloudbreak deployment to play with version 2.4.2.

Re: CLoudbreak + multiple data disks

New Contributor

Thanks @Tamas Bihari
I obviously restarted HDFS service, by clicking: Service Actions -> Restart All.

The disk capacity remains the same. Either in ambari metrics as well as by running hdfs dfs -df cmd.
Is it possible it might be cached or something?

Re: CLoudbreak + multiple data disks

Rising Star

It worked for us if we updated the "data.dir" in Ambari and restart the necessary services.

76598-datanode-update-hg-datadir.png

Updating the "DataNode directories" doesn't solve the issue, we had to click to "Switch to 'hdp26-data....k2:worker'" and update the settings for the hostgroup that contains the datanodes in that text-area that was not editable by default. Then a Save and restart updated the available dfs space in Ambari.
Could you please check the mentioned steps again?

We didn't run any additional dfs related command.

Re: CLoudbreak + multiple data disks

Rising Star

@Jakub Igla

The post-cluster-install recipe could work but looks like a dirty workaround because the Ambari credentials are needed in the script to be able to communicate with the Ambari server.

Cloudbreak add the "dfs.datanode.data.dir" to the hostgroups array for every hostgroup in a configuration array section, you could add the attached disks data dirs to the hostgroup that contains the datanodes, this way:

  "host_groups": [
    {
      "name": "master",
      "configurations": [
        {
          "hdfs-site": {
            "dfs.datanode.data.dir": "/hadoopfs/fs1/hdfs/datanode"
          }
        }
      ],
      "components": [
        {
          "name": "APP_TIMELINE_SERVER"
        },
        {
          "name": "HCAT"
        },
        {
          "name": "HDFS_CLIENT"
        },
        {
          "name": "HISTORYSERVER"
	}
      ]
    },
    {
      "name": "compute",
.......