Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

purpose of master, worker and compute node in Cloudbreak Hortonworks data platform on azure

avatar
Expert Contributor

Hi,

I have following confusions regarding Cloudbreak HDP on Azure :

  1. Does azure blob storage works as default HDFS for cluster ?
  2. difference between worker node and compute node

Suppose I am puling data from SFTP Source and i want to store them in HDFS, Where data will be stored ?

When I want to process this data on which node data gets processed, worker node or compute node?

Please someone help me clear this doubts.

Thank you

8 REPLIES 8

avatar

@heta desai

You can connect ADLS or WASB to your cluster to copy or access data stored there, but this storage should not be used as default file system. I believe that some people use WASB for this purpose, but it is not officially supported by Hortonworks.

The difference between worker and compute is that no data is stored on compute nodes. If you look at one of the default workload cluster blueprint, the difference between these two is the ""name": "DATANODE"" component that is included in worker nodes, but not in compute.

    {
      "name": "worker",
      "configurations": [],
      "components": [
        {
          "name": "HIVE_CLIENT"
        },
        {
          "name": "TEZ_CLIENT"
        },
        {
          "name": "SPARK_CLIENT"
        },
        {
          "name": "DATANODE"
        },
        {
          "name": "METRICS_MONITOR"
        },
        {
          "name": "NODEMANAGER"
        }
      ],
      "cardinality": "1+"
    },
    {
      "name": "compute",
      "configurations": [],
      "components": [
        {
          "name": "HIVE_CLIENT"
        },
        {
          "name": "TEZ_CLIENT"
        },
        {
          "name": "SPARK_CLIENT"
        },
        {
          "name": "METRICS_MONITOR"
        },
        {
          "name": "NODEMANAGER"
        }
      ],
      "cardinality": "1+"
    }

Hope this helps!

avatar
Expert Contributor

@Dominika Bialek

So i can use WASB as a default storage right ? and compute node can read data from WASB for processing ?

avatar
Expert Contributor

@Dominika Bialek How yarn will know when it's the right time to move the workload to the compute nodes? If you keep NodeManager on the DATANODE it's still a WORKER. Also, sending the data to COMPUTE node will impact performance. Is that correct?

Is it possible to use a blueprint to crare WASB or ADS as a secendary storage for the cluster?

Thanks,

ANdrzej

avatar

I do not the answer to the first question, perhaps someone else can answer. Regarding WASB or ADLS, you can use Cloudbreak to configure access https://docs.hortonworks.com/HDPDocuments/Cloudbreak/Cloudbreak-2.9.0/create-cluster-azure/content/c..., not sure about defining it in a blueprint.

avatar
Master Mentor

@heta desai

The simple answer is YES

The hadoop-azure file system layer simulates HDFS folders on top of Azure storage. Windows Azure Storage Blob (WASB) is an extension built on top of the HDFS APIs.It in many ways "is" HDFS. However, WASB creates a layer of abstraction that enables the separation of storage. This separation is what enables your data to persist even when no clusters currently exist and enables multiple clusters plus other applications to access a single piece of data all at the same time. This increases functionality and flexibility while reducing costs and reducing the time from question to insight.

HDInsights which is a Hortontworkd offering in Azure runs against WASB

Azure doesn't have the notion of a directory. However, the parsing of the file name gives the tree structure because Hadoop recognizes that a slash “/” is an indication of a directory.

Blob address:

# Fully Qualified name Local

hdfs://<namenodehost>/<path> 

# HDInsight Syntax Global

wasb[s]://<containername>@<accountname>.blob.core.windows.net/<path> 

# Example

wasb://YOURDefaultContainer@YOURStorageAccount.blob.core.windows.net/SomeDirectory/ASubDirectory/AFile.txt

Hope that enlightens your knowledge

avatar
Expert Contributor

so there is no replication mechanism if i use WASB as default storage ?

avatar
Master Mentor

@heta desai

How do I manage and configure block/chunk size and the replication factor with WASB?

You don't. It's not generally necessary. The data is stored in the Azure storage accounts, remaining accessible to many applications at once. Each blob (file) is replicated 3x within the data center. If you choose to use geo-replication on your account you also get 3 copies of the data in another data center within the same region.

The data is chunked and distributed to nodes when a job is run. If you need to change the chunk size for memory-related performance at run time that is still an option. You can pass in any Hadoop configuration parameter setting when you create the cluster or you can use the SET command for a given job.

Reference: Understanding WASB and Hadoop Storage in Azure

avatar
Expert Contributor

When i use WASB as storage, while creating cluster i need to have Master node and compute nodes only right ? No need to have worker node as i am using WASB not HDFS ?