Created on 02-08-2019 05:56 AM - edited 09-16-2022 07:08 AM
Hi,
I have following confusions regarding Cloudbreak HDP on Azure :
Suppose I am puling data from SFTP Source and i want to store them in HDFS, Where data will be stored ?
When I want to process this data on which node data gets processed, worker node or compute node?
Please someone help me clear this doubts.
Thank you
Created 02-08-2019 06:34 PM
You can connect ADLS or WASB to your cluster to copy or access data stored there, but this storage should not be used as default file system. I believe that some people use WASB for this purpose, but it is not officially supported by Hortonworks.
The difference between worker and compute is that no data is stored on compute nodes. If you look at one of the default workload cluster blueprint, the difference between these two is the ""name": "DATANODE"" component that is included in worker nodes, but not in compute.
{ "name": "worker", "configurations": [], "components": [ { "name": "HIVE_CLIENT" }, { "name": "TEZ_CLIENT" }, { "name": "SPARK_CLIENT" }, { "name": "DATANODE" }, { "name": "METRICS_MONITOR" }, { "name": "NODEMANAGER" } ], "cardinality": "1+" }, { "name": "compute", "configurations": [], "components": [ { "name": "HIVE_CLIENT" }, { "name": "TEZ_CLIENT" }, { "name": "SPARK_CLIENT" }, { "name": "METRICS_MONITOR" }, { "name": "NODEMANAGER" } ], "cardinality": "1+" }
Hope this helps!
Created 02-11-2019 07:19 AM
So i can use WASB as a default storage right ? and compute node can read data from WASB for processing ?
Created 02-08-2019 06:51 PM
@Dominika Bialek How yarn will know when it's the right time to move the workload to the compute nodes? If you keep NodeManager on the DATANODE it's still a WORKER. Also, sending the data to COMPUTE node will impact performance. Is that correct?
Is it possible to use a blueprint to crare WASB or ADS as a secendary storage for the cluster?
Thanks,
ANdrzej
Created 02-08-2019 07:10 PM
I do not the answer to the first question, perhaps someone else can answer. Regarding WASB or ADLS, you can use Cloudbreak to configure access https://docs.hortonworks.com/HDPDocuments/Cloudbreak/Cloudbreak-2.9.0/create-cluster-azure/content/c..., not sure about defining it in a blueprint.
Created 02-11-2019 07:45 AM
The simple answer is YES
The hadoop-azure file system layer simulates HDFS folders on top of Azure storage. Windows Azure Storage Blob (WASB) is an extension built on top of the HDFS APIs.It in many ways "is" HDFS. However, WASB creates a layer of abstraction that enables the separation of storage. This separation is what enables your data to persist even when no clusters currently exist and enables multiple clusters plus other applications to access a single piece of data all at the same time. This increases functionality and flexibility while reducing costs and reducing the time from question to insight.
HDInsights which is a Hortontworkd offering in Azure runs against WASB
Azure doesn't have the notion of a directory. However, the parsing of the file name gives the tree structure because Hadoop recognizes that a slash “/” is an indication of a directory.
Blob address:
# Fully Qualified name Local
hdfs://<namenodehost>/<path>
# HDInsight Syntax Global
wasb[s]://<containername>@<accountname>.blob.core.windows.net/<path>
# Example
wasb://YOURDefaultContainer@YOURStorageAccount.blob.core.windows.net/SomeDirectory/ASubDirectory/AFile.txt
Hope that enlightens your knowledge
Created 02-11-2019 09:37 AM
so there is no replication mechanism if i use WASB as default storage ?
Created 02-11-2019 10:07 AM
How do I manage and configure block/chunk size and the replication factor with WASB?
You don't. It's not generally necessary. The data is stored in the Azure storage accounts, remaining accessible to many applications at once. Each blob (file) is replicated 3x within the data center. If you choose to use geo-replication on your account you also get 3 copies of the data in another data center within the same region.
The data is chunked and distributed to nodes when a job is run. If you need to change the chunk size for memory-related performance at run time that is still an option. You can pass in any Hadoop configuration parameter setting when you create the cluster or you can use the SET command for a given job.
Created 02-12-2019 07:39 AM
When i use WASB as storage, while creating cluster i need to have Master node and compute nodes only right ? No need to have worker node as i am using WASB not HDFS ?