Created 01-27-2017 12:05 PM
Hello,
I have a bunch of questions about hadoop cluster hardware configuration, mostly about storage configuration.
-According to public documents, storage requirement depends on workload. If workload needs performance using fast disks(SAS) is feasible, if workload needs storage then SATA disks can be used. But according to many documents, it is said that using small capacities is better, but many documents are 2 years old or old. Today big capacities like 8 TB is possible in a single disk, what do you think using these disks? It seems if this type disk fails, healing time would take longer, so does it effect cluster performance?
-What are the storage considerations about Apache Spark? It is documented that spark can use disks if tasks doesn't fit to memory and for intermediate outputs between stages. What is the density of this type of operations in a spark task? Do speed and capacity of disks matter?
-Another issue about Spark, according to my readings more than 200 GB memory Java VM may not behave well so serialization recommended. Does it mean spark is CPU intensive also? Roughly, without JVM issue, can we say spark is cpu intensive?
-Are there any calculation for name node storage requirement? For example how much meta data area is required for 100 TB hadoop data?
-According to hadoop documents, storage tiering is possible. Have you ever tried that? Does it provide using heterogeneous disk types at different racks or in a same rack for different data types?
-My last question about edge node and master nodes. As far as I know edge node is a gateway between hadoop cluster and outer network. So if I use an edge node, slave or master nodes wouldn't need to connect to outer network(except administration works), data transfer can be done over edge node. Is it true? Also are there any consideration about number of master nodes? How can I decide it should be more than 2?
Thanks for your help.
Created 01-27-2017 08:44 PM
-According to public documents, storage requirement depends on workload. If workload needs performance using fast disks(SAS) is feasible, if workload needs storage then SATA disks can be used. But according to many documents, it is said that using small capacities is better, but many documents are 2 years old or old. Today big capacities like 8 TB is possible in a single disk, what do you think using these disks? It seems if this type disk fails, healing time would take longer, so does it effect cluster performance?
Answer:
Yes, you are correct! with larger disks there would be longer healing time in case of disk failures and extra overhead for the NN to re-replicate bunch of blocks. It's always better to have 4*2 disks rather than having 1*8TB disk (for e.g.) in order to reduce the disk I/O, improve write performance, minimize the downtime. I would still stick to larger number of disks with smaller capacity instead of minimum disks with larger capacity.
.
-What are the storage considerations about Apache Spark? It is documented that spark can use disks if tasks doesn't fit to memory and for intermediate outputs between stages. What is the density of this type of operations in a spark task? Do speed and capacity of disks matter?
Answer:
For speed and capacity, you can refer my above answer.
You can refer 'Local disk' section of below documentation
http://spark.apache.org/docs/latest/hardware-provisioning.html
.
-Another issue about Spark, according to my readings more than 200 GB memory Java VM may not behave well so serialization recommended. Does it mean spark is CPU intensive also? Roughly, without JVM issue, can we say spark is cpu intensive?
Answer:
Yes it can be bottlenecked by any resources such as CPU, network bandwidth or memory itself.
Please refer below documentation for more details
https://spark.apache.org/docs/latest/tuning.html
.
-Are there any calculation for name node storage requirement? For example how much meta data area is required for 100 TB hadoop data?
Answer:
Please see below doc
.
-According to hadoop documents, storage tiering is possible. Have you ever tried that? Does it provide using heterogeneous disk types at different racks or in a same rack for different data types?
Answer:
HDFS supports tiered storage since Hadoop 2.3. Please have a look at below blog on how eBay has managed tiered storage on their Hadoop cluster.
http://www.ebaytechblog.com/2015/01/12/hdfs-storage-efficiency-using-tiered-storage/
.
-My last question about edge node and master nodes. As far as I know edge node is a gateway between hadoop cluster and outer network. So if I use an edge node, slave or master nodes wouldn't need to connect to outer network(except administration works), data transfer can be done over edge node. Is it true? Also are there any consideration about number of master nodes? How can I decide it should be more than 2?
Answer:
Yes, your understanding is correct. You can access HDFS data from edge node, client applications can be run from edge node.
Regarding master nodes,
It's always better and recommended to have HA configured for critical master components like Namenode, Resource managers etc. for production clusters. You can configure more than 2 Namenodes in hadoop 3.0 I believe.
Please refer below jira for more details
https://issues.apache.org/jira/browse/HDFS-6440
.
Please accept this answer if it is helpful. Happy Hadooping!! 🙂
Created 01-27-2017 08:44 PM
-According to public documents, storage requirement depends on workload. If workload needs performance using fast disks(SAS) is feasible, if workload needs storage then SATA disks can be used. But according to many documents, it is said that using small capacities is better, but many documents are 2 years old or old. Today big capacities like 8 TB is possible in a single disk, what do you think using these disks? It seems if this type disk fails, healing time would take longer, so does it effect cluster performance?
Answer:
Yes, you are correct! with larger disks there would be longer healing time in case of disk failures and extra overhead for the NN to re-replicate bunch of blocks. It's always better to have 4*2 disks rather than having 1*8TB disk (for e.g.) in order to reduce the disk I/O, improve write performance, minimize the downtime. I would still stick to larger number of disks with smaller capacity instead of minimum disks with larger capacity.
.
-What are the storage considerations about Apache Spark? It is documented that spark can use disks if tasks doesn't fit to memory and for intermediate outputs between stages. What is the density of this type of operations in a spark task? Do speed and capacity of disks matter?
Answer:
For speed and capacity, you can refer my above answer.
You can refer 'Local disk' section of below documentation
http://spark.apache.org/docs/latest/hardware-provisioning.html
.
-Another issue about Spark, according to my readings more than 200 GB memory Java VM may not behave well so serialization recommended. Does it mean spark is CPU intensive also? Roughly, without JVM issue, can we say spark is cpu intensive?
Answer:
Yes it can be bottlenecked by any resources such as CPU, network bandwidth or memory itself.
Please refer below documentation for more details
https://spark.apache.org/docs/latest/tuning.html
.
-Are there any calculation for name node storage requirement? For example how much meta data area is required for 100 TB hadoop data?
Answer:
Please see below doc
.
-According to hadoop documents, storage tiering is possible. Have you ever tried that? Does it provide using heterogeneous disk types at different racks or in a same rack for different data types?
Answer:
HDFS supports tiered storage since Hadoop 2.3. Please have a look at below blog on how eBay has managed tiered storage on their Hadoop cluster.
http://www.ebaytechblog.com/2015/01/12/hdfs-storage-efficiency-using-tiered-storage/
.
-My last question about edge node and master nodes. As far as I know edge node is a gateway between hadoop cluster and outer network. So if I use an edge node, slave or master nodes wouldn't need to connect to outer network(except administration works), data transfer can be done over edge node. Is it true? Also are there any consideration about number of master nodes? How can I decide it should be more than 2?
Answer:
Yes, your understanding is correct. You can access HDFS data from edge node, client applications can be run from edge node.
Regarding master nodes,
It's always better and recommended to have HA configured for critical master components like Namenode, Resource managers etc. for production clusters. You can configure more than 2 Namenodes in hadoop 3.0 I believe.
Please refer below jira for more details
https://issues.apache.org/jira/browse/HDFS-6440
.
Please accept this answer if it is helpful. Happy Hadooping!! 🙂
Created 01-30-2017 05:58 AM
Hello,
thanks for answers. Can you clarify these answers;
-I couldn't get meta data calculation on namenode. The document is about calculating java heap size. Roughly I need storage requirement of name nodes.
-Can we say "more serialization means more cpu" in Spark?