Support Questions

alexmc · ‎12-11-2015

I've been looking around but am not sure what the answer to this is.

A client who is speccing up a hadoop cluster is keen to use SSDs. They have experience of using them for a variety of other applications like databases, OS, etc. I am happy for them to use SSDs for the OS, temp files, spill, log files, and so on, but I am still pushing them towards spindle disks for HDFS data as the best performance to cost ratio.

Is that everyone's opinion ("spindle HDDs for storage") or should I just let them spend the extra money on SSDs for storage.

PS I have seen the question

https://community.hortonworks.com/questions/1405/c...

Can you please advise about how best to use this SSD storage to boost performance in HDP on Azure?

The second part of this question is - if we use both SSD and HDD for storage is there a good way of mixing them in a cluster. I am thinking that *possibly* we might set up the SSD disks to be on data nodes which pretend that they are on a different rack to the HDD ones. That way they should always have at least one copy of each block. HOWEVER that wont work if I am trying to have HDDs on the same physical boxes. I can't have two different data nodes running on the same box with the same IP address. I would also need separate compute/YARN services on each box - one for each data node.

Thanks

orenault · ‎12-11-2015

The usage for SSD that you've described do make sense ( spill, temp files, ... ) for SSD. We've also seen a good performance benefit in using them for zookeeper disks. IMHO, it will be better to use the SSD budget and invest it into additional servers. Having said that, I know few customers which have decided to use SSD for their HBase clusters in order to get the best performance.

The second part of your question is much easier to answer. HDFS is supporting tier storage which let you define different class of storages. You can find some further information on heterogeneous storage at : http://hortonworks.com/blog/heterogeneous-storages...

http://www.ebaytechblog.com/2015/01/12/hdfs-storag...

http://www.slideshare.net/Hadoop_Summit/reduce-sto...

View solution in original post

orenault · ‎12-11-2015

The usage for SSD that you've described do make sense ( spill, temp files, ... ) for SSD. We've also seen a good performance benefit in using them for zookeeper disks. IMHO, it will be better to use the SSD budget and invest it into additional servers. Having said that, I know few customers which have decided to use SSD for their HBase clusters in order to get the best performance.

The second part of your question is much easier to answer. HDFS is supporting tier storage which let you define different class of storages. You can find some further information on heterogeneous storage at : http://hortonworks.com/blog/heterogeneous-storages...

http://www.ebaytechblog.com/2015/01/12/hdfs-storag...

http://www.slideshare.net/Hadoop_Summit/reduce-sto...

snichols · ‎12-11-2015

Typical usage of HDFS is large blocks that are access sequentially. In this scenario seek time has negligible cost and throughput is the only significant factor that determines speed. Hard drives typically have high sequential transfer rates so this is an ideal situation. Other files that depend on sequential access are swap files and some temp files if they are large and being produced all at once. Log files also work well here.

SSD drives make an excellent choice for a relational database because their access pattern is one of random reads and writes of small blocks. When reading and writing small blocks in random order seek time is the major cost while throughput is relatively insignificant. Since SSDs have zero seek time they are perfect for relational databases despite (traditionally) having lower throughput. They are also good for large collections of small files such as the operating system binaries, config files, and collections of temporary files.

Different cloud storage providers aggregate drive throughput differently and therefore provide different performance guarantees so you will need to read the fine print to determine what metrics are actually specified. SSDs scale up faster than spindles because they are relatively small compared to hard drives. For the same amount of space there will be more SSDs striped together which means the throughput can be higher than a hard drive of the same size.

tl;dr SSDis only faster for typical HDFS usage if the storage provider offers higher throughput than HDD.

Cloudera Community

Support Questions

Do people see benefit from SSDs and HDFS? Can we mix and match?