Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Can you please advise about how best to use this SSD storage to boost performance in HDP on Azure?

avatar
New Contributor

We're running HDP 2.3 on Windows Azure virtual machines. These machines come with a 400GB temporary SSD disk (which gets wiped after restart). I wanted to ask for advice about how best to use this SSD storage to boost performance? e.g. which config params should we change to point to locations on the SSD disk to boost HDFS / Tez / Yarn / Hive performance?

1 ACCEPTED SOLUTION

avatar
Expert Contributor

SSDs are best suited for the shuffle intermediate data storage & on-disk logging. For Tez/Yarn/Hive, the main parameter to modify is inside the yarn-site.xml (Tez uses sub-dirs) yarn.nodemanager.local-dirs=file://d:/yarn/data

yarn.nodemanager.log-dirs=file://d:/yarn/logs

Also, check for SSD+TRIM by checking the fsutil behavior command. You can also use the SSD acceleration for Temporary tables in hive, exposing the SSD via HDFS The dfs.datanode.data.dir needs a parameter like "[SSD]file://d:/hdfs/data" (to store the SSD data on d:\hdfs\data). And hive-site.xml needs hive.exec.temporary.table.storage=SSD;

Then you can use "create temporary table xyz stored as orc as select.... from table where ...;" To create temporary tables cached on SSDs.

View solution in original post

3 REPLIES 3

avatar
Super Collaborator

You could use those disks for temporary data of your MapReduce/Tez processes (intermediate data during the shuffle&sort phase). That should boost quite a lot your performance.

See some benchmarks in that paper (look at figure 17, tmpSSD vs HD):

https://peerj.com/preprints/1320.pdf

avatar
Expert Contributor

SSDs are best suited for the shuffle intermediate data storage & on-disk logging. For Tez/Yarn/Hive, the main parameter to modify is inside the yarn-site.xml (Tez uses sub-dirs) yarn.nodemanager.local-dirs=file://d:/yarn/data

yarn.nodemanager.log-dirs=file://d:/yarn/logs

Also, check for SSD+TRIM by checking the fsutil behavior command. You can also use the SSD acceleration for Temporary tables in hive, exposing the SSD via HDFS The dfs.datanode.data.dir needs a parameter like "[SSD]file://d:/hdfs/data" (to store the SSD data on d:\hdfs\data). And hive-site.xml needs hive.exec.temporary.table.storage=SSD;

Then you can use "create temporary table xyz stored as orc as select.... from table where ...;" To create temporary tables cached on SSDs.

avatar
Master Mentor

@cliu@hortonworks.com

This is very helpful benchmarks posted by Amplab. Click