Support Questions

cliu · ‎10-14-2015

We're running HDP 2.3 on Windows Azure virtual machines. These machines come with a 400GB temporary SSD disk (which gets wiped after restart). I wanted to ask for advice about how best to use this SSD storage to boost performance? e.g. which config params should we change to point to locations on the SSD disk to boost HDFS / Tez / Yarn / Hive performance?

gopalv · ‎10-21-2015

SSDs are best suited for the shuffle intermediate data storage & on-disk logging. For Tez/Yarn/Hive, the main parameter to modify is inside the yarn-site.xml (Tez uses sub-dirs) yarn.nodemanager.local-dirs=file://d:/yarn/data

yarn.nodemanager.log-dirs=file://d:/yarn/logs

Also, check for SSD+TRIM by checking the fsutil behavior command. You can also use the SSD acceleration for Temporary tables in hive, exposing the SSD via HDFS The dfs.datanode.data.dir needs a parameter like "[SSD]file://d:/hdfs/data" (to store the SSD data on d:\hdfs\data). And hive-site.xml needs hive.exec.temporary.table.storage=SSD;

Then you can use "create temporary table xyz stored as orc as select.... from table where ...;" To create temporary tables cached on SSDs.

View solution in original post

sluangsay · ‎10-14-2015

You could use those disks for temporary data of your MapReduce/Tez processes (intermediate data during the shuffle&sort phase). That should boost quite a lot your performance.

See some benchmarks in that paper (look at figure 17, tmpSSD vs HD):

https://peerj.com/preprints/1320.pdf

gopalv · ‎10-21-2015

SSDs are best suited for the shuffle intermediate data storage & on-disk logging. For Tez/Yarn/Hive, the main parameter to modify is inside the yarn-site.xml (Tez uses sub-dirs) yarn.nodemanager.local-dirs=file://d:/yarn/data

yarn.nodemanager.log-dirs=file://d:/yarn/logs

Also, check for SSD+TRIM by checking the fsutil behavior command. You can also use the SSD acceleration for Temporary tables in hive, exposing the SSD via HDFS The dfs.datanode.data.dir needs a parameter like "[SSD]file://d:/hdfs/data" (to store the SSD data on d:\hdfs\data). And hive-site.xml needs hive.exec.temporary.table.storage=SSD;

Then you can use "create temporary table xyz stored as orc as select.... from table where ...;" To create temporary tables cached on SSDs.

nsabharwal · ‎10-21-2015

@cliu@hortonworks.com

This is very helpful benchmarks posted by Amplab. Click

Cloudera Community

Support Questions

Can you please advise about how best to use this SSD storage to boost performance in HDP on Azure?