Created on 10-14-2015 04:23 AM - edited 09-16-2022 02:44 AM
We're running HDP 2.3 on Windows Azure virtual machines. These machines come with a 400GB temporary SSD disk (which gets wiped after restart). I wanted to ask for advice about how best to use this SSD storage to boost performance? e.g. which config params should we change to point to locations on the SSD disk to boost HDFS / Tez / Yarn / Hive performance?
Created 10-21-2015 04:40 AM
SSDs are best suited for the shuffle intermediate data storage & on-disk logging. For Tez/Yarn/Hive, the main parameter to modify is inside the yarn-site.xml (Tez uses sub-dirs) yarn.nodemanager.local-dirs=file://d:/yarn/data
yarn.nodemanager.log-dirs=file://d:/yarn/logs
Also, check for SSD+TRIM by checking the fsutil behavior command. You can also use the SSD acceleration for Temporary tables in hive, exposing the SSD via HDFS The dfs.datanode.data.dir needs a parameter like "[SSD]file://d:/hdfs/data" (to store the SSD data on d:\hdfs\data). And hive-site.xml needs hive.exec.temporary.table.storage=SSD;
Then you can use "create temporary table xyz stored as orc as select.... from table where ...;" To create temporary tables cached on SSDs.
Created 10-14-2015 07:36 AM
You could use those disks for temporary data of your MapReduce/Tez processes (intermediate data during the shuffle&sort phase). That should boost quite a lot your performance.
See some benchmarks in that paper (look at figure 17, tmpSSD vs HD):
Created 10-21-2015 04:40 AM
SSDs are best suited for the shuffle intermediate data storage & on-disk logging. For Tez/Yarn/Hive, the main parameter to modify is inside the yarn-site.xml (Tez uses sub-dirs) yarn.nodemanager.local-dirs=file://d:/yarn/data
yarn.nodemanager.log-dirs=file://d:/yarn/logs
Also, check for SSD+TRIM by checking the fsutil behavior command. You can also use the SSD acceleration for Temporary tables in hive, exposing the SSD via HDFS The dfs.datanode.data.dir needs a parameter like "[SSD]file://d:/hdfs/data" (to store the SSD data on d:\hdfs\data). And hive-site.xml needs hive.exec.temporary.table.storage=SSD;
Then you can use "create temporary table xyz stored as orc as select.... from table where ...;" To create temporary tables cached on SSDs.
Created 10-21-2015 09:47 AM
This is very helpful benchmarks posted by Amplab. Click