Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Can you please advise about how best to use this SSD storage to boost performance in HDP on Azure?

Solved Go to solution

Can you please advise about how best to use this SSD storage to boost performance in HDP on Azure?

New Contributor

We're running HDP 2.3 on Windows Azure virtual machines. These machines come with a 400GB temporary SSD disk (which gets wiped after restart). I wanted to ask for advice about how best to use this SSD storage to boost performance? e.g. which config params should we change to point to locations on the SSD disk to boost HDFS / Tez / Yarn / Hive performance?

1 ACCEPTED SOLUTION

Accepted Solutions
Highlighted

Re: Can you please advise about how best to use this SSD storage to boost performance in HDP on Azure?

Rising Star

SSDs are best suited for the shuffle intermediate data storage & on-disk logging. For Tez/Yarn/Hive, the main parameter to modify is inside the yarn-site.xml (Tez uses sub-dirs) yarn.nodemanager.local-dirs=file://d:/yarn/data

yarn.nodemanager.log-dirs=file://d:/yarn/logs

Also, check for SSD+TRIM by checking the fsutil behavior command. You can also use the SSD acceleration for Temporary tables in hive, exposing the SSD via HDFS The dfs.datanode.data.dir needs a parameter like "[SSD]file://d:/hdfs/data" (to store the SSD data on d:\hdfs\data). And hive-site.xml needs hive.exec.temporary.table.storage=SSD;

Then you can use "create temporary table xyz stored as orc as select.... from table where ...;" To create temporary tables cached on SSDs.

View solution in original post

3 REPLIES 3
Highlighted

Re: Can you please advise about how best to use this SSD storage to boost performance in HDP on Azure?

Expert Contributor

You could use those disks for temporary data of your MapReduce/Tez processes (intermediate data during the shuffle&sort phase). That should boost quite a lot your performance.

See some benchmarks in that paper (look at figure 17, tmpSSD vs HD):

https://peerj.com/preprints/1320.pdf

Highlighted

Re: Can you please advise about how best to use this SSD storage to boost performance in HDP on Azure?

Rising Star

SSDs are best suited for the shuffle intermediate data storage & on-disk logging. For Tez/Yarn/Hive, the main parameter to modify is inside the yarn-site.xml (Tez uses sub-dirs) yarn.nodemanager.local-dirs=file://d:/yarn/data

yarn.nodemanager.log-dirs=file://d:/yarn/logs

Also, check for SSD+TRIM by checking the fsutil behavior command. You can also use the SSD acceleration for Temporary tables in hive, exposing the SSD via HDFS The dfs.datanode.data.dir needs a parameter like "[SSD]file://d:/hdfs/data" (to store the SSD data on d:\hdfs\data). And hive-site.xml needs hive.exec.temporary.table.storage=SSD;

Then you can use "create temporary table xyz stored as orc as select.... from table where ...;" To create temporary tables cached on SSDs.

View solution in original post

Highlighted

Re: Can you please advise about how best to use this SSD storage to boost performance in HDP on Azure?

@cliu@hortonworks.com

This is very helpful benchmarks posted by Amplab. Click

Don't have an account?
Coming from Hortonworks? Activate your account here