Support Questions

SSVKrishna · ‎07-14-2017

I am planning to install Hortonworks cluster with YARN, TEZ, HIVE, MapReduce, Pig, Zookeeper, Spark etc without HDFS. I am planning to install on NFS storage. Any pointers would be great. Can this be done using Ambari?

raul_ping4rr0n · ‎08-18-2017

That should be doable. I have been performing some tests with a high-performance enterprise NFSv3 storage and Spark and it worked like a charm. I still kept an HDFS filesystem to keep logs and historical data (as a kind of tier-2) and used the high-performance NFS storage for the tier-1 datasets that needed more performance and lower response times. Ironically I found out that this NFS storage solution NFS performed similar or slightly better than HDFS when it comes to massive reads but clearly outperformed HDFS in writes, specially when the jobs had a lot of shuffle and spill to disk.

The key thing to use an external and high-performance NFS storage is to make sure all the nodes in the cluster have a persistent mount to the NFS filesystem and all of them use the same mountpoint. When you submit your Spark jobs you just use instead "file:///", for example: "file:///mnt_bigdata/datasets/x".

The great questions here are:

(1) Is Hortonworks supporting this?

(2) Is there any kind of generic NFS integration/deployment/best-practice guide?

(3) Is there a procedure to completely move the entire cluster services and resources file dependencies out from HDFS to NFS ?

View solution in original post

narendrakumar · ‎07-15-2017

@Krishna S

Yes, this can be done. You can install all the required components using Ambari with HDFS. The default storage (default file system) can later be changed to NFS. This is a doable configuration, but can/t be posted here. You can email me for your specific requirement.

raul_ping4rr0n · ‎11-29-2017

@nkumar I'm interested in changing the default filesystem for the entire HDP to NFS, can you please share this?

p_german127 · ‎11-01-2018

@nkumar I'm also interested in HDP on NFS instead of HDFS, could you please share what's need to be done?

SSVKrishna · ‎07-17-2017

@nkumar It will be great ....Thank you very much , how can I reach you out?

narendrakumar · ‎07-18-2017

Hi @Krishna S

Let me know your email id. I can directly email you there. Or if you are from Hortonworks, you can for sure find me via the hipchat.

mqureshi · ‎07-18-2017

@Krishna S

To use these components without HDFS, you need a file system that supports Hadoop API. Some such systems are Amazon S3, WASB, EMC Isilon and a few others(these systems might not implement 100 percent of Hadoop API - please verify). you can also install Hadoop in standalone mode which does not use HDFS.

I am not sure NFS on its own supports Hadoop API but using Hadoop NFS gateway, you can mount HDFS as client's local file system. Here is a link on using this feature.

https://hadoop.apache.org/docs/r2.8.0/hadoop-project-dist/hadoop-hdfs/HdfsNfsGateway.htm

raul_ping4rr0n · ‎08-18-2017

That should be doable. I have been performing some tests with a high-performance enterprise NFSv3 storage and Spark and it worked like a charm. I still kept an HDFS filesystem to keep logs and historical data (as a kind of tier-2) and used the high-performance NFS storage for the tier-1 datasets that needed more performance and lower response times. Ironically I found out that this NFS storage solution NFS performed similar or slightly better than HDFS when it comes to massive reads but clearly outperformed HDFS in writes, specially when the jobs had a lot of shuffle and spill to disk.