Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

How to Install Hortonworks entire ecosystem without HDFS?

avatar
Explorer

I am planning to install Hortonworks cluster with YARN, TEZ, HIVE, MapReduce, Pig, Zookeeper, Spark etc without HDFS. I am planning to install on NFS storage. Any pointers would be great. Can this be done using Ambari?

1 ACCEPTED SOLUTION

avatar

That should be doable. I have been performing some tests with a high-performance enterprise NFSv3 storage and Spark and it worked like a charm. I still kept an HDFS filesystem to keep logs and historical data (as a kind of tier-2) and used the high-performance NFS storage for the tier-1 datasets that needed more performance and lower response times. Ironically I found out that this NFS storage solution NFS performed similar or slightly better than HDFS when it comes to massive reads but clearly outperformed HDFS in writes, specially when the jobs had a lot of shuffle and spill to disk.

The key thing to use an external and high-performance NFS storage is to make sure all the nodes in the cluster have a persistent mount to the NFS filesystem and all of them use the same mountpoint. When you submit your Spark jobs you just use instead "file:///", for example: "file:///mnt_bigdata/datasets/x".

The great questions here are:

(1) Is Hortonworks supporting this?

(2) Is there any kind of generic NFS integration/deployment/best-practice guide?

(3) Is there a procedure to completely move the entire cluster services and resources file dependencies out from HDFS to NFS ?

View solution in original post

7 REPLIES 7

avatar
Contributor
@Krishna S

Yes, this can be done. You can install all the required components using Ambari with HDFS. The default storage (default file system) can later be changed to NFS. This is a doable configuration, but can/t be posted here. You can email me for your specific requirement.

avatar

@nkumar I'm interested in changing the default filesystem for the entire HDP to NFS, can you please share this?

avatar
New Contributor

@nkumar I'm also interested in HDP on NFS instead of HDFS, could you please share what's need to be done?

avatar
Explorer

@nkumar It will be great ....Thank you very much , how can I reach you out?

avatar
Contributor

Hi @Krishna S

Let me know your email id. I can directly email you there. Or if you are from Hortonworks, you can for sure find me via the hipchat.

avatar
Super Guru

@Krishna S

To use these components without HDFS, you need a file system that supports Hadoop API. Some such systems are Amazon S3, WASB, EMC Isilon and a few others(these systems might not implement 100 percent of Hadoop API - please verify). you can also install Hadoop in standalone mode which does not use HDFS.

I am not sure NFS on its own supports Hadoop API but using Hadoop NFS gateway, you can mount HDFS as client's local file system. Here is a link on using this feature.

https://hadoop.apache.org/docs/r2.8.0/hadoop-project-dist/hadoop-hdfs/HdfsNfsGateway.htm

avatar

That should be doable. I have been performing some tests with a high-performance enterprise NFSv3 storage and Spark and it worked like a charm. I still kept an HDFS filesystem to keep logs and historical data (as a kind of tier-2) and used the high-performance NFS storage for the tier-1 datasets that needed more performance and lower response times. Ironically I found out that this NFS storage solution NFS performed similar or slightly better than HDFS when it comes to massive reads but clearly outperformed HDFS in writes, specially when the jobs had a lot of shuffle and spill to disk.

The key thing to use an external and high-performance NFS storage is to make sure all the nodes in the cluster have a persistent mount to the NFS filesystem and all of them use the same mountpoint. When you submit your Spark jobs you just use instead "file:///", for example: "file:///mnt_bigdata/datasets/x".

The great questions here are:

(1) Is Hortonworks supporting this?

(2) Is there any kind of generic NFS integration/deployment/best-practice guide?

(3) Is there a procedure to completely move the entire cluster services and resources file dependencies out from HDFS to NFS ?