Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Best pratice for lacation file in cluster HDP

avatar
Rising Star

Hello,

I deploy my first cluster HDP since one month and it are used by all my departement.

So i want to store various files in the cluster. But i dont know the best pratice to do it. can store file in master node? Edge node? data node? ...

Examples for files i want to store are :

- files for proof of concept

- jars files for application likes spark

- files for teradata client

- ifexp files

1 ACCEPTED SOLUTION

avatar
Guru

1. Never use master or data node local storage

Best practice is definitely not to touch the master nodes or data nodes for local filesystem storage or command line interface (use edge node CLI or local machine via Ambari Views or integration through Knox gateway).

2. 3rd party tools

3rd party tools will specify where to locate their files/jars.

3. Edge node

If you need files (typically jars) for client interface to cluster, place on edge node and use client there.

If you simply want to archive files (e.g. POC work) you can do this on the edge node local file system.

4. HDFS

If you are archiving files on the edge node and it does not have high availability or backup (e.g. autoreplication of mounts) and you want this, putting it into HDFS is a good idea since each is replicated 3x.

When putting into HDFS, from a client perspective there is no specification of name node or data node -- you interact with the namenode and it will store it on the data nodes. The name node is your interface with the data nodes.

In HDFS, you could define a path like /misc and store these files there. You can also manage read-write permissions on this folder.

You can manage files (make dir, put file, get file) in hdfs through the command line (edge node is good host for this) or Ambari file view.

See: http://hortonworks.com/hadoop-tutorial/using-commandline-manage-files-hdfs/

http://hortonworks.com/blog/best-practices-in-hdfs-authorization-with-apache-ranger/

View solution in original post

7 REPLIES 7

avatar
Rising Star

can i create a linux file system to store all file in? and where node can i use? Thanks

avatar
Guru

If you do anything with the linux file system, it should be on edge node only. See fuller answer below.

avatar
Guru

1. Never use master or data node local storage

Best practice is definitely not to touch the master nodes or data nodes for local filesystem storage or command line interface (use edge node CLI or local machine via Ambari Views or integration through Knox gateway).

2. 3rd party tools

3rd party tools will specify where to locate their files/jars.

3. Edge node

If you need files (typically jars) for client interface to cluster, place on edge node and use client there.

If you simply want to archive files (e.g. POC work) you can do this on the edge node local file system.

4. HDFS

If you are archiving files on the edge node and it does not have high availability or backup (e.g. autoreplication of mounts) and you want this, putting it into HDFS is a good idea since each is replicated 3x.

When putting into HDFS, from a client perspective there is no specification of name node or data node -- you interact with the namenode and it will store it on the data nodes. The name node is your interface with the data nodes.

In HDFS, you could define a path like /misc and store these files there. You can also manage read-write permissions on this folder.

You can manage files (make dir, put file, get file) in hdfs through the command line (edge node is good host for this) or Ambari file view.

See: http://hortonworks.com/hadoop-tutorial/using-commandline-manage-files-hdfs/

http://hortonworks.com/blog/best-practices-in-hdfs-authorization-with-apache-ranger/

avatar
Rising Star

Hello Greg,

Thanks for your answers; I don't talk about data files that i can store in hdfs but files like applications jars (jars for spark application) or teradata generate file.

Thanks

avatar
Guru

As mentioned in previous comment -- you should only store files in local file system of edge node. You should never use the actual cluster (master and data nodes) for local file storage. The fuller answer gives the benefit of HDFS if you are worried about automatic backup of files. (I have seen edge nodes go down and everything lost; thus, either have automatic backup or go to hdfs for files you want to backup.)

avatar
Rising Star

ok! thanks very much.

avatar
Guru

If you feel like you have everything you need, let me know by accepting the answer; else, good to wait for additional answers or follow up with additional questions.