Support Questions

Find answers, ask questions, and share your expertise

How to import existing repositories's contents (files) into HDFS ?

avatar
Rising Star

Hi guys,

I want to import/migrate my existing TBs of content (documents) into HDFS, How can I do that as easy as possible ?

1 ACCEPTED SOLUTION

avatar

@Viraj Vekaria

Option 2

I noticed you mentioned WinSCP, so I’m assuming that the import job is for the initial data load only (or may run occasionally) and will be a manual process. If that is the case then the easiest thing to do is copy the files over to the cluster’s local file system and then use the command line to put the files into HDFS.

1) Copy files from your Windows machine to the cluster’s Linux file system using WinSCP

2) Create a directory in HDFS using the “hadoop fs -mkdir” command

  • Takes the path uri’s as an argument and creates a directory or multiple directories.
# hadoop fs -mkdir <paths> 
# Example:

        hadoop fs -mkdir /user/hadoop
        hadoop fs -mkdir /user/hadoop/dir1 /user/hadoop/dir2 /user/hadoop/dir3

create_3_dir

3) Copy the files from the local file system to HDFS using the “hadoop fs -put” command

  • Copies single src file or multiple src files from local file system to the Hadoop Distributed File System.
# hadoop fs -put <local-src> ... <HDFS_dest_path> 
# Example:

        hadoop fs -put popularNames.txt /user/hadoop/dir1/popularNames.txt

For more command line commands such as delete files, list files, etc... take a look at the links below:

http://hortonworks.com/hadoop-tutorial/using-commandline-manage-files-hdfs/

http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/FileSystemShell.html

View solution in original post

6 REPLIES 6

avatar
Guru

Is your content on local or on hdfs at some other cluster ?

avatar
Rising Star

@Saurabh Kumar, It is on different windows machine and HDFS cluster is on totally new place, So need to import those windows repositories content in HDFS.

I have tried to do using WinSCP tool..but it is used to do the copy of flat files and not in HDFS pattern like creating blocks etc..

avatar
Rising Star

@Neeraj Sabharwal, Do you have any idea about this ?

avatar

One way would be to use NiFi/HDF. You would create a ListFile processor to read the list of files in a folder and the pass it on to a GetFile processor (if you want to delete the original file) or FetchFile processor (if you want to keep the original). You would then use PutHDFS processor to land the files in HDFS.

GetFile: Streams the contents of a file from a local disk (or network-attached disk) into NiFi and then deletes the original file. This Processor is expected to move the file from one location to another location and is not to be used for copying the data.

FetchFile: Reads the contents of a file from disk and streams it into the contents of an incoming FlowFile. Once this is done, the file is optionally moved elsewhere or deleted to help keep the file system organized.

ListHDFS: ListHDFS monitors a user-specified directory in HDFS and emits a FlowFile containing the filename for each file that it encounters. It then persists this state across the entire NiFi cluster by way of a Distributed Cache. These FlowFiles can then be fanned out across the cluster and sent to the FetchHDFS/GetFile Processor, which is responsible for fetching the actual content of those files and emitting FlowFiles that contain the content fetched from HDFS.

PutHDFS: Write FlowFile data to Hadoop Distributed File System (HDFS)

The flow you create in NiFi would continuously monitor your folder for new files and move them over. If it's only a one time ingest that you're interested in then you can just disable NiFi after you're done.

Resources:

https://nifi.apache.org/docs.html

https://nifi.apache.org/docs/nifi-docs/html/getting-started.html

avatar

@Viraj Vekaria

Option 2

I noticed you mentioned WinSCP, so I’m assuming that the import job is for the initial data load only (or may run occasionally) and will be a manual process. If that is the case then the easiest thing to do is copy the files over to the cluster’s local file system and then use the command line to put the files into HDFS.

1) Copy files from your Windows machine to the cluster’s Linux file system using WinSCP

2) Create a directory in HDFS using the “hadoop fs -mkdir” command

  • Takes the path uri’s as an argument and creates a directory or multiple directories.
# hadoop fs -mkdir <paths> 
# Example:

        hadoop fs -mkdir /user/hadoop
        hadoop fs -mkdir /user/hadoop/dir1 /user/hadoop/dir2 /user/hadoop/dir3

create_3_dir

3) Copy the files from the local file system to HDFS using the “hadoop fs -put” command

  • Copies single src file or multiple src files from local file system to the Hadoop Distributed File System.
# hadoop fs -put <local-src> ... <HDFS_dest_path> 
# Example:

        hadoop fs -put popularNames.txt /user/hadoop/dir1/popularNames.txt

For more command line commands such as delete files, list files, etc... take a look at the links below:

http://hortonworks.com/hadoop-tutorial/using-commandline-manage-files-hdfs/

http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/FileSystemShell.html

avatar
Rising Star

@Eyad Garelnabi, Yes this is for initial loading only as we have already windows file server with my DMS.

So need to do that initial loading and than v'll transfer the flow of our app to HDFS tech.

Bdw thnx for the suggestion, I'll look forward to it. 🙂