Support Questions

Find answers, ask questions, and share your expertise

Can I change the contents of a file present inside HDFS? If Yes, how and what are the Pros and cons ?

avatar
Contributor

Hi,

I have a text file stored in HDFS and I want to append some rows into it.

How can I resolve complete this task ?

Thanks,

1 ACCEPTED SOLUTION

avatar
Master Guru
@Rakesh AN

Yes, you can append some rows to the existing text file in hdfs

appendToFile

Usage: hdfs dfs -appendToFile <localsrc> ... <dst>

Append single src, or multiple srcs from local file system to the destination file system. Also reads input from stdin and appends to destination file system.

  • hdfs dfs -appendToFile localfile /user/hadoop/hadoopfile
  • hdfs dfs -appendToFile localfile1 localfile2 /user/hadoop/hadoopfile
  • hdfs dfs -appendToFile localfile hdfs://nn.example.com/hadoop/hadoopfile
  • hdfs dfs -appendToFile - hdfs://nn.example.com/hadoop/hadoopfile Reads the input from stdin.
  • echo "hi"|hdfs dfs -appendToFile - /user/hadoop/hadoopfile

pros:-

A small file is one which is significantly smaller than the HDFS block sizeEvery file, Directory and block in HDFS is represented as an object in the namenode’s memory, the problem is that HDFS can’t handle lots of files, it is good to have large files in HDFS instead of small files.

more info

Cons:-

When we wants append to hdfs file we must need to obtain a lease which is essentially a lock, to ensure the single writer semantics.more info

In addition if you are having n part files in hdfs directory then wants to merge them into 1 file then

hadoop jar
/usr/hdp/2.5.3.0-37/hadoop-mapreduce/hadoop-streaming-2.7.3.2.5.3.0-37.jar \ 
-Dmapred.reduce.tasks=1 \ 
-input "<path-to-input-directory>" \ 
-output "<path-to-output-directory>" \ 
-mapper cat \ 
-reducer cat

make sure which version of hadoop streaming jar you are using by going to

/usr/hdp

then give the input path and make sure the output directory is not existed as this job will merge the files and creates the output directory for you.

Here what i tried:-

#hdfs dfs -ls /user/yashu/folder2/
Found 2 items 
-rw-r--r--  3 hdfs hdfs  150 2017-09-26 17:55 /user/yashu/folder2/part1.txt 
-rw-r--r--  3 hdfs hdfs  20 2017-09-27 09:07 /user/yashu/folder2/part1_sed.txt
#hadoop jar
/usr/hdp/2.5.3.0-37/hadoop-mapreduce/hadoop-streaming-2.7.3.2.5.3.0-37.jar \> 
-Dmapred.reduce.tasks=1 \> 
-input "/user/yashu/folder2/" \> 
-output "/user/yashu/folder1/" \> 
-mapper cat \> 
-reducer cat

Folder2 having 2 files after running the above command, i am storing the merged files to folder1 directory and the 2 files got merged into 1 file as you can see below.

#hdfs dfs -ls /user/yashu/folder1/
Found 2 items 
-rw-r--r--  3 hdfs hdfs  0 2017-10-09 16:00   /user/yashu/folder1/_SUCCESS 
-rw-r--r--  3 hdfs hdfs  174 2017-10-09 16:00 /user/yashu/folder1/part-00000

If the Answer helped to resolve your issue, Click on Accept button below to accept the answer, That would be great help to Community users to find solution quickly for these kind of errors.

View solution in original post

4 REPLIES 4

avatar
Master Guru
@Rakesh AN

Yes, you can append some rows to the existing text file in hdfs

appendToFile

Usage: hdfs dfs -appendToFile <localsrc> ... <dst>

Append single src, or multiple srcs from local file system to the destination file system. Also reads input from stdin and appends to destination file system.

  • hdfs dfs -appendToFile localfile /user/hadoop/hadoopfile
  • hdfs dfs -appendToFile localfile1 localfile2 /user/hadoop/hadoopfile
  • hdfs dfs -appendToFile localfile hdfs://nn.example.com/hadoop/hadoopfile
  • hdfs dfs -appendToFile - hdfs://nn.example.com/hadoop/hadoopfile Reads the input from stdin.
  • echo "hi"|hdfs dfs -appendToFile - /user/hadoop/hadoopfile

pros:-

A small file is one which is significantly smaller than the HDFS block sizeEvery file, Directory and block in HDFS is represented as an object in the namenode’s memory, the problem is that HDFS can’t handle lots of files, it is good to have large files in HDFS instead of small files.

more info

Cons:-

When we wants append to hdfs file we must need to obtain a lease which is essentially a lock, to ensure the single writer semantics.more info

In addition if you are having n part files in hdfs directory then wants to merge them into 1 file then

hadoop jar
/usr/hdp/2.5.3.0-37/hadoop-mapreduce/hadoop-streaming-2.7.3.2.5.3.0-37.jar \ 
-Dmapred.reduce.tasks=1 \ 
-input "<path-to-input-directory>" \ 
-output "<path-to-output-directory>" \ 
-mapper cat \ 
-reducer cat

make sure which version of hadoop streaming jar you are using by going to

/usr/hdp

then give the input path and make sure the output directory is not existed as this job will merge the files and creates the output directory for you.

Here what i tried:-

#hdfs dfs -ls /user/yashu/folder2/
Found 2 items 
-rw-r--r--  3 hdfs hdfs  150 2017-09-26 17:55 /user/yashu/folder2/part1.txt 
-rw-r--r--  3 hdfs hdfs  20 2017-09-27 09:07 /user/yashu/folder2/part1_sed.txt
#hadoop jar
/usr/hdp/2.5.3.0-37/hadoop-mapreduce/hadoop-streaming-2.7.3.2.5.3.0-37.jar \> 
-Dmapred.reduce.tasks=1 \> 
-input "/user/yashu/folder2/" \> 
-output "/user/yashu/folder1/" \> 
-mapper cat \> 
-reducer cat

Folder2 having 2 files after running the above command, i am storing the merged files to folder1 directory and the 2 files got merged into 1 file as you can see below.

#hdfs dfs -ls /user/yashu/folder1/
Found 2 items 
-rw-r--r--  3 hdfs hdfs  0 2017-10-09 16:00   /user/yashu/folder1/_SUCCESS 
-rw-r--r--  3 hdfs hdfs  174 2017-10-09 16:00 /user/yashu/folder1/part-00000

If the Answer helped to resolve your issue, Click on Accept button below to accept the answer, That would be great help to Community users to find solution quickly for these kind of errors.

avatar
Contributor

@Shu

I have one doubt - If we change the contents of a file, will this affect to the metadata information stored on the Namenode.

what happens if we keep on appending the data to the same file on daily basis? Also, what if we append large files, will this reduces performance ?

Do you recommend to appending data the existing file or creating the new file ?

Thanks,

avatar
Master Guru

@Rakesh AN

Yes it need to update the metadata because let's assume your existing file in HDFS is 127 MB size and you are appending 3 MB file to the existing file i.e 130 MB.Now we are going to split the 130 MB size file to 2 (128+2 MB) and make sure all the replicated files are also updated with the new data.

Example:-

$ hdfs dfs -ls /user/yashu/test4/
Found 1 items
-rw-r--r--   3 hdfs hdfs         21 2017-12-11 15:42 /user/yashu/test4/sam.txt
$ hadoop fs -appendToFile sam.txt /user/yashu/test4/sam.txt
$ hdfs dfs -ls /user/yashu/test4/
Found 1 items
-rw-r--r--   3 hdfs hdfs         30 2017-12-12 09:19 /user/yashu/test4/sam.txt
$ echo "hi"|hdfs dfs -appendToFile - /user/yashu/test4/sam.txt
$ hdfs dfs -ls /user/yashu/test4/
Found 1 items
-rw-r--r--   3 hdfs hdfs         33 2017-12-12 09:20 /user/yashu/test4/sam.txt

In this above example you can see my HDFS file is having size 21 and date is 2017-12-11 15:42 and then i appended the file then the size and date has changed. Name node needs to update the new metadata of the file and update the replicated blocks also.HDFS MetaData

It won't reduce the performance if you are having big file sizes also. Append new data to the existing file.

https://community.hortonworks.com/questions/16278/best-practises-beetwen-size-block-size-file-and-re...

avatar
Contributor

Thanks for the clarification @Shu