Created 12-11-2017 10:31 AM
Hi,
I have a text file stored in HDFS and I want to append some rows into it.
How can I resolve complete this task ?
Thanks,
Created 12-11-2017 09:32 PM
Yes, you can append some rows to the existing text file in hdfs
appendToFile
Usage: hdfs dfs -appendToFile <localsrc> ... <dst>
Append single src, or multiple srcs from local file system to the destination file system. Also reads input from stdin and appends to destination file system.
pros:-
A small file is one which is significantly smaller than the HDFS block sizeEvery file, Directory and block in HDFS is represented as an object in the namenode’s memory, the problem is that HDFS can’t handle lots of files, it is good to have large files in HDFS instead of small files.
Cons:-
When we wants append to hdfs file we must need to obtain a lease which is essentially a lock, to ensure the single writer semantics.more info
In addition if you are having n part files in hdfs directory then wants to merge them into 1 file then
hadoop jar /usr/hdp/2.5.3.0-37/hadoop-mapreduce/hadoop-streaming-2.7.3.2.5.3.0-37.jar \ -Dmapred.reduce.tasks=1 \ -input "<path-to-input-directory>" \ -output "<path-to-output-directory>" \ -mapper cat \ -reducer cat
make sure which version of hadoop streaming jar you are using by going to
/usr/hdp
then give the input path and make sure the output directory is not existed as this job will merge the files and creates the output directory for you.
Here what i tried:-
#hdfs dfs -ls /user/yashu/folder2/ Found 2 items -rw-r--r-- 3 hdfs hdfs 150 2017-09-26 17:55 /user/yashu/folder2/part1.txt -rw-r--r-- 3 hdfs hdfs 20 2017-09-27 09:07 /user/yashu/folder2/part1_sed.txt
#hadoop jar /usr/hdp/2.5.3.0-37/hadoop-mapreduce/hadoop-streaming-2.7.3.2.5.3.0-37.jar \> -Dmapred.reduce.tasks=1 \> -input "/user/yashu/folder2/" \> -output "/user/yashu/folder1/" \> -mapper cat \> -reducer cat
Folder2 having 2 files after running the above command, i am storing the merged files to folder1 directory and the 2 files got merged into 1 file as you can see below.
#hdfs dfs -ls /user/yashu/folder1/ Found 2 items -rw-r--r-- 3 hdfs hdfs 0 2017-10-09 16:00 /user/yashu/folder1/_SUCCESS -rw-r--r-- 3 hdfs hdfs 174 2017-10-09 16:00 /user/yashu/folder1/part-00000
If the Answer helped to resolve your issue, Click on Accept button below to accept the answer, That would be great help to Community users to find solution quickly for these kind of errors.
Created 12-11-2017 09:32 PM
Yes, you can append some rows to the existing text file in hdfs
appendToFile
Usage: hdfs dfs -appendToFile <localsrc> ... <dst>
Append single src, or multiple srcs from local file system to the destination file system. Also reads input from stdin and appends to destination file system.
pros:-
A small file is one which is significantly smaller than the HDFS block sizeEvery file, Directory and block in HDFS is represented as an object in the namenode’s memory, the problem is that HDFS can’t handle lots of files, it is good to have large files in HDFS instead of small files.
Cons:-
When we wants append to hdfs file we must need to obtain a lease which is essentially a lock, to ensure the single writer semantics.more info
In addition if you are having n part files in hdfs directory then wants to merge them into 1 file then
hadoop jar /usr/hdp/2.5.3.0-37/hadoop-mapreduce/hadoop-streaming-2.7.3.2.5.3.0-37.jar \ -Dmapred.reduce.tasks=1 \ -input "<path-to-input-directory>" \ -output "<path-to-output-directory>" \ -mapper cat \ -reducer cat
make sure which version of hadoop streaming jar you are using by going to
/usr/hdp
then give the input path and make sure the output directory is not existed as this job will merge the files and creates the output directory for you.
Here what i tried:-
#hdfs dfs -ls /user/yashu/folder2/ Found 2 items -rw-r--r-- 3 hdfs hdfs 150 2017-09-26 17:55 /user/yashu/folder2/part1.txt -rw-r--r-- 3 hdfs hdfs 20 2017-09-27 09:07 /user/yashu/folder2/part1_sed.txt
#hadoop jar /usr/hdp/2.5.3.0-37/hadoop-mapreduce/hadoop-streaming-2.7.3.2.5.3.0-37.jar \> -Dmapred.reduce.tasks=1 \> -input "/user/yashu/folder2/" \> -output "/user/yashu/folder1/" \> -mapper cat \> -reducer cat
Folder2 having 2 files after running the above command, i am storing the merged files to folder1 directory and the 2 files got merged into 1 file as you can see below.
#hdfs dfs -ls /user/yashu/folder1/ Found 2 items -rw-r--r-- 3 hdfs hdfs 0 2017-10-09 16:00 /user/yashu/folder1/_SUCCESS -rw-r--r-- 3 hdfs hdfs 174 2017-10-09 16:00 /user/yashu/folder1/part-00000
If the Answer helped to resolve your issue, Click on Accept button below to accept the answer, That would be great help to Community users to find solution quickly for these kind of errors.
Created 12-12-2017 10:43 AM
I have one doubt - If we change the contents of a file, will this affect to the metadata information stored on the Namenode.
what happens if we keep on appending the data to the same file on daily basis? Also, what if we append large files, will this reduces performance ?
Do you recommend to appending data the existing file or creating the new file ?
Thanks,
Created 12-12-2017 02:32 PM
Yes it need to update the metadata because let's assume your existing file in HDFS is 127 MB size and you are appending 3 MB file to the existing file i.e 130 MB.Now we are going to split the 130 MB size file to 2 (128+2 MB) and make sure all the replicated files are also updated with the new data.
Example:-
$ hdfs dfs -ls /user/yashu/test4/ Found 1 items -rw-r--r-- 3 hdfs hdfs 21 2017-12-11 15:42 /user/yashu/test4/sam.txt $ hadoop fs -appendToFile sam.txt /user/yashu/test4/sam.txt $ hdfs dfs -ls /user/yashu/test4/ Found 1 items -rw-r--r-- 3 hdfs hdfs 30 2017-12-12 09:19 /user/yashu/test4/sam.txt $ echo "hi"|hdfs dfs -appendToFile - /user/yashu/test4/sam.txt $ hdfs dfs -ls /user/yashu/test4/ Found 1 items -rw-r--r-- 3 hdfs hdfs 33 2017-12-12 09:20 /user/yashu/test4/sam.txt
In this above example you can see my HDFS file is having size 21 and date is 2017-12-11 15:42 and then i appended the file then the size and date has changed. Name node needs to update the new metadata of the file and update the replicated blocks also.HDFS MetaData
It won't reduce the performance if you are having big file sizes also. Append new data to the existing file.
Created 12-13-2017 12:50 PM
Thanks for the clarification @Shu