- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
Can I change the contents of a file present inside HDFS? If Yes, how and what are the Pros and cons ?
- Labels:
-
Apache Hadoop
Created ‎12-11-2017 10:31 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I have a text file stored in HDFS and I want to append some rows into it.
How can I resolve complete this task ?
Thanks,
Created ‎12-11-2017 09:32 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Yes, you can append some rows to the existing text file in hdfs
appendToFile
Usage: hdfs dfs -appendToFile <localsrc> ... <dst>
Append single src, or multiple srcs from local file system to the destination file system. Also reads input from stdin and appends to destination file system.
- hdfs dfs -appendToFile localfile /user/hadoop/hadoopfile
- hdfs dfs -appendToFile localfile1 localfile2 /user/hadoop/hadoopfile
- hdfs dfs -appendToFile localfile hdfs://nn.example.com/hadoop/hadoopfile
- hdfs dfs -appendToFile - hdfs://nn.example.com/hadoop/hadoopfile Reads the input from stdin.
- echo "hi"|hdfs dfs -appendToFile - /user/hadoop/hadoopfile
pros:-
A small file is one which is significantly smaller than the HDFS block sizeEvery file, Directory and block in HDFS is represented as an object in the namenode’s memory, the problem is that HDFS can’t handle lots of files, it is good to have large files in HDFS instead of small files.
Cons:-
When we wants append to hdfs file we must need to obtain a lease which is essentially a lock, to ensure the single writer semantics.more info
In addition if you are having n part files in hdfs directory then wants to merge them into 1 file then
hadoop jar /usr/hdp/2.5.3.0-37/hadoop-mapreduce/hadoop-streaming-2.7.3.2.5.3.0-37.jar \ -Dmapred.reduce.tasks=1 \ -input "<path-to-input-directory>" \ -output "<path-to-output-directory>" \ -mapper cat \ -reducer cat
make sure which version of hadoop streaming jar you are using by going to
/usr/hdp
then give the input path and make sure the output directory is not existed as this job will merge the files and creates the output directory for you.
Here what i tried:-
#hdfs dfs -ls /user/yashu/folder2/ Found 2 items -rw-r--r-- 3 hdfs hdfs 150 2017-09-26 17:55 /user/yashu/folder2/part1.txt -rw-r--r-- 3 hdfs hdfs 20 2017-09-27 09:07 /user/yashu/folder2/part1_sed.txt
#hadoop jar /usr/hdp/2.5.3.0-37/hadoop-mapreduce/hadoop-streaming-2.7.3.2.5.3.0-37.jar \> -Dmapred.reduce.tasks=1 \> -input "/user/yashu/folder2/" \> -output "/user/yashu/folder1/" \> -mapper cat \> -reducer cat
Folder2 having 2 files after running the above command, i am storing the merged files to folder1 directory and the 2 files got merged into 1 file as you can see below.
#hdfs dfs -ls /user/yashu/folder1/ Found 2 items -rw-r--r-- 3 hdfs hdfs 0 2017-10-09 16:00 /user/yashu/folder1/_SUCCESS -rw-r--r-- 3 hdfs hdfs 174 2017-10-09 16:00 /user/yashu/folder1/part-00000
If the Answer helped to resolve your issue, Click on Accept button below to accept the answer, That would be great help to Community users to find solution quickly for these kind of errors.
Created ‎12-11-2017 09:32 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Yes, you can append some rows to the existing text file in hdfs
appendToFile
Usage: hdfs dfs -appendToFile <localsrc> ... <dst>
Append single src, or multiple srcs from local file system to the destination file system. Also reads input from stdin and appends to destination file system.
- hdfs dfs -appendToFile localfile /user/hadoop/hadoopfile
- hdfs dfs -appendToFile localfile1 localfile2 /user/hadoop/hadoopfile
- hdfs dfs -appendToFile localfile hdfs://nn.example.com/hadoop/hadoopfile
- hdfs dfs -appendToFile - hdfs://nn.example.com/hadoop/hadoopfile Reads the input from stdin.
- echo "hi"|hdfs dfs -appendToFile - /user/hadoop/hadoopfile
pros:-
A small file is one which is significantly smaller than the HDFS block sizeEvery file, Directory and block in HDFS is represented as an object in the namenode’s memory, the problem is that HDFS can’t handle lots of files, it is good to have large files in HDFS instead of small files.
Cons:-
When we wants append to hdfs file we must need to obtain a lease which is essentially a lock, to ensure the single writer semantics.more info
In addition if you are having n part files in hdfs directory then wants to merge them into 1 file then
hadoop jar /usr/hdp/2.5.3.0-37/hadoop-mapreduce/hadoop-streaming-2.7.3.2.5.3.0-37.jar \ -Dmapred.reduce.tasks=1 \ -input "<path-to-input-directory>" \ -output "<path-to-output-directory>" \ -mapper cat \ -reducer cat
make sure which version of hadoop streaming jar you are using by going to
/usr/hdp
then give the input path and make sure the output directory is not existed as this job will merge the files and creates the output directory for you.
Here what i tried:-
#hdfs dfs -ls /user/yashu/folder2/ Found 2 items -rw-r--r-- 3 hdfs hdfs 150 2017-09-26 17:55 /user/yashu/folder2/part1.txt -rw-r--r-- 3 hdfs hdfs 20 2017-09-27 09:07 /user/yashu/folder2/part1_sed.txt
#hadoop jar /usr/hdp/2.5.3.0-37/hadoop-mapreduce/hadoop-streaming-2.7.3.2.5.3.0-37.jar \> -Dmapred.reduce.tasks=1 \> -input "/user/yashu/folder2/" \> -output "/user/yashu/folder1/" \> -mapper cat \> -reducer cat
Folder2 having 2 files after running the above command, i am storing the merged files to folder1 directory and the 2 files got merged into 1 file as you can see below.
#hdfs dfs -ls /user/yashu/folder1/ Found 2 items -rw-r--r-- 3 hdfs hdfs 0 2017-10-09 16:00 /user/yashu/folder1/_SUCCESS -rw-r--r-- 3 hdfs hdfs 174 2017-10-09 16:00 /user/yashu/folder1/part-00000
If the Answer helped to resolve your issue, Click on Accept button below to accept the answer, That would be great help to Community users to find solution quickly for these kind of errors.
Created ‎12-12-2017 10:43 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have one doubt - If we change the contents of a file, will this affect to the metadata information stored on the Namenode.
what happens if we keep on appending the data to the same file on daily basis? Also, what if we append large files, will this reduces performance ?
Do you recommend to appending data the existing file or creating the new file ?
Thanks,
Created ‎12-12-2017 02:32 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Yes it need to update the metadata because let's assume your existing file in HDFS is 127 MB size and you are appending 3 MB file to the existing file i.e 130 MB.Now we are going to split the 130 MB size file to 2 (128+2 MB) and make sure all the replicated files are also updated with the new data.
Example:-
$ hdfs dfs -ls /user/yashu/test4/ Found 1 items -rw-r--r-- 3 hdfs hdfs 21 2017-12-11 15:42 /user/yashu/test4/sam.txt $ hadoop fs -appendToFile sam.txt /user/yashu/test4/sam.txt $ hdfs dfs -ls /user/yashu/test4/ Found 1 items -rw-r--r-- 3 hdfs hdfs 30 2017-12-12 09:19 /user/yashu/test4/sam.txt $ echo "hi"|hdfs dfs -appendToFile - /user/yashu/test4/sam.txt $ hdfs dfs -ls /user/yashu/test4/ Found 1 items -rw-r--r-- 3 hdfs hdfs 33 2017-12-12 09:20 /user/yashu/test4/sam.txt
In this above example you can see my HDFS file is having size 21 and date is 2017-12-11 15:42 and then i appended the file then the size and date has changed. Name node needs to update the new metadata of the file and update the replicated blocks also.HDFS MetaData
It won't reduce the performance if you are having big file sizes also. Append new data to the existing file.
Created ‎12-13-2017 12:50 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for the clarification @Shu
