Support Questions
Find answers, ask questions, and share your expertise

Need to move all the individual text files into one text file in HDFS

Highlighted

Need to move all the individual text files into one text file in HDFS

Contributor

1) When I given the below command, I am getting a lot of text files. The size of the text files is very small, I need to move all the text files data into one text file. So that the size of the file is big.

Command Given on the Terminal: hadoop fs -ls hadoopcli2/queue_paths

Found 13 items

-rw-r--r-- 3 sg865w hdfs 113 2016-05-12 13:02 hadoopcli2/queue_paths/2016-02-12-.txt

-rw-r--r-- 3 sg865w hdfs 114 2016-05-12 13:02 hadoopcli2/queue_paths/2016-02-12-01-02.txt

-rw-r--r-- 3 sg865w hdfs 114 2016-05-12 13:04 hadoopcli2/queue_paths/2016-04-12-01-4.txt

-rw-r--r-- 3 sg865w hdfs 112 2016-05-12 13:06 hadoopcli2/queue_paths/2016-05-12-01-06.txt

-rw-r--r-- 3 sg865w hdfs 111 2016-05-12 14:31 hadoopcli2/queue_paths/2016-05-12-01-17.txt

-rw-r--r-- 3 sg865w hdfs 112 2016-05-12 13:21 hadoopcli2/queue_paths/2016-05-12-01-21.txt

-rw-r--r-- 3 sg865w hdfs 111 2016-05-12 13:31 hadoopcli2/queue_paths/2016-05-12-01-31.txt

-rw-r--r-- 3 sg865w hdfs 113 2016-05-12 14:53 hadoopcli2/queue_paths/2016-05-12-02-53.txt

-rw-r--r-- 3 sg865w hdfs 112 2016-05-12 16:10 hadoopcli2/queue_paths/2016-05-12-04-10.txt

-rw-r--r-- 3 sg865w hdfs 112 2016-05-12 13:03 hadoopcli2/queue_paths/2016-3-12-01-03.txt

-rw-r--r-- 3 sg865w hdfs 113 2016-05-12 13:06 hadoopcli2/queue_paths/2016-5-12-01-06.txt

-rw-r--r-- 3 sg865w hdfs 114 2016-05-12 12:57 hadoopcli2/queue_paths/2016-57-12-.txt

-rw-r--r-- 3 sg865w hdfs 113 2016-05-12 12:58 hadoopcli2/queue_paths/2016-58-12-.txt

Now When I am giving this command to move all these text files into one text file, I am not getting the result. Please help me on this.

Command I am Giving: hadoop fs -getmerge hadoopcli2/queue_paths/*.txt testing/queue_paths/2016-05-19-16-05.txt

2) All these text files which I have shown in the above are generating for every minute, if the above process is successful how can i write a cronjob for every 15 min for the above one. Can you please help me on this.

5 REPLIES 5
Highlighted

Re: Need to move all the individual text files into one text file in HDFS

Expert Contributor

@shyam gurram is this any auto-process which put lot many small files on HDFS? Can you update the process to put big chunk of data? Or collect the data on your local FS and only push HDFS once it reaches to a threshold?

1. getmerge will combine file and place it to your local FS not HDFS.

2. You can write a small MR Job which reads all these file and results in bigger file.

3. Use Sequence File format to store files

Refer this as well : http://snowplowanalytics.com/blog/2013/05/30/dealing-with-hadoops-small-files-problem/

Highlighted

Re: Need to move all the individual text files into one text file in HDFS

And you could schedule the job with Oozie or Falcon to run regularly.

Re: Need to move all the individual text files into one text file in HDFS

Contributor
@Pradeep Bhadani

Thanks for the reply, yes this is an auto- process. Every minute there are some log files which are landing on HDFS. We need to move all these small log text file into one text file. I am using -getmerge command, but will work for local file system to HDFS. But, my log files are landing on HDFS. Help me on this.

Highlighted

Re: Need to move all the individual text files into one text file in HDFS

Guru

getmerge process will merge it on client which is an issue. Your command should have worked (I tested the same thing on sandbox and it puts a local file).

However, a better approach is to see that merge happens on the cluster and not on the client. You can try something like this for that.

hadoop jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-streaming.jar -Dmapred.reduce.tasks=1 -input /tmp/testgetmerge -output /tmp/getmergeoutput -mapper cat -reducer cat

There are other ways as well to do this but this is one of a simpler approach where you data will not leave the cluster during merge process. You can also tweak your ingest process to put merged files if you have control over ingest.

Highlighted

Re: Need to move all the individual text files into one text file in HDFS

Expert Contributor

@shyam gurramCan you update your auto-process or use tools like Flume to collect logs and put on HDFS?

http://www.rittmanmead.com/2014/05/trickle-feeding-webserver-log-files-to-hdfs-using-apache-flume/