1) When I given the below command, I am getting a lot of text files. The size of the text files is very small, I need to move all the text files data into one text file. So that the size of the file is big.
Command Given on the Terminal: hadoop fs -ls hadoopcli2/queue_paths
Found 13 items
-rw-r--r-- 3 sg865w hdfs 113 2016-05-12 13:02 hadoopcli2/queue_paths/2016-02-12-.txt
-rw-r--r-- 3 sg865w hdfs 114 2016-05-12 13:02 hadoopcli2/queue_paths/2016-02-12-01-02.txt
-rw-r--r-- 3 sg865w hdfs 114 2016-05-12 13:04 hadoopcli2/queue_paths/2016-04-12-01-4.txt
-rw-r--r-- 3 sg865w hdfs 112 2016-05-12 13:06 hadoopcli2/queue_paths/2016-05-12-01-06.txt
-rw-r--r-- 3 sg865w hdfs 111 2016-05-12 14:31 hadoopcli2/queue_paths/2016-05-12-01-17.txt
-rw-r--r-- 3 sg865w hdfs 112 2016-05-12 13:21 hadoopcli2/queue_paths/2016-05-12-01-21.txt
-rw-r--r-- 3 sg865w hdfs 111 2016-05-12 13:31 hadoopcli2/queue_paths/2016-05-12-01-31.txt
-rw-r--r-- 3 sg865w hdfs 113 2016-05-12 14:53 hadoopcli2/queue_paths/2016-05-12-02-53.txt
-rw-r--r-- 3 sg865w hdfs 112 2016-05-12 16:10 hadoopcli2/queue_paths/2016-05-12-04-10.txt
-rw-r--r-- 3 sg865w hdfs 112 2016-05-12 13:03 hadoopcli2/queue_paths/2016-3-12-01-03.txt
-rw-r--r-- 3 sg865w hdfs 113 2016-05-12 13:06 hadoopcli2/queue_paths/2016-5-12-01-06.txt
-rw-r--r-- 3 sg865w hdfs 114 2016-05-12 12:57 hadoopcli2/queue_paths/2016-57-12-.txt
-rw-r--r-- 3 sg865w hdfs 113 2016-05-12 12:58 hadoopcli2/queue_paths/2016-58-12-.txt
Now When I am giving this command to move all these text files into one text file, I am not getting the result. Please help me on this.
Command I am Giving: hadoop fs -getmerge hadoopcli2/queue_paths/*.txt testing/queue_paths/2016-05-19-16-05.txt
2) All these text files which I have shown in the above are generating for every minute, if the above process is successful how can i write a cronjob for every 15 min for the above one. Can you please help me on this.
@shyam gurram is this any auto-process which put lot many small files on HDFS? Can you update the process to put big chunk of data? Or collect the data on your local FS and only push HDFS once it reaches to a threshold?
1. getmerge will combine file and place it to your local FS not HDFS.
2. You can write a small MR Job which reads all these file and results in bigger file.
3. Use Sequence File format to store files
Thanks for the reply, yes this is an auto- process. Every minute there are some log files which are landing on HDFS. We need to move all these small log text file into one text file. I am using -getmerge command, but will work for local file system to HDFS. But, my log files are landing on HDFS. Help me on this.
getmerge process will merge it on client which is an issue. Your command should have worked (I tested the same thing on sandbox and it puts a local file).
However, a better approach is to see that merge happens on the cluster and not on the client. You can try something like this for that.
hadoop jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-streaming.jar -Dmapred.reduce.tasks=1 -input /tmp/testgetmerge -output /tmp/getmergeoutput -mapper cat -reducer cat
There are other ways as well to do this but this is one of a simpler approach where you data will not leave the cluster during merge process. You can also tweak your ingest process to put merged files if you have control over ingest.
@shyam gurramCan you update your auto-process or use tools like Flume to collect logs and put on HDFS?