Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

How to join multiple csv files in folder to one output file?

avatar
Explorer

How can I now join all the files in one folder to one single csv file?

I have a folder called Folder1 and I want to combine them all to a file called "output.csv".

I tried:

hadoop fs -getmerge Folder1 /user/maria_dev/output.csv

But I get the error:

getmerge: Mkdirs failed to create file:/user/maria_dev (exists=false, cwd=file:/home/maria_dev)

I also tried:

hadoop fs -cat Folder1 /output.csv 

But receive error: No such file or directory.

Thanks

1 ACCEPTED SOLUTION

avatar
Master Mentor

@Matt

Are you looking at the correct directory?

Can you please share the complete PATH of the directory in the One screenshot with all the commands that you are trying from Directory.

.

View solution in original post

16 REPLIES 16

avatar
Master Mentor

@Matt

The "getmerge" command will take assume "Folder1" as HDFS Source directory and then second argument "/user/maria_dev/" as Local filesystem Destination directory and hence you will see this error..

Here is a complete example which will help in understanding "getmerge"

Syntax:

[-getmerge [-nl]   <src>   <localdst>]

.


1. I have 3 Files in Sandbox as following:

[maria_dev@sandbox ~]$ cat /tmp/aa.txt
aa
[maria_dev@sandbox ~]$ cat /tmp/bb.txt
bb
[maria_dev@sandbox ~]$ cat /tmp/cc.txt
cc


2. I have placed those files to HDFS "/user/maria_dev/test" directory as following:

[maria_dev@sandbox ~]$ hdfs dfs -mkdir /user/maria_dev/test
[maria_dev@sandbox ~]$ hdfs dfs -put /tmp/aa.txt /user/maria_dev/test
[maria_dev@sandbox ~]$ hdfs dfs -put /tmp/bb.txt /user/maria_dev/test
[maria_dev@sandbox ~]$ hdfs dfs -put /tmp/cc.txt /user/maria_dev/test



3. Following files are now present on HDFS.

[maria_dev@sandbox ~]$ hdfs dfs -ls /user/maria_dev/test
Found 3 items
-rw-r--r--  1 maria_dev hadoop  3 2018-01-05 23:39 /user/maria_dev/test/aa.txt
-rw-r--r--  1 maria_dev hadoop  3 2018-01-05 23:39 /user/maria_dev/test/bb.txt
-rw-r--r--  1 maria_dev hadoop  3 2018-01-05 23:39 /user/maria_dev/test/cc.txt



4. Now doing a "getmerge" as following. Following command will Merge the contents of all the files present in "/user/maria_dev/test/" HDFS directory to the ("local Filesystem") "/tmp/test.txt" file.

[maria_dev@sandbox ~]$ hdfs dfs -getmerge /user/maria_dev/test/* /tmp/test.txt
[maria_dev@sandbox ~]$ cat /tmp/test.txt 
aa
bb
cc

.

5. Now put the Merged file to HDFS back.

[maria_dev@sandbox ~]$ hdfs dfs -put /tmp/test.txt /user/maria_dev/test/

[maria_dev@sandbox ~]$ hdfs dfs -ls /user/maria_dev/test
Found 4 items
-rw-r--r--   1 maria_dev hadoop          3 2018-01-05 23:39 /user/maria_dev/test/aa.txt
-rw-r--r--   1 maria_dev hadoop          3 2018-01-05 23:39 /user/maria_dev/test/bb.txt
-rw-r--r--   1 maria_dev hadoop          3 2018-01-05 23:39 /user/maria_dev/test/cc.txt
-rw-r--r--   1 maria_dev hadoop          9 2018-01-05 23:55 /user/maria_dev/test/test.txt

.

.

avatar
Explorer

@ Jay Kumar SenSharma

thank you. Two questions:

1. Is there a way to merge the files directly from HDFS, or do you need to merge them to local file system and then back to HDFS?

2. I was following your instructions, but on point 4 with getmerge, I used this:

hdfs dfs -getmerge /user/maria_dev/Folder1/* /maria_dev/Folder1/output.csv

I have a folder called Folder1 (it is also on local file system under maria_dev folder as Folder1 but get the same error:

getmerge: Mkdirs failed to create file:/maria_dev/Folder1 (exists=false, cwd=file:/home/maria_dev)

Have I missed a step or written this incorrectly?

Thanks

avatar
Explorer

@ Jay Kumar SenSharma

thank you. Two questions:

1. Is there a way to merge the files directly on HDFS, or do you need to merge to local file system then put back on HDFS?

2. I followed your instructions but on point no. 4 I used:

hdfs dfs -getmerge /user/maria_dev/Folder1/* /Folder1/output.csv

I have a folder called Folder1 on HDFS and it is also the same folder on local system, but got the same error:

getmerge: Mkdirs failed to create file:/Folder1 (exists=false, cwd=file:/home/maria_dev)

Not sure why this occurred. Have I missed a step or typed incorrectly?

Thanks

avatar
Master Mentor

@Matt

When you run the command as following:
hdfs dfs -getmerge /user/maria_dev/Folder1/* /Folder1/output.csv

Then it expects that the second argument which is "/Folder1/" is a valid directory on your local filesystem.

Hence you will need to first create a valid path in your local file system.

You will need to create the "/Folder1" directory on your local machine first.

# mkdir "/Folder1/"

The you should be able to run:

# hdfs dfs -getmerge /user/maria_dev/Folder1/* /Folder1/output.csv

.

avatar
Master Mentor

@Matt

Similarly if you want to run the following command:

# hdfs dfs -getmerge /user/maria_dev/Folder1/* /maria_dev/Folder1/output.csv

.

Then you will need to make sure that the following PATH "/maria_dev/Folder1/" exist on your local machine (sandbox)

# mkdir -p  /maria_dev/Folder1/
# hdfs dfs -getmerge /user/maria_dev/Folder1/* /maria_dev/Folder1/output.csv

.

Another Example:

# mkdir -p /tmp/aa/bb/cc/dd/Folder1/
# hdfs dfs -getmerge /user/maria_dev/Folder1/* /tmp/aa/bb/cc/dd/Folder1/output.csv

.

avatar
Master Mentor

@Matt

The user who is running the command "hdfs" should have the WRITE permission on the local filesystem to create the directories.

Else you will see the following error:

[maria_dev@sandbox ~]$ hdfs dfs -getmerge /user/maria_dev/Folder1/* /maria_dev/Folder1/output.csv
      getmerge: Mkdirs failed to create file:/Folder1 (exists=false, cwd=file:/home/maria_dev)

.

This error indicates that the Sandbox User "maria_dev" does not have privileges to create that directory on the local filsystem.

[maria_dev@sandbox ~]$ mkdir /Folder1
mkdir: cannot create directory `/Folder1': Permission denied

.

So you will need to make sure two things:

Thumb Rule:

1. The PATH mentioned in the second argument of the "getmerge" command exist.

2. The Operating System user (like "maria_dev") who is running the "getmerge" command has enough Read and write permission to the PATH which is mentioned in the command.

avatar
Master Mentor

@Matt

Example:

[maria_dev@sandbox ~]$ ls -l /Folder1/
ls: cannot access /Folder1/: No such file or directory

As the above directory does not exist hence we see the following error

[maria_dev@sandbox ~]$ hdfs dfs -getmerge /user/maria_dev/test/* /Folder1/merged_files.txt
getmerge: Mkdirs failed to create file:/Folder1 (exists=false, cwd=file:/home/maria_dev)

.
As the directory does not exist hence we will need to create that PATH first.

[maria_dev@sandbox ~]$ mkdir /Folder1
mkdir: cannot create directory `/Folder1': Permission denied

.
Running directory creation command command as "root"

[maria_dev@sandbox ~]$ exit

[root@sandbox ~]# mkdir -p /Folder1
[root@sandbox ~]# chmod 777 -R /Folder1/


Now as the "maria_dev" user has read-write permission on PATH '/Folder1" hence we can now run the command:

[root@sandbox ~]# su - maria_dev
[maria_dev@sandbox ~]$ hdfs dfs -getmerge /user/maria_dev/test/* /Folder1/merged_files.txt

.

avatar
Explorer

@ Jay Kumar SenSharma

thank you. The Folder1 folder does indeed exist. This worked for me:

hadoop fs -getmerge /user/maria_dev/Folder1/* output.csv

I cannot seem to use any of the "hdfs dfs" commands above?

The above gave me an output file, but it was only the first file, i.e. it did not join the second file to it?

avatar
Master Mentor

@Matt

Please refer to my last example and the explanation (Which i just posted few seconds back) about the user permission on the second argument path.