Support Questions
Find answers, ask questions, and share your expertise

How to join multiple csv files in folder to one output file?

Explorer

How can I now join all the files in one folder to one single csv file?

I have a folder called Folder1 and I want to combine them all to a file called "output.csv".

I tried:

hadoop fs -getmerge Folder1 /user/maria_dev/output.csv

But I get the error:

getmerge: Mkdirs failed to create file:/user/maria_dev (exists=false, cwd=file:/home/maria_dev)

I also tried:

hadoop fs -cat Folder1 /output.csv 

But receive error: No such file or directory.

Thanks

1 ACCEPTED SOLUTION

Super Mentor

@Matt

Are you looking at the correct directory?

Can you please share the complete PATH of the directory in the One screenshot with all the commands that you are trying from Directory.

.

View solution in original post

16 REPLIES 16

Super Mentor

@Matt

The "getmerge" command will take assume "Folder1" as HDFS Source directory and then second argument "/user/maria_dev/" as Local filesystem Destination directory and hence you will see this error..

Here is a complete example which will help in understanding "getmerge"

Syntax:

[-getmerge [-nl]   <src>   <localdst>]

.


1. I have 3 Files in Sandbox as following:

[maria_dev@sandbox ~]$ cat /tmp/aa.txt
aa
[maria_dev@sandbox ~]$ cat /tmp/bb.txt
bb
[maria_dev@sandbox ~]$ cat /tmp/cc.txt
cc


2. I have placed those files to HDFS "/user/maria_dev/test" directory as following:

[maria_dev@sandbox ~]$ hdfs dfs -mkdir /user/maria_dev/test
[maria_dev@sandbox ~]$ hdfs dfs -put /tmp/aa.txt /user/maria_dev/test
[maria_dev@sandbox ~]$ hdfs dfs -put /tmp/bb.txt /user/maria_dev/test
[maria_dev@sandbox ~]$ hdfs dfs -put /tmp/cc.txt /user/maria_dev/test



3. Following files are now present on HDFS.

[maria_dev@sandbox ~]$ hdfs dfs -ls /user/maria_dev/test
Found 3 items
-rw-r--r--  1 maria_dev hadoop  3 2018-01-05 23:39 /user/maria_dev/test/aa.txt
-rw-r--r--  1 maria_dev hadoop  3 2018-01-05 23:39 /user/maria_dev/test/bb.txt
-rw-r--r--  1 maria_dev hadoop  3 2018-01-05 23:39 /user/maria_dev/test/cc.txt



4. Now doing a "getmerge" as following. Following command will Merge the contents of all the files present in "/user/maria_dev/test/" HDFS directory to the ("local Filesystem") "/tmp/test.txt" file.

[maria_dev@sandbox ~]$ hdfs dfs -getmerge /user/maria_dev/test/* /tmp/test.txt
[maria_dev@sandbox ~]$ cat /tmp/test.txt 
aa
bb
cc

.

5. Now put the Merged file to HDFS back.

[maria_dev@sandbox ~]$ hdfs dfs -put /tmp/test.txt /user/maria_dev/test/

[maria_dev@sandbox ~]$ hdfs dfs -ls /user/maria_dev/test
Found 4 items
-rw-r--r--   1 maria_dev hadoop          3 2018-01-05 23:39 /user/maria_dev/test/aa.txt
-rw-r--r--   1 maria_dev hadoop          3 2018-01-05 23:39 /user/maria_dev/test/bb.txt
-rw-r--r--   1 maria_dev hadoop          3 2018-01-05 23:39 /user/maria_dev/test/cc.txt
-rw-r--r--   1 maria_dev hadoop          9 2018-01-05 23:55 /user/maria_dev/test/test.txt

.

.

Explorer

@ Jay Kumar SenSharma

thank you. Two questions:

1. Is there a way to merge the files directly from HDFS, or do you need to merge them to local file system and then back to HDFS?

2. I was following your instructions, but on point 4 with getmerge, I used this:

hdfs dfs -getmerge /user/maria_dev/Folder1/* /maria_dev/Folder1/output.csv

I have a folder called Folder1 (it is also on local file system under maria_dev folder as Folder1 but get the same error:

getmerge: Mkdirs failed to create file:/maria_dev/Folder1 (exists=false, cwd=file:/home/maria_dev)

Have I missed a step or written this incorrectly?

Thanks

Explorer

@ Jay Kumar SenSharma

thank you. Two questions:

1. Is there a way to merge the files directly on HDFS, or do you need to merge to local file system then put back on HDFS?

2. I followed your instructions but on point no. 4 I used:

hdfs dfs -getmerge /user/maria_dev/Folder1/* /Folder1/output.csv

I have a folder called Folder1 on HDFS and it is also the same folder on local system, but got the same error:

getmerge: Mkdirs failed to create file:/Folder1 (exists=false, cwd=file:/home/maria_dev)

Not sure why this occurred. Have I missed a step or typed incorrectly?

Thanks

Super Mentor

@Matt

When you run the command as following:
hdfs dfs -getmerge /user/maria_dev/Folder1/* /Folder1/output.csv

Then it expects that the second argument which is "/Folder1/" is a valid directory on your local filesystem.

Hence you will need to first create a valid path in your local file system.

You will need to create the "/Folder1" directory on your local machine first.

# mkdir "/Folder1/"

The you should be able to run:

# hdfs dfs -getmerge /user/maria_dev/Folder1/* /Folder1/output.csv

.

Super Mentor

@Matt

Similarly if you want to run the following command:

# hdfs dfs -getmerge /user/maria_dev/Folder1/* /maria_dev/Folder1/output.csv

.

Then you will need to make sure that the following PATH "/maria_dev/Folder1/" exist on your local machine (sandbox)

# mkdir -p  /maria_dev/Folder1/
# hdfs dfs -getmerge /user/maria_dev/Folder1/* /maria_dev/Folder1/output.csv

.

Another Example:

# mkdir -p /tmp/aa/bb/cc/dd/Folder1/
# hdfs dfs -getmerge /user/maria_dev/Folder1/* /tmp/aa/bb/cc/dd/Folder1/output.csv

.

Super Mentor

@Matt

The user who is running the command "hdfs" should have the WRITE permission on the local filesystem to create the directories.

Else you will see the following error:

[maria_dev@sandbox ~]$ hdfs dfs -getmerge /user/maria_dev/Folder1/* /maria_dev/Folder1/output.csv
      getmerge: Mkdirs failed to create file:/Folder1 (exists=false, cwd=file:/home/maria_dev)

.

This error indicates that the Sandbox User "maria_dev" does not have privileges to create that directory on the local filsystem.

[maria_dev@sandbox ~]$ mkdir /Folder1
mkdir: cannot create directory `/Folder1': Permission denied

.

So you will need to make sure two things:

Thumb Rule:

1. The PATH mentioned in the second argument of the "getmerge" command exist.

2. The Operating System user (like "maria_dev") who is running the "getmerge" command has enough Read and write permission to the PATH which is mentioned in the command.

Super Mentor

@Matt

Example:

[maria_dev@sandbox ~]$ ls -l /Folder1/
ls: cannot access /Folder1/: No such file or directory

As the above directory does not exist hence we see the following error

[maria_dev@sandbox ~]$ hdfs dfs -getmerge /user/maria_dev/test/* /Folder1/merged_files.txt
getmerge: Mkdirs failed to create file:/Folder1 (exists=false, cwd=file:/home/maria_dev)

.
As the directory does not exist hence we will need to create that PATH first.

[maria_dev@sandbox ~]$ mkdir /Folder1
mkdir: cannot create directory `/Folder1': Permission denied

.
Running directory creation command command as "root"

[maria_dev@sandbox ~]$ exit

[root@sandbox ~]# mkdir -p /Folder1
[root@sandbox ~]# chmod 777 -R /Folder1/


Now as the "maria_dev" user has read-write permission on PATH '/Folder1" hence we can now run the command:

[root@sandbox ~]# su - maria_dev
[maria_dev@sandbox ~]$ hdfs dfs -getmerge /user/maria_dev/test/* /Folder1/merged_files.txt

.

Explorer

@ Jay Kumar SenSharma

thank you. The Folder1 folder does indeed exist. This worked for me:

hadoop fs -getmerge /user/maria_dev/Folder1/* output.csv

I cannot seem to use any of the "hdfs dfs" commands above?

The above gave me an output file, but it was only the first file, i.e. it did not join the second file to it?

Super Mentor

@Matt

Please refer to my last example and the explanation (Which i just posted few seconds back) about the user permission on the second argument path.

Explorer

@ Jay Kumar SenSharma

thank you for your help with this. I am still not quite there.

The Folder1 definitely exists as shown in the attached (pic4) with two files.

I then tried:

chmod 777 -R Folder1

but then when I ran the getmerge command again, it produced the same error?

pic4.jpg

Super Mentor

@Matt

Your attached image "pic4.jpg" says that "maria_dev" user does not have WRITE permission on the "Folder1" contents. Only "root" user has WRITE permission on that directory.

-rw-r--r--

.

So you will need to make sure that when you do "ls -l" then you see the WRITE permission.

[root@sandbox ~]# mkdir -p /Folder1
[root@sandbox ~]# chmod 777 -R /Folder1/

(OR)

[root@sandbox ~]# chown maria_dev:hadoop -R /Folder1/

.

For testing try creating a simple file inside the folder where you are getting the error (this is just to see if "maria_dev" user has permission to create files inside that folder or not?)

[root@sandbox ~]# su - maria_dev

[maria_dev@sandbox ~]$ cd /Folder1/

[maria_dev@sandbox Folder1]$ echo "Hello"  > test.txt

.

Explorer

@ Jay Kumar SenSharma

I changed the "ls-l" to see WRITE permission as shown in the attached pic.

I also created tespic5.jpgt.txt but where should I see this now? It is not showing in list, also not showing in my directory for WinSCP?

Super Mentor

@Matt

Are you looking at the correct directory?

Can you please share the complete PATH of the directory in the One screenshot with all the commands that you are trying from Directory.

.

Explorer

@ Jay Kumar SenSharma

do you mean from WinSCP? I am not sure. I have included pics in my previous post. If you could take a look, that would be most appreciated. When I create a Folder1 directory, it does not show up under root - home - maria_dev.

Explorer

I was looking at the wrong directory. This appears to have resolved the issue. Thank you.

Explorer

@ Jay Kumar SenSharma

are you referring to WinSCP directory? If so, please see screen shot attached.pic6.jpg

There is no Folder1 here so this is why I am still confused.

Pic7 shows directory under root (pic 6 was as maria_dev).pic7.jpg The Folder1 shows here but I am still unsure as to how to progress.

Should I be logged into WinSCP as root or maria_dev?

; ;