Reply
Explorer
Posts: 6
Registered: ‎07-07-2015
Accepted Solution

Duplicate Directories in HDFS

Hi All,

 

Our application team created hdfs directories with below script.


hadoop fs -mkdir /a/b/c/d/20160208
hadoop fs -mkdir /a/b/c/d/20160208/s
hadoop fs -mkdir /a/b/c/d/20160208/s/inputmap
hadoop fs -mkdir /a/b/c/d/20160208/s/temp
hadoop fs -mkdir /a/b/c/d/20160208/s/map
hadoop fs -mkdir /a/b/c/d/20160208/s/input
hadoop fs -copyFromLocal /x/y/z/20160208.dat /a/b/c/d/20160208/s/inputmap

echo "Setup Complete"

 

The directories got creted but it throws error if we try to access it.

 

hdfs@hostname$ hadoop fs -ls /a/b/c/d/
Found 20 items
drwxr-xr-x - user group 0 2016-01-27 09:10 /a/b/c/d/20141211
drwxr-xr-x - user group 0 2016-01-06 01:03 /a/b/c/d/20141212
drwxr-xr-x - user group 0 2016-01-06 01:09 /a/b/c/d/20141213
drwxr-xr-x - user group 0 2015-11-12 08:53 /a/b/c/d/20151106
drwxr-xr-x - user group 0 2016-01-12 01:48 /a/b/c/d/20151118
drwxr-xr-x - user group 0 2015-12-04 04:21 /a/b/c/d/20151130
drwxrwxr-x - user group 0 2016-01-12 10:48 /a/b/c/d/20151221
drwxr-xr-x - user group 0 2016-01-19 11:23 /a/b/c/d/20160111
drwxr-xr-x - user group 0 2016-01-27 14:56 /a/b/c/d/20160112
drwxr-xr-x - user group 0 2016-02-02 16:12 /a/b/c/d/20160125
drwxr-xr-x - user group 0 2016-02-08 12:41 /a/b/c/d/20160126
drwxr-xr-x - user group 0 2016-02-08 10:26 /a/b/c/d/20160127
drwxr-xr-x - user group 0 2016-01-29 10:48 /a/b/c/d/20160129
drwxr-xr-x - user group 0 2016-02-09 02:43 /a/b/c/d/20160203
drwxr-xr-x - user group 0 2016-02-09 02:42 /a/b/c/d/20160204
drwxr-xr-x - user group 0 2016-02-08 15:38 /a/b/c/d/20160205
drwxr-xr-x - user group 0 2016-02-08 09:02 /a/b/c/d/20160205
drwxr-xr-x - user group 0 2016-02-08 07:00 /a/b/c/d/20160206
drwxr-xr-x - user group 0 2016-02-09 17:11 /a/b/c/d/20160208
drwxr-xr-x - user group 0 2016-02-08 11:07 /a/b/c/d/20160208
hdfs@hostname$ hadoop fs -ls /a/b/c/d/20160206
ls: `/a/b/c/d/20160206': No such file or directory

 

when we did | cat -v along with "ls" we came to know that some special character got inserted in the directory name as below.

 


hdfs@hostname$ hadoop fs -ls /a/b/c/d/ | cat -v
Found 20 items
drwxr-xr-x - user group 0 2016-01-27 09:10 /a/b/c/d//20141211
drwxr-xr-x - user group 0 2016-01-06 01:03 /a/b/c/d//20141212
drwxr-xr-x - user group 0 2016-01-06 01:09 /a/b/c/d//20141213
drwxr-xr-x - user group 0 2015-11-12 08:53 /a/b/c/d//20151106
drwxr-xr-x - user group 0 2016-01-12 01:48 /a/b/c/d//20151118
drwxr-xr-x - user group 0 2015-12-04 04:21 /a/b/c/d//20151130
drwxrwxr-x - user group 0 2016-01-12 10:48 /a/b/c/d//20151221
drwxr-xr-x - user group 0 2016-01-19 11:23 /a/b/c/d//20160111
drwxr-xr-x - user group 0 2016-01-27 14:56 /a/b/c/d//20160112
drwxr-xr-x - user group 0 2016-02-02 16:12 /a/b/c/d//20160125
drwxr-xr-x - user group 0 2016-02-08 12:41 /a/b/c/d//20160126
drwxr-xr-x - user group 0 2016-02-08 10:26 /a/b/c/d//20160127
drwxr-xr-x - user group 0 2016-01-29 10:48 /a/b/c/d//20160129
drwxr-xr-x - user group 0 2016-02-09 02:43 /a/b/c/d//20160203
drwxr-xr-x - user group 0 2016-02-09 02:42 /a/b/c/d//20160204
drwxr-xr-x - user group 0 2016-02-08 15:38 /a/b/c/d//20160205
drwxr-xr-x - user group 0 2016-02-08 09:02 /a/b/c/d//20160205^M
drwxr-xr-x - user group 0 2016-02-08 07:00 /a/b/c/d//20160206^M
drwxr-xr-x - user group 0 2016-02-09 17:11 /a/b/c/d//20160208
drwxr-xr-x - user group 0 2016-02-08 11:07 /a/b/c/d//20160208^M

 

Now i want to delete these duplicate entries, can anyone help me with this.

 

Thanks

Srini

 

 

 

Cloudera Employee
Posts: 29
Registered: ‎08-19-2013

Re: Duplicate Directories in HDFS

You would handle this in the same way if the issue occurred on a linux filesystem.   Use quotes around the filename and ctrl-v to insert the special characters.

 

In this case, I type ctrl-v then ctrl-m to insert ^M into my strings.

 

 

$ hdfs dfs -put /etc/group "/tmp/abc^M"

$ hdfs dfs -ls /tmp
Found 4 items
drwxrwxrwx   - hdfs   supergroup          0 2016-02-11 11:29 /tmp/.cloudera_health_monitoring_canary_files
-rw-r--r--   3 hdfs   supergroup        954 2016-02-11 11:30 /tmp/abc
drwx-wx-wx   - hive   supergroup          0 2016-01-11 12:10 /tmp/hive
drwxrwxrwt   - mapred hadoop              0 2016-01-11 12:08 /tmp/logs

$ hdfs dfs -ls /tmp | cat -v
Found 4 items
drwxrwxrwx   - hdfs   supergroup          0 2016-02-11 11:30 /tmp/.cloudera_health_monitoring_canary_files
-rw-r--r--   3 hdfs   supergroup        954 2016-02-11 11:30 /tmp/abc^M
drwx-wx-wx   - hive   supergroup          0 2016-01-11 12:10 /tmp/hive
drwxrwxrwt   - mapred hadoop              0 2016-01-11 12:08 /tmp/logs

$ hdfs dfs -mv "/tmp/abc^M" /tmp/abc

$ hdfs dfs -ls /tmp | cat -v
Found 4 items
drwxrwxrwx   - hdfs   supergroup          0 2016-02-11 11:31 /tmp/.cloudera_health_monitoring_canary_files
-rw-r--r--   3 hdfs   supergroup        954 2016-02-11 11:30 /tmp/abc
drwx-wx-wx   - hive   supergroup          0 2016-01-11 12:10 /tmp/hive
drwxrwxrwt   - mapred hadoop              0 2016-01-11 12:08 /tmp/logs

 

 

Cloudera Employee
Posts: 29
Registered: ‎08-19-2013

Re: Duplicate Directories in HDFS

In my example I used -mv.  You would use -rmdir.

 

hdfs dfs -rmdir "/a/b/c/d//20160205^M"

 

Remember, to get "^M" type ctrl-v ctrl-m.

Contributor
Posts: 47
Registered: ‎09-12-2014

Re: Duplicate Directories in HDFS

[ Edited ]

Thanks Denloe for your response. I actually got one more doubt.

 

What if there is some other name following with after that ^M ? Should we need to use

 

hdfs dfs -rmdir "/a/b/c/d//20160205^Msomepart" or do we need to use some escape sequence for this?

 

Moreover when I press ctrl+v or ctrl+m the command is immediately executing (it is not allowing to type the second word)

Thanks,
Sathish (Satz)
Cloudera Employee
Posts: 29
Registered: ‎08-19-2013

Re: Duplicate Directories in HDFS

The non-printable character may be located anywhere in the filename.  You just need to insert it in the appropriate location when quoting the filename.

 

Using ctrl-v to insert special characters is the default for the bash shell, but your terminal emulator (especially if you are coming in from Windows) may be catching it instead.

 

Try using shift-insert instead of ctrl-v.  If that fails, you may need to find an alternate method to embed control characters, such as use vi to create to bash script and insert them using vi.

Posts: 1,566
Kudos: 287
Solutions: 240
Registered: ‎07-31-2013

Re: Duplicate Directories in HDFS

I prefer using the simpler bash syntax of using special escaped characters,
if it helps:

We know that ^M is the same as \r, which makes sense if you used Windows
Notepad to write the commands but forgot to convert the file via dos2unix:

~> echo $'\x0d' | cat -v
^M
~> echo -n $'\x0d' | od -c
0000000 \r
0000002

(The \x0D or \x0d is the hex equivalent of \r, per
http://www.asciitable.com/ (carriage return))

Therefore, you can use the $'' syntax to write a string that includes the
escape:

~> hadoop fs -ls $'/a/b/c/d/20160206\r'
Or,
~> hadoop fs -ls $'/a/b/c/d/20160206\x0d'

This words well regardless of the terminal emulator you are using, cause
we're escaping based on representation vs. by reliance on the emulator
understanding the characters via input.
Backline Customer Operations Engineer
New Contributor
Posts: 5
Registered: ‎03-08-2016

Re: Duplicate Directories in HDFS

There is a simple method to remove those.

 

1. List those directories inside a txt file like below

 

hadoop fs -ls /path > test

2. cat -t test will give you positions of duplicate with junk character

3. open another shell and just try to comment it # to identify exact ones

4. again cat -t the file to confirm u commented the culprits

5. remove original folder frm list

6. for i in `cat list`;

do hadoop fs -rmr $i;

done

Announcements