Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

parsing the HDFS dfs -count output

avatar
Master Collaborator

Hi,

 

I need to send the hdfs dfs -count output to graphite, but want to do this on one command rather to do 3 commands: one for the folders count, the files count and the size,

 

I can do this by separated commands like this:

 

hdfs dfs -ls /fawze/data | awk '{system("hdfs dfs -count " $8) }' | awk '{print $4,$2;}'

 

But i want the output to be like this:

 

fawze/data/x.folders    20

fawze/data/x.files    200 

fawze/data/x.size   2650

 

fawze/data/y.folders    25 

fawze/data/y.files    2450

fawze/data/y.size   23560

 

I'm not a linux expert so will appreciate any help.

 

 

1 ACCEPTED SOLUTION

avatar
Champion

man took a bit of trial and error.

The issue with the first run is that it is returning an empty line. I tried a few awk specific was to get around it but they didn't work. So here is a hack. And using the variable withing awk as well.

 

DC=PN
hdfs dfs -ls /lib/ | grep "drwx" | awk '{system("hdfs dfs -count " $8) }' | awk '{ gsub(/\/lib\//,"'$DC'"".hadoop.hdfs.",$4); print $4 ".folderscount",$1"\n"$4 ".filescount",$2"\n"$4 ".size",$3;}'

PN.hadoop.hdfs.archive.folderscount 9
PN.hadoop.hdfs.archive.filescount 103
PN.hadoop.hdfs.archive.size 928524788
PN.hadoop.hdfs.dae.folderscount 1
PN.hadoop.hdfs.dae.filescount 13
PN.hadoop.hdfs.dae.size 192504874
PN.hadoop.hdfs.schema.folderscount 1
PN.hadoop.hdfs.schema.filescount 14
PN.hadoop.hdfs.schema.size 45964

DC=VA

hdfs dfs -ls /lib/ | grep "drwx" | awk '{system("hdfs dfs -count " $8) }' | awk '{ gsub(/\/lib\//,"'$DC'"".hadoop.hdfs.",$4); print $4 ".folderscount",$1"\n"$4 ".filescount",$2"\n"$4 ".size",$3;}'

VA.hadoop.hdfs.archive.folderscount 9
VA.hadoop.hdfs.archive.filescount 103
VA.hadoop.hdfs.archive.size 928524788
VA.hadoop.hdfs.dae.folderscount 1
VA.hadoop.hdfs.dae.filescount 13
VA.hadoop.hdfs.dae.size 192504874
VA.hadoop.hdfs.schema.folderscount 1
VA.hadoop.hdfs.schema.filescount 14
VA.hadoop.hdfs.schema.size 45964

View solution in original post

3 REPLIES 3

avatar
Champion
You were close.

hdfs dfs -ls /lib | awk '{system("hdfs dfs -count " $8) }' | awk '{print $4,$1"\n"$4,$2"\n"$4,$3;}'

This throws an usage error for the first run and I haven't looked into why, but prints out all subdirs; three entries for each stat from -count.

avatar
Master Collaborator

@mbigelow That's helped me alot, Thanks

 

I made small additions to the command:

 

hdfs dfs -ls /liveperson/data | grep -v storage | awk '{system("hdfs dfs -count " $8) }' | awk '{ gsub(/\/liveperson\/data\/server_/,"hadoop.hdfs.",$4); print $4 ".folderscount",$1"\n"$4 ".filescount",$2"\n"$4 ".size",$3;}'

 

still investigating the usage error for the first run, and want to add a variable before the hadoop.hdfs.

 

Can you help with this.

 

I have a vaibale called DC, and i want to concat it to the path and should looks like this (exampe DC is VA)

VA.hadoop.hdfs.$4.

 

I identified $DC

avatar
Champion

man took a bit of trial and error.

The issue with the first run is that it is returning an empty line. I tried a few awk specific was to get around it but they didn't work. So here is a hack. And using the variable withing awk as well.

 

DC=PN
hdfs dfs -ls /lib/ | grep "drwx" | awk '{system("hdfs dfs -count " $8) }' | awk '{ gsub(/\/lib\//,"'$DC'"".hadoop.hdfs.",$4); print $4 ".folderscount",$1"\n"$4 ".filescount",$2"\n"$4 ".size",$3;}'

PN.hadoop.hdfs.archive.folderscount 9
PN.hadoop.hdfs.archive.filescount 103
PN.hadoop.hdfs.archive.size 928524788
PN.hadoop.hdfs.dae.folderscount 1
PN.hadoop.hdfs.dae.filescount 13
PN.hadoop.hdfs.dae.size 192504874
PN.hadoop.hdfs.schema.folderscount 1
PN.hadoop.hdfs.schema.filescount 14
PN.hadoop.hdfs.schema.size 45964

DC=VA

hdfs dfs -ls /lib/ | grep "drwx" | awk '{system("hdfs dfs -count " $8) }' | awk '{ gsub(/\/lib\//,"'$DC'"".hadoop.hdfs.",$4); print $4 ".folderscount",$1"\n"$4 ".filescount",$2"\n"$4 ".size",$3;}'

VA.hadoop.hdfs.archive.folderscount 9
VA.hadoop.hdfs.archive.filescount 103
VA.hadoop.hdfs.archive.size 928524788
VA.hadoop.hdfs.dae.folderscount 1
VA.hadoop.hdfs.dae.filescount 13
VA.hadoop.hdfs.dae.size 192504874
VA.hadoop.hdfs.schema.folderscount 1
VA.hadoop.hdfs.schema.filescount 14
VA.hadoop.hdfs.schema.size 45964