Archives of Support Questions (Read Only)

This is an archived board for historical reference. Information and links may no longer be available or relevant
Announcements
This board is archived and read-only for historical reference. To ask a new question, please post a new topic on the appropriate active board.

parsing the HDFS dfs -count output

avatar
Master Collaborator

Hi,

 

I need to send the hdfs dfs -count output to graphite, but want to do this on one command rather to do 3 commands: one for the folders count, the files count and the size,

 

I can do this by separated commands like this:

 

hdfs dfs -ls /fawze/data | awk '{system("hdfs dfs -count " $8) }' | awk '{print $4,$2;}'

 

But i want the output to be like this:

 

fawze/data/x.folders    20

fawze/data/x.files    200 

fawze/data/x.size   2650

 

fawze/data/y.folders    25 

fawze/data/y.files    2450

fawze/data/y.size   23560

 

I'm not a linux expert so will appreciate any help.

 

 

1 ACCEPTED SOLUTION

avatar
Champion

man took a bit of trial and error.

The issue with the first run is that it is returning an empty line. I tried a few awk specific was to get around it but they didn't work. So here is a hack. And using the variable withing awk as well.

 

DC=PN
hdfs dfs -ls /lib/ | grep "drwx" | awk '{system("hdfs dfs -count " $8) }' | awk '{ gsub(/\/lib\//,"'$DC'"".hadoop.hdfs.",$4); print $4 ".folderscount",$1"\n"$4 ".filescount",$2"\n"$4 ".size",$3;}'

PN.hadoop.hdfs.archive.folderscount 9
PN.hadoop.hdfs.archive.filescount 103
PN.hadoop.hdfs.archive.size 928524788
PN.hadoop.hdfs.dae.folderscount 1
PN.hadoop.hdfs.dae.filescount 13
PN.hadoop.hdfs.dae.size 192504874
PN.hadoop.hdfs.schema.folderscount 1
PN.hadoop.hdfs.schema.filescount 14
PN.hadoop.hdfs.schema.size 45964

DC=VA

hdfs dfs -ls /lib/ | grep "drwx" | awk '{system("hdfs dfs -count " $8) }' | awk '{ gsub(/\/lib\//,"'$DC'"".hadoop.hdfs.",$4); print $4 ".folderscount",$1"\n"$4 ".filescount",$2"\n"$4 ".size",$3;}'

VA.hadoop.hdfs.archive.folderscount 9
VA.hadoop.hdfs.archive.filescount 103
VA.hadoop.hdfs.archive.size 928524788
VA.hadoop.hdfs.dae.folderscount 1
VA.hadoop.hdfs.dae.filescount 13
VA.hadoop.hdfs.dae.size 192504874
VA.hadoop.hdfs.schema.folderscount 1
VA.hadoop.hdfs.schema.filescount 14
VA.hadoop.hdfs.schema.size 45964

View solution in original post

3 REPLIES 3

avatar
Champion
You were close.

hdfs dfs -ls /lib | awk '{system("hdfs dfs -count " $8) }' | awk '{print $4,$1"\n"$4,$2"\n"$4,$3;}'

This throws an usage error for the first run and I haven't looked into why, but prints out all subdirs; three entries for each stat from -count.

avatar
Master Collaborator

@mbigelow That's helped me alot, Thanks

 

I made small additions to the command:

 

hdfs dfs -ls /liveperson/data | grep -v storage | awk '{system("hdfs dfs -count " $8) }' | awk '{ gsub(/\/liveperson\/data\/server_/,"hadoop.hdfs.",$4); print $4 ".folderscount",$1"\n"$4 ".filescount",$2"\n"$4 ".size",$3;}'

 

still investigating the usage error for the first run, and want to add a variable before the hadoop.hdfs.

 

Can you help with this.

 

I have a vaibale called DC, and i want to concat it to the path and should looks like this (exampe DC is VA)

VA.hadoop.hdfs.$4.

 

I identified $DC

avatar
Champion

man took a bit of trial and error.

The issue with the first run is that it is returning an empty line. I tried a few awk specific was to get around it but they didn't work. So here is a hack. And using the variable withing awk as well.

 

DC=PN
hdfs dfs -ls /lib/ | grep "drwx" | awk '{system("hdfs dfs -count " $8) }' | awk '{ gsub(/\/lib\//,"'$DC'"".hadoop.hdfs.",$4); print $4 ".folderscount",$1"\n"$4 ".filescount",$2"\n"$4 ".size",$3;}'

PN.hadoop.hdfs.archive.folderscount 9
PN.hadoop.hdfs.archive.filescount 103
PN.hadoop.hdfs.archive.size 928524788
PN.hadoop.hdfs.dae.folderscount 1
PN.hadoop.hdfs.dae.filescount 13
PN.hadoop.hdfs.dae.size 192504874
PN.hadoop.hdfs.schema.folderscount 1
PN.hadoop.hdfs.schema.filescount 14
PN.hadoop.hdfs.schema.size 45964

DC=VA

hdfs dfs -ls /lib/ | grep "drwx" | awk '{system("hdfs dfs -count " $8) }' | awk '{ gsub(/\/lib\//,"'$DC'"".hadoop.hdfs.",$4); print $4 ".folderscount",$1"\n"$4 ".filescount",$2"\n"$4 ".size",$3;}'

VA.hadoop.hdfs.archive.folderscount 9
VA.hadoop.hdfs.archive.filescount 103
VA.hadoop.hdfs.archive.size 928524788
VA.hadoop.hdfs.dae.folderscount 1
VA.hadoop.hdfs.dae.filescount 13
VA.hadoop.hdfs.dae.size 192504874
VA.hadoop.hdfs.schema.folderscount 1
VA.hadoop.hdfs.schema.filescount 14
VA.hadoop.hdfs.schema.size 45964