- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
parsing the HDFS dfs -count output
- Labels:
-
HDFS
Created ‎04-18-2017 06:46 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I need to send the hdfs dfs -count output to graphite, but want to do this on one command rather to do 3 commands: one for the folders count, the files count and the size,
I can do this by separated commands like this:
hdfs dfs -ls /fawze/data | awk '{system("hdfs dfs -count " $8) }' | awk '{print $4,$2;}'
But i want the output to be like this:
fawze/data/x.folders 20
fawze/data/x.files 200
fawze/data/x.size 2650
fawze/data/y.folders 25
fawze/data/y.files 2450
fawze/data/y.size 23560
I'm not a linux expert so will appreciate any help.
Created ‎04-19-2017 02:39 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
man took a bit of trial and error.
The issue with the first run is that it is returning an empty line. I tried a few awk specific was to get around it but they didn't work. So here is a hack. And using the variable withing awk as well.
DC=PN hdfs dfs -ls /lib/ | grep "drwx" | awk '{system("hdfs dfs -count " $8) }' | awk '{ gsub(/\/lib\//,"'$DC'"".hadoop.hdfs.",$4); print $4 ".folderscount",$1"\n"$4 ".filescount",$2"\n"$4 ".size",$3;}' PN.hadoop.hdfs.archive.folderscount 9 PN.hadoop.hdfs.archive.filescount 103 PN.hadoop.hdfs.archive.size 928524788 PN.hadoop.hdfs.dae.folderscount 1 PN.hadoop.hdfs.dae.filescount 13 PN.hadoop.hdfs.dae.size 192504874 PN.hadoop.hdfs.schema.folderscount 1 PN.hadoop.hdfs.schema.filescount 14 PN.hadoop.hdfs.schema.size 45964 DC=VA hdfs dfs -ls /lib/ | grep "drwx" | awk '{system("hdfs dfs -count " $8) }' | awk '{ gsub(/\/lib\//,"'$DC'"".hadoop.hdfs.",$4); print $4 ".folderscount",$1"\n"$4 ".filescount",$2"\n"$4 ".size",$3;}' VA.hadoop.hdfs.archive.folderscount 9 VA.hadoop.hdfs.archive.filescount 103 VA.hadoop.hdfs.archive.size 928524788 VA.hadoop.hdfs.dae.folderscount 1 VA.hadoop.hdfs.dae.filescount 13 VA.hadoop.hdfs.dae.size 192504874 VA.hadoop.hdfs.schema.folderscount 1 VA.hadoop.hdfs.schema.filescount 14 VA.hadoop.hdfs.schema.size 45964
Created ‎04-18-2017 09:11 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
hdfs dfs -ls /lib | awk '{system("hdfs dfs -count " $8) }' | awk '{print $4,$1"\n"$4,$2"\n"$4,$3;}'
This throws an usage error for the first run and I haven't looked into why, but prints out all subdirs; three entries for each stat from -count.
Created ‎04-19-2017 01:56 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@mbigelow That's helped me alot, Thanks
I made small additions to the command:
hdfs dfs -ls /liveperson/data | grep -v storage | awk '{system("hdfs dfs -count " $8) }' | awk '{ gsub(/\/liveperson\/data\/server_/,"hadoop.hdfs.",$4); print $4 ".folderscount",$1"\n"$4 ".filescount",$2"\n"$4 ".size",$3;}'
still investigating the usage error for the first run, and want to add a variable before the hadoop.hdfs.
Can you help with this.
I have a vaibale called DC, and i want to concat it to the path and should looks like this (exampe DC is VA)
VA.hadoop.hdfs.$4.
I identified $DC
Created ‎04-19-2017 02:39 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
man took a bit of trial and error.
The issue with the first run is that it is returning an empty line. I tried a few awk specific was to get around it but they didn't work. So here is a hack. And using the variable withing awk as well.
DC=PN hdfs dfs -ls /lib/ | grep "drwx" | awk '{system("hdfs dfs -count " $8) }' | awk '{ gsub(/\/lib\//,"'$DC'"".hadoop.hdfs.",$4); print $4 ".folderscount",$1"\n"$4 ".filescount",$2"\n"$4 ".size",$3;}' PN.hadoop.hdfs.archive.folderscount 9 PN.hadoop.hdfs.archive.filescount 103 PN.hadoop.hdfs.archive.size 928524788 PN.hadoop.hdfs.dae.folderscount 1 PN.hadoop.hdfs.dae.filescount 13 PN.hadoop.hdfs.dae.size 192504874 PN.hadoop.hdfs.schema.folderscount 1 PN.hadoop.hdfs.schema.filescount 14 PN.hadoop.hdfs.schema.size 45964 DC=VA hdfs dfs -ls /lib/ | grep "drwx" | awk '{system("hdfs dfs -count " $8) }' | awk '{ gsub(/\/lib\//,"'$DC'"".hadoop.hdfs.",$4); print $4 ".folderscount",$1"\n"$4 ".filescount",$2"\n"$4 ".size",$3;}' VA.hadoop.hdfs.archive.folderscount 9 VA.hadoop.hdfs.archive.filescount 103 VA.hadoop.hdfs.archive.size 928524788 VA.hadoop.hdfs.dae.folderscount 1 VA.hadoop.hdfs.dae.filescount 13 VA.hadoop.hdfs.dae.size 192504874 VA.hadoop.hdfs.schema.folderscount 1 VA.hadoop.hdfs.schema.filescount 14 VA.hadoop.hdfs.schema.size 45964
