Reply
New Contributor
Posts: 3
Registered: ‎04-16-2018
Accepted Solution

HDFS File Record Counts

Hi Gurus,

 

We have an S3 location with multiple directories and files. We would like to get the filename and their corresponding record count. We were able to get the filename and filesize using the below command:

 

hdfs dfs -ls -R /bucket_name/* | awk '{system("hdfs dfs -count " $8) }' | awk '{print $4,$3;}'

 

Output:

/bucket_name/Directory/File_name.txt 44998 --filesize

 

Thanks in advance!

 

Regards,

Surendran

Posts: 1,760
Kudos: 379
Solutions: 282
Registered: ‎07-31-2013

Re: HDFS File Record Counts

Record counting depends on understanding the format of the file (text,
avro, parquet, etc.) and HDFS/S3 being storage systems are format-agnostic
and store absolutely zero information beyond the file size (as to file's
contents). To find record counts, you will need to query the files directly
with a program suited to read such files.

If they are simple text files, a very trivial example would be 'hadoop fs
-text FILE_URI | wc -l'. This of course does not scale for a large group of
files as it is single threaded - you'd ideally want to use MR or Spark to
generate the counts quicker.

Another trick to think of for speed: Parquet files carry a footer area with
stats about the written file and can give you record counts without having
to read the whole file: https://github.com/apache/parquet-format#metadata
and
https://github.com/apache/parquet-mr/tree/master/parquet-tools#meta-legend,
but this does not apply to all file formats.
New Contributor
Posts: 3
Registered: ‎04-16-2018

Re: HDFS File Record Counts

Thanks for your reply Harsh.

I am able to get the file record counts, but cant get the filenames to append to them.

Any idea how we can tweak your code for simple text files to have the filenames as well?

Thanks,
Surendran
Highlighted
Posts: 1,760
Kudos: 379
Solutions: 282
Registered: ‎07-31-2013

Re: HDFS File Record Counts

For the trivial shell example you could just make echo print both with an
inlined sub-shell that does the counting:

for file in $(FILE_LIST_SUBCOMMAND)
do
echo ${file} $(hadoop fs -text ${file} | wc -l)
done
Announcements