Support Questions

Find answers, ask questions, and share your expertise

HDFS File Record Counts

avatar
New Contributor

Hi Gurus,

 

We have an S3 location with multiple directories and files. We would like to get the filename and their corresponding record count. We were able to get the filename and filesize using the below command:

 

hdfs dfs -ls -R /bucket_name/* | awk '{system("hdfs dfs -count " $8) }' | awk '{print $4,$3;}'

 

Output:

/bucket_name/Directory/File_name.txt 44998 --filesize

 

Thanks in advance!

 

Regards,

Surendran

1 ACCEPTED SOLUTION

avatar
Mentor
Record counting depends on understanding the format of the file (text,
avro, parquet, etc.) and HDFS/S3 being storage systems are format-agnostic
and store absolutely zero information beyond the file size (as to file's
contents). To find record counts, you will need to query the files directly
with a program suited to read such files.

If they are simple text files, a very trivial example would be 'hadoop fs
-text FILE_URI | wc -l'. This of course does not scale for a large group of
files as it is single threaded - you'd ideally want to use MR or Spark to
generate the counts quicker.

Another trick to think of for speed: Parquet files carry a footer area with
stats about the written file and can give you record counts without having
to read the whole file: https://github.com/apache/parquet-format#metadata
and
https://github.com/apache/parquet-mr/tree/master/parquet-tools#meta-legend,
but this does not apply to all file formats.

View solution in original post

3 REPLIES 3

avatar
Mentor
Record counting depends on understanding the format of the file (text,
avro, parquet, etc.) and HDFS/S3 being storage systems are format-agnostic
and store absolutely zero information beyond the file size (as to file's
contents). To find record counts, you will need to query the files directly
with a program suited to read such files.

If they are simple text files, a very trivial example would be 'hadoop fs
-text FILE_URI | wc -l'. This of course does not scale for a large group of
files as it is single threaded - you'd ideally want to use MR or Spark to
generate the counts quicker.

Another trick to think of for speed: Parquet files carry a footer area with
stats about the written file and can give you record counts without having
to read the whole file: https://github.com/apache/parquet-format#metadata
and
https://github.com/apache/parquet-mr/tree/master/parquet-tools#meta-legend,
but this does not apply to all file formats.

avatar
New Contributor
Thanks for your reply Harsh.

I am able to get the file record counts, but cant get the filenames to append to them.

Any idea how we can tweak your code for simple text files to have the filenames as well?

Thanks,
Surendran

avatar
Mentor
For the trivial shell example you could just make echo print both with an
inlined sub-shell that does the counting:

for file in $(FILE_LIST_SUBCOMMAND)
do
echo ${file} $(hadoop fs -text ${file} | wc -l)
done