question HDFS File Record Counts in Archives of Support Questions (Read Only)

HDFS File Record Counts

Naive — Mon, 16 Apr 2018 13:53:03 GMT

Hi Gurus,

We have an S3 location with multiple directories and files. We would like to get the filename and their corresponding record count. We were able to get the filename and filesize using the below command:

hdfs dfs -ls -R /bucket_name/* | awk '{system("hdfs dfs -count " $8) }' | awk '{print $4,$3;}'

Output:

/bucket_name/Directory/File_name.txt 44998 --filesize

Thanks in advance!

Regards,

Surendran

Re: HDFS File Record Counts

Harsh J — Mon, 16 Apr 2018 14:09:47 GMT

Record counting depends on understanding the format of the file (text,
avro, parquet, etc.) and HDFS/S3 being storage systems are format-agnostic
and store absolutely zero information beyond the file size (as to file's
contents). To find record counts, you will need to query the files directly
with a program suited to read such files.

If they are simple text files, a very trivial example would be 'hadoop fs
-text FILE_URI | wc -l'. This of course does not scale for a large group of
files as it is single threaded - you'd ideally want to use MR or Spark to
generate the counts quicker.

Another trick to think of for speed: Parquet files carry a footer area with
stats about the written file and can give you record counts without having
to read the whole file: https://github.com/apache/parquet-format#metadata
and
https://github.com/apache/parquet-mr/tree/master/parquet-tools#meta-legend,
but this does not apply to all file formats.

Re: HDFS File Record Counts

Naive — Mon, 16 Apr 2018 14:35:00 GMT

Thanks for your reply Harsh.

I am able to get the file record counts, but cant get the filenames to append to them.

Any idea how we can tweak your code for simple text files to have the filenames as well?

Thanks,
Surendran

Re: HDFS File Record Counts

Harsh J — Mon, 16 Apr 2018 14:47:47 GMT

For the trivial shell example you could just make echo print both with an
inlined sub-shell that does the counting:

for file in $(FILE_LIST_SUBCOMMAND)
do
echo ${file} $(hadoop fs -text ${file} | wc -l)
done