Record counting depends on understanding the format of the file (text,
avro, parquet, etc.) and HDFS/S3 being storage systems are format-agnostic
and store absolutely zero information beyond the file size (as to file's
contents). To find record counts, you will need to query the files directly
with a program suited to read such files.
If they are simple text files, a very trivial example would be 'hadoop fs
-text FILE_URI | wc -l'. This of course does not scale for a large group of
files as it is single threaded - you'd ideally want to use MR or Spark to
generate the counts quicker.
Another trick to think of for speed: Parquet files carry a footer area with
stats about the written file and can give you record counts without having
to read the whole file: https://github.com/apache/parquet-format#metadata
but this does not apply to all file formats.