<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question HDFS File Record Counts in Archives of Support Questions (Read Only)</title>
    <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/HDFS-File-Record-Counts/m-p/66386#M77288</link>
    <description>&lt;P&gt;Hi Gurus,&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;We have an S3 location with multiple directories and files. We would like to get the filename and their corresponding record count. We were able to get the filename and filesize using the below command:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;hdfs dfs -ls -R /bucket_name/* | awk '{system("hdfs dfs -count " $8) }' | awk '{print $4,$3;}'&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Output:&lt;/P&gt;&lt;P&gt;/bucket_name/Directory/File_name.txt 44998 --&lt;EM&gt;filesize&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Thanks in advance!&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Regards,&lt;/P&gt;&lt;P&gt;Surendran&lt;/P&gt;</description>
    <pubDate>Mon, 16 Apr 2018 13:53:03 GMT</pubDate>
    <dc:creator>Naive</dc:creator>
    <dc:date>2018-04-16T13:53:03Z</dc:date>
    <item>
      <title>HDFS File Record Counts</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/HDFS-File-Record-Counts/m-p/66386#M77288</link>
      <description>&lt;P&gt;Hi Gurus,&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;We have an S3 location with multiple directories and files. We would like to get the filename and their corresponding record count. We were able to get the filename and filesize using the below command:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;hdfs dfs -ls -R /bucket_name/* | awk '{system("hdfs dfs -count " $8) }' | awk '{print $4,$3;}'&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Output:&lt;/P&gt;&lt;P&gt;/bucket_name/Directory/File_name.txt 44998 --&lt;EM&gt;filesize&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Thanks in advance!&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Regards,&lt;/P&gt;&lt;P&gt;Surendran&lt;/P&gt;</description>
      <pubDate>Mon, 16 Apr 2018 13:53:03 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/HDFS-File-Record-Counts/m-p/66386#M77288</guid>
      <dc:creator>Naive</dc:creator>
      <dc:date>2018-04-16T13:53:03Z</dc:date>
    </item>
    <item>
      <title>Re: HDFS File Record Counts</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/HDFS-File-Record-Counts/m-p/66387#M77289</link>
      <description>Record counting depends on understanding the format of the file (text,&lt;BR /&gt;avro, parquet, etc.) and HDFS/S3 being storage systems are format-agnostic&lt;BR /&gt;and store absolutely zero information beyond the file size (as to file's&lt;BR /&gt;contents). To find record counts, you will need to query the files directly&lt;BR /&gt;with a program suited to read such files.&lt;BR /&gt;&lt;BR /&gt;If they are simple text files, a very trivial example would be 'hadoop fs&lt;BR /&gt;-text FILE_URI | wc -l'. This of course does not scale for a large group of&lt;BR /&gt;files as it is single threaded - you'd ideally want to use MR or Spark to&lt;BR /&gt;generate the counts quicker.&lt;BR /&gt;&lt;BR /&gt;Another trick to think of for speed: Parquet files carry a footer area with&lt;BR /&gt;stats about the written file and can give you record counts without having&lt;BR /&gt;to read the whole file: &lt;A href="https://github.com/apache/parquet-format#metadata" target="_blank"&gt;https://github.com/apache/parquet-format#metadata&lt;/A&gt;&lt;BR /&gt;and&lt;BR /&gt;&lt;A href="https://github.com/apache/parquet-mr/tree/master/parquet-tools#meta-legend" target="_blank"&gt;https://github.com/apache/parquet-mr/tree/master/parquet-tools#meta-legend&lt;/A&gt;,&lt;BR /&gt;but this does not apply to all file formats.&lt;BR /&gt;</description>
      <pubDate>Mon, 16 Apr 2018 14:09:47 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/HDFS-File-Record-Counts/m-p/66387#M77289</guid>
      <dc:creator>Harsh J</dc:creator>
      <dc:date>2018-04-16T14:09:47Z</dc:date>
    </item>
    <item>
      <title>Re: HDFS File Record Counts</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/HDFS-File-Record-Counts/m-p/66391#M77290</link>
      <description>Thanks for your reply Harsh.&lt;BR /&gt;&lt;BR /&gt;I am able to get the file record counts, but cant get the filenames to append to them.&lt;BR /&gt;&lt;BR /&gt;Any idea how we can tweak your code for simple text files to have the filenames as well?&lt;BR /&gt;&lt;BR /&gt;Thanks,&lt;BR /&gt;Surendran</description>
      <pubDate>Mon, 16 Apr 2018 14:35:00 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/HDFS-File-Record-Counts/m-p/66391#M77290</guid>
      <dc:creator>Naive</dc:creator>
      <dc:date>2018-04-16T14:35:00Z</dc:date>
    </item>
    <item>
      <title>Re: HDFS File Record Counts</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/HDFS-File-Record-Counts/m-p/66393#M77291</link>
      <description>For the trivial shell example you could just make echo print both with an&lt;BR /&gt;inlined sub-shell that does the counting:&lt;BR /&gt;&lt;BR /&gt;for file in $(FILE_LIST_SUBCOMMAND)&lt;BR /&gt;do&lt;BR /&gt;echo ${file} $(hadoop fs -text ${file} | wc -l)&lt;BR /&gt;done&lt;BR /&gt;</description>
      <pubDate>Mon, 16 Apr 2018 14:47:47 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/HDFS-File-Record-Counts/m-p/66393#M77291</guid>
      <dc:creator>Harsh J</dc:creator>
      <dc:date>2018-04-16T14:47:47Z</dc:date>
    </item>
  </channel>
</rss>

