- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
How to find no of lines in all the files in a hadoop directory?
- Labels:
-
Apache Hadoop
-
Apache Hive
Created ‎08-11-2016 07:47 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Created ‎08-11-2016 11:41 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You can try below command :
for i in `hdfs dfs -ls -R <DIRECTORY_PATH> | awk '{print $8}'`; do echo $i ; hdfs dfs -cat $i | wc -l; done
It will recursively list the files in <DIRECTORY_PATH> and then print the number of lines in each file.
Created ‎08-11-2016 07:53 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You can use below command to check the number of lines in a HDFS file:
[hdfs@ssnode1 root]$ hdfs dfs -cat /tmp/test.txt |wc -l
23
Created ‎08-11-2016 07:56 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sindhu I need to know the count for each file in a directory not for a single file in a directory.
Created ‎08-11-2016 11:41 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You can try below command :
for i in `hdfs dfs -ls -R <DIRECTORY_PATH> | awk '{print $8}'`; do echo $i ; hdfs dfs -cat $i | wc -l; done
It will recursively list the files in <DIRECTORY_PATH> and then print the number of lines in each file.
Created ‎08-11-2016 12:36 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks ssharma. It helps. Is there any commands available to check the no of lines in each file in a directory or even just to find in a single file?
Created ‎08-11-2016 03:58 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I dont think there's any single command to achieve this. Not only in HDFS but also in regular linux. So its better to use multiple commands with pipes or create a simple script which will provide you the desired output.
Please accept the answer if it was helpful 🙂
Created ‎08-11-2016 05:49 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The above approach is pretty good and work very well when you having small number of files but what if you have thousands or millions of files in directories? In that case its better to use Hadoop Mapreduce framework to do same job on large files but in less time. Below is an example to count lines using mapreduce.
Created ‎09-21-2018 02:53 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
To get the sum of count all rows in a directory, you can follow the below.
a=0
for i in `hdfs dfs -ls -R <DIRECTORY_PATH> | awk '{print $8}'`;
do
echo $i ;
b="`hdfs dfs -cat $i | wc -l`";
a=`expr $a + $b`
echo $a;
done
Created ‎04-20-2020 09:47 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
hdfs dfs -ls -R <directory> |grep part-r* |awk '{print $8}' |xargs hdfs dfs -cat | wc -l
