Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

How to find no of lines in all the files in a hadoop directory?

avatar
 
1 ACCEPTED SOLUTION

avatar
Super Collaborator

@Bala Vignesh N V

You can try below command :

for i in `hdfs dfs -ls -R <DIRECTORY_PATH> | awk '{print $8}'`; do echo $i ; hdfs dfs -cat $i | wc -l; done

It will recursively list the files in <DIRECTORY_PATH> and then print the number of lines in each file.

View solution in original post

8 REPLIES 8

avatar
@Bala Vignesh N V

You can use below command to check the number of lines in a HDFS file:

[hdfs@ssnode1 root]$ hdfs dfs -cat /tmp/test.txt |wc -l

23

avatar

Sindhu I need to know the count for each file in a directory not for a single file in a directory.

avatar
Super Collaborator

@Bala Vignesh N V

You can try below command :

for i in `hdfs dfs -ls -R <DIRECTORY_PATH> | awk '{print $8}'`; do echo $i ; hdfs dfs -cat $i | wc -l; done

It will recursively list the files in <DIRECTORY_PATH> and then print the number of lines in each file.

avatar

Thanks ssharma. It helps. Is there any commands available to check the no of lines in each file in a directory or even just to find in a single file?

avatar
Super Collaborator

@Bala Vignesh N V

I dont think there's any single command to achieve this. Not only in HDFS but also in regular linux. So its better to use multiple commands with pipes or create a simple script which will provide you the desired output.

Please accept the answer if it was helpful 🙂

avatar
Super Guru

Hi @Bala Vignesh N V

The above approach is pretty good and work very well when you having small number of files but what if you have thousands or millions of files in directories? In that case its better to use Hadoop Mapreduce framework to do same job on large files but in less time. Below is an example to count lines using mapreduce.

https://sites.google.com/site/hadoopandhive/home/hadoop-how-to-count-number-of-lines-in-a-file-using...

avatar
New Contributor

To get the sum of count all rows in a directory, you can follow the below.

a=0

for i in `hdfs dfs -ls -R <DIRECTORY_PATH> | awk '{print $8}'`;

do

echo $i ;

b="`hdfs dfs -cat $i | wc -l`";

a=`expr $a + $b`

echo $a;

done

avatar
New Contributor

hdfs dfs -ls -R <directory> |grep part-r* |awk '{print $8}' |xargs hdfs dfs -cat | wc -l