Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

How to find no of lines in all the files in a hadoop directory?

Solved Go to solution
Highlighted

How to find no of lines in all the files in a hadoop directory?

 
1 ACCEPTED SOLUTION

Accepted Solutions
Highlighted

Re: How to find no of lines in all the files in a hadoop directory?

Expert Contributor

@Bala Vignesh N V

You can try below command :

for i in `hdfs dfs -ls -R <DIRECTORY_PATH> | awk '{print $8}'`; do echo $i ; hdfs dfs -cat $i | wc -l; done

It will recursively list the files in <DIRECTORY_PATH> and then print the number of lines in each file.

View solution in original post

8 REPLIES 8
Highlighted

Re: How to find no of lines in all the files in a hadoop directory?

@Bala Vignesh N V

You can use below command to check the number of lines in a HDFS file:

[hdfs@ssnode1 root]$ hdfs dfs -cat /tmp/test.txt |wc -l

23

Highlighted

Re: How to find no of lines in all the files in a hadoop directory?

Sindhu I need to know the count for each file in a directory not for a single file in a directory.

Highlighted

Re: How to find no of lines in all the files in a hadoop directory?

Expert Contributor

@Bala Vignesh N V

You can try below command :

for i in `hdfs dfs -ls -R <DIRECTORY_PATH> | awk '{print $8}'`; do echo $i ; hdfs dfs -cat $i | wc -l; done

It will recursively list the files in <DIRECTORY_PATH> and then print the number of lines in each file.

View solution in original post

Highlighted

Re: How to find no of lines in all the files in a hadoop directory?

Thanks ssharma. It helps. Is there any commands available to check the no of lines in each file in a directory or even just to find in a single file?

Highlighted

Re: How to find no of lines in all the files in a hadoop directory?

Expert Contributor

@Bala Vignesh N V

I dont think there's any single command to achieve this. Not only in HDFS but also in regular linux. So its better to use multiple commands with pipes or create a simple script which will provide you the desired output.

Please accept the answer if it was helpful :)

Re: How to find no of lines in all the files in a hadoop directory?

Hi @Bala Vignesh N V

The above approach is pretty good and work very well when you having small number of files but what if you have thousands or millions of files in directories? In that case its better to use Hadoop Mapreduce framework to do same job on large files but in less time. Below is an example to count lines using mapreduce.

https://sites.google.com/site/hadoopandhive/home/hadoop-how-to-count-number-of-lines-in-a-file-using...

Highlighted

Re: How to find no of lines in all the files in a hadoop directory?

New Contributor

To get the sum of count all rows in a directory, you can follow the below.

a=0

for i in `hdfs dfs -ls -R <DIRECTORY_PATH> | awk '{print $8}'`;

do

echo $i ;

b="`hdfs dfs -cat $i | wc -l`";

a=`expr $a + $b`

echo $a;

done

Highlighted

Re: How to find no of lines in all the files in a hadoop directory?

New Contributor

hdfs dfs -ls -R <directory> |grep part-r* |awk '{print $8}' |xargs hdfs dfs -cat | wc -l

Don't have an account?
Coming from Hortonworks? Activate your account here