Created 08-11-2016 07:47 AM
Created 08-11-2016 11:41 AM
You can try below command :
for i in `hdfs dfs -ls -R <DIRECTORY_PATH> | awk '{print $8}'`; do echo $i ; hdfs dfs -cat $i | wc -l; done
It will recursively list the files in <DIRECTORY_PATH> and then print the number of lines in each file.
Created 08-11-2016 07:53 AM
You can use below command to check the number of lines in a HDFS file:
[hdfs@ssnode1 root]$ hdfs dfs -cat /tmp/test.txt |wc -l
23
Created 08-11-2016 07:56 AM
Sindhu I need to know the count for each file in a directory not for a single file in a directory.
Created 08-11-2016 11:41 AM
You can try below command :
for i in `hdfs dfs -ls -R <DIRECTORY_PATH> | awk '{print $8}'`; do echo $i ; hdfs dfs -cat $i | wc -l; done
It will recursively list the files in <DIRECTORY_PATH> and then print the number of lines in each file.
Created 08-11-2016 12:36 PM
Thanks ssharma. It helps. Is there any commands available to check the no of lines in each file in a directory or even just to find in a single file?
Created 08-11-2016 03:58 PM
I dont think there's any single command to achieve this. Not only in HDFS but also in regular linux. So its better to use multiple commands with pipes or create a simple script which will provide you the desired output.
Please accept the answer if it was helpful 🙂
Created 08-11-2016 05:49 PM
The above approach is pretty good and work very well when you having small number of files but what if you have thousands or millions of files in directories? In that case its better to use Hadoop Mapreduce framework to do same job on large files but in less time. Below is an example to count lines using mapreduce.
Created 09-21-2018 02:53 PM
To get the sum of count all rows in a directory, you can follow the below.
a=0
for i in `hdfs dfs -ls -R <DIRECTORY_PATH> | awk '{print $8}'`;
do
echo $i ;
b="`hdfs dfs -cat $i | wc -l`";
a=`expr $a + $b`
echo $a;
done
Created 04-20-2020 09:47 AM
hdfs dfs -ls -R <directory> |grep part-r* |awk '{print $8}' |xargs hdfs dfs -cat | wc -l