Support Questions

s198 · ‎04-10-2024

My requirement is to retrieve the total number of files in a given HDFS directory and based on the number of files proceed with the downstream flow

I cannot use the ListHDFS processor as it does not allow inbound connections. The GetHDFSFileInfo processor generates flowfiles for each HDFS file, causing all downstream processors to execute the same number of times.

I have observed that we can use ExecuteStreamCommand to invoke a script and execute HDFS commands to get the number of files. I would like to know if we can obtain the count without using a script? Or if there is any other option available besides the above.

florence0239 · ‎08-09-2024

@s198 wrote:
My requirement is to retrieve the total number of files in a given HDFS directory and based on the number of files proceed with the downstream flow
Health Insurance Market
I cannot use the ListHDFS processor as it does not allow inbound connections. The GetHDFSFileInfo processor generates flowfiles for each HDFS file, causing all downstream processors to execute the same number of times.
I have observed that we can use ExecuteStreamCommand to invoke a script and execute HDFS commands to get the number of files. I would like to know if we can obtain the count without using a script? Or if there is any other option available besides the above.

Hello,

To retrieve the total number of files in a given HDFS directory without using a script, you can use the ExecuteStreamCommand processor in Apache NiFi to run HDFS commands directly. However, if you prefer not to use scripts, you can leverage the ExecuteScript processor with a simple Groovy script to achieve this.

Here’s how you can do it using the ExecuteScript processor:

Add the ExecuteScript Processor:
Drag and drop the ExecuteScript processor onto your NiFi canvas.
Configure the ExecuteScript Processor:
Set the Script Engine to Groovy.
In the Script Body, use the following Groovy script to count the files in the HDFS directory:

import org.apache.nifi.processor.io.StreamCallback
import java.nio.charset.StandardCharsets

def flowFile = session.get()
if (!flowFile) return

def hdfsDir = '/path/to/hdfs/directory'
def command = "hdfs dfs -count ${hdfsDir}"
def process = command.execute()
process.waitFor()

def output = process.in.text
def fileCount = output.split()[1] // Assuming the second column is the file count

flowFile = session.putAttribute(flowFile, 'file.count', fileCount)
session.transfer(flowFile, REL_SUCCESS)

Set the HDFS Directory Path:
Replace /path/to/hdfs/directory with the actual path to your HDFS directory.
Connect the Processor:
Connect the ExecuteScript processor to the downstream processors that need the file count.
Run the Flow:
Start the flow and the ExecuteScript processor will count the files in the specified HDFS directory and add the count as an attribute to the flow file.
This approach avoids the need for a separate script file and keeps everything within NiFi. The ExecuteScript processor runs the HDFS command and extracts the file count, which can then be used in your downstream processors.

Hope this will help you.
Best regards,
florence0239

Cloudera Community

Support Questions

How to get the number of files from an HDFS directory using NiFi