Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Python script to process files on HDFS

avatar
Explorer

I'm fairly new to NiFi and trying to execute a Python script stored on Local FS using NiFi. There are couple of XLSB files stored in HDFS. I would want to build a NiFi flow that reads files from HDFS and passes the filename to Python script so that it can convert those to CSV and store it back to HDFS.

What should be the flow I need to use to get the above working. I tried using ListHDFS -> ExecuteStream but dont know if that's correct. Also, how do I just test the output of ListHDFS to see the output.

1 ACCEPTED SOLUTION

avatar

The normal way to process Excel files on HDFS would be with just these NiFi processors, you would not need python:

 

ListHDFS>FetchHDFS>ConvertExcelToCSV>PutHDFS

 

I would recommend you to try this, the documentation does not mention explicitly whether this works with XLSB, so you may actually need the python script for the conversion. In this case the ExecuteStreamCommand processor would indeed be a logical choice.

 

-----

Regarding the output of the first processor: In development, I find the most convenient way to see the output, is by stopping the downstream processor and then right clicking on the que to list it and inspect the messages.

 

If stopping the queue is not possible, you could also investigate via the provenance view.

 

 


- Dennis Jaheruddin

If this answer helped, please mark it as 'solved' and/or if it is valuable for future readers please apply 'kudos'.

View solution in original post

5 REPLIES 5

avatar

The normal way to process Excel files on HDFS would be with just these NiFi processors, you would not need python:

 

ListHDFS>FetchHDFS>ConvertExcelToCSV>PutHDFS

 

I would recommend you to try this, the documentation does not mention explicitly whether this works with XLSB, so you may actually need the python script for the conversion. In this case the ExecuteStreamCommand processor would indeed be a logical choice.

 

-----

Regarding the output of the first processor: In development, I find the most convenient way to see the output, is by stopping the downstream processor and then right clicking on the que to list it and inspect the messages.

 

If stopping the queue is not possible, you could also investigate via the provenance view.

 

 


- Dennis Jaheruddin

If this answer helped, please mark it as 'solved' and/or if it is valuable for future readers please apply 'kudos'.

avatar
Explorer

Thanks, I will explore on the XLStoCSV processor. Once converted to CSV, I have to do couple of transformations for which I am using Python script. If I place the CSV in HDFS, how do I use Python script to process data from HDFS.Are you suggesting to use ExecuteStream to get the session content and process it or is there a better way to do it.

avatar
If you are putting the data in HDFS first, I assume the following python script is more batch than streaming.

In that case, consider running it via a scheduler like Oozie.

Also, if you run into scalability issues with your script, consider using something like pyspark instead.

- Dennis Jaheruddin

If this answer helped, please mark it as 'solved' and/or if it is valuable for future readers please apply 'kudos'.

avatar
Explorer

Alright, got it. Is there a way to access files on HDFS in Python without using pyspark.

avatar
A quick search suggests that libhdfs can do it, but I have not tried it myself.


- Dennis Jaheruddin

If this answer helped, please mark it as 'solved' and/or if it is valuable for future readers please apply 'kudos'.