Created 08-27-2019 06:59 AM
I'm fairly new to NiFi and trying to execute a Python script stored on Local FS using NiFi. There are couple of XLSB files stored in HDFS. I would want to build a NiFi flow that reads files from HDFS and passes the filename to Python script so that it can convert those to CSV and store it back to HDFS.
What should be the flow I need to use to get the above working. I tried using ListHDFS -> ExecuteStream but dont know if that's correct. Also, how do I just test the output of ListHDFS to see the output.
Created 08-28-2019 10:09 AM
The normal way to process Excel files on HDFS would be with just these NiFi processors, you would not need python:
ListHDFS>FetchHDFS>ConvertExcelToCSV>PutHDFS
I would recommend you to try this, the documentation does not mention explicitly whether this works with XLSB, so you may actually need the python script for the conversion. In this case the ExecuteStreamCommand processor would indeed be a logical choice.
-----
Regarding the output of the first processor: In development, I find the most convenient way to see the output, is by stopping the downstream processor and then right clicking on the que to list it and inspect the messages.
If stopping the queue is not possible, you could also investigate via the provenance view.
Created 08-28-2019 10:09 AM
The normal way to process Excel files on HDFS would be with just these NiFi processors, you would not need python:
ListHDFS>FetchHDFS>ConvertExcelToCSV>PutHDFS
I would recommend you to try this, the documentation does not mention explicitly whether this works with XLSB, so you may actually need the python script for the conversion. In this case the ExecuteStreamCommand processor would indeed be a logical choice.
-----
Regarding the output of the first processor: In development, I find the most convenient way to see the output, is by stopping the downstream processor and then right clicking on the que to list it and inspect the messages.
If stopping the queue is not possible, you could also investigate via the provenance view.
Created 08-29-2019 12:08 AM
Thanks, I will explore on the XLStoCSV processor. Once converted to CSV, I have to do couple of transformations for which I am using Python script. If I place the CSV in HDFS, how do I use Python script to process data from HDFS.Are you suggesting to use ExecuteStream to get the session content and process it or is there a better way to do it.
Created 08-29-2019 01:02 AM
Created 08-29-2019 01:56 AM
Alright, got it. Is there a way to access files on HDFS in Python without using pyspark.
Created 08-29-2019 03:11 AM