Support Questions

Teej · ‎08-27-2019

I'm fairly new to NiFi and trying to execute a Python script stored on Local FS using NiFi. There are couple of XLSB files stored in HDFS. I would want to build a NiFi flow that reads files from HDFS and passes the filename to Python script so that it can convert those to CSV and store it back to HDFS.

What should be the flow I need to use to get the above working. I tried using ListHDFS -> ExecuteStream but dont know if that's correct. Also, how do I just test the output of ListHDFS to see the output.

DennisJaheruddi · ‎08-28-2019

The normal way to process Excel files on HDFS would be with just these NiFi processors, you would not need python:

ListHDFS>FetchHDFS>ConvertExcelToCSV>PutHDFS

I would recommend you to try this, the documentation does not mention explicitly whether this works with XLSB, so you may actually need the python script for the conversion. In this case the ExecuteStreamCommand processor would indeed be a logical choice.

-----

Regarding the output of the first processor: In development, I find the most convenient way to see the output, is by stopping the downstream processor and then right clicking on the que to list it and inspect the messages.

If stopping the queue is not possible, you could also investigate via the provenance view.

- Dennis Jaheruddin

If this answer helped, please mark it as 'solved' and/or if it is valuable for future readers please apply 'kudos'.

View solution in original post

DennisJaheruddi · ‎08-28-2019

The normal way to process Excel files on HDFS would be with just these NiFi processors, you would not need python:

ListHDFS>FetchHDFS>ConvertExcelToCSV>PutHDFS

I would recommend you to try this, the documentation does not mention explicitly whether this works with XLSB, so you may actually need the python script for the conversion. In this case the ExecuteStreamCommand processor would indeed be a logical choice.

-----

Regarding the output of the first processor: In development, I find the most convenient way to see the output, is by stopping the downstream processor and then right clicking on the que to list it and inspect the messages.

If stopping the queue is not possible, you could also investigate via the provenance view.

- Dennis Jaheruddin

If this answer helped, please mark it as 'solved' and/or if it is valuable for future readers please apply 'kudos'.

Teej · ‎08-29-2019

Thanks, I will explore on the XLStoCSV processor. Once converted to CSV, I have to do couple of transformations for which I am using Python script. If I place the CSV in HDFS, how do I use Python script to process data from HDFS.Are you suggesting to use ExecuteStream to get the session content and process it or is there a better way to do it.

DennisJaheruddi · ‎08-29-2019

If you are putting the data in HDFS first, I assume the following python script is more batch than streaming.

In that case, consider running it via a scheduler like Oozie.

Also, if you run into scalability issues with your script, consider using something like pyspark instead.

- Dennis Jaheruddin

If this answer helped, please mark it as 'solved' and/or if it is valuable for future readers please apply 'kudos'.

Teej · ‎08-29-2019

Alright, got it. Is there a way to access files on HDFS in Python without using pyspark.

DennisJaheruddi · ‎08-29-2019

A quick search suggests that libhdfs can do it, but I have not tried it myself.

- Dennis Jaheruddin

If this answer helped, please mark it as 'solved' and/or if it is valuable for future readers please apply 'kudos'.

Cloudera Community

Support Questions

Python script to process files on HDFS

read/write hdfs files with standalone python scrip...

NiFi- Python vs Groovy Script Performance Analysis

Cloudera Data Engineering Spark Job with Python Wh...

Autoscale File Processing - A Disciplined Approach

How to manage Airflow Python Environments with CDE...

Error in datetime module in python script for ni...

Script to remove hdfs file more than 1 years.

ERROR: “Python script has been killed due to timeo...

Uploading Files for Cloudera Support - alternate m...

HDFS service failing with python error