- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
Python script to process files on HDFS
- Labels:
-
Apache Hadoop
-
Apache NiFi
Created ‎08-27-2019 06:59 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I'm fairly new to NiFi and trying to execute a Python script stored on Local FS using NiFi. There are couple of XLSB files stored in HDFS. I would want to build a NiFi flow that reads files from HDFS and passes the filename to Python script so that it can convert those to CSV and store it back to HDFS.
What should be the flow I need to use to get the above working. I tried using ListHDFS -> ExecuteStream but dont know if that's correct. Also, how do I just test the output of ListHDFS to see the output.
Created ‎08-28-2019 10:09 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The normal way to process Excel files on HDFS would be with just these NiFi processors, you would not need python:
ListHDFS>FetchHDFS>ConvertExcelToCSV>PutHDFS
I would recommend you to try this, the documentation does not mention explicitly whether this works with XLSB, so you may actually need the python script for the conversion. In this case the ExecuteStreamCommand processor would indeed be a logical choice.
-----
Regarding the output of the first processor: In development, I find the most convenient way to see the output, is by stopping the downstream processor and then right clicking on the que to list it and inspect the messages.
If stopping the queue is not possible, you could also investigate via the provenance view.
- Dennis Jaheruddin
If this answer helped, please mark it as 'solved' and/or if it is valuable for future readers please apply 'kudos'.
Created ‎08-28-2019 10:09 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The normal way to process Excel files on HDFS would be with just these NiFi processors, you would not need python:
ListHDFS>FetchHDFS>ConvertExcelToCSV>PutHDFS
I would recommend you to try this, the documentation does not mention explicitly whether this works with XLSB, so you may actually need the python script for the conversion. In this case the ExecuteStreamCommand processor would indeed be a logical choice.
-----
Regarding the output of the first processor: In development, I find the most convenient way to see the output, is by stopping the downstream processor and then right clicking on the que to list it and inspect the messages.
If stopping the queue is not possible, you could also investigate via the provenance view.
- Dennis Jaheruddin
If this answer helped, please mark it as 'solved' and/or if it is valuable for future readers please apply 'kudos'.
Created ‎08-29-2019 12:08 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks, I will explore on the XLStoCSV processor. Once converted to CSV, I have to do couple of transformations for which I am using Python script. If I place the CSV in HDFS, how do I use Python script to process data from HDFS.Are you suggesting to use ExecuteStream to get the session content and process it or is there a better way to do it.
Created ‎08-29-2019 01:02 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
In that case, consider running it via a scheduler like Oozie.
Also, if you run into scalability issues with your script, consider using something like pyspark instead.
- Dennis Jaheruddin
If this answer helped, please mark it as 'solved' and/or if it is valuable for future readers please apply 'kudos'.
Created ‎08-29-2019 01:56 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Alright, got it. Is there a way to access files on HDFS in Python without using pyspark.
Created ‎08-29-2019 03:11 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Dennis Jaheruddin
If this answer helped, please mark it as 'solved' and/or if it is valuable for future readers please apply 'kudos'.
