Support Questions

Find answers, ask questions, and share your expertise

Using nifi & pyspark to move & transform data on S3 - examples and resources


Hey all,

After some information on how I can use nifi to get a file on S3 send it to pyspark, transform it and move it to another folder in a different bucket.

I have used this template to get data moving between buckets, which works fine.

But im a bit unsure of the next steps of how to pass a file to pyspark, run a script to transform it then put it in another location. I have been looking at this which I will try to understand.

If you know of or have any examples of how I might do this, or could describe how I might set it up





Hello. Using this template I managed to get a pyspark job running.

The spark script doesn't accept data from a flow file however, it has a hardcoded path for the input and output file. I had to tell spark to use a specific anaconda python environment in spark setting PYSPARK_PYTHON as follows "export PYSPARK_PYTHON="/path/to/python/env/python" in the spark conf/ file.

It would be nice to know how to how to create a script and template that accepts flowfiles however. If anyone has a template with an example of that would be great.





Super Guru

I would recommend trying the Apache NiFi executesparkinteractive processor


Thanks, I did see that but it looked a bit hard to follow how to do it from scratch