Support Questions

Find answers, ask questions, and share your expertise

Using nifi & pyspark to move & transform data on S3 - examples and resources

avatar
Explorer

Hey all,

After some information on how I can use nifi to get a file on S3 send it to pyspark, transform it and move it to another folder in a different bucket.

I have used this template https://gist.github.com/ijokarumawak/26ff675039e252d177b1195f3576cf9a to get data moving between buckets, which works fine.

But im a bit unsure of the next steps of how to pass a file to pyspark, run a script to transform it then put it in another location. I have been looking at this https://pierrevillard.com/2016/03/09/transform-data-with-apache-nifi/ which I will try to understand.

If you know of or have any examples of how I might do this, or could describe how I might set it up

Thanks,

Tim

3 REPLIES 3

avatar
Explorer

Hello. Using this template https://github.com/Teradata/kylo/blob/master/samples/templates/nifi-1.0/template-starter-pyspark.xml I managed to get a pyspark job running.

The spark script doesn't accept data from a flow file however, it has a hardcoded path for the input and output file. I had to tell spark to use a specific anaconda python environment in spark setting PYSPARK_PYTHON as follows "export PYSPARK_PYTHON="/path/to/python/env/python" in the spark conf/spark-env.sh file.

It would be nice to know how to how to create a script and template that accepts flowfiles however. If anyone has a template with an example of that would be great.

Cheers,

Tim

71394-capture.png

71395-capture.png

avatar
Master Guru

I would recommend trying the Apache NiFi executesparkinteractive processor

https://community.hortonworks.com/articles/171787/hdf-31-executing-apache-spark-via-executesparkinte...

avatar
Explorer

Thanks, I did see that but it looked a bit hard to follow how to do it from scratch