Support Questions

rajdip_chaudhur · ‎02-01-2017

Hello Friends,

I am absolutely new to Hadoop and Spark. So trying to understand the knowledge of Spark/Hadoop. Currently am facing a big problem with pySpark coding.

We have an use case of log analytics using python which successfully runs. However, we are thinking to convert the code to pySpark to gain speed. But am absolutely stuck for conversion of this python code to pySpark.

Really need your help on how to do it and will use this learning experience on future assignments.

Uploading the log files and py script for reference.sample.zip.

Please help us with this conversion.

tkiss · ‎02-01-2017

Currently your parsing logic is based on a state machine. That approach won't work well with the idea of Spark.

In Spark you'd need to load your data to a Dataset/Dataframe (or RDD) and do operations through that datastructure.

I don't think that anybody will convert your code to Spark here and learning Spark would be inevitable anyways if you'd need to maintained the ported code.

The lowest hanging fruit for you would be to make a try with Pypy interpreter which is more performant than cPython:

http://pypy.org/

I've noticed in your code that you are reading in the file content in one go:

lines = file.readlines()

It would be more efficient to iterate through the file line by line:

for line in open("Virtual_Ports.log", "r")

I'd also suggest to use a profiler to see where your hotspots are.

Hope this helps,

Tibor

rajdip_chaudhur · ‎02-01-2017

So can you please guide me then on how to convert this to pyspark script and give the same results?

I am very sorry for posting this type naive query, but require help here.

agauthier · ‎02-09-2017

I had a similar use case recently. You have to approach this understanding that it's different paradigm:

You can't do I/Os the old fashion way; whatever dataset you're manipulating must be distributed; ie your log file should be on HDFS. So first step, opening the log file and creating a RDD would look something like this:

spark = SparkSession\
	.builder\
	.appName("CheckData")\
	.getOrCreate()

lines = spark.read.text("hdfs://[servername]/[path]/Virtual_Ports.log").rdd.map(lambda r: r[0])

You don't programmatically iterate on the data per say, instead you supply a function to process each value (lines in this case). So your code where you iterate on lines could be put inside a function:

def virtualPortFunction(line):
#Do something, return output process of a line

virtualPortsSomething = lines.flatMap(lambda x: x.split(' ')) \
		             .map(lambda x: virtualPortFunction(x))

This is very simplistic way to put it but this will give you a starting point if you decide to go down the PySpark route.

Also look at the pyspark samples part of the Spark distribution. Good place to start.

https://github.com/apache/spark/blob/master/examples/src/main/python/wordcount.py

Cloudera Community

Support Questions

Need to convert a python code to pyspark script

Python Script in NiFi

Spark Python Supportability Matrix

Interacting with Hadoop HDFS using Python codes

Spark Python Integration Test Result Exceptions

NiFi- Python vs Groovy Script Performance Analysis

Kafka Producer sample code in Scala and Python

pyspark convert unixtimestamp to datetime

Use of Python version 3 scripts for pyspark with H...

How to configure Zeppelin Pyspark Interpreter to u...

Apache Nifi passing a folder as an flowfile to pyt...