Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Need to convert a python code to pyspark script

avatar
Expert Contributor

Hello Friends,

I am absolutely new to Hadoop and Spark. So trying to understand the knowledge of Spark/Hadoop. Currently am facing a big problem with pySpark coding.

We have an use case of log analytics using python which successfully runs. However, we are thinking to convert the code to pySpark to gain speed. But am absolutely stuck for conversion of this python code to pySpark.

Really need your help on how to do it and will use this learning experience on future assignments.

Uploading the log files and py script for reference.sample.zip.

Please help us with this conversion.

3 REPLIES 3

avatar
Expert Contributor

Currently your parsing logic is based on a state machine. That approach won't work well with the idea of Spark.

In Spark you'd need to load your data to a Dataset/Dataframe (or RDD) and do operations through that datastructure.

I don't think that anybody will convert your code to Spark here and learning Spark would be inevitable anyways if you'd need to maintained the ported code.

The lowest hanging fruit for you would be to make a try with Pypy interpreter which is more performant than cPython:

http://pypy.org/

I've noticed in your code that you are reading in the file content in one go:

lines = file.readlines() 

It would be more efficient to iterate through the file line by line:

for line in open("Virtual_Ports.log", "r")

I'd also suggest to use a profiler to see where your hotspots are.

Hope this helps,

Tibor

avatar
Expert Contributor

So can you please guide me then on how to convert this to pyspark script and give the same results?

I am very sorry for posting this type naive query, but require help here.

avatar
Contributor

I had a similar use case recently. You have to approach this understanding that it's different paradigm:

  • You can't do I/Os the old fashion way; whatever dataset you're manipulating must be distributed; ie your log file should be on HDFS. So first step, opening the log file and creating a RDD would look something like this:
spark = SparkSession\
	.builder\
	.appName("CheckData")\
	.getOrCreate()

lines = spark.read.text("hdfs://[servername]/[path]/Virtual_Ports.log").rdd.map(lambda r: r[0])




  • You don't programmatically iterate on the data per say, instead you supply a function to process each value (lines in this case). So your code where you iterate on lines could be put inside a function:
def virtualPortFunction(line):
#Do something, return output process of a line

virtualPortsSomething = lines.flatMap(lambda x: x.split(' ')) \
		             .map(lambda x: virtualPortFunction(x))

This is very simplistic way to put it but this will give you a starting point if you decide to go down the PySpark route.

Also look at the pyspark samples part of the Spark distribution. Good place to start.

https://github.com/apache/spark/blob/master/examples/src/main/python/wordcount.py