Created 02-01-2017 08:18 AM
Hello Friends,
I am absolutely new to Hadoop and Spark. So trying to understand the knowledge of Spark/Hadoop. Currently am facing a big problem with pySpark coding.
We have an use case of log analytics using python which successfully runs. However, we are thinking to convert the code to pySpark to gain speed. But am absolutely stuck for conversion of this python code to pySpark.
Really need your help on how to do it and will use this learning experience on future assignments.
Uploading the log files and py script for reference.sample.zip.
Please help us with this conversion.
Created 02-01-2017 08:49 AM
Currently your parsing logic is based on a state machine. That approach won't work well with the idea of Spark.
In Spark you'd need to load your data to a Dataset/Dataframe (or RDD) and do operations through that datastructure.
I don't think that anybody will convert your code to Spark here and learning Spark would be inevitable anyways if you'd need to maintained the ported code.
The lowest hanging fruit for you would be to make a try with Pypy interpreter which is more performant than cPython:
I've noticed in your code that you are reading in the file content in one go:
lines = file.readlines()
It would be more efficient to iterate through the file line by line:
for line in open("Virtual_Ports.log", "r")
I'd also suggest to use a profiler to see where your hotspots are.
Hope this helps,
Tibor
Created 02-01-2017 02:39 PM
So can you please guide me then on how to convert this to pyspark script and give the same results?
I am very sorry for posting this type naive query, but require help here.
Created 02-09-2017 02:44 AM
I had a similar use case recently. You have to approach this understanding that it's different paradigm:
spark = SparkSession\ .builder\ .appName("CheckData")\ .getOrCreate() lines = spark.read.text("hdfs://[servername]/[path]/Virtual_Ports.log").rdd.map(lambda r: r[0])
def virtualPortFunction(line): #Do something, return output process of a line virtualPortsSomething = lines.flatMap(lambda x: x.split(' ')) \ .map(lambda x: virtualPortFunction(x))
This is very simplistic way to put it but this will give you a starting point if you decide to go down the PySpark route.
Also look at the pyspark samples part of the Spark distribution. Good place to start.
https://github.com/apache/spark/blob/master/examples/src/main/python/wordcount.py