- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
Need to convert a python code to pyspark script
- Labels:
-
Apache Spark
Created ‎02-01-2017 08:18 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello Friends,
I am absolutely new to Hadoop and Spark. So trying to understand the knowledge of Spark/Hadoop. Currently am facing a big problem with pySpark coding.
We have an use case of log analytics using python which successfully runs. However, we are thinking to convert the code to pySpark to gain speed. But am absolutely stuck for conversion of this python code to pySpark.
Really need your help on how to do it and will use this learning experience on future assignments.
Uploading the log files and py script for reference.sample.zip.
Please help us with this conversion.
Created ‎02-01-2017 08:49 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Currently your parsing logic is based on a state machine. That approach won't work well with the idea of Spark.
In Spark you'd need to load your data to a Dataset/Dataframe (or RDD) and do operations through that datastructure.
I don't think that anybody will convert your code to Spark here and learning Spark would be inevitable anyways if you'd need to maintained the ported code.
The lowest hanging fruit for you would be to make a try with Pypy interpreter which is more performant than cPython:
I've noticed in your code that you are reading in the file content in one go:
lines = file.readlines()
It would be more efficient to iterate through the file line by line:
for line in open("Virtual_Ports.log", "r")
I'd also suggest to use a profiler to see where your hotspots are.
Hope this helps,
Tibor
Created ‎02-01-2017 02:39 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
So can you please guide me then on how to convert this to pyspark script and give the same results?
I am very sorry for posting this type naive query, but require help here.
Created ‎02-09-2017 02:44 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I had a similar use case recently. You have to approach this understanding that it's different paradigm:
- You can't do I/Os the old fashion way; whatever dataset you're manipulating must be distributed; ie your log file should be on HDFS. So first step, opening the log file and creating a RDD would look something like this:
spark = SparkSession\ .builder\ .appName("CheckData")\ .getOrCreate() lines = spark.read.text("hdfs://[servername]/[path]/Virtual_Ports.log").rdd.map(lambda r: r[0])
- You don't programmatically iterate on the data per say, instead you supply a function to process each value (lines in this case). So your code where you iterate on lines could be put inside a function:
def virtualPortFunction(line): #Do something, return output process of a line virtualPortsSomething = lines.flatMap(lambda x: x.split(' ')) \ .map(lambda x: virtualPortFunction(x))
This is very simplistic way to put it but this will give you a starting point if you decide to go down the PySpark route.
Also look at the pyspark samples part of the Spark distribution. Good place to start.
https://github.com/apache/spark/blob/master/examples/src/main/python/wordcount.py
