Support Questions
Find answers, ask questions, and share your expertise

How to import a data from URL through pyspark?

I want to import the data available in this link "http://www.cricbuzz.com/cricket-series/2489/england-tour-of-india-2016-17/matches" into spark using pyspark. Is there a way that we can download it directly into spark?

1 REPLY 1

Expert Contributor

@Bala Vignesh N V

If you wanted static data, you could use native python requests or urllib2 modules to fetch it, then parse and convert it to a Spark rdd.

But if you want to make a Streaming application then as per official documentation:

Spark Streaming provides two categories of built-in streaming sources.

  • Basic sources: Sources directly available in the StreamingContext API. Examples: file systems, and socket connections.
  • Advanced sources: Sources like Kafka, Flume, Kinesis, etc. are available through extra utility classes. These require linking against extra dependencies as discussed in the linking section.

If you're using Kafka:

You need to have a Producer that reads the data from a url and writes to a topic.

http://saurzcode.in/2015/02/kafka-producer-using-twitter-stream/ for reference.

You can then use KafkaInputDStream in Spark as an abstraction over Kafka Consumer to create a Spark DStream.

You can read the link below, should give you some idea:

http://www.michael-noll.com/blog/2014/10/01/kafka-spark-streaming-integration-example-tutorial/