I want to import the data available in this link "http://www.cricbuzz.com/cricket-series/2489/england-tour-of-india-2016-17/matches" into spark using pyspark. Is there a way that we can download it directly into spark?
@Bala Vignesh N V
If you wanted static data, you could use native python requests or urllib2 modules to fetch it, then parse and convert it to a Spark rdd.
But if you want to make a Streaming application then as per official documentation:
Spark Streaming provides two categories of built-in streaming sources.
If you're using Kafka:
You need to have a Producer that reads the data from a url and writes to a topic.
http://saurzcode.in/2015/02/kafka-producer-using-twitter-stream/ for reference.
You can then use KafkaInputDStream in Spark as an abstraction over Kafka Consumer to create a Spark DStream.
You can read the link below, should give you some idea: