How to import a data from URL through pyspark?

I want to import the data available in this link "" into spark using pyspark. Is there a way that we can download it directly into spark?


If you wanted static data, you could use native python requests or urllib2 modules to fetch it, then parse and convert it to a Spark rdd.

But if you want to make a Streaming application then as per official documentation:

Spark Streaming provides two categories of built-in streaming sources.

  • Basic sources: Sources directly available in the StreamingContext API. Examples: file systems, and socket connections.
  • Advanced sources: Sources like Kafka, Flume, Kinesis, etc. are available through extra utility classes. These require linking against extra dependencies as discussed in the linking section.

If you're using Kafka:

You need to have a Producer that reads the data from a url and writes to a topic. for reference.

You can then use KafkaInputDStream in Spark as an abstraction over Kafka Consumer to create a Spark DStream.

You can read the link below, should give you some idea: