Support Questions
Find answers, ask questions, and share your expertise

How to import a data from URL through pyspark?

I want to import the data available in this link "" into spark using pyspark. Is there a way that we can download it directly into spark?


Expert Contributor

@Bala Vignesh N V

If you wanted static data, you could use native python requests or urllib2 modules to fetch it, then parse and convert it to a Spark rdd.

But if you want to make a Streaming application then as per official documentation:

Spark Streaming provides two categories of built-in streaming sources.

  • Basic sources: Sources directly available in the StreamingContext API. Examples: file systems, and socket connections.
  • Advanced sources: Sources like Kafka, Flume, Kinesis, etc. are available through extra utility classes. These require linking against extra dependencies as discussed in the linking section.

If you're using Kafka:

You need to have a Producer that reads the data from a url and writes to a topic. for reference.

You can then use KafkaInputDStream in Spark as an abstraction over Kafka Consumer to create a Spark DStream.

You can read the link below, should give you some idea: