Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Spark Standalone need of HDFS

avatar

Hi,

Does Apache 'Spark Standalone' need HDFS?

If it's required how Spark uses the HDFS block size during the Spark application execution. I mean am trying to understand what will be the HDFS role during Spark application execution.

Spark documentation says that the processing parallelism is controlled through RDD partitions and the executors/cores.

Can anyone please help me to understand.

1 ACCEPTED SOLUTION

avatar
Super Guru
@RAMESH K

Spark is the engine that processes data. the data it proceses can be sitting in HDFS or other file systems and data repositories that Spark supports.

For example, spark can read and then process data from S3. HDFS is just one of the file systems that Spark supports. Similarly Spark can read from JDBC data sources like Oracle. So HDFS is one of the file systems where you can use Spark.

When Spark is running in parallel, that is a Spark cluster. For example, you can have a Spark cluster that reads from S3 and processes data in parallel.

Similarly you can have a Spark cluster that reads data from HDFS and processes it in parallel. In this case, Spark is processing data in parallel on a number of machines while HDFS is also being used to read data in pararllel from different machines.

You need to distinguish between "reading data in parallel" (HDFS) and processing data in parallel (Spark).

View solution in original post

4 REPLIES 4

avatar

Here is my understanding where HDFS is not required for Spark:

If in case we are migrating the structured data from any database like Oracle to any noSQL data base like Cassandra using Spark/SparkSQL Job, then in this case we do not need any storage like HDFS.

Please correct me if am wrong. Thanks

avatar

@RAMESH K

Spark can run without HDFS. HDFS is only one of quite a few data stores/sources for Spark.

Below are some links that answer your question in depth from different perspectives with some explanations and comparisons:

http://stackoverflow.com/questions/32669187/is-hdfs-necessary-for-spark-workloads/34789554#34789554

http://stackoverflow.com/questions/32022334/can-apache-spark-run-without-hadoop

http://stackoverflow.com/questions/28664834/which-cluster-type-should-i-choose-for-spark/34657719#34...

avatar
Super Guru
@RAMESH K

Spark is the engine that processes data. the data it proceses can be sitting in HDFS or other file systems and data repositories that Spark supports.

For example, spark can read and then process data from S3. HDFS is just one of the file systems that Spark supports. Similarly Spark can read from JDBC data sources like Oracle. So HDFS is one of the file systems where you can use Spark.

When Spark is running in parallel, that is a Spark cluster. For example, you can have a Spark cluster that reads from S3 and processes data in parallel.

Similarly you can have a Spark cluster that reads data from HDFS and processes it in parallel. In this case, Spark is processing data in parallel on a number of machines while HDFS is also being used to read data in pararllel from different machines.

You need to distinguish between "reading data in parallel" (HDFS) and processing data in parallel (Spark).

avatar

Thankyou 🙂