Support Questions
Find answers, ask questions, and share your expertise

Spark Standalone need of HDFS

Solved Go to solution
Highlighted

Spark Standalone need of HDFS

Hi,

Does Apache 'Spark Standalone' need HDFS?

If it's required how Spark uses the HDFS block size during the Spark application execution. I mean am trying to understand what will be the HDFS role during Spark application execution.

Spark documentation says that the processing parallelism is controlled through RDD partitions and the executors/cores.

Can anyone please help me to understand.

1 ACCEPTED SOLUTION

Accepted Solutions
Highlighted

Re: Spark Standalone need of HDFS

Super Guru
@RAMESH K

Spark is the engine that processes data. the data it proceses can be sitting in HDFS or other file systems and data repositories that Spark supports.

For example, spark can read and then process data from S3. HDFS is just one of the file systems that Spark supports. Similarly Spark can read from JDBC data sources like Oracle. So HDFS is one of the file systems where you can use Spark.

When Spark is running in parallel, that is a Spark cluster. For example, you can have a Spark cluster that reads from S3 and processes data in parallel.

Similarly you can have a Spark cluster that reads data from HDFS and processes it in parallel. In this case, Spark is processing data in parallel on a number of machines while HDFS is also being used to read data in pararllel from different machines.

You need to distinguish between "reading data in parallel" (HDFS) and processing data in parallel (Spark).

View solution in original post

4 REPLIES 4
Highlighted

Re: Spark Standalone need of HDFS

Here is my understanding where HDFS is not required for Spark:

If in case we are migrating the structured data from any database like Oracle to any noSQL data base like Cassandra using Spark/SparkSQL Job, then in this case we do not need any storage like HDFS.

Please correct me if am wrong. Thanks

Highlighted

Re: Spark Standalone need of HDFS

@RAMESH K

Spark can run without HDFS. HDFS is only one of quite a few data stores/sources for Spark.

Below are some links that answer your question in depth from different perspectives with some explanations and comparisons:

http://stackoverflow.com/questions/32669187/is-hdfs-necessary-for-spark-workloads/34789554#34789554

http://stackoverflow.com/questions/32022334/can-apache-spark-run-without-hadoop

http://stackoverflow.com/questions/28664834/which-cluster-type-should-i-choose-for-spark/34657719#34...

Highlighted

Re: Spark Standalone need of HDFS

Super Guru
@RAMESH K

Spark is the engine that processes data. the data it proceses can be sitting in HDFS or other file systems and data repositories that Spark supports.

For example, spark can read and then process data from S3. HDFS is just one of the file systems that Spark supports. Similarly Spark can read from JDBC data sources like Oracle. So HDFS is one of the file systems where you can use Spark.

When Spark is running in parallel, that is a Spark cluster. For example, you can have a Spark cluster that reads from S3 and processes data in parallel.

Similarly you can have a Spark cluster that reads data from HDFS and processes it in parallel. In this case, Spark is processing data in parallel on a number of machines while HDFS is also being used to read data in pararllel from different machines.

You need to distinguish between "reading data in parallel" (HDFS) and processing data in parallel (Spark).

View solution in original post

Highlighted

Re: Spark Standalone need of HDFS

Thankyou :)