Support Questions

srikanth_ch45 · ‎08-13-2016

Hi,

Does Apache 'Spark Standalone' need HDFS?

If it's required how Spark uses the HDFS block size during the Spark application execution. I mean am trying to understand what will be the HDFS role during Spark application execution.

Spark documentation says that the processing parallelism is controlled through RDD partitions and the executors/cores.

Can anyone please help me to understand.

mqureshi · ‎08-14-2016

@RAMESH K

Spark is the engine that processes data. the data it proceses can be sitting in HDFS or other file systems and data repositories that Spark supports.

For example, spark can read and then process data from S3. HDFS is just one of the file systems that Spark supports. Similarly Spark can read from JDBC data sources like Oracle. So HDFS is one of the file systems where you can use Spark.

When Spark is running in parallel, that is a Spark cluster. For example, you can have a Spark cluster that reads from S3 and processes data in parallel.

Similarly you can have a Spark cluster that reads data from HDFS and processes it in parallel. In this case, Spark is processing data in parallel on a number of machines while HDFS is also being used to read data in pararllel from different machines.

You need to distinguish between "reading data in parallel" (HDFS) and processing data in parallel (Spark).

View solution in original post

srikanth_ch45 · ‎08-13-2016

Here is my understanding where HDFS is not required for Spark:

If in case we are migrating the structured data from any database like Oracle to any noSQL data base like Cassandra using Spark/SparkSQL Job, then in this case we do not need any storage like HDFS.

Please correct me if am wrong. Thanks

egarelnabi · ‎08-13-2016

@RAMESH K

Spark can run without HDFS. HDFS is only one of quite a few data stores/sources for Spark.

Below are some links that answer your question in depth from different perspectives with some explanations and comparisons:

http://stackoverflow.com/questions/32669187/is-hdfs-necessary-for-spark-workloads/34789554#34789554

http://stackoverflow.com/questions/32022334/can-apache-spark-run-without-hadoop

http://stackoverflow.com/questions/28664834/which-cluster-type-should-i-choose-for-spark/34657719#34...

mqureshi · ‎08-14-2016

@RAMESH K

Spark is the engine that processes data. the data it proceses can be sitting in HDFS or other file systems and data repositories that Spark supports.

For example, spark can read and then process data from S3. HDFS is just one of the file systems that Spark supports. Similarly Spark can read from JDBC data sources like Oracle. So HDFS is one of the file systems where you can use Spark.

When Spark is running in parallel, that is a Spark cluster. For example, you can have a Spark cluster that reads from S3 and processes data in parallel.

Similarly you can have a Spark cluster that reads data from HDFS and processes it in parallel. In this case, Spark is processing data in parallel on a number of machines while HDFS is also being used to read data in pararllel from different machines.

You need to distinguish between "reading data in parallel" (HDFS) and processing data in parallel (Spark).

srikanth_ch45 · ‎08-14-2016

Thankyou 🙂

Cloudera Community

Support Questions

Spark Standalone need of HDFS