Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

How is Spark better than Hadoop?

How is Spark better than Hadoop?

New Contributor
 
4 REPLIES 4

Re: How is Spark better than Hadoop?

Super Mentor

@Sakina MIrza

Ideally comparing Spark with HDFS is like comparing apples with oranges.


Spark: Apache Spark is a fast, in-memory data processing engine with elegant and expressive development APIs to allow data workers to efficiently execute streaming, machine learning or SQL workloads that require fast iterative access to datasets. With Spark running on Apache Hadoop YARN, developers everywhere can now create applications to exploit Spark’s power, derive insights, and enrich their data science workloads within a single, shared dataset in Hadoop. The Hadoop YARN-based architecture provides the foundation that enables Spark and other applications to share a common cluster and dataset while ensuring consistent levels of service and response. Spark is now one of many data access engines that work with YARN in HDP.

1. https://spark.apache.org/

2. https://hortonworks.com/apache/spark/#section_2

.


Hadoop: Its is a Java-based file system that provides scalable and reliable data storage, and it was designed to span large clusters of commodity servers. HDFS has demonstrated production scalability of up to 200 PB of storage and a single cluster of 4500 servers, supporting close to a billion files and blocks. When that quantity and quality of enterprise data is available in HDFS, and YARN enables multiple data access applications to process it, Hadoop users can confidently answer questions that eluded previous data platforms. HDFS is a scalable, fault-tolerant, distributed storage system that works closely with a wide variety of concurrent data access applications, coordinated by YARN. HDFS will “just work” under a variety of physical and systemic circumstances. By distributing storage and computation across many servers, the combined storage resource can grow linearly with demand while remaining economical at every amount of storage.

1. http://hadoop.apache.org/

2. https://hortonworks.com/apache/hdfs/#section_2

.

Please see the whole HDP ecosystem to understand where Hadoop (HDFS) and Spark are.

https://hortonworks.com/ecosystems/

34494-hdp-ecosystem.png

.

Re: How is Spark better than Hadoop?

New Contributor

The Apache Spark developers bill it as “a fast and general engine for large-scale data processing.” By comparison, and sticking with the analogy, if Hadoop’s Big Data framework is the 800-lb gorilla, then Spark is the 130-lb big data cheetah.

Highlighted

Re: How is Spark better than Hadoop?

New Contributor

Apache Spark comes with a very advanced Directed Acyclic Graph(DAG) data processing engine. What it means is that for every Spark job, a DAG of tasks is created to be executed by the engine. The DAG in mathematical parlance consists of a set of vertices and directed edges connecting them. The tasks are executed as per the DAG layout. In the MapReduce case, the DAG consists of only two vertices, with one vertex for the map task and the other one for the reduce task.

Re: How is Spark better than Hadoop?

New Contributor

Speed

  • Apache Spark –Spark is lightning fast cluster computing tool. Apache Spark runs applications up to 100x faster in memory and 10x faster on disk than Hadoop. Because of reducing the number of read/write cycle to disk and storing intermediate data in-memory Spark makes it possible.
  • Hadoop MapReduce –MapReduce reads and writes from disk, as a result, it slows down the processing speed.

Difficulty

  • Apache Spark –Spark is easy to program as it has tons of high-level operators with RDD – Resilient Distributed Dataset.
  • Hadoop MapReduce –In MapReduce, developers need to hand code each and every operation which makes it very difficult to work.

Easy to Manage

  • Apache Spark –Spark is capable of performing batch, interactive and Machine Learning and Streaming all in the same cluster. As a result makes it a completedata analyticsengine. Thus, no need to manage different component for each need. Installing Spark on a cluster will be enough to handle all the requirements.
  • Hadoop MapReduce –As MapReduce only provides the batch engine. Hence, we are dependent on different engines. For example- Storm, Giraph, Impala, etc. for other requirements. So, it is very difficult to manage many components.

For more refer below link:

Spark vs Hadoop