<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Re: What are common use cases for Spark and Data science? in Archives of Support Questions (Read Only)</title>
    <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/What-are-common-use-cases-for-Spark-and-Data-science/m-p/98149#M11616</link>
    <description>&lt;P&gt;&lt;STRONG&gt;5 Common use cases for Apache Spark:&lt;/STRONG&gt;&lt;/P&gt;&lt;H3&gt;&lt;/H3&gt;&lt;P&gt;&lt;EM&gt;Streaming ingest and analytics&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;Spark isn’t the first big data tool for handling streaming ingest, but it is the first one to &lt;A href="https://spark.apache.org/streaming/"&gt;integrate it with the rest of the analytic environment&lt;/A&gt;. Spark is friendly with the rest of the streaming data ecosystem, supporting data sources including Flume, Kafka, ZeroMQ, and HDFS.&lt;/P&gt;&lt;H3&gt;&lt;/H3&gt;&lt;P&gt;&lt;EM&gt;Exploratory analytics&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;One of the headline benefits of using Spark is that you no longer need to maintain different environments for exploratory and production work. The relatively long execution times of a Hadoop MapReduce job make it difficult for hands-on exploration of data: data scientists typically still must sample data if they want to move quickly. Thanks to the speed of Spark’s in-memory capabilities, interactive exploration can now happen completely within &lt;A target="_blank" href="https://mindmajix.com/apache-spark-training"&gt;Spark&lt;/A&gt; , without the need for Java engineering or sampling of the data. &lt;/P&gt;&lt;H3&gt;&lt;/H3&gt;&lt;P&gt;&lt;EM&gt;Model building and machine learning&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;Spark’s status as a big data tool that data scientists find easy to use makes it ideal for building models for analytical purposes. In a pre-Spark world, big data modelers typically built their models in a language such as R or SAS, then threw them to data engineers to re-implement in Java for production on Hadoop. &lt;/P&gt;&lt;H3&gt;&lt;/H3&gt;&lt;P&gt;&lt;EM&gt;Graph analysis&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;By incorporating the &lt;A href="https://spark.apache.org/docs/latest/graphx-programming-guide.html"&gt;GraphX&lt;/A&gt; component, Spark brings all the benefits of using its environment to graph computation: enabling use cases such as social network analysis, fraud detection, and recommendations. &lt;/P&gt;&lt;H3&gt;&lt;/H3&gt;&lt;P&gt;&lt;EM&gt;Simpler, faster, ETL&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;Though less glamorous than the analytical applications, ETL is often the lion’s share of data workloads. If the rest of your data pipeline is based on Spark, then the &lt;A href="http://www.slideshare.net/rafalkwasny/etl-with-spark-first-spark-london-meetup"&gt;benefits of using Spark for ETL&lt;/A&gt; are obvious, with consequent increases in maintainability and code-reuse.&lt;/P&gt;</description>
    <pubDate>Sat, 23 Sep 2017 13:20:33 GMT</pubDate>
    <dc:creator>dreamzdaisy</dc:creator>
    <dc:date>2017-09-23T13:20:33Z</dc:date>
    <item>
      <title>What are common use cases for Spark and Data science?</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/What-are-common-use-cases-for-Spark-and-Data-science/m-p/98147#M11614</link>
      <description>&lt;P&gt;What are common use cases for Spark and Data science across different verticals?&lt;/P&gt;</description>
      <pubDate>Tue, 21 Apr 2026 13:30:35 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/What-are-common-use-cases-for-Spark-and-Data-science/m-p/98147#M11614</guid>
      <dc:creator>abajwa</dc:creator>
      <dc:date>2026-04-21T13:30:35Z</dc:date>
    </item>
    <item>
      <title>Re: What are common use cases for Spark and Data science?</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/What-are-common-use-cases-for-Spark-and-Data-science/m-p/98148#M11615</link>
      <description>&lt;P&gt;Some of the common use case for Spark:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;Interactive SQL with small dataset for relatively simple SQL but where the response is expected in under a second. In this scenario the table is usually cached in memory.&lt;/LI&gt;&lt;LI&gt;ETL: Use Spark for traditional ETL where MR was used. Any usecase where in the past you used MR is now a good fit for Spark&lt;/LI&gt;&lt;LI&gt;Streaming: Spark Streaming can ingest data from variety of sources but the most commonly it is used in conjunction with Kafka. Since Kafka can provide message replay, putting it in front of Spark (or Storm) helps reliability. Spark is a good fit for Streaming where Streaming is a part of the overall data processing platform. If you need to build a specialized platform focused on streaming with millisecond latency consider Storm, otherwise Spark is good fit.&lt;/LI&gt;&lt;LI&gt;Predictive Analytics: Spark makes data science and machine learning easier, with its built in libraries in MLlib  &amp;amp; ML Pipeline API to model workflows, predictive analytics is much easier.&lt;/LI&gt;&lt;LI&gt;Combine to above in a single application&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;To make it more concrete, here are some examples from actual customers:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;Predict at risk shopping cart in an online session and offer coupon/other incentives to increase sales&lt;/LI&gt;&lt;LI&gt;Process insurance claims coming from traditional data pipeline and process all claims data including textual claims information using SparkCore, use Spark for Feature engineering by using built in feature extraction facilities like TF-IDF and Word2Vec to predict insurance payment accuracies and flag certain claims for closer inspection.&lt;/LI&gt;&lt;/UL&gt;</description>
      <pubDate>Tue, 08 Dec 2015 13:44:13 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/What-are-common-use-cases-for-Spark-and-Data-science/m-p/98148#M11615</guid>
      <dc:creator>vshukla</dc:creator>
      <dc:date>2015-12-08T13:44:13Z</dc:date>
    </item>
    <item>
      <title>Re: What are common use cases for Spark and Data science?</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/What-are-common-use-cases-for-Spark-and-Data-science/m-p/98149#M11616</link>
      <description>&lt;P&gt;&lt;STRONG&gt;5 Common use cases for Apache Spark:&lt;/STRONG&gt;&lt;/P&gt;&lt;H3&gt;&lt;/H3&gt;&lt;P&gt;&lt;EM&gt;Streaming ingest and analytics&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;Spark isn’t the first big data tool for handling streaming ingest, but it is the first one to &lt;A href="https://spark.apache.org/streaming/"&gt;integrate it with the rest of the analytic environment&lt;/A&gt;. Spark is friendly with the rest of the streaming data ecosystem, supporting data sources including Flume, Kafka, ZeroMQ, and HDFS.&lt;/P&gt;&lt;H3&gt;&lt;/H3&gt;&lt;P&gt;&lt;EM&gt;Exploratory analytics&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;One of the headline benefits of using Spark is that you no longer need to maintain different environments for exploratory and production work. The relatively long execution times of a Hadoop MapReduce job make it difficult for hands-on exploration of data: data scientists typically still must sample data if they want to move quickly. Thanks to the speed of Spark’s in-memory capabilities, interactive exploration can now happen completely within &lt;A target="_blank" href="https://mindmajix.com/apache-spark-training"&gt;Spark&lt;/A&gt; , without the need for Java engineering or sampling of the data. &lt;/P&gt;&lt;H3&gt;&lt;/H3&gt;&lt;P&gt;&lt;EM&gt;Model building and machine learning&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;Spark’s status as a big data tool that data scientists find easy to use makes it ideal for building models for analytical purposes. In a pre-Spark world, big data modelers typically built their models in a language such as R or SAS, then threw them to data engineers to re-implement in Java for production on Hadoop. &lt;/P&gt;&lt;H3&gt;&lt;/H3&gt;&lt;P&gt;&lt;EM&gt;Graph analysis&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;By incorporating the &lt;A href="https://spark.apache.org/docs/latest/graphx-programming-guide.html"&gt;GraphX&lt;/A&gt; component, Spark brings all the benefits of using its environment to graph computation: enabling use cases such as social network analysis, fraud detection, and recommendations. &lt;/P&gt;&lt;H3&gt;&lt;/H3&gt;&lt;P&gt;&lt;EM&gt;Simpler, faster, ETL&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;Though less glamorous than the analytical applications, ETL is often the lion’s share of data workloads. If the rest of your data pipeline is based on Spark, then the &lt;A href="http://www.slideshare.net/rafalkwasny/etl-with-spark-first-spark-london-meetup"&gt;benefits of using Spark for ETL&lt;/A&gt; are obvious, with consequent increases in maintainability and code-reuse.&lt;/P&gt;</description>
      <pubDate>Sat, 23 Sep 2017 13:20:33 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/What-are-common-use-cases-for-Spark-and-Data-science/m-p/98149#M11616</guid>
      <dc:creator>dreamzdaisy</dc:creator>
      <dc:date>2017-09-23T13:20:33Z</dc:date>
    </item>
  </channel>
</rss>

