Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

What are common use cases for Spark and Data science?

avatar

What are common use cases for Spark and Data science across different verticals?

1 ACCEPTED SOLUTION

avatar

Some of the common use case for Spark:

  • Interactive SQL with small dataset for relatively simple SQL but where the response is expected in under a second. In this scenario the table is usually cached in memory.
  • ETL: Use Spark for traditional ETL where MR was used. Any usecase where in the past you used MR is now a good fit for Spark
  • Streaming: Spark Streaming can ingest data from variety of sources but the most commonly it is used in conjunction with Kafka. Since Kafka can provide message replay, putting it in front of Spark (or Storm) helps reliability. Spark is a good fit for Streaming where Streaming is a part of the overall data processing platform. If you need to build a specialized platform focused on streaming with millisecond latency consider Storm, otherwise Spark is good fit.
  • Predictive Analytics: Spark makes data science and machine learning easier, with its built in libraries in MLlib & ML Pipeline API to model workflows, predictive analytics is much easier.
  • Combine to above in a single application

To make it more concrete, here are some examples from actual customers:

  • Predict at risk shopping cart in an online session and offer coupon/other incentives to increase sales
  • Process insurance claims coming from traditional data pipeline and process all claims data including textual claims information using SparkCore, use Spark for Feature engineering by using built in feature extraction facilities like TF-IDF and Word2Vec to predict insurance payment accuracies and flag certain claims for closer inspection.

View solution in original post

2 REPLIES 2

avatar

Some of the common use case for Spark:

  • Interactive SQL with small dataset for relatively simple SQL but where the response is expected in under a second. In this scenario the table is usually cached in memory.
  • ETL: Use Spark for traditional ETL where MR was used. Any usecase where in the past you used MR is now a good fit for Spark
  • Streaming: Spark Streaming can ingest data from variety of sources but the most commonly it is used in conjunction with Kafka. Since Kafka can provide message replay, putting it in front of Spark (or Storm) helps reliability. Spark is a good fit for Streaming where Streaming is a part of the overall data processing platform. If you need to build a specialized platform focused on streaming with millisecond latency consider Storm, otherwise Spark is good fit.
  • Predictive Analytics: Spark makes data science and machine learning easier, with its built in libraries in MLlib & ML Pipeline API to model workflows, predictive analytics is much easier.
  • Combine to above in a single application

To make it more concrete, here are some examples from actual customers:

  • Predict at risk shopping cart in an online session and offer coupon/other incentives to increase sales
  • Process insurance claims coming from traditional data pipeline and process all claims data including textual claims information using SparkCore, use Spark for Feature engineering by using built in feature extraction facilities like TF-IDF and Word2Vec to predict insurance payment accuracies and flag certain claims for closer inspection.

avatar
Explorer

5 Common use cases for Apache Spark:

Streaming ingest and analytics

Spark isn’t the first big data tool for handling streaming ingest, but it is the first one to integrate it with the rest of the analytic environment. Spark is friendly with the rest of the streaming data ecosystem, supporting data sources including Flume, Kafka, ZeroMQ, and HDFS.

Exploratory analytics

One of the headline benefits of using Spark is that you no longer need to maintain different environments for exploratory and production work. The relatively long execution times of a Hadoop MapReduce job make it difficult for hands-on exploration of data: data scientists typically still must sample data if they want to move quickly. Thanks to the speed of Spark’s in-memory capabilities, interactive exploration can now happen completely within Spark , without the need for Java engineering or sampling of the data.

Model building and machine learning

Spark’s status as a big data tool that data scientists find easy to use makes it ideal for building models for analytical purposes. In a pre-Spark world, big data modelers typically built their models in a language such as R or SAS, then threw them to data engineers to re-implement in Java for production on Hadoop.

Graph analysis

By incorporating the GraphX component, Spark brings all the benefits of using its environment to graph computation: enabling use cases such as social network analysis, fraud detection, and recommendations.

Simpler, faster, ETL

Though less glamorous than the analytical applications, ETL is often the lion’s share of data workloads. If the rest of your data pipeline is based on Spark, then the benefits of using Spark for ETL are obvious, with consequent increases in maintainability and code-reuse.