Community Articles

njayakumar · ‎09-24-2017

Spark Load testing framework built on a number of distributed technologies, including Gatling, Livy, Akka, and HDP. Using Akka Server powered by LIVY {Spark as a Service} provides the following benefits.

REST friendly and Docker Friendly
Low latency execution
Sharing cache across jobs
Separation of concern
Multi tenancy
Direct Spark SQL execution
Configuration at one place
Auditing and Logging
Complete statement history and metrics

Livy Server

Livy is an open source REST interface for interacting with Apache Spark from anywhere. It supports executing snippets of code or programs in a Spark context that runs locally or in Apache Hadoop YARN.

Livy offers three modes to run Spark jobs:

Using programmatic API
Running interactive statements through REST API
Submitting batch applications with REST API

Livy provides the following features:

Interactive Scala, Python, and R shells
Batch submissions in Scala, Java, Python
Multiple users can share the same server (impersonation support)
Can be used for submitting jobs from anywhere with REST
Does not require any code change to your programs
Support Spark1/ Spark2, Scala 2.10/2.11 within one build.

Livy provides the following advantages:

Programmatically upload jar file and run job. Add additional applications that will connect to same cluster and upload jar with next job. If you use spark-submit, you must upload manually JAR file to cluster and run command. Everything must be prepared before run
Use Spark in interactive mode, hard to do with spark-submit or Thrift Server at scale.
Security. Reduce exposure of the cluster to the outside world.
Stability. Spark is a complex framework and there many factors which can affect its long term performance and stability. Decoupling Spark context and application allows to handle Spark issues gracefully, without full downtime of the application.

Gatling Server

Gatling is a highly capable load testing tool. It is designed for ease of use, maintainability and high performance. Gatling server provides the following benefits.

Powerful scripting using Scala
Akka + Netty
Run multiple scenarios in one simulation
Scenarios = code + DSL
Graphical reports with clear & concise graphs

Gatling’s architecture is asynchronous as long as the underlying protocol, such as HTTP, can be implemented in a non blocking way. This kind of architecture lets us implement virtual users as messages instead of dedicated threads, making them very resource cheap. Thus, running thousands of concurrent virtual users is not an issue.

val theScenarioBuilder =
    scenario("Interactive Spark Command Scenario Using LIVY Rest Services $sessionId").exec(
        /* myRequest1 is a name that describes the request. */
        http("Interactive Spark Command Simulation")
.get("/insrun?sessionId=${sessionId}&statement=sparkSession.sql(%22%20select%20event.site_id%20from%20siteexposure_event%20as%20event%20where%20st_intersects(st_makeBBOX(${bbox})%2C%20geom)%20limit%205%20%22).show").check()      
).pause(4 second)

So, this is great, we can load test our spark interactive command with one user! Let’s increase the number of users.

To increase the number of simulated users, all you have to do is to change the configuration of the simulation as follows:

setUp(
    theScenarioBuilder.inject(atOnceUsers(10))
    ).protocols(theHttpProtocolBuilder)

If you want to simulate 3000 users, you might not want them to start at the same time. Indeed, real users are more likely to connect to your web application gradually.

Gatling provides rampUsers to implement this behavior. The value of the ramp indicates the duration over which the users will be linearly started. In our scenario let’s have 10 regular users ramp them over 10 seconds so we don’t hammer the Livy server:

 setUp(
    theScenarioBuilder.inject(rampUsers(10) over (10 seconds)),
  ).protocols(theHttpProtocolBuilder)

Cloudera Community

Community Articles

Spark Load/Performance Testing using Gatling – PART I

Apache Spark

Livy Server

Gatling Server

Best Practices for Spark Programming - Part I

Scaling the HDFS NameNode (part 5)

Automate HDP installation using Ambari Blueprints ...

Spark in CML: Recommendations for using Spark in C...

Automate HDP installation using Ambari Blueprints ...

Testing Spark write performance with Spark version...

Securing Spark with Ranger using Zeppelin and Liv...

Automate HDP installation using Ambari Blueprints ...

Automate HDP installation using Ambari Blueprints ...

Tuning Hbase for optimized performance ( Part 2 )