Support Questions
Find answers, ask questions, and share your expertise

Unable to process large dataset through Spark in Zeppelin

Highlighted

Unable to process large dataset through Spark in Zeppelin

Explorer
I am trying to load an oracle db table in Spark through Zeppelin. I am using the below code to load the table

val df = spark.read.format("jdbc")
        .option("url","jdbc:oracle:thin:****/***@hostip/appname")
        .option("driver", "oracle.jdbc.OracleDriver")
        .option("dbtable", "schema.tablename")
        .load()

The above code executes successfully in Zeppelin with msg displayed 
df: org.apache.spark.sql.DataFrame = [fieldname1: int, fieldname2: string ... 92 more fields]

However, when I try to print the above df through df.count, df.collect I get the below error
I get the same error if do a 
val df_count = spark.sql("select count (*) from count") 
df_count.show

org.apache.spark.SparkException: Job aborted due to stage failure: Task serialization failed: java.util.NoSuchElementException: key not found: -1024
java.util.NoSuchElementException: key not found: -1024
	at scala.collection.MapLike$class.default(MapLike.scala:228)
	at scala.collection.AbstractMap.default(Map.scala:59)
	at scala.collection.mutable.HashMap.apply(HashMap.scala:65)

I am using the below code to set spark session

import org.apache.spark.sql._
import org.apache.spark.sql.SparkSession
import spark.implicits._

    val spark = SparkSession
      .builder
      .appName("test_app")
      .master("local[*]")
      .config("spark.kryo.registrator", "org.bdgenomics.adam.serialization.ADAMKryoRegistrator")
      .getOrCreate()

2 REPLIES 2
Highlighted

Re: Unable to process large dataset through Spark in Zeppelin

Super Collaborator

Hi @Rohan Naidu,

looks like the serialization library comparability issue, could you please remove explicit serializer so that it uses spar internal serializer( Kyro again from Spark 2.0 on wards https://spark.apache.org/docs/latest/tuning.html#data-serialization ) and proceed further

  .config("spark.kryo.registrator","org.bdgenomics.adam.serialization.ADAMKryoRegistrator")

Hope this helps !!

Highlighted

Re: Unable to process large dataset through Spark in Zeppelin

Explorer

@bkosaraju

I did try without the kyro serializer. The issue is Inam able to load the oracle table. But the moment I use show or count, it goes on a ending run.