Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Unable to process large dataset through Spark in Zeppelin

Unable to process large dataset through Spark in Zeppelin

New Contributor
I am trying to load an oracle db table in Spark through Zeppelin. I am using the below code to load the table

val df = spark.read.format("jdbc")
        .option("url","jdbc:oracle:thin:****/***@hostip/appname")
        .option("driver", "oracle.jdbc.OracleDriver")
        .option("dbtable", "schema.tablename")
        .load()

The above code executes successfully in Zeppelin with msg displayed 
df: org.apache.spark.sql.DataFrame = [fieldname1: int, fieldname2: string ... 92 more fields]

However, when I try to print the above df through df.count, df.collect I get the below error
I get the same error if do a 
val df_count = spark.sql("select count (*) from count") 
df_count.show

org.apache.spark.SparkException: Job aborted due to stage failure: Task serialization failed: java.util.NoSuchElementException: key not found: -1024
java.util.NoSuchElementException: key not found: -1024
	at scala.collection.MapLike$class.default(MapLike.scala:228)
	at scala.collection.AbstractMap.default(Map.scala:59)
	at scala.collection.mutable.HashMap.apply(HashMap.scala:65)

I am using the below code to set spark session

import org.apache.spark.sql._
import org.apache.spark.sql.SparkSession
import spark.implicits._

    val spark = SparkSession
      .builder
      .appName("test_app")
      .master("local[*]")
      .config("spark.kryo.registrator", "org.bdgenomics.adam.serialization.ADAMKryoRegistrator")
      .getOrCreate()

2 REPLIES 2

Re: Unable to process large dataset through Spark in Zeppelin

Super Collaborator

Hi @Rohan Naidu,

looks like the serialization library comparability issue, could you please remove explicit serializer so that it uses spar internal serializer( Kyro again from Spark 2.0 on wards https://spark.apache.org/docs/latest/tuning.html#data-serialization ) and proceed further

  .config("spark.kryo.registrator","org.bdgenomics.adam.serialization.ADAMKryoRegistrator")

Hope this helps !!

Re: Unable to process large dataset through Spark in Zeppelin

New Contributor

@bkosaraju

I did try without the kyro serializer. The issue is Inam able to load the oracle table. But the moment I use show or count, it goes on a ending run.

Don't have an account?
Coming from Hortonworks? Activate your account here