Support Questions

Find answers, ask questions, and share your expertise
Announcements
Check out our newest addition to the community, the Cloudera Data Analytics (CDA) group hub.

Error processing XML data in Spark,Unable to process XML data in Spark

New Contributor

Hi,

I'm getting the below error while processing an XML file using Spark. Don't know what I am doing wrong here. Any suggestion to resolve this will be greatly apreciated -

spark-submit --class csvdf /CSVDF/target/scala-2.11/misc-test_2.11-1.0.jar

Error:

Exception in thread "main" java.lang.ClassNotFoundException: Failed to find data source: com.databricks.spark.xml. Please find packages at http://spark.apache.org/third-party-projects.html at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:594) at org.apache.spark.sql.execution.datasources.DataSource.providingClass$lzycompute(DataSource.scala:86) at org.apache.spark.sql.execution.datasources.DataSource.providingClass(DataSource.scala:86) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:325) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:135) at csvdf$.main(csvdf.scala:45) at csvdf.main(csvdf.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$runMain(SparkSubmit.scala:743) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:187) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: java.lang.ClassNotFoundException: com.databricks.spark.xml.DefaultSource at java.net.URLClassLoader.findClass(URLClassLoader.java:381) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) at org.apache.spark.sql.execution.datasources.DataSource$anonfun$25$anonfun$apply$13.apply(DataSource.scala:579) at org.apache.spark.sql.execution.datasources.DataSource$anonfun$25$anonfun$apply$13.apply(DataSource.scala:579) at scala.util.Try$.apply(Try.scala:192) at org.apache.spark.sql.execution.datasources.DataSource$anonfun$25.apply(DataSource.scala:579) at org.apache.spark.sql.execution.datasources.DataSource$anonfun$25.apply(DataSource.scala:579) at scala.util.Try.orElse(Try.scala:84) at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:579) ... 16 more

Source Code:

import org.apache.spark.sql.SQLContext
import com.databricks.spark.xml._
.
.
.
val sConf = new SparkConf().setAppName("Hive test").setMaster("local")
val sc = new SparkContext(sConf)
val warehouseLocation = new File("spark-warehouse").getAbsolutePath
val spark = SparkSession .builder() .appName("Hive test") .config("spark.sql.warehouse.dir", warehouseLocation) .enableHiveSupport() .getOrCreate()

import spark.implicits._
import spark.sql


//* Test XML input file
val xml_df = spark.read .format("com.databricks.spark.xml") .option("rowTag", "doc") .load("file:///Downloads/sample.xml")
xml_df.printSchema()
xml_df.createOrReplaceTempView("XML_DATA")
spark.sql("SELECT * FROM XML_DATA").show()

SBT file:

name := "MISC test"

version := "1.0"

scalaVersion := "2.11.8"

libraryDependencies += "org.apache.spark" % "spark-core_2.11" % "2.1.1"

libraryDependencies += "org.apache.spark" % "spark-sql_2.11" % "2.1.1"

libraryDependencies += "org.apache.spark" %% "spark-hive" % "2.1.1"

libraryDependencies += "org.apache.spark" %% "spark-streaming" % "2.1.1" % "provided"

libraryDependencies += "com.databricks" %% "spark-xml" % "0.4.1" % "provided"

I don't know what I am doing wrong. Any suggestions to resolve this will be a great help.

Many Thanks

Satya

1 REPLY 1

Cloudera Employee

Have you tried with --package option (like --packages com.databricks:spark-xml_2.11:0.4.1)

spark-submit \
--packages com.databricks:spark-xml_2.11:0.4.1  \ 
--class csvdf \ 
/CSVDF/target/scala-2.11/misc-test_2.11-1.0.jar
Take a Tour of the Community
Don't have an account?
Your experience may be limited. Sign in to explore more.