Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Error processing XML data in Spark,Unable to process XML data in Spark

avatar
New Contributor

Hi,

I'm getting the below error while processing an XML file using Spark. Don't know what I am doing wrong here. Any suggestion to resolve this will be greatly apreciated -

spark-submit --class csvdf /CSVDF/target/scala-2.11/misc-test_2.11-1.0.jar

Error:

Exception in thread "main" java.lang.ClassNotFoundException: Failed to find data source: com.databricks.spark.xml. Please find packages at http://spark.apache.org/third-party-projects.html at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:594) at org.apache.spark.sql.execution.datasources.DataSource.providingClass$lzycompute(DataSource.scala:86) at org.apache.spark.sql.execution.datasources.DataSource.providingClass(DataSource.scala:86) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:325) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:135) at csvdf$.main(csvdf.scala:45) at csvdf.main(csvdf.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$runMain(SparkSubmit.scala:743) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:187) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: java.lang.ClassNotFoundException: com.databricks.spark.xml.DefaultSource at java.net.URLClassLoader.findClass(URLClassLoader.java:381) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) at org.apache.spark.sql.execution.datasources.DataSource$anonfun$25$anonfun$apply$13.apply(DataSource.scala:579) at org.apache.spark.sql.execution.datasources.DataSource$anonfun$25$anonfun$apply$13.apply(DataSource.scala:579) at scala.util.Try$.apply(Try.scala:192) at org.apache.spark.sql.execution.datasources.DataSource$anonfun$25.apply(DataSource.scala:579) at org.apache.spark.sql.execution.datasources.DataSource$anonfun$25.apply(DataSource.scala:579) at scala.util.Try.orElse(Try.scala:84) at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:579) ... 16 more

Source Code:

import org.apache.spark.sql.SQLContext
import com.databricks.spark.xml._
.
.
.
val sConf = new SparkConf().setAppName("Hive test").setMaster("local")
val sc = new SparkContext(sConf)
val warehouseLocation = new File("spark-warehouse").getAbsolutePath
val spark = SparkSession .builder() .appName("Hive test") .config("spark.sql.warehouse.dir", warehouseLocation) .enableHiveSupport() .getOrCreate()

import spark.implicits._
import spark.sql


//* Test XML input file
val xml_df = spark.read .format("com.databricks.spark.xml") .option("rowTag", "doc") .load("file:///Downloads/sample.xml")
xml_df.printSchema()
xml_df.createOrReplaceTempView("XML_DATA")
spark.sql("SELECT * FROM XML_DATA").show()

SBT file:

name := "MISC test"

version := "1.0"

scalaVersion := "2.11.8"

libraryDependencies += "org.apache.spark" % "spark-core_2.11" % "2.1.1"

libraryDependencies += "org.apache.spark" % "spark-sql_2.11" % "2.1.1"

libraryDependencies += "org.apache.spark" %% "spark-hive" % "2.1.1"

libraryDependencies += "org.apache.spark" %% "spark-streaming" % "2.1.1" % "provided"

libraryDependencies += "com.databricks" %% "spark-xml" % "0.4.1" % "provided"

I don't know what I am doing wrong. Any suggestions to resolve this will be a great help.

Many Thanks

Satya

1 REPLY 1

avatar
Contributor

Have you tried with --package option (like --packages com.databricks:spark-xml_2.11:0.4.1)

spark-submit \
--packages com.databricks:spark-xml_2.11:0.4.1  \ 
--class csvdf \ 
/CSVDF/target/scala-2.11/misc-test_2.11-1.0.jar