Support Questions
Find answers, ask questions, and share your expertise

Process an RDD through the databricks xml jar.

Explorer

I have a need to process some xml data in Spark ( 1.6) using the databricks xml jar. My problem is the data source adds "xmlns: /data/path/d" to the root element tag and this extra verbiage makes the databricks xml parser not parse a node. If I remove the extra verbiage and leave a normal tag like <tag1\> , the parser parses fine. I would like to load the file to an RDD, replaceAll on the verbiage, and then run the RDD through the databricks xml parser to create a dataframe. So, the main question is I'm not sure how to load the RDD into the databricks xml jar. I only see examples of files being loaded.

9 REPLIES 9

Re: Process an RDD through the databricks xml jar.

Explorer

@kenny creed,

Can you please share a sample xml file with "xmlns: /data/path/d" to try out a solution?

Thanks

Vinod

Re: Process an RDD through the databricks xml jar.

Explorer

Re: Process an RDD through the databricks xml jar.

Explorer

Please try this and let me know. I've tested in spark 1.6.3.

./bin/spark-shell --packages com.databricks:spark-xml_2.10:0.4.1

scala> sqlContext.read.format("com.databricks.spark.xml").option("rowTag","WSAOnRoad").load("file:///root/problem.xml").show(false)

Re: Process an RDD through the databricks xml jar.

Explorer

@kenny creed

Using sample xml given below, I'm able to parse and get the result

sqlContext.read.format("com.databricks.spark.xml").option("rowTag", "root").load("file:///root/testxml/data.xml").show()
<root xmlns:h="http://www.w3.org/TR/html4/"
xmlns:f="https://www.abc.com/furniture">

<h:table>
<h:tr>
<h:td>Apples</h:td>
<h:td>Bananas</h:td>
</h:tr>
</h:table>

<f:table>
<f:name>Coffee Table</f:name>
<f:width>80</f:width>
<f:length>120</f:length>
</f:table>

</root

Re: Process an RDD through the databricks xml jar.

Expert Contributor

if SLA is not the constraint, you can save RDD as temporary file and read it again via databricks. If you are running it via zeppelin dashboard, you can invoke shell interpreter and use sed to do an infile replace for xmlns: prior to reading it on your dataframe.

Re: Process an RDD through the databricks xml jar.

Explorer

Yes, I realize I could write it to disk but was trying to avoid that if possible. I am not using Zeppellin.

Re: Process an RDD through the databricks xml jar.

Explorer

@kenny creed

Using a sample xml file with "xmlns: /data/path/d" to the root element tag, I'm able to parse it with this code

sqlContext.read.format("com.databricks.spark.xml").option("rowTag", "root").load("file:///testxml/data.xml").show()


//Sample XML file
<root xmlns:h="http://www.w3.org/TR/html4/"
xmlns:f="https://www.abc.com/furniture">

<h:table>
<h:tr>
<h:td>Apps</h:td>
<h:td>bean</h:td>
</h:tr>
</h:table>

<f:table>
<f:name>tables</f:name>
<f:width>80</f:width>
<f:length>120</f:length>
</f:table>
</root>
  

If it is not solving your current issue, please share your sample xml file to understand the xml content better.

Re: Process an RDD through the databricks xml jar.

Explorer

I went back and checked my uploaded file and I too can parse it. I am thinking there may be some type of hidden characters somewhere in my file. It is too big to upload the complete file. Is there a way to map the file as an RDD, replaceAll, and input it through the Databricks parser?

Re: Process an RDD through the databricks xml jar.

Explorer