Created 08-16-2017 07:17 PM
I have a need to process some xml data in Spark ( 1.6) using the databricks xml jar. My problem is the data source adds "xmlns: /data/path/d" to the root element tag and this extra verbiage makes the databricks xml parser not parse a node. If I remove the extra verbiage and leave a normal tag like <tag1\> , the parser parses fine. I would like to load the file to an RDD, replaceAll on the verbiage, and then run the RDD through the databricks xml parser to create a dataframe. So, the main question is I'm not sure how to load the RDD into the databricks xml jar. I only see examples of files being loaded.
Created 08-17-2017 04:16 PM
Can you please share a sample xml file with "xmlns: /data/path/d" to try out a solution?
Thanks
Vinod
Created 08-17-2017 06:12 PM
Created 08-18-2017 02:01 AM
Please try this and let me know. I've tested in spark 1.6.3.
./bin/spark-shell --packages com.databricks:spark-xml_2.10:0.4.1
scala> sqlContext.read.format("com.databricks.spark.xml").option("rowTag","WSAOnRoad").load("file:///root/problem.xml").show(false)
Created 08-18-2017 06:24 PM
Using sample xml given below, I'm able to parse and get the result
sqlContext.read.format("com.databricks.spark.xml").option("rowTag", "root").load("file:///root/testxml/data.xml").show()
<root xmlns:h="http://www.w3.org/TR/html4/" xmlns:f="https://www.abc.com/furniture"> <h:table> <h:tr> <h:td>Apples</h:td> <h:td>Bananas</h:td> </h:tr> </h:table> <f:table> <f:name>Coffee Table</f:name> <f:width>80</f:width> <f:length>120</f:length> </f:table> </root
Created 08-17-2017 04:31 PM
if SLA is not the constraint, you can save RDD as temporary file and read it again via databricks. If you are running it via zeppelin dashboard, you can invoke shell interpreter and use sed to do an infile replace for xmlns: prior to reading it on your dataframe.
Created 08-17-2017 06:14 PM
Yes, I realize I could write it to disk but was trying to avoid that if possible. I am not using Zeppellin.
Created 08-18-2017 06:24 PM
@kenny creed
Using a sample xml file with "xmlns: /data/path/d" to the root element tag, I'm able to parse it with this code
sqlContext.read.format("com.databricks.spark.xml").option("rowTag", "root").load("file:///testxml/data.xml").show() //Sample XML file <root xmlns:h="http://www.w3.org/TR/html4/" xmlns:f="https://www.abc.com/furniture"> <h:table> <h:tr> <h:td>Apps</h:td> <h:td>bean</h:td> </h:tr> </h:table> <f:table> <f:name>tables</f:name> <f:width>80</f:width> <f:length>120</f:length> </f:table> </root>
If it is not solving your current issue, please share your sample xml file to understand the xml content better.
Created 08-18-2017 08:28 PM
I went back and checked my uploaded file and I too can parse it. I am thinking there may be some type of hidden characters somewhere in my file. It is too big to upload the complete file. Is there a way to map the file as an RDD, replaceAll, and input it through the Databricks parser?
Created 08-20-2017 08:01 AM
You may try the APIs suggested in this link : https://github.com/databricks/spark-xml/issues/114#issuecomment-215301575