Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

XML processing using Spark

avatar
Explorer

Help me how to process an XML file using Spark without using databricks spark-xml package. Is there any standard way which we can use in real time in live projects?

7 REPLIES 7

avatar
Explorer

@Shu Can you help me with this? I need to run a poc in my system.

avatar
Master Guru
@Satya G

AFAIK Yes, by using databricks spark-xml package, we can parse the xml file and create Dataframe on top of Xml data.

Once we create dataframe then by using DataframeAPI functions we can analyze the data.

Refer to this and this link for more details regards to usage/source code of Spark XML package.

avatar
Explorer

I want to parse them using pyspark withput usind databricks package. Is there a way to do it? If yes, please give me a sample code.

Thank you.

avatar
New Contributor

Spark is great for XML processing. It is based on a massively parallel distributed compute paradigm. I think you cam find some useful info in this examples:

https://stackoverflow.com/questions/33078221/xml-processing-in-spark

https://community.hortonworks.com/questions/71538/parsing-xml-in-spark-rdd.html

Also, check on https://anonymous-essay.com/ XSD/XML complexity. And finally you can view this thread to find out how do it without databricks package.

avatar
Contributor

If you like to use NIFI instead you can try this groovy script

https://github.com/maxbback/nifi-xml

avatar
New Contributor

Hola, 

Para procesamiento de XML sobre Apache Spark puede utilizar la librería  spark-xml.

Para Apache Spark 3.0 utiliza la versión spark-xml_2.12-0.10.0.jar

Para Apache Spark 2.4 utiliza la versión spark-xml_2.11-0.6.0.jar

Saludos.

avatar
New Contributor

spark-xml package is a good option too. With all options you are limited to only process simple XMLs which can be interpreted as dataset with rows and columns. However, if we make it a little complex, those options won’t be useful.