Support Questions

Find answers, ask questions, and share your expertise

XML processing using Spark


Help me how to process an XML file using Spark without using databricks spark-xml package. Is there any standard way which we can use in real time in live projects?



@Shu Can you help me with this? I need to run a poc in my system.

Super Guru
@Satya G

AFAIK Yes, by using databricks spark-xml package, we can parse the xml file and create Dataframe on top of Xml data.

Once we create dataframe then by using DataframeAPI functions we can analyze the data.

Refer to this and this link for more details regards to usage/source code of Spark XML package.


I want to parse them using pyspark withput usind databricks package. Is there a way to do it? If yes, please give me a sample code.

Thank you.

New Contributor

Spark is great for XML processing. It is based on a massively parallel distributed compute paradigm. I think you cam find some useful info in this examples:

Also, check on XSD/XML complexity. And finally you can view this thread to find out how do it without databricks package.

If you like to use NIFI instead you can try this groovy script

New Contributor


Para procesamiento de XML sobre Apache Spark puede utilizar la librería  spark-xml.

Para Apache Spark 3.0 utiliza la versión spark-xml_2.12-0.10.0.jar

Para Apache Spark 2.4 utiliza la versión spark-xml_2.11-0.6.0.jar


New Contributor

spark-xml package is a good option too. With all options you are limited to only process simple XMLs which can be interpreted as dataset with rows and columns. However, if we make it a little complex, those options won’t be useful.