Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

XML processing using Spark

XML processing using Spark

Explorer

Help me how to process an XML file using Spark without using databricks spark-xml package. Is there any standard way which we can use in real time in live projects?

7 REPLIES 7
Highlighted

Re: XML processing using Spark

Explorer

@Shu Can you help me with this? I need to run a poc in my system.

Highlighted

Re: XML processing using Spark

Super Guru
@Satya G

AFAIK Yes, by using databricks spark-xml package, we can parse the xml file and create Dataframe on top of Xml data.

Once we create dataframe then by using DataframeAPI functions we can analyze the data.

Refer to this and this link for more details regards to usage/source code of Spark XML package.

Highlighted

Re: XML processing using Spark

Explorer

I want to parse them using pyspark withput usind databricks package. Is there a way to do it? If yes, please give me a sample code.

Thank you.

Highlighted

Re: XML processing using Spark

New Contributor

Spark is great for XML processing. It is based on a massively parallel distributed compute paradigm. I think you cam find some useful info in this examples:

https://stackoverflow.com/questions/33078221/xml-processing-in-spark

https://community.hortonworks.com/questions/71538/parsing-xml-in-spark-rdd.html

Also, check on https://anonymous-essay.com/ XSD/XML complexity. And finally you can view this thread to find out how do it without databricks package.

Highlighted

Re: XML processing using Spark

If you like to use NIFI instead you can try this groovy script

https://github.com/maxbback/nifi-xml

Highlighted

Re: XML processing using Spark

New Contributor

Hola, 

Para procesamiento de XML sobre Apache Spark puede utilizar la librería  spark-xml.

Para Apache Spark 3.0 utiliza la versión spark-xml_2.12-0.10.0.jar

Para Apache Spark 2.4 utiliza la versión spark-xml_2.11-0.6.0.jar

Saludos.

Highlighted

Re: XML processing using Spark

New Contributor

spark-xml package is a good option too. With all options you are limited to only process simple XMLs which can be interpreted as dataset with rows and columns. However, if we make it a little complex, those options won’t be useful. 

Don't have an account?
Coming from Hortonworks? Activate your account here