Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

XML processing using Spark

XML processing using Spark

New Contributor

Help me how to process an XML file using Spark without using databricks spark-xml package. Is there any standard way which we can use in real time in live projects?

5 REPLIES 5

Re: XML processing using Spark

New Contributor

@Shu Can you help me with this? I need to run a poc in my system.

Re: XML processing using Spark

Super Guru
@Satya G

AFAIK Yes, by using databricks spark-xml package, we can parse the xml file and create Dataframe on top of Xml data.

Once we create dataframe then by using DataframeAPI functions we can analyze the data.

Refer to this and this link for more details regards to usage/source code of Spark XML package.

Re: XML processing using Spark

New Contributor

I want to parse them using pyspark withput usind databricks package. Is there a way to do it? If yes, please give me a sample code.

Thank you.

Re: XML processing using Spark

New Contributor

Spark is great for XML processing. It is based on a massively parallel distributed compute paradigm. I think you cam find some useful info in this examples:

https://stackoverflow.com/questions/33078221/xml-processing-in-spark

https://community.hortonworks.com/questions/71538/parsing-xml-in-spark-rdd.html

Also, check on https://anonymous-essay.com/ XSD/XML complexity. And finally you can view this thread to find out how do it without databricks package.

Re: XML processing using Spark

New Contributor

If you like to use NIFI instead you can try this groovy script

https://github.com/maxbback/nifi-xml