Created on 11-22-202311:47 PM - edited on 11-24-202312:59 AM by VidyaSargur
Objective
This article provides a quickstart for the SparkXML package in Cloudera Data Engineering. You can run the following commands to parse XML files with Spark in Cloudera Data Engineering (CDE).
CDE is the only cloud-native service purpose-built for enterprise data engineering teams. Building on Apache Spark, Data Engineering is an all-inclusive data engineering toolset that enables orchestration automation with Apache Airflow, advanced pipeline monitoring, visual troubleshooting, and comprehensive management tools to streamline ETL processes across enterprise analytics teams.
The CDE is fully integrated with the Cloudera Data Platform (CDP), enabling end-to-end visibility and security with SDX as well as seamless integrations with CDP services such as Data Warehouse and Machine Learning. Data Engineering on CDP powers consistent, repeatable, and automated data engineering workflows on a hybrid cloud platform anywhere
Spark-XML is a library for parsing and querying XML data with Apache Spark, for Spark SQL and DataFrames. Spark Data Engineers use the package in CDE to parse XML files at scale. In the rest of this tutorial, we will run a few commands to demonstrate basic functionality of this package in CDE.
A working installation of git in your local machine. Please clone this git repository and keep in mind all commands assume they are run in the project's main directory.
No code edits required but familiarity with Python, Spark and XML is recommended.