question Re: How to automatically convert huge and complex XML files to flat tables structure? in Support Questions

How to automatically convert huge and complex XML files to flat tables structure?

Faflusniak — Tue, 21 Apr 2026 07:08:42 GMT

Hi,

We have huge and complex XML files. For example: 15-20 levels in XML tree structure, approximately 180 basic types and 200 complex types, 1 to many relations between nodes in XML tree structure.

As the output we want to have tables in Hive or Impala and to use SQL to query this tables.

Could you please advise how to that in the most effective way?

Effective - that is reducing manual coding works.

Best regards

Re: How to automatically convert huge and complex XML files to flat tables structure?

ggangadharan — Tue, 13 Jun 2023 07:16:17 GMT

Share sample Data file with minimum of 2 records , to understand the structure.

Re: How to automatically convert huge and complex XML files to flat tables structure?

Faflusniak — Tue, 13 Jun 2023 13:36:31 GMT

Hi,

Thank you for response.

Documentation and samples is available to download (XSD, XML sample, documentation in XLS):

https://www.esma.europa.eu/sites/default/files/library/disclosure_templates_1.3.1.zip

There are 4 kinds of files and most problematic for us is "099" DRAFT1auth.099.001.04_1.3.0.xsd / DRAFT1auth.099.001.04_non-ABCP Underlying Exposure Report.xml

Unpacked and biggest XML file is up to 4,7 GB but most problematic is not size but complex, nested structure.

Have you ever struggled with such complex and huge XMLs?

Re: How to automatically convert huge and complex XML files to flat tables structure?

ggangadharan — Wed, 14 Jun 2023 14:05:24 GMT

Check the bellow helps for your usecase.

https://github.com/databricks/spark-xml

scala> import com.databricks.spark.xml.util.XSDToSchema import com.databricks.spark.xml.util.XSDToSchema scala> import java.nio.file.Paths import java.nio.file.Paths scala> val schema = XSDToSchema.read(Paths.get("/tmp/DRAFT1auth.099.001.04_1.3.0.xsd")) schema: org.apache.spark.sql.types.StructType = StructType(StructField(Document,StructType(StructField(ScrtstnNonAsstBckdComrclPprUndrlygXpsrRpt,StructType(StructField(NewCrrctn,StructType(StructField(ScrtstnRpt,StructType(StructField(ScrtstnIdr,StringType,false), StructField(CutOffDt,StringType,false), StructField(UndrlygXpsrRcrd,ArrayType(StructType(StructField(UndrlygXpsrId,StructType(StructField(NewUndrlygXpsrIdr,StringType,false), StructField(OrgnlUndrlygXpsrIdr,StringType,false), StructField(NewOblgrIdr,StringType,false), StructField(OrgnlOblgrIdr,StringType,false)),false), StructField(UndrlygXpsrData,StructType(StructField(ResdtlRealEsttLn,StructType(StructField(PrfrmgLn,StructType(StructField(UndrlygXpsrCmonData,StructType(StructField(ActvtyDtDtls,StructType(StructField(PoolAddt... scala> import com.databricks.spark.xml._ import com.databricks.spark.xml._ scala> val df=spark.read.schema(schema).xml("/tmp/DRAFT1auth.099.001.04_non-ABCP_Underlying_Exposure_Report.xml") 23/06/14 13:53:19 WARN util.package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'. df: org.apache.spark.sql.DataFrame = [Document: struct<ScrtstnNonAsstBckdComrclPprUndrlygXpsrRpt: struct<NewCrrctn: struct<ScrtstnRpt: struct<ScrtstnIdr: string, CutOffDt: string ... 1 more field>>, Cxl: struct<ScrtstnCxl: array<string>, UndrlygXpsrRptCxl: array<struct<ScrtstnIdr:string,CutOffDt:string>>>>>]

spark-shell command used

spark-shell --jars /tmp/spark-xml_2.11-0.12.0.jar,/tmp/xmlschema-core-2.2.1.jar --files "/tmp/DRAFT1auth.099.001.04_1.3.0.xsd"

FYI - In these example , used opensource databrics sparxml libraries. Would request you to validate the data and store the data in parquet or orc format. Since Spark/hive gives better performance with parquet/orc formats over xml.

Re: How to automatically convert huge and complex XML files to flat tables structure?

Faflusniak — Thu, 15 Jun 2023 11:26:02 GMT

Thank you for your advise.

We will investigate proposed solution with spark-xml

Best regards