Support Questions

Faflusniak · ‎05-26-2023

Hi,

We have huge and complex XML files. For example: 15-20 levels in XML tree structure, approximately 180 basic types and 200 complex types, 1 to many relations between nodes in XML tree structure.

As the output we want to have tables in Hive or Impala and to use SQL to query this tables.

Could you please advise how to that in the most effective way?

Effective - that is reducing manual coding works.

Best regards

ggangadharan · ‎06-13-2023

Share sample Data file with minimum of 2 records , to understand the structure.

Faflusniak · ‎06-13-2023

Hi,

Thank you for response.

Documentation and samples is available to download (XSD, XML sample, documentation in XLS):

https://www.esma.europa.eu/sites/default/files/library/disclosure_templates_1.3.1.zip

There are 4 kinds of files and most problematic for us is "099" DRAFT1auth.099.001.04_1.3.0.xsd / DRAFT1auth.099.001.04_non-ABCP Underlying Exposure Report.xml

Unpacked and biggest XML file is up to 4,7 GB but most problematic is not size but complex, nested structure.

Have you ever struggled with such complex and huge XMLs?

ggangadharan · ‎06-14-2023

Check the bellow helps for your usecase.

https://github.com/databricks/spark-xml

scala> import com.databricks.spark.xml.util.XSDToSchema
import com.databricks.spark.xml.util.XSDToSchema

scala> import java.nio.file.Paths
import java.nio.file.Paths

scala> val schema = XSDToSchema.read(Paths.get("/tmp/DRAFT1auth.099.001.04_1.3.0.xsd"))
schema: org.apache.spark.sql.types.StructType = StructType(StructField(Document,StructType(StructField(ScrtstnNonAsstBckdComrclPprUndrlygXpsrRpt,StructType(StructField(NewCrrctn,StructType(StructField(ScrtstnRpt,StructType(StructField(ScrtstnIdr,StringType,false), StructField(CutOffDt,StringType,false), StructField(UndrlygXpsrRcrd,ArrayType(StructType(StructField(UndrlygXpsrId,StructType(StructField(NewUndrlygXpsrIdr,StringType,false), StructField(OrgnlUndrlygXpsrIdr,StringType,false), StructField(NewOblgrIdr,StringType,false), StructField(OrgnlOblgrIdr,StringType,false)),false), StructField(UndrlygXpsrData,StructType(StructField(ResdtlRealEsttLn,StructType(StructField(PrfrmgLn,StructType(StructField(UndrlygXpsrCmonData,StructType(StructField(ActvtyDtDtls,StructType(StructField(PoolAddt...


scala> import com.databricks.spark.xml._
import com.databricks.spark.xml._

scala> val df=spark.read.schema(schema).xml("/tmp/DRAFT1auth.099.001.04_non-ABCP_Underlying_Exposure_Report.xml")
23/06/14 13:53:19 WARN util.package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
df: org.apache.spark.sql.DataFrame = [Document: struct<ScrtstnNonAsstBckdComrclPprUndrlygXpsrRpt: struct<NewCrrctn: struct<ScrtstnRpt: struct<ScrtstnIdr: string, CutOffDt: string ... 1 more field>>, Cxl: struct<ScrtstnCxl: array<string>, UndrlygXpsrRptCxl: array<struct<ScrtstnIdr:string,CutOffDt:string>>>>>]

spark-shell command used

spark-shell --jars /tmp/spark-xml_2.11-0.12.0.jar,/tmp/xmlschema-core-2.2.1.jar  --files "/tmp/DRAFT1auth.099.001.04_1.3.0.xsd"

FYI - In these example , used opensource databrics sparxml libraries. Would request you to validate the data and store the data in parquet or orc format. Since Spark/hive gives better performance with parquet/orc formats over xml.

Faflusniak · ‎06-15-2023

Thank you for your advise.

We will investigate proposed solution with spark-xml

Best regards

Cloudera Community

Support Questions

How to automatically convert huge and complex XML files to flat tables structure?