Created 05-26-2023 02:47 AM
Hi,
We have huge and complex XML files. For example: 15-20 levels in XML tree structure, approximately 180 basic types and 200 complex types, 1 to many relations between nodes in XML tree structure.
As the output we want to have tables in Hive or Impala and to use SQL to query this tables.
Could you please advise how to that in the most effective way?
Effective - that is reducing manual coding works.
Best regards
Created 06-13-2023 12:16 AM
Share sample Data file with minimum of 2 records , to understand the structure.
Created 06-13-2023 06:36 AM
Hi,
Thank you for response.
Documentation and samples is available to download (XSD, XML sample, documentation in XLS):
https://www.esma.europa.eu/sites/default/files/library/disclosure_templates_1.3.1.zip
There are 4 kinds of files and most problematic for us is "099" DRAFT1auth.099.001.04_1.3.0.xsd / DRAFT1auth.099.001.04_non-ABCP Underlying Exposure Report.xml
Unpacked and biggest XML file is up to 4,7 GB but most problematic is not size but complex, nested structure.
Have you ever struggled with such complex and huge XMLs?
Created on 06-14-2023 07:04 AM - edited 06-14-2023 07:05 AM
Check the bellow helps for your usecase.
https://github.com/databricks/spark-xml
scala> import com.databricks.spark.xml.util.XSDToSchema
import com.databricks.spark.xml.util.XSDToSchema
scala> import java.nio.file.Paths
import java.nio.file.Paths
scala> val schema = XSDToSchema.read(Paths.get("/tmp/DRAFT1auth.099.001.04_1.3.0.xsd"))
schema: org.apache.spark.sql.types.StructType = StructType(StructField(Document,StructType(StructField(ScrtstnNonAsstBckdComrclPprUndrlygXpsrRpt,StructType(StructField(NewCrrctn,StructType(StructField(ScrtstnRpt,StructType(StructField(ScrtstnIdr,StringType,false), StructField(CutOffDt,StringType,false), StructField(UndrlygXpsrRcrd,ArrayType(StructType(StructField(UndrlygXpsrId,StructType(StructField(NewUndrlygXpsrIdr,StringType,false), StructField(OrgnlUndrlygXpsrIdr,StringType,false), StructField(NewOblgrIdr,StringType,false), StructField(OrgnlOblgrIdr,StringType,false)),false), StructField(UndrlygXpsrData,StructType(StructField(ResdtlRealEsttLn,StructType(StructField(PrfrmgLn,StructType(StructField(UndrlygXpsrCmonData,StructType(StructField(ActvtyDtDtls,StructType(StructField(PoolAddt...
scala> import com.databricks.spark.xml._
import com.databricks.spark.xml._
scala> val df=spark.read.schema(schema).xml("/tmp/DRAFT1auth.099.001.04_non-ABCP_Underlying_Exposure_Report.xml")
23/06/14 13:53:19 WARN util.package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
df: org.apache.spark.sql.DataFrame = [Document: struct<ScrtstnNonAsstBckdComrclPprUndrlygXpsrRpt: struct<NewCrrctn: struct<ScrtstnRpt: struct<ScrtstnIdr: string, CutOffDt: string ... 1 more field>>, Cxl: struct<ScrtstnCxl: array<string>, UndrlygXpsrRptCxl: array<struct<ScrtstnIdr:string,CutOffDt:string>>>>>]
spark-shell command used
spark-shell --jars /tmp/spark-xml_2.11-0.12.0.jar,/tmp/xmlschema-core-2.2.1.jar --files "/tmp/DRAFT1auth.099.001.04_1.3.0.xsd"
FYI - In these example , used opensource databrics sparxml libraries. Would request you to validate the data and store the data in parquet or orc format. Since Spark/hive gives better performance with parquet/orc formats over xml.
Created 06-15-2023 04:26 AM