Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

How to automatically convert huge and complex XML files to flat tables structure?

avatar
Explorer

Hi,

We have huge and complex XML files. For example: 15-20 levels in XML tree structure, approximately 180 basic types and 200 complex types, 1 to many relations between nodes in XML tree structure.

As the output we want to have tables in Hive or Impala and to use SQL to query this tables.

 

Could you please advise how to that in the most effective way?

Effective - that is reducing manual coding works.

 

Best regards

4 REPLIES 4

avatar
Super Collaborator

Share sample Data file with minimum of 2 records , to understand the structure. 

avatar
Explorer

Hi,

Thank you for response.

Documentation and samples is available to download (XSD, XML sample, documentation in XLS):

https://www.esma.europa.eu/sites/default/files/library/disclosure_templates_1.3.1.zip

There are 4 kinds of files and most problematic for us is "099" DRAFT1auth.099.001.04_1.3.0.xsd /  DRAFT1auth.099.001.04_non-ABCP Underlying Exposure Report.xml

 

Unpacked and biggest XML file is up to 4,7 GB but most problematic is not size but complex, nested structure.

Have you ever struggled with such complex and huge XMLs?

avatar
Super Collaborator

Check the bellow helps for your usecase. 

https://github.com/databricks/spark-xml

 

scala> import com.databricks.spark.xml.util.XSDToSchema
import com.databricks.spark.xml.util.XSDToSchema

scala> import java.nio.file.Paths
import java.nio.file.Paths

scala> val schema = XSDToSchema.read(Paths.get("/tmp/DRAFT1auth.099.001.04_1.3.0.xsd"))
schema: org.apache.spark.sql.types.StructType = StructType(StructField(Document,StructType(StructField(ScrtstnNonAsstBckdComrclPprUndrlygXpsrRpt,StructType(StructField(NewCrrctn,StructType(StructField(ScrtstnRpt,StructType(StructField(ScrtstnIdr,StringType,false), StructField(CutOffDt,StringType,false), StructField(UndrlygXpsrRcrd,ArrayType(StructType(StructField(UndrlygXpsrId,StructType(StructField(NewUndrlygXpsrIdr,StringType,false), StructField(OrgnlUndrlygXpsrIdr,StringType,false), StructField(NewOblgrIdr,StringType,false), StructField(OrgnlOblgrIdr,StringType,false)),false), StructField(UndrlygXpsrData,StructType(StructField(ResdtlRealEsttLn,StructType(StructField(PrfrmgLn,StructType(StructField(UndrlygXpsrCmonData,StructType(StructField(ActvtyDtDtls,StructType(StructField(PoolAddt...


scala> import com.databricks.spark.xml._
import com.databricks.spark.xml._

scala> val df=spark.read.schema(schema).xml("/tmp/DRAFT1auth.099.001.04_non-ABCP_Underlying_Exposure_Report.xml")
23/06/14 13:53:19 WARN util.package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
df: org.apache.spark.sql.DataFrame = [Document: struct<ScrtstnNonAsstBckdComrclPprUndrlygXpsrRpt: struct<NewCrrctn: struct<ScrtstnRpt: struct<ScrtstnIdr: string, CutOffDt: string ... 1 more field>>, Cxl: struct<ScrtstnCxl: array<string>, UndrlygXpsrRptCxl: array<struct<ScrtstnIdr:string,CutOffDt:string>>>>>]

 



spark-shell command used 

 

spark-shell --jars /tmp/spark-xml_2.11-0.12.0.jar,/tmp/xmlschema-core-2.2.1.jar  --files "/tmp/DRAFT1auth.099.001.04_1.3.0.xsd"

 


FYI -  In these example , used opensource databrics sparxml libraries. Would request you to validate the data and store the data in parquet or orc format. Since Spark/hive gives better performance with parquet/orc formats over xml. 

avatar
Explorer

Thank you for your advise.

We will investigate proposed solution with spark-xml

Best regards