<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Re: How to automatically convert huge and complex XML files to flat tables structure? in Support Questions</title>
    <link>https://community.cloudera.com/t5/Support-Questions/How-to-automatically-convert-huge-and-complex-XML-files-to/m-p/372605#M241305</link>
    <description>&lt;P&gt;Check the bellow helps for your usecase.&amp;nbsp;&lt;BR /&gt;&lt;BR /&gt;&lt;A href="https://github.com/databricks/spark-xml" target="_blank" rel="noopener"&gt;https://github.com/databricks/spark-xml&lt;/A&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;scala&amp;gt; import com.databricks.spark.xml.util.XSDToSchema
import com.databricks.spark.xml.util.XSDToSchema

scala&amp;gt; import java.nio.file.Paths
import java.nio.file.Paths

scala&amp;gt; val schema = XSDToSchema.read(Paths.get("/tmp/DRAFT1auth.099.001.04_1.3.0.xsd"))
schema: org.apache.spark.sql.types.StructType = StructType(StructField(Document,StructType(StructField(ScrtstnNonAsstBckdComrclPprUndrlygXpsrRpt,StructType(StructField(NewCrrctn,StructType(StructField(ScrtstnRpt,StructType(StructField(ScrtstnIdr,StringType,false), StructField(CutOffDt,StringType,false), StructField(UndrlygXpsrRcrd,ArrayType(StructType(StructField(UndrlygXpsrId,StructType(StructField(NewUndrlygXpsrIdr,StringType,false), StructField(OrgnlUndrlygXpsrIdr,StringType,false), StructField(NewOblgrIdr,StringType,false), StructField(OrgnlOblgrIdr,StringType,false)),false), StructField(UndrlygXpsrData,StructType(StructField(ResdtlRealEsttLn,StructType(StructField(PrfrmgLn,StructType(StructField(UndrlygXpsrCmonData,StructType(StructField(ActvtyDtDtls,StructType(StructField(PoolAddt...


scala&amp;gt; import com.databricks.spark.xml._
import com.databricks.spark.xml._

scala&amp;gt; val df=spark.read.schema(schema).xml("/tmp/DRAFT1auth.099.001.04_non-ABCP_Underlying_Exposure_Report.xml")
23/06/14 13:53:19 WARN util.package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
df: org.apache.spark.sql.DataFrame = [Document: struct&amp;lt;ScrtstnNonAsstBckdComrclPprUndrlygXpsrRpt: struct&amp;lt;NewCrrctn: struct&amp;lt;ScrtstnRpt: struct&amp;lt;ScrtstnIdr: string, CutOffDt: string ... 1 more field&amp;gt;&amp;gt;, Cxl: struct&amp;lt;ScrtstnCxl: array&amp;lt;string&amp;gt;, UndrlygXpsrRptCxl: array&amp;lt;struct&amp;lt;ScrtstnIdr:string,CutOffDt:string&amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;]&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&lt;BR /&gt;spark-shell command used&amp;nbsp;&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;spark-shell --jars /tmp/spark-xml_2.11-0.12.0.jar,/tmp/xmlschema-core-2.2.1.jar  --files "/tmp/DRAFT1auth.099.001.04_1.3.0.xsd"&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;FYI -&amp;nbsp; In these example , used opensource databrics sparxml libraries. Would request you to validate the data and store the data in parquet or orc format. Since Spark/hive gives better performance with parquet/orc formats over xml.&amp;nbsp;&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;</description>
    <pubDate>Wed, 14 Jun 2023 14:05:24 GMT</pubDate>
    <dc:creator>ggangadharan</dc:creator>
    <dc:date>2023-06-14T14:05:24Z</dc:date>
    <item>
      <title>How to automatically convert huge and complex XML files to flat tables structure?</title>
      <link>https://community.cloudera.com/t5/Support-Questions/How-to-automatically-convert-huge-and-complex-XML-files-to/m-p/371595#M241018</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;We have huge and complex XML files. For example: 15-20 levels in XML tree structure, approximately 180 basic types and 200 complex types, 1 to many relations between nodes in XML tree structure.&lt;/P&gt;&lt;P&gt;As the output we want to have tables in Hive or Impala and to use SQL to query this tables.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Could you please advise how to that in the most effective way?&lt;/P&gt;&lt;P&gt;Effective - that is reducing manual coding works.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Best regards&lt;/P&gt;</description>
      <pubDate>Tue, 21 Apr 2026 07:08:42 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/How-to-automatically-convert-huge-and-complex-XML-files-to/m-p/371595#M241018</guid>
      <dc:creator>Faflusniak</dc:creator>
      <dc:date>2026-04-21T07:08:42Z</dc:date>
    </item>
    <item>
      <title>Re: How to automatically convert huge and complex XML files to flat tables structure?</title>
      <link>https://community.cloudera.com/t5/Support-Questions/How-to-automatically-convert-huge-and-complex-XML-files-to/m-p/372545#M241271</link>
      <description>&lt;P&gt;Share sample Data file with minimum of 2 records , to understand the structure.&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 13 Jun 2023 07:16:17 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/How-to-automatically-convert-huge-and-complex-XML-files-to/m-p/372545#M241271</guid>
      <dc:creator>ggangadharan</dc:creator>
      <dc:date>2023-06-13T07:16:17Z</dc:date>
    </item>
    <item>
      <title>Re: How to automatically convert huge and complex XML files to flat tables structure?</title>
      <link>https://community.cloudera.com/t5/Support-Questions/How-to-automatically-convert-huge-and-complex-XML-files-to/m-p/372559#M241277</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;Thank you for response.&lt;/P&gt;&lt;P&gt;Documentation and samples is available to download (XSD, XML sample, documentation in XLS):&lt;/P&gt;&lt;P&gt;&lt;A href="https://www.esma.europa.eu/sites/default/files/library/disclosure_templates_1.3.1.zip" target="_blank"&gt;https://www.esma.europa.eu/sites/default/files/library/disclosure_templates_1.3.1.zip&lt;/A&gt;&lt;/P&gt;&lt;P&gt;There are 4 kinds of files and most problematic for us is "099" DRAFT1auth.099.001.04_1.3.0.xsd /&amp;nbsp;&amp;nbsp;DRAFT1auth.099.001.04_non-ABCP Underlying Exposure Report.xml&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Unpacked and biggest XML file is up to 4,7 GB but most problematic is not size but complex, nested structure.&lt;/P&gt;&lt;P&gt;Have you ever struggled with such complex and huge XMLs?&lt;/P&gt;</description>
      <pubDate>Tue, 13 Jun 2023 13:36:31 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/How-to-automatically-convert-huge-and-complex-XML-files-to/m-p/372559#M241277</guid>
      <dc:creator>Faflusniak</dc:creator>
      <dc:date>2023-06-13T13:36:31Z</dc:date>
    </item>
    <item>
      <title>Re: How to automatically convert huge and complex XML files to flat tables structure?</title>
      <link>https://community.cloudera.com/t5/Support-Questions/How-to-automatically-convert-huge-and-complex-XML-files-to/m-p/372605#M241305</link>
      <description>&lt;P&gt;Check the bellow helps for your usecase.&amp;nbsp;&lt;BR /&gt;&lt;BR /&gt;&lt;A href="https://github.com/databricks/spark-xml" target="_blank" rel="noopener"&gt;https://github.com/databricks/spark-xml&lt;/A&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;scala&amp;gt; import com.databricks.spark.xml.util.XSDToSchema
import com.databricks.spark.xml.util.XSDToSchema

scala&amp;gt; import java.nio.file.Paths
import java.nio.file.Paths

scala&amp;gt; val schema = XSDToSchema.read(Paths.get("/tmp/DRAFT1auth.099.001.04_1.3.0.xsd"))
schema: org.apache.spark.sql.types.StructType = StructType(StructField(Document,StructType(StructField(ScrtstnNonAsstBckdComrclPprUndrlygXpsrRpt,StructType(StructField(NewCrrctn,StructType(StructField(ScrtstnRpt,StructType(StructField(ScrtstnIdr,StringType,false), StructField(CutOffDt,StringType,false), StructField(UndrlygXpsrRcrd,ArrayType(StructType(StructField(UndrlygXpsrId,StructType(StructField(NewUndrlygXpsrIdr,StringType,false), StructField(OrgnlUndrlygXpsrIdr,StringType,false), StructField(NewOblgrIdr,StringType,false), StructField(OrgnlOblgrIdr,StringType,false)),false), StructField(UndrlygXpsrData,StructType(StructField(ResdtlRealEsttLn,StructType(StructField(PrfrmgLn,StructType(StructField(UndrlygXpsrCmonData,StructType(StructField(ActvtyDtDtls,StructType(StructField(PoolAddt...


scala&amp;gt; import com.databricks.spark.xml._
import com.databricks.spark.xml._

scala&amp;gt; val df=spark.read.schema(schema).xml("/tmp/DRAFT1auth.099.001.04_non-ABCP_Underlying_Exposure_Report.xml")
23/06/14 13:53:19 WARN util.package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
df: org.apache.spark.sql.DataFrame = [Document: struct&amp;lt;ScrtstnNonAsstBckdComrclPprUndrlygXpsrRpt: struct&amp;lt;NewCrrctn: struct&amp;lt;ScrtstnRpt: struct&amp;lt;ScrtstnIdr: string, CutOffDt: string ... 1 more field&amp;gt;&amp;gt;, Cxl: struct&amp;lt;ScrtstnCxl: array&amp;lt;string&amp;gt;, UndrlygXpsrRptCxl: array&amp;lt;struct&amp;lt;ScrtstnIdr:string,CutOffDt:string&amp;gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;]&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&lt;BR /&gt;spark-shell command used&amp;nbsp;&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;spark-shell --jars /tmp/spark-xml_2.11-0.12.0.jar,/tmp/xmlschema-core-2.2.1.jar  --files "/tmp/DRAFT1auth.099.001.04_1.3.0.xsd"&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;FYI -&amp;nbsp; In these example , used opensource databrics sparxml libraries. Would request you to validate the data and store the data in parquet or orc format. Since Spark/hive gives better performance with parquet/orc formats over xml.&amp;nbsp;&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 14 Jun 2023 14:05:24 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/How-to-automatically-convert-huge-and-complex-XML-files-to/m-p/372605#M241305</guid>
      <dc:creator>ggangadharan</dc:creator>
      <dc:date>2023-06-14T14:05:24Z</dc:date>
    </item>
    <item>
      <title>Re: How to automatically convert huge and complex XML files to flat tables structure?</title>
      <link>https://community.cloudera.com/t5/Support-Questions/How-to-automatically-convert-huge-and-complex-XML-files-to/m-p/372684#M241346</link>
      <description>&lt;P&gt;Thank you for your advise.&lt;/P&gt;&lt;P&gt;We will investigate proposed solution with&amp;nbsp;&lt;A href="https://github.com/databricks/spark-xml" target="_blank" rel="noopener nofollow noreferrer"&gt;spark-xml&lt;/A&gt;&lt;/P&gt;&lt;P&gt;Best regards&lt;/P&gt;</description>
      <pubDate>Thu, 15 Jun 2023 11:26:02 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/How-to-automatically-convert-huge-and-complex-XML-files-to/m-p/372684#M241346</guid>
      <dc:creator>Faflusniak</dc:creator>
      <dc:date>2023-06-15T11:26:02Z</dc:date>
    </item>
  </channel>
</rss>

