<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Re: com.databricks.spark.xml parsing xml takes a very long time in Support Questions</title>
    <link>https://community.cloudera.com/t5/Support-Questions/com-databricks-spark-xml-parsing-xml-takes-a-very-long-time/m-p/130449#M93135</link>
    <description>&lt;P&gt;I gave it a quick try and created 50 xml files according to your structure each having 60MB. Tested on 3 workers (each 7 core 26GB per worker)&lt;/P&gt;&lt;P&gt;1) The tar.gz file had 450MB and took 14min with 1 (!) executor. Since it is a tar file, only one executor reads the file. &lt;/P&gt;&lt;P&gt;2) Putting all files as single xml.gz in one folder and starting the job again I had 3 executors involved and the job got done in under 5 min (roughly the 14 min / 3 since no shuffle required)&lt;/P&gt;&lt;P&gt;So I see two issues here:&lt;/P&gt;&lt;P&gt;1) Don't use tar.gz&lt;/P&gt;&lt;P&gt;2) 50 min compared to 14 min: How fast is your machine (cores, ...)? &lt;/P&gt;</description>
    <pubDate>Mon, 16 Jan 2017 20:27:39 GMT</pubDate>
    <dc:creator>bwalter1</dc:creator>
    <dc:date>2017-01-16T20:27:39Z</dc:date>
  </channel>
</rss>

