question Re: com.databricks.spark.xml parsing xml takes a very long time in Support Questions

question Re: com.databricks.spark.xml parsing xml takes a very long time in Support Questions https://community.cloudera.com/t5/Support-Questions/com-databricks-spark-xml-parsing-xml-takes-a-very-long-time/m-p/130449#M93135 I gave it a quick try and created 50 xml files according to your structure each having 60MB. Tested on 3 workers (each 7 core 26GB per worker)1) The tar.gz file had 450MB and took 14min with 1 (!) executor. Since it is a tar file, only one executor reads the file. 2) Putting all files as single xml.gz in one folder and starting the job again I had 3 executors involved and the job got done in under 5 min (roughly the 14 min / 3 since no shuffle required)So I see two issues here:1) Don't use tar.gz2) 50 min compared to 14 min: How fast is your machine (cores, ...)? Mon, 16 Jan 2017 20:27:39 GMT bwalter1 2017-01-16T20:27:39Z