question com.databricks.spark.xml parsing xml takes a very long time in Support Questions

question com.databricks.spark.xml parsing xml takes a very long time in Support Questions https://community.cloudera.com/t5/Support-Questions/com-databricks-spark-xml-parsing-xml-takes-a-very-long-time/m-p/130447#M93133 Hello All,I require to import and parse xml files in Hadoop.I have an old pig 'REGEX_EXTRACT' script parser that works fine but takes a sometime to run, arround 10-15mins.In the last 6 months, I have started to use spark, with large success in improving run time. So I am trying to move the old pig script into spark using databricks xml parser. Mentioned in the following posts: <A href="http://community.hortonworks.com/questions/71538/parsing-xml-in-spark-rdd.html" target="_blank">http://community.hortonworks.com/questions/71538/parsing-xml-in-spark-rdd.html</A> <A href="http://community.hortonworks.com/questions/66678/how-to-convert-spark-dataframes-into-xml-files.html" target="_blank">http://community.hortonworks.com/questions/66678/how-to-convert-spark-dataframes-into-xml-files.html</A> The version used is; <A href="http://github.com/databricks/spark-xml/tree/branch-0.3" target="_blank">http://github.com/databricks/spark-xml/tree/branch-0.3</A> The script I try to run is similar to: <PRE>import org.apache.spark.{SparkConf, SparkContext} import org.apache.spark.sql.hive.orc._ import org.apache.spark.sql._ import org.apache.hadoop.fs._ import com.databricks.spark import com.databricks.spark.xml import org.apache.spark.sql.functions._ import org.apache.spark.sql.SQLContext import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType} // drop table val dfremove = hiveContext.sql("DROP TABLE FileExtract") // Create schema val xmlSchema = StructType(Array( StructField("Text1", StringType, nullable = false), StructField("Text2", StringType, nullable = false), StructField("Text3", StringType, nullable = false), StructField("Text4", StringType ,nullable = false), StructField("Text5", StringType, nullable = false), StructField("Num1", IntegerType, nullable = false), StructField("Num2", IntegerType, nullable = false), StructField("Num3", IntegerType, nullable = false), StructField("Num4", IntegerType, nullable = false), StructField("Num5", IntegerType, nullable = false), StructField("Num6", IntegerType, nullable = false), StructField("AnotherText1", StringType, nullable = false), StructField("Num7", IntegerType, nullable = false), StructField("Num8", IntegerType, nullable = false), StructField("Num9", IntegerType, nullable = false), StructField("AnotherText2", StringType, nullable = false) )) // Read file val df = hiveContext.read.format("com.databricks.spark.xml").option("rootTag", "File").option("rowTag", "row").schema(xmlSchema).load("hdfs://MyCluster/RawXMLData/RecievedToday/File/Files.tar.gz") // select val selectedData = df.select("Text1", "Text2", "Text3", "Text4", "Text5", "Num1", "Num2", "Num3", "Num4", "Num5", "Num6", "AnotherText1", "Num7", "Num8", "Num9", "AnotherText2" ) selectedData.write.format("orc").mode(SaveMode.Overwrite).saveAsTable("FileExtract") </PRE>The xml file looks similar to:<PRE><?xml version="1.0"?> <File> <row> <Text1>something here</Text1> <Text2>something here</Text2> <Text3>something here</Text3> <Text4>something here</Text4> <Text5>something here</Text5> <Num1>2</Num1> <Num2>1</Num2> <Num3>1</Num3> <Num4>0</Num4> <Num5>1</Num5> <Num6>0</Num6> <AnotherText1>something here</AnotherText1> <Num7>2</Num7> <Num8>0</Num8> <Num9>0</Num9> <AnotherText2>something here</AnotherText2> </row> <row> <Text1>something here</Text1> <Text2>something else here</Text2> <Text3>something new here</Text3> <Text4>something here</Text4> <Text5>something here</Text5> <Num1>2</Num1> <Num2>1</Num2> <Num3>1</Num3> <Num4>0</Num4> <Num5>1</Num5> <Num6>0</Num6> <AnotherText1>something here</AnotherText1> <Num7>2</Num7> <Num8>0</Num8> <Num9>0</Num9> <AnotherText2>something here</AnotherText2> </row> ... ... </File></PRE>Many xml files are zipped together. Hence the tar.gz file. This runs. However for a 400MB file it takes 50mins to finish. Does anyone have an idea why it is so slow, or how I may speed it up? I am running on a 7 machine cluster with about 120GB Yarn memory, with hortonworks HDP-2.5.3.0 and spark 1.6.2.Many thanks in Advance! Mon, 16 Jan 2017 18:03:21 GMT antin_leszczysz 2017-01-16T18:03:21Z