<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question com.databricks.spark.xml parsing xml takes a very long time in Support Questions</title>
    <link>https://community.cloudera.com/t5/Support-Questions/com-databricks-spark-xml-parsing-xml-takes-a-very-long-time/m-p/130447#M93133</link>
    <description>&lt;P&gt;Hello All,&lt;/P&gt;&lt;P&gt;I require to import and parse xml files in Hadoop.&lt;/P&gt;&lt;P&gt;I have an old pig 'REGEX_EXTRACT' script parser that works fine but takes a sometime to run, arround 10-15mins.&lt;/P&gt;&lt;P&gt;In the last 6 months, I have started to use spark, with large success in improving run time. So I am trying to move the old pig script into spark using databricks xml parser. Mentioned in the following posts:
&lt;A href="http://community.hortonworks.com/questions/71538/parsing-xml-in-spark-rdd.html" target="_blank"&gt;http://community.hortonworks.com/questions/71538/parsing-xml-in-spark-rdd.html&lt;/A&gt;
&lt;A href="http://community.hortonworks.com/questions/66678/how-to-convert-spark-dataframes-into-xml-files.html" target="_blank"&gt;http://community.hortonworks.com/questions/66678/how-to-convert-spark-dataframes-into-xml-files.html&lt;/A&gt;
The version used is;
&lt;A href="http://github.com/databricks/spark-xml/tree/branch-0.3" target="_blank"&gt;http://github.com/databricks/spark-xml/tree/branch-0.3&lt;/A&gt; &lt;/P&gt;&lt;P&gt;The script I try to run is similar to: &lt;/P&gt;&lt;PRE&gt;import org.apache.spark.{SparkConf, SparkContext} 
import org.apache.spark.sql.hive.orc._
import org.apache.spark.sql._
import org.apache.hadoop.fs._
import com.databricks.spark
import com.databricks.spark.xml
import org.apache.spark.sql.functions._
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType}


    // drop table
    val dfremove = hiveContext.sql("DROP TABLE FileExtract")
    // Create schema
    val xmlSchema = StructType(Array(
        StructField("Text1", StringType, nullable = false),
        StructField("Text2", StringType, nullable = false),
        StructField("Text3", StringType, nullable = false),
        StructField("Text4", StringType ,nullable = false),
        StructField("Text5", StringType, nullable = false),
        StructField("Num1", IntegerType, nullable = false),
        StructField("Num2", IntegerType, nullable = false),
        StructField("Num3", IntegerType, nullable = false),
        StructField("Num4", IntegerType, nullable = false),
        StructField("Num5", IntegerType, nullable = false),
        StructField("Num6", IntegerType, nullable = false),
        StructField("AnotherText1", StringType, nullable = false),
        StructField("Num7", IntegerType, nullable = false),
        StructField("Num8", IntegerType, nullable = false),
        StructField("Num9", IntegerType, nullable = false), 
        StructField("AnotherText2", StringType, nullable = false)        
        ))
    // Read file
    val df = hiveContext.read.format("com.databricks.spark.xml").option("rootTag", "File").option("rowTag", "row").schema(xmlSchema).load("hdfs://MyCluster/RawXMLData/RecievedToday/File/Files.tar.gz")
    // select
    val selectedData = df.select("Text1",
                                 "Text2",
                                 "Text3",
                                 "Text4",
                                 "Text5",
                                 "Num1",
                                 "Num2",
                                 "Num3",
                                 "Num4",
                                 "Num5",
                                 "Num6",
                                 "AnotherText1",
                                 "Num7",
                                 "Num8",
                                 "Num9",
                                 "AnotherText2"
                                )
    selectedData.write.format("orc").mode(SaveMode.Overwrite).saveAsTable("FileExtract")    
&lt;/PRE&gt;&lt;P&gt;The xml file looks similar to:&lt;/P&gt;&lt;PRE&gt;&amp;lt;?xml version="1.0"?&amp;gt;
&amp;lt;File&amp;gt;
  &amp;lt;row&amp;gt;
    &amp;lt;Text1&amp;gt;something here&amp;lt;/Text1&amp;gt;
    &amp;lt;Text2&amp;gt;something here&amp;lt;/Text2&amp;gt;
    &amp;lt;Text3&amp;gt;something here&amp;lt;/Text3&amp;gt;
    &amp;lt;Text4&amp;gt;something here&amp;lt;/Text4&amp;gt;
    &amp;lt;Text5&amp;gt;something here&amp;lt;/Text5&amp;gt;
    &amp;lt;Num1&amp;gt;2&amp;lt;/Num1&amp;gt;
    &amp;lt;Num2&amp;gt;1&amp;lt;/Num2&amp;gt;
    &amp;lt;Num3&amp;gt;1&amp;lt;/Num3&amp;gt;
    &amp;lt;Num4&amp;gt;0&amp;lt;/Num4&amp;gt;
    &amp;lt;Num5&amp;gt;1&amp;lt;/Num5&amp;gt;
    &amp;lt;Num6&amp;gt;0&amp;lt;/Num6&amp;gt;
    &amp;lt;AnotherText1&amp;gt;something here&amp;lt;/AnotherText1&amp;gt;
    &amp;lt;Num7&amp;gt;2&amp;lt;/Num7&amp;gt;
    &amp;lt;Num8&amp;gt;0&amp;lt;/Num8&amp;gt;
    &amp;lt;Num9&amp;gt;0&amp;lt;/Num9&amp;gt;
    &amp;lt;AnotherText2&amp;gt;something here&amp;lt;/AnotherText2&amp;gt;
  &amp;lt;/row&amp;gt;
  &amp;lt;row&amp;gt;
    &amp;lt;Text1&amp;gt;something here&amp;lt;/Text1&amp;gt;
    &amp;lt;Text2&amp;gt;something else here&amp;lt;/Text2&amp;gt;
    &amp;lt;Text3&amp;gt;something new here&amp;lt;/Text3&amp;gt;
    &amp;lt;Text4&amp;gt;something here&amp;lt;/Text4&amp;gt;
    &amp;lt;Text5&amp;gt;something here&amp;lt;/Text5&amp;gt;
    &amp;lt;Num1&amp;gt;2&amp;lt;/Num1&amp;gt;
    &amp;lt;Num2&amp;gt;1&amp;lt;/Num2&amp;gt;
    &amp;lt;Num3&amp;gt;1&amp;lt;/Num3&amp;gt;
    &amp;lt;Num4&amp;gt;0&amp;lt;/Num4&amp;gt;
    &amp;lt;Num5&amp;gt;1&amp;lt;/Num5&amp;gt;
    &amp;lt;Num6&amp;gt;0&amp;lt;/Num6&amp;gt;
    &amp;lt;AnotherText1&amp;gt;something here&amp;lt;/AnotherText1&amp;gt;
    &amp;lt;Num7&amp;gt;2&amp;lt;/Num7&amp;gt;
    &amp;lt;Num8&amp;gt;0&amp;lt;/Num8&amp;gt;
    &amp;lt;Num9&amp;gt;0&amp;lt;/Num9&amp;gt;
    &amp;lt;AnotherText2&amp;gt;something here&amp;lt;/AnotherText2&amp;gt;
  &amp;lt;/row&amp;gt;
...
...
&amp;lt;/File&amp;gt;&lt;/PRE&gt;&lt;P&gt;Many xml files are zipped together. Hence the tar.gz file. &lt;/P&gt;&lt;P&gt;This runs. However for a 400MB file it takes 50mins to finish. &lt;/P&gt;&lt;P&gt;Does anyone have an idea why it is so slow, or how I may speed it up?
I am running on a 7 machine cluster with about 120GB Yarn memory, with hortonworks HDP-2.5.3.0 and spark 1.6.2.&lt;/P&gt;&lt;P&gt;Many thanks in Advance!&lt;/P&gt;</description>
    <pubDate>Mon, 16 Jan 2017 18:03:21 GMT</pubDate>
    <dc:creator>antin_leszczysz</dc:creator>
    <dc:date>2017-01-16T18:03:21Z</dc:date>
  </channel>
</rss>

