Support Questions

Find answers, ask questions, and share your expertise

com.databricks.spark.xml parsing xml takes a very long time

avatar
Contributor

Hello All,

I require to import and parse xml files in Hadoop.

I have an old pig 'REGEX_EXTRACT' script parser that works fine but takes a sometime to run, arround 10-15mins.

In the last 6 months, I have started to use spark, with large success in improving run time. So I am trying to move the old pig script into spark using databricks xml parser. Mentioned in the following posts: http://community.hortonworks.com/questions/71538/parsing-xml-in-spark-rdd.html http://community.hortonworks.com/questions/66678/how-to-convert-spark-dataframes-into-xml-files.html The version used is; http://github.com/databricks/spark-xml/tree/branch-0.3

The script I try to run is similar to:

import org.apache.spark.{SparkConf, SparkContext} 
import org.apache.spark.sql.hive.orc._
import org.apache.spark.sql._
import org.apache.hadoop.fs._
import com.databricks.spark
import com.databricks.spark.xml
import org.apache.spark.sql.functions._
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType}


    // drop table
    val dfremove = hiveContext.sql("DROP TABLE FileExtract")
    // Create schema
    val xmlSchema = StructType(Array(
        StructField("Text1", StringType, nullable = false),
        StructField("Text2", StringType, nullable = false),
        StructField("Text3", StringType, nullable = false),
        StructField("Text4", StringType ,nullable = false),
        StructField("Text5", StringType, nullable = false),
        StructField("Num1", IntegerType, nullable = false),
        StructField("Num2", IntegerType, nullable = false),
        StructField("Num3", IntegerType, nullable = false),
        StructField("Num4", IntegerType, nullable = false),
        StructField("Num5", IntegerType, nullable = false),
        StructField("Num6", IntegerType, nullable = false),
        StructField("AnotherText1", StringType, nullable = false),
        StructField("Num7", IntegerType, nullable = false),
        StructField("Num8", IntegerType, nullable = false),
        StructField("Num9", IntegerType, nullable = false), 
        StructField("AnotherText2", StringType, nullable = false)        
        ))
    // Read file
    val df = hiveContext.read.format("com.databricks.spark.xml").option("rootTag", "File").option("rowTag", "row").schema(xmlSchema).load("hdfs://MyCluster/RawXMLData/RecievedToday/File/Files.tar.gz")
    // select
    val selectedData = df.select("Text1",
                                 "Text2",
                                 "Text3",
                                 "Text4",
                                 "Text5",
                                 "Num1",
                                 "Num2",
                                 "Num3",
                                 "Num4",
                                 "Num5",
                                 "Num6",
                                 "AnotherText1",
                                 "Num7",
                                 "Num8",
                                 "Num9",
                                 "AnotherText2"
                                )
    selectedData.write.format("orc").mode(SaveMode.Overwrite).saveAsTable("FileExtract")    

The xml file looks similar to:

<?xml version="1.0"?>
<File>
  <row>
    <Text1>something here</Text1>
    <Text2>something here</Text2>
    <Text3>something here</Text3>
    <Text4>something here</Text4>
    <Text5>something here</Text5>
    <Num1>2</Num1>
    <Num2>1</Num2>
    <Num3>1</Num3>
    <Num4>0</Num4>
    <Num5>1</Num5>
    <Num6>0</Num6>
    <AnotherText1>something here</AnotherText1>
    <Num7>2</Num7>
    <Num8>0</Num8>
    <Num9>0</Num9>
    <AnotherText2>something here</AnotherText2>
  </row>
  <row>
    <Text1>something here</Text1>
    <Text2>something else here</Text2>
    <Text3>something new here</Text3>
    <Text4>something here</Text4>
    <Text5>something here</Text5>
    <Num1>2</Num1>
    <Num2>1</Num2>
    <Num3>1</Num3>
    <Num4>0</Num4>
    <Num5>1</Num5>
    <Num6>0</Num6>
    <AnotherText1>something here</AnotherText1>
    <Num7>2</Num7>
    <Num8>0</Num8>
    <Num9>0</Num9>
    <AnotherText2>something here</AnotherText2>
  </row>
...
...
</File>

Many xml files are zipped together. Hence the tar.gz file.

This runs. However for a 400MB file it takes 50mins to finish.

Does anyone have an idea why it is so slow, or how I may speed it up? I am running on a 7 machine cluster with about 120GB Yarn memory, with hortonworks HDP-2.5.3.0 and spark 1.6.2.

Many thanks in Advance!

10 REPLIES 10

avatar
New Contributor

Hi Mark,

com.databricks.spark.xml loader expect to have XML files in the path, not Sequence Files. How spark-xml will deals with SequenceFileInputFormat instead com.databricks.spark.xml.XmlInputFormat?