Created 01-16-2017 10:03 AM
Hello All,
I require to import and parse xml files in Hadoop.
I have an old pig 'REGEX_EXTRACT' script parser that works fine but takes a sometime to run, arround 10-15mins.
In the last 6 months, I have started to use spark, with large success in improving run time. So I am trying to move the old pig script into spark using databricks xml parser. Mentioned in the following posts: http://community.hortonworks.com/questions/71538/parsing-xml-in-spark-rdd.html http://community.hortonworks.com/questions/66678/how-to-convert-spark-dataframes-into-xml-files.html The version used is; http://github.com/databricks/spark-xml/tree/branch-0.3
The script I try to run is similar to:
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.hive.orc._
import org.apache.spark.sql._
import org.apache.hadoop.fs._
import com.databricks.spark
import com.databricks.spark.xml
import org.apache.spark.sql.functions._
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType}
// drop table
val dfremove = hiveContext.sql("DROP TABLE FileExtract")
// Create schema
val xmlSchema = StructType(Array(
StructField("Text1", StringType, nullable = false),
StructField("Text2", StringType, nullable = false),
StructField("Text3", StringType, nullable = false),
StructField("Text4", StringType ,nullable = false),
StructField("Text5", StringType, nullable = false),
StructField("Num1", IntegerType, nullable = false),
StructField("Num2", IntegerType, nullable = false),
StructField("Num3", IntegerType, nullable = false),
StructField("Num4", IntegerType, nullable = false),
StructField("Num5", IntegerType, nullable = false),
StructField("Num6", IntegerType, nullable = false),
StructField("AnotherText1", StringType, nullable = false),
StructField("Num7", IntegerType, nullable = false),
StructField("Num8", IntegerType, nullable = false),
StructField("Num9", IntegerType, nullable = false),
StructField("AnotherText2", StringType, nullable = false)
))
// Read file
val df = hiveContext.read.format("com.databricks.spark.xml").option("rootTag", "File").option("rowTag", "row").schema(xmlSchema).load("hdfs://MyCluster/RawXMLData/RecievedToday/File/Files.tar.gz")
// select
val selectedData = df.select("Text1",
"Text2",
"Text3",
"Text4",
"Text5",
"Num1",
"Num2",
"Num3",
"Num4",
"Num5",
"Num6",
"AnotherText1",
"Num7",
"Num8",
"Num9",
"AnotherText2"
)
selectedData.write.format("orc").mode(SaveMode.Overwrite).saveAsTable("FileExtract")
The xml file looks similar to:
<?xml version="1.0"?>
<File>
<row>
<Text1>something here</Text1>
<Text2>something here</Text2>
<Text3>something here</Text3>
<Text4>something here</Text4>
<Text5>something here</Text5>
<Num1>2</Num1>
<Num2>1</Num2>
<Num3>1</Num3>
<Num4>0</Num4>
<Num5>1</Num5>
<Num6>0</Num6>
<AnotherText1>something here</AnotherText1>
<Num7>2</Num7>
<Num8>0</Num8>
<Num9>0</Num9>
<AnotherText2>something here</AnotherText2>
</row>
<row>
<Text1>something here</Text1>
<Text2>something else here</Text2>
<Text3>something new here</Text3>
<Text4>something here</Text4>
<Text5>something here</Text5>
<Num1>2</Num1>
<Num2>1</Num2>
<Num3>1</Num3>
<Num4>0</Num4>
<Num5>1</Num5>
<Num6>0</Num6>
<AnotherText1>something here</AnotherText1>
<Num7>2</Num7>
<Num8>0</Num8>
<Num9>0</Num9>
<AnotherText2>something here</AnotherText2>
</row>
...
...
</File>Many xml files are zipped together. Hence the tar.gz file.
This runs. However for a 400MB file it takes 50mins to finish.
Does anyone have an idea why it is so slow, or how I may speed it up? I am running on a 7 machine cluster with about 120GB Yarn memory, with hortonworks HDP-2.5.3.0 and spark 1.6.2.
Many thanks in Advance!
Created 04-13-2017 04:17 PM
Hi Mark,
com.databricks.spark.xml loader expect to have XML files in the path, not Sequence Files. How spark-xml will deals with SequenceFileInputFormat instead com.databricks.spark.xml.XmlInputFormat?