Created 01-16-2017 10:03 AM
Hello All,
I require to import and parse xml files in Hadoop.
I have an old pig 'REGEX_EXTRACT' script parser that works fine but takes a sometime to run, arround 10-15mins.
In the last 6 months, I have started to use spark, with large success in improving run time. So I am trying to move the old pig script into spark using databricks xml parser. Mentioned in the following posts: http://community.hortonworks.com/questions/71538/parsing-xml-in-spark-rdd.html http://community.hortonworks.com/questions/66678/how-to-convert-spark-dataframes-into-xml-files.html The version used is; http://github.com/databricks/spark-xml/tree/branch-0.3
The script I try to run is similar to:
import org.apache.spark.{SparkConf, SparkContext} import org.apache.spark.sql.hive.orc._ import org.apache.spark.sql._ import org.apache.hadoop.fs._ import com.databricks.spark import com.databricks.spark.xml import org.apache.spark.sql.functions._ import org.apache.spark.sql.SQLContext import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType} // drop table val dfremove = hiveContext.sql("DROP TABLE FileExtract") // Create schema val xmlSchema = StructType(Array( StructField("Text1", StringType, nullable = false), StructField("Text2", StringType, nullable = false), StructField("Text3", StringType, nullable = false), StructField("Text4", StringType ,nullable = false), StructField("Text5", StringType, nullable = false), StructField("Num1", IntegerType, nullable = false), StructField("Num2", IntegerType, nullable = false), StructField("Num3", IntegerType, nullable = false), StructField("Num4", IntegerType, nullable = false), StructField("Num5", IntegerType, nullable = false), StructField("Num6", IntegerType, nullable = false), StructField("AnotherText1", StringType, nullable = false), StructField("Num7", IntegerType, nullable = false), StructField("Num8", IntegerType, nullable = false), StructField("Num9", IntegerType, nullable = false), StructField("AnotherText2", StringType, nullable = false) )) // Read file val df = hiveContext.read.format("com.databricks.spark.xml").option("rootTag", "File").option("rowTag", "row").schema(xmlSchema).load("hdfs://MyCluster/RawXMLData/RecievedToday/File/Files.tar.gz") // select val selectedData = df.select("Text1", "Text2", "Text3", "Text4", "Text5", "Num1", "Num2", "Num3", "Num4", "Num5", "Num6", "AnotherText1", "Num7", "Num8", "Num9", "AnotherText2" ) selectedData.write.format("orc").mode(SaveMode.Overwrite).saveAsTable("FileExtract")
The xml file looks similar to:
<?xml version="1.0"?> <File> <row> <Text1>something here</Text1> <Text2>something here</Text2> <Text3>something here</Text3> <Text4>something here</Text4> <Text5>something here</Text5> <Num1>2</Num1> <Num2>1</Num2> <Num3>1</Num3> <Num4>0</Num4> <Num5>1</Num5> <Num6>0</Num6> <AnotherText1>something here</AnotherText1> <Num7>2</Num7> <Num8>0</Num8> <Num9>0</Num9> <AnotherText2>something here</AnotherText2> </row> <row> <Text1>something here</Text1> <Text2>something else here</Text2> <Text3>something new here</Text3> <Text4>something here</Text4> <Text5>something here</Text5> <Num1>2</Num1> <Num2>1</Num2> <Num3>1</Num3> <Num4>0</Num4> <Num5>1</Num5> <Num6>0</Num6> <AnotherText1>something here</AnotherText1> <Num7>2</Num7> <Num8>0</Num8> <Num9>0</Num9> <AnotherText2>something here</AnotherText2> </row> ... ... </File>
Many xml files are zipped together. Hence the tar.gz file.
This runs. However for a 400MB file it takes 50mins to finish.
Does anyone have an idea why it is so slow, or how I may speed it up? I am running on a 7 machine cluster with about 120GB Yarn memory, with hortonworks HDP-2.5.3.0 and spark 1.6.2.
Many thanks in Advance!
Created 04-13-2017 04:17 PM
Hi Mark,
com.databricks.spark.xml loader expect to have XML files in the path, not Sequence Files. How spark-xml will deals with SequenceFileInputFormat instead com.databricks.spark.xml.XmlInputFormat?