New Contributor
Posts: 1
Registered: ‎11-15-2016

Semi Structured data with xml embedded

[ Edited ]


I am very new to Spark.  I am trying to handle a semi structure file which is '^' delimited and has xml tags embedded in one of the columns.  Please let me know what is the best way to process this file.


Sampe Data:


11001^1^100^2015-06-05 22:35:21.543^<d><nv n="ExtStationID" v="Station/FYI Television, Inc./25102" /><nv n="MediaDesc" v="19b8f4c0-92ce-44a7-a403-df4ee413aca9" /><nv n="ChannelNumber" v="1366" /><nv n="Duration" v="24375" /><nv n="IsTunedToService" v="True" /><nv n="StreamSelection" v="FULLSCREEN_PRIMARY" /><nv n="ChannelType" v="LiveTVMediaChannel" /><nv n="TuneID" v="636007629215440000" /></d>^0122648d-4352-4eec-9327-effae0c34ef2^2016060601


This what I am doing currently to process it:",","")).map(_.replace("^",",")).map(_.replace("<d><nv","")).map(_.replace("v=",",")).map(_.replace("n=",",")).map(_.replace("\"","")).map(_.replace("/><nv","")).map(_.replace("/></d>","")).saveAsTextFile("out7")


Please let me know if there is a better way to handle this type of data.




Cloudera Employee
Posts: 97
Registered: ‎05-10-2016

Re: Semi Structured data with xml embedded

Hi Sudha,


I believe you have a couple options.


1.  Use spark-csv package to intially process the delimited portion or simply map over RDD and use splitBy but ensure delimiter does not exist within column.  You can then map over RDD again and use an XML parser on columns that contain XML.

2.  Again use spark-csv or splitBy to initially process data.  Next map over results and select only the xml column.   Next use spark-xml package to convert xml RDD into DataFrame.  The example shows reading a file, but you can also process an RDD using XMLReaders xmlRdd method.

3.  Create a spark udf.  You may be able to find a udf for parsing xml within spark already, but I haven't looked myself.  This would be similar to #1 but a bit more reusable with a little extra effort upfront.

New Contributor
Posts: 1
Registered: ‎12-21-2017

Re: Semi Structured data with xml embedded

I don't know if this is the best way. I think there are better ways available than the below method.

I tried not to use dataframe explode method and parsed the file using functions available at RDD level.

Please correct/suggest any improvements


//define a case class for schema match with data input

case class stb (server_unique_id:Int,request_type:Int,event_id:Int,stb_timestamp:String,stb_xml:String,device_id:String,secondary_timestamp: String)


val data =;///read data from supplied path from CLI


//check for ^ delimiter and 7 fields, else filter out

var clean_Data = data.filter { line => {line.trim().contains("^")}}
.map { line => {line.split("\\^")}}
.filter{ line => line.length == 7}


//match the schema and filter out data having event id = 100 and the tag having Duration

 var tup_Map ={ line => stb (line(0).toInt,line(1).toInt,line(2).toInt,line(3),line(4),line(5),line(6))}
.filter(line => (line.event_id == 100 && line.stb_xml.contains("Duration")));


//xml is of name-value format, hence the attrbutes are all same(n,v)

//parse through the xml structure and find out necessary data

//xmlnv will parse top level to nodeseq having 8 different data like duration,channel in self closing tags

//and name-value format

var xml_Map ={line =>
var xmld = XML.loadString(line.stb_xml);
var xmlnv = xmld \\ "nv";

var duration = 0;
for { i <- 0 to xmlnv.length-1 if xmlnv(i).attributes.toString().contains("Duration") } duration = (xmlnv(i) \\ "@v").text.toInt;

var channelNum = 0;
for { i <- 0 to xmlnv.length-1 if xmlnv(i).attributes.toString().contains("ChannelNumber") } channelNum = (xmlnv(i) \\ "@v").text.toInt;

var channelType = "";
for { i <- 0 to xmlnv.length-1 if xmlnv(i).attributes.toString().contains("ChannelType") } channelType = (xmlnv(i) \\ "@v").text;

(duration, channelNum, channelType,line.device_id)


//persist xml_Map for further operations