Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Semi Structured data with xml embedded

Highlighted

Semi Structured data with xml embedded

New Contributor

Hi,

I am very new to Spark.  I am trying to handle a semi structure file which is '^' delimited and has xml tags embedded in one of the columns.  Please let me know what is the best way to process this file.

 

Sampe Data:

 

11001^1^100^2015-06-05 22:35:21.543^<d><nv n="ExtStationID" v="Station/FYI Television, Inc./25102" /><nv n="MediaDesc" v="19b8f4c0-92ce-44a7-a403-df4ee413aca9" /><nv n="ChannelNumber" v="1366" /><nv n="Duration" v="24375" /><nv n="IsTunedToService" v="True" /><nv n="StreamSelection" v="FULLSCREEN_PRIMARY" /><nv n="ChannelType" v="LiveTVMediaChannel" /><nv n="TuneID" v="636007629215440000" /></d>^0122648d-4352-4eec-9327-effae0c34ef2^2016060601

 

This what I am doing currently to process it:

data.map(_.replace(",","")).map(_.replace("^",",")).map(_.replace("<d><nv","")).map(_.replace("v=",",")).map(_.replace("n=",",")).map(_.replace("\"","")).map(_.replace("/><nv","")).map(_.replace("/></d>","")).saveAsTextFile("out7")

 

Please let me know if there is a better way to handle this type of data.

 

Thanks,

Sudha.

2 REPLIES 2

Re: Semi Structured data with xml embedded

Expert Contributor

Hi Sudha,

 

I believe you have a couple options.

 

1.  Use spark-csv package to intially process the delimited portion or simply map over RDD and use splitBy but ensure delimiter does not exist within column.  You can then map over RDD again and use an XML parser on columns that contain XML.

2.  Again use spark-csv or splitBy to initially process data.  Next map over results and select only the xml column.   Next use spark-xml package to convert xml RDD into DataFrame.  The example shows reading a file, but you can also process an RDD using XMLReaders xmlRdd method.

3.  Create a spark udf.  You may be able to find a udf for parsing xml within spark already, but I haven't looked myself.  This would be similar to #1 but a bit more reusable with a little extra effort upfront.

Highlighted

Re: Semi Structured data with xml embedded

New Contributor

I don't know if this is the best way. I think there are better ways available than the below method.

I tried not to use dataframe explode method and parsed the file using functions available at RDD level.

Please correct/suggest any improvements

 

//define a case class for schema match with data input

case class stb (server_unique_id:Int,request_type:Int,event_id:Int,stb_timestamp:String,stb_xml:String,device_id:String,secondary_timestamp: String)

 

val data = spark.read.textFile(args(0)).rdd;///read data from supplied path from CLI

 

//check for ^ delimiter and 7 fields, else filter out

var clean_Data = data.filter { line => {line.trim().contains("^")}}
.map { line => {line.split("\\^")}}
.filter{ line => line.length == 7}

 

//match the schema and filter out data having event id = 100 and the tag having Duration

 var tup_Map = clean_Data.map{ line => stb (line(0).toInt,line(1).toInt,line(2).toInt,line(3),line(4),line(5),line(6))}
.filter(line => (line.event_id == 100 && line.stb_xml.contains("Duration")));

 

//xml is of name-value format, hence the attrbutes are all same(n,v)

//parse through the xml structure and find out necessary data

//xmlnv will parse top level to nodeseq having 8 different data like duration,channel in self closing tags

//and name-value format

var xml_Map = tup_Map.map{line =>
var xmld = XML.loadString(line.stb_xml);
var xmlnv = xmld \\ "nv";

var duration = 0;
for { i <- 0 to xmlnv.length-1 if xmlnv(i).attributes.toString().contains("Duration") } duration = (xmlnv(i) \\ "@v").text.toInt;

var channelNum = 0;
for { i <- 0 to xmlnv.length-1 if xmlnv(i).attributes.toString().contains("ChannelNumber") } channelNum = (xmlnv(i) \\ "@v").text.toInt;

var channelType = "";
for { i <- 0 to xmlnv.length-1 if xmlnv(i).attributes.toString().contains("ChannelType") } channelType = (xmlnv(i) \\ "@v").text;

(duration, channelNum, channelType,line.device_id)
}

 

//persist xml_Map for further operations

xml_Map.persist();

Don't have an account?
Coming from Hortonworks? Activate your account here