Reply
Highlighted
New Contributor
Posts: 1
Registered: ‎06-11-2016

How to use StreamXmlRecordReader to parse single & multiline xml records within a single file

I have an input file (txt) as below

<a><b><c>val1</c></b></a>||<a><b><c>val2</c></b></a>||<a><b>
<c>val3</c></b></a>||<a></b><c>val4-c-1</c><c>val4-c-2</c></b><d>val-d-1</d></a>

If you observe the input carefully, the xml data record after the third '||' is split across two lines.

 

I want to use StreamXmlRecordReader of hadoop streaming to parse this file

-inputreader "org.apache.hadoop.streaming.StreamXmlRecordReader,begin=<a>,end=</a>,slowmatch=true

which I am unable to parse the 3rd record.

 

I am getting the below error

Traceback (most recent call last):
  File "/home/rsome/test/code/m1.py", line 13, in <module>    root = ET.fromstring(xml_str.getvalue())
  File "/usr/lib64/python2.6/xml/etree/ElementTree.py", line 964, in XML
    return parser.close()
  File "/usr/lib64/python2.6/xml/etree/ElementTree.py", line 1254, in close
    self._parser.Parse("", 1) # end of dataxml.parsers.expat.ExpatError: no element found: line 1, column 18478

I have used slowmatch=true as well but still no luck.

 

My output is coming as below where xml-3 is being treated as two records

$ hdfs dfs -text /user/rsome/poc/testout001/part-*
rec::1::mapper1
<a><b><c>val1</c></b></a>
rec::2::mapper1
<a><b><c>val2</c></b></a>
rec::3::mapper1
<a><b>
rec::4::mapper1
<c>val3</c></b></a>
rec::1::mapper2
<a></b><c>val4-c-1</c><c>val4-c-2</c></b><d>val-d-1</d></a>

 

My expected output is

$ hdfs dfs -text /user/rsome/poc/testout001/part-*
rec::1::mapper1
<a><b><c>val1</c></b></a>
rec::2::mapper1
<a><b><c>val2</c></b></a>
rec::3::mapper1
<a><b><c>val3</c></b></a>
rec::1::mapper2
<a></b><c>val4-c-1</c><c>val4-c-2</c></b><d>val-d-1</d></a>

any help on this would be of great help