06-11-2016 09:40 PM
I have an input file (txt) as below
<a><b><c>val1</c></b></a>||<a><b><c>val2</c></b></a>||<a><b> <c>val3</c></b></a>||<a></b><c>val4-c-1</c><c>val4-c-2</c></b><d>val-d-1</d></a>
If you observe the input carefully, the xml data record after the third '||' is split across two lines.
I want to use StreamXmlRecordReader of hadoop streaming to parse this file
-inputreader "org.apache.hadoop.streaming.StreamXmlRecordReader,begin=<a>,end=</a>,slowmatch=true
which I am unable to parse the 3rd record.
I am getting the below error
Traceback (most recent call last): File "/home/rsome/test/code/m1.py", line 13, in <module> root = ET.fromstring(xml_str.getvalue()) File "/usr/lib64/python2.6/xml/etree/ElementTree.py", line 964, in XML return parser.close() File "/usr/lib64/python2.6/xml/etree/ElementTree.py", line 1254, in close self._parser.Parse("", 1) # end of dataxml.parsers.expat.ExpatError: no element found: line 1, column 18478
I have used slowmatch=true as well but still no luck.
My output is coming as below where xml-3 is being treated as two records
$ hdfs dfs -text /user/rsome/poc/testout001/part-*
rec::1::mapper1
<a><b><c>val1</c></b></a>
rec::2::mapper1
<a><b><c>val2</c></b></a>
rec::3::mapper1
<a><b>
rec::4::mapper1
<c>val3</c></b></a>
rec::1::mapper2
<a></b><c>val4-c-1</c><c>val4-c-2</c></b><d>val-d-1</d></a>
My expected output is
$ hdfs dfs -text /user/rsome/poc/testout001/part-*
rec::1::mapper1
<a><b><c>val1</c></b></a>
rec::2::mapper1
<a><b><c>val2</c></b></a>
rec::3::mapper1
<a><b><c>val3</c></b></a>
rec::1::mapper2
<a></b><c>val4-c-1</c><c>val4-c-2</c></b><d>val-d-1</d></a>
any help on this would be of great help