Reply
New Contributor
Posts: 3
Registered: ‎01-08-2016

Xml parsing error in cloudera search

Hi, 

I have many xml files and have no control over the generation of them. The cloudera search was working fine for well formed xml's. However on bulk indexing, almost all the mappers are failing which results in the job being failed due to the following exception.

 

Caused by: org.kitesdk.morphline.api.MorphlineRuntimeException: com.ctc.wstx.exc.WstxUnexpectedCharException: Illegal character ((CTRL-CHAR, code 3)) at [row,col {unknown-source}]: [13220,0]

 

After careful research, I understood that these characters are not allowed in xml version 1.0 but allowed in 1.1. How to tell the morphline code to use xml 1.1 so that those characters are read properly.

 

I tried after looking into below class

http://www.saxonica.com/html/documentation/javadoc/net/sf/saxon/lib/FeatureKeys.html#XML_VERSION

 

xquery {
    features: {"XML_VERSION":"1.1"}
}

 

But got error 

 

Error: java.lang.IllegalArgumentException: Unknown configuration option XML_VERSION

 

Also would the above option be best solution or do we have any other options. 

Cloudera Employee
Posts: 146
Registered: ‎08-21-2013

Re: Xml parsing error in cloudera search

Configuring saxon won't help because the woodstox xml parser is invoked before saxon even comes into play. Maybe there's a java system property for woodstox that enables xml 1.1? Or maybe a more recent version of woodstox understands xml 1.1 out of the box?

(As an aside, you'd use "http://saxon.sf.net/feature/xml-version" instead of "XML_VERSION" because
public final static String XML_VERSION = "http://saxon.sf.net/feature/xml-version"; )

Wolfgang
New Contributor
Posts: 3
Registered: ‎01-08-2016

Re: Xml parsing error in cloudera search

I tried xml-version too, but no luck. So is there any alternative to make the morphline stage understand the special characters and move ahead?
We either need to skip those bad xmls or make parser understand the special characters and index those docs also
Highlighted
New Contributor
Posts: 3
Registered: ‎01-08-2016

Re: Xml parsing error in cloudera search

This issue is still not resolved. I tried removing the control characters, and now getting error with surrogate characters. There should be a way to tell the xml parser to use UTF-16 encoding and all these characters rather than replace each. The xquery do not have that option configurable. Can anyone help to resolve this.

Announcements
The Kite SDK is a collection of docs, sample code, APIs, and tools to make Hadoop application development faster. Learn more at http://kitesdk.org.