Support Questions
Find answers, ask questions, and share your expertise
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Xml parsing error in cloudera search


Xml parsing error in cloudera search

New Contributor


I have many xml files and have no control over the generation of them. The cloudera search was working fine for well formed xml's. However on bulk indexing, almost all the mappers are failing which results in the job being failed due to the following exception.


Caused by: org.kitesdk.morphline.api.MorphlineRuntimeException: com.ctc.wstx.exc.WstxUnexpectedCharException: Illegal character ((CTRL-CHAR, code 3)) at [row,col {unknown-source}]: [13220,0]


After careful research, I understood that these characters are not allowed in xml version 1.0 but allowed in 1.1. How to tell the morphline code to use xml 1.1 so that those characters are read properly.


I tried after looking into below class


xquery {
    features: {"XML_VERSION":"1.1"}


But got error 


Error: java.lang.IllegalArgumentException: Unknown configuration option XML_VERSION


Also would the above option be best solution or do we have any other options. 


Re: Xml parsing error in cloudera search

Expert Contributor
Configuring saxon won't help because the woodstox xml parser is invoked before saxon even comes into play. Maybe there's a java system property for woodstox that enables xml 1.1? Or maybe a more recent version of woodstox understands xml 1.1 out of the box?

(As an aside, you'd use "" instead of "XML_VERSION" because
public final static String XML_VERSION = ""; )


Re: Xml parsing error in cloudera search

New Contributor
I tried xml-version too, but no luck. So is there any alternative to make the morphline stage understand the special characters and move ahead?
We either need to skip those bad xmls or make parser understand the special characters and index those docs also

Re: Xml parsing error in cloudera search

New Contributor

This issue is still not resolved. I tried removing the control characters, and now getting error with surrogate characters. There should be a way to tell the xml parser to use UTF-16 encoding and all these characters rather than replace each. The xquery do not have that option configurable. Can anyone help to resolve this.

Don't have an account?
Coming from Hortonworks? Activate your account here