Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Who agreed with this topic

Xml parsing error in cloudera search

New Contributor

Hi, 

I have many xml files and have no control over the generation of them. The cloudera search was working fine for well formed xml's. However on bulk indexing, almost all the mappers are failing which results in the job being failed due to the following exception.

 

Caused by: org.kitesdk.morphline.api.MorphlineRuntimeException: com.ctc.wstx.exc.WstxUnexpectedCharException: Illegal character ((CTRL-CHAR, code 3)) at [row,col {unknown-source}]: [13220,0]

 

After careful research, I understood that these characters are not allowed in xml version 1.0 but allowed in 1.1. How to tell the morphline code to use xml 1.1 so that those characters are read properly.

 

I tried after looking into below class

http://www.saxonica.com/html/documentation/javadoc/net/sf/saxon/lib/FeatureKeys.html#XML_VERSION

 

xquery {
    features: {"XML_VERSION":"1.1"}
}

 

But got error 

 

Error: java.lang.IllegalArgumentException: Unknown configuration option XML_VERSION

 

Also would the above option be best solution or do we have any other options. 

Who agreed with this topic