Welcome to the Cloudera Community

Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Who agreed with this topic

Xml parsing error in cloudera search

avatar
New Contributor

Hi, 

I have many xml files and have no control over the generation of them. The cloudera search was working fine for well formed xml's. However on bulk indexing, almost all the mappers are failing which results in the job being failed due to the following exception.

 

Caused by: org.kitesdk.morphline.api.MorphlineRuntimeException: com.ctc.wstx.exc.WstxUnexpectedCharException: Illegal character ((CTRL-CHAR, code 3)) at [row,col {unknown-source}]: [13220,0]

 

After careful research, I understood that these characters are not allowed in xml version 1.0 but allowed in 1.1. How to tell the morphline code to use xml 1.1 so that those characters are read properly.

 

I tried after looking into below class

http://www.saxonica.com/html/documentation/javadoc/net/sf/saxon/lib/FeatureKeys.html#XML_VERSION

 

xquery {
    features: {"XML_VERSION":"1.1"}
}

 

But got error 

 

Error: java.lang.IllegalArgumentException: Unknown configuration option XML_VERSION

 

Also would the above option be best solution or do we have any other options. 

Who agreed with this topic