Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Duplicate documents in high volume write scenario

Duplicate documents in high volume write scenario

New Contributor

Hi,

 

I'm seeing duplicate documents being created in my high volume write scenario. I have multiple processes updating a document at the same time. 

 

I've also added a dedupe request processor (as shown below) to the Update Chain. The field id is also a unique value. The field id is also a byte array. However, I see duplicate fields on querying. My Auto Soft commit interval is 2 minutes and Auto Commit interval is 6 hours. I also have a 3 node setup with 3 shards. Any idea what is causing this issue? Could it be because of 4.10.3 version of Solr? 

 

 

  <requestHandler name="/update" class="solr.UpdateRequestHandler">

    <!-- See below for information on defining

         updateRequestProcessorChains that can be used by name

         on each Update Request

      -->

       <lst name="defaults">

         <str name="update.chain">dedupe</str>

       </lst>

  </requestHandler>

  <requestHandler name="/update/javabin" class="solr.BinaryUpdateRequestHandler">

       <lst name="defaults">

         <str name="update.chain">dedupe</str>

       </lst>

  </requestHandler>

 

     <updateRequestProcessorChain name="dedupe">

       <processor class="solr.processor.SignatureUpdateProcessorFactory">

         <bool name="enabled">true</bool>

         <str name="signatureField">signature</str>

         <bool name="overwriteDupes">true</bool>

         <str name="fields">id</str>

         <str name="signatureClass">solr.processor.Lookup3Signature</str>

       </processor>

       <processor class="solr.RunUpdateProcessorFactory" />

     </updateRequestProcessorChain>

 

Thanks,

Karthik. 

9 REPLIES 9

Re: Duplicate documents in high volume write scenario

Contributor

Can you try addind uuidprocessfactory as well as I dont see that in solrconfig

 

Also try with having id field as string type

 

</admin>
<updateRequestProcessorChain name="uuid">
  <processor class="solr.UUIDUpdateProcessorFactory">
    <str name="fieldName">id</str>
  </processor>
  <processor class="solr.LogUpdateProcessorFactory" />
  <processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>

<requestHandler name="/update" class="solr.UpdateRequestHandler">
      <lst name="defaults">
        <str name="update.chain">uuid</str>
      </lst>
</requestHandler>
</config>

Re: Duplicate documents in high volume write scenario

New Contributor

I tried UUIDProcessFactory and it still didn't work. UUIDProcessFactory creates a UUID if my field doesn't have value. My id field already has a Unique value and is never blank. 

 

I can't have id as a String type. Are you implying only Strings as unique keys will work with Solr? 

Re: Duplicate documents in high volume write scenario

Contributor

Can you share your solrconfig.xml and schema.xml

 

No, id should be a long not a string . it was a typo. 

 

More details are here too regarding unique key

 

https://wiki.apache.org/solr/UniqueKey

 

 

 

 

Re: Duplicate documents in high volume write scenario

Contributor

If you already have a field in your data file then you don't need uuid concept.

 

all you need is id field as long. look at the doc I sent above.

 

Hope this helps.

 

 

 

 

Highlighted

Re: Duplicate documents in high volume write scenario

New Contributor

Here's my schema.xml - https://dl.dropboxusercontent.com/u/4755535/schema.xml

 

Here's the solrconfig.xml - https://dl.dropboxusercontent.com/u/4755535/solrconfig.xml

 

My id field is byte array. The id field can't be string or long for my case. 

Re: Duplicate documents in high volume write scenario

Contributor

Can you also confirm what are you using to index document? Are you using MapreduceIndexer tool?

Re: Duplicate documents in high volume write scenario

New Contributor
I'm using solrj version 4.10.3-cdh5.5.0 to index the document

Re: Duplicate documents in high volume write scenario

New Contributor

Binary fields are not currently supported for the "id" field.

Best practice is to leave the id field as originally specified:

<field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" /> 

If you have a binary ID field, simply encode it using base64 to convert it to a string.

Given that you already have unique ID fields, you don't need SignatureUpdateProcessorFactory or anything else.

Re: Duplicate documents in high volume write scenario

New Contributor

Thanks for the response Yonik.