Reply
New Contributor
Posts: 5
Registered: ‎01-06-2016

Duplicate documents in high volume write scenario

[ Edited ]

Hi,

 

I'm seeing duplicate documents being created in my high volume write scenario. I have multiple processes updating a document at the same time. 

 

I've also added a dedupe request processor (as shown below) to the Update Chain. The field id is also a unique value. The field id is also a byte array. However, I see duplicate fields on querying. My Auto Soft commit interval is 2 minutes and Auto Commit interval is 6 hours. I also have a 3 node setup with 3 shards. Any idea what is causing this issue? Could it be because of 4.10.3 version of Solr? 

 

 

  <requestHandler name="/update" class="solr.UpdateRequestHandler">

    <!-- See below for information on defining

         updateRequestProcessorChains that can be used by name

         on each Update Request

      -->

       <lst name="defaults">

         <str name="update.chain">dedupe</str>

       </lst>

  </requestHandler>

  <requestHandler name="/update/javabin" class="solr.BinaryUpdateRequestHandler">

       <lst name="defaults">

         <str name="update.chain">dedupe</str>

       </lst>

  </requestHandler>

 

     <updateRequestProcessorChain name="dedupe">

       <processor class="solr.processor.SignatureUpdateProcessorFactory">

         <bool name="enabled">true</bool>

         <str name="signatureField">signature</str>

         <bool name="overwriteDupes">true</bool>

         <str name="fields">id</str>

         <str name="signatureClass">solr.processor.Lookup3Signature</str>

       </processor>

       <processor class="solr.RunUpdateProcessorFactory" />

     </updateRequestProcessorChain>

 

Thanks,

Karthik. 

Cloudera Employee
Posts: 25
Registered: ‎08-22-2014

Re: Duplicate documents in high volume write scenario

Can you try addind uuidprocessfactory as well as I dont see that in solrconfig

 

Also try with having id field as string type

 

</admin>
<updateRequestProcessorChain name="uuid">
  <processor class="solr.UUIDUpdateProcessorFactory">
    <str name="fieldName">id</str>
  </processor>
  <processor class="solr.LogUpdateProcessorFactory" />
  <processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>

<requestHandler name="/update" class="solr.UpdateRequestHandler">
      <lst name="defaults">
        <str name="update.chain">uuid</str>
      </lst>
</requestHandler>
</config>

New Contributor
Posts: 5
Registered: ‎01-06-2016

Re: Duplicate documents in high volume write scenario

I tried UUIDProcessFactory and it still didn't work. UUIDProcessFactory creates a UUID if my field doesn't have value. My id field already has a Unique value and is never blank. 

 

I can't have id as a String type. Are you implying only Strings as unique keys will work with Solr? 

Cloudera Employee
Posts: 25
Registered: ‎08-22-2014

Re: Duplicate documents in high volume write scenario

Can you share your solrconfig.xml and schema.xml

 

No, id should be a long not a string . it was a typo. 

 

More details are here too regarding unique key

 

https://wiki.apache.org/solr/UniqueKey

 

 

 

 

Cloudera Employee
Posts: 25
Registered: ‎08-22-2014

Re: Duplicate documents in high volume write scenario

If you already have a field in your data file then you don't need uuid concept.

 

all you need is id field as long. look at the doc I sent above.

 

Hope this helps.

 

 

 

 

New Contributor
Posts: 5
Registered: ‎01-06-2016

Re: Duplicate documents in high volume write scenario

Here's my schema.xml - https://dl.dropboxusercontent.com/u/4755535/schema.xml

 

Here's the solrconfig.xml - https://dl.dropboxusercontent.com/u/4755535/solrconfig.xml

 

My id field is byte array. The id field can't be string or long for my case. 

Cloudera Employee
Posts: 25
Registered: ‎08-22-2014

Re: Duplicate documents in high volume write scenario

Can you also confirm what are you using to index document? Are you using MapreduceIndexer tool?

New Contributor
Posts: 5
Registered: ‎01-06-2016

Re: Duplicate documents in high volume write scenario

I'm using solrj version 4.10.3-cdh5.5.0 to index the document
New Contributor
Posts: 1
Registered: ‎01-11-2016

Re: Duplicate documents in high volume write scenario

Binary fields are not currently supported for the "id" field.

Best practice is to leave the id field as originally specified:

<field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" /> 

If you have a binary ID field, simply encode it using base64 to convert it to a string.

Given that you already have unique ID fields, you don't need SignatureUpdateProcessorFactory or anything else.

New Contributor
Posts: 5
Registered: ‎01-06-2016

Re: Duplicate documents in high volume write scenario

Thanks for the response Yonik.