Reply
New Contributor
Posts: 2
Registered: ‎10-15-2013

Way to avoid duplicated documents after batch indexing with GoLive feature

Hi experts,

 

I have been trying Cloudera Search 1.0, and have a question about behavior of the GoLive feature of the batch indexing.

 

By following the tutorial, I have confirmed that the command in the section "Batch Indexing into Online Solr Servers Using GoLive Feature" successfully pushes 2106 documents into the solr servers.

 

However, when I re-invoked the same command again, the document count on the solr cluster was doubled (i.e. it shows 4212 documents). Each time I invoked the command, the documents in the solr server increased by 2106.

 

I have looked through some query results, and noticed that, the index contains duplicated documents with the same id (uniqueKey) and different _version_. I have expected that a document in the current index is overwritten by the one with the same uniqueKey in the newly pushed index. Is it possible to achieve this behavior by using GoLive feature?

Cloudera Employee
Posts: 146
Registered: ‎08-21-2013

Re: Way to avoid duplicated documents after batch indexing with GoLive feature

This is the expected behavior. MapReduceIndexerTool with golive can only be used to insert new documents into Solr, not to update or delete existing documents in Solr. This limitation stems from the way Lucene segment merges work. You can update existing Solr documents later via the standard Solr NRT API, of course.

Wolfgang.

New Contributor
Posts: 2
Registered: ‎10-15-2013

Re: Way to avoid duplicated documents after batch indexing with GoLive feature

Hi Wolfgang,

Thank you for the reply.

 

I now understood that the behavior I saw is  expected and unavoidable because it  comes from the underlying mechanism of Lucene index merging.

I will check Solr NRT API and Flume NRT Indexing to see how I could make this product fit in my use case.

 

Thanks!

Cloudera Employee
Posts: 146
Registered: ‎08-21-2013

Re: Way to avoid duplicated documents after batch indexing with GoLive feature

What's the use case? If it's a batch use case why does it need batch updates and why does MR indexing into a new empty collection or inserts into an existing collection not cover it?

Again, once you are done with MR batch inserts into a collection you can update that same collection with standard NRT Solr API.

New Contributor
Posts: 5
Registered: ‎02-08-2017

Re: Way to avoid duplicated documents after batch indexing with GoLive feature

Hi,

I have this same problem, in a batch use case.

 

I keep a Solr index that is built on data that comes, say, once per day. 

However, some of the data may come and update existing data in the cluster, so updates (upserts) are necessary.

Doing a full reload of data (re-generation of the index) is not feasible, since each time the data to be indexed needs to be downloaded from an external system.

 

We tested the application locally and upserts were running fine - however when running it in cluster with MapReduceIndexerTool(MRIT), it just inserted duplicates on unique field.

 

 

I understand this thread is quite old.

Is there any new addition that makes this a possibility? 

It was suggested to use Solr NRT API. We don't have NRT necessities, though, so I guess this does not apply.

Is there an easy way to make this possible - and a resource to explain how to do it, and steps necessary? 

 

Would a solution be to merge Solr indexes with the produced indexes by MRIT "by hand"? Something like this... but would it help with UPSERTing, or would it just keep the duplicates? Do we need to do anything special to remove duplicates, and to keep fresher version from a certain index? 

 

Thanks a lot. 

 

Cloudera Employee
Posts: 146
Registered: ‎08-21-2013

Re: Way to avoid duplicated documents after batch indexing with GoLive feature

Announcements