Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Way to avoid duplicated documents after batch indexing with GoLive feature

Way to avoid duplicated documents after batch indexing with GoLive feature

New Contributor

Hi experts,

 

I have been trying Cloudera Search 1.0, and have a question about behavior of the GoLive feature of the batch indexing.

 

By following the tutorial, I have confirmed that the command in the section "Batch Indexing into Online Solr Servers Using GoLive Feature" successfully pushes 2106 documents into the solr servers.

 

However, when I re-invoked the same command again, the document count on the solr cluster was doubled (i.e. it shows 4212 documents). Each time I invoked the command, the documents in the solr server increased by 2106.

 

I have looked through some query results, and noticed that, the index contains duplicated documents with the same id (uniqueKey) and different _version_. I have expected that a document in the current index is overwritten by the one with the same uniqueKey in the newly pushed index. Is it possible to achieve this behavior by using GoLive feature?

5 REPLIES 5

Re: Way to avoid duplicated documents after batch indexing with GoLive feature

Expert Contributor
This is the expected behavior. MapReduceIndexerTool with golive can only be used to insert new documents into Solr, not to update or delete existing documents in Solr. This limitation stems from the way Lucene segment merges work. You can update existing Solr documents later via the standard Solr NRT API, of course.

Wolfgang.

Re: Way to avoid duplicated documents after batch indexing with GoLive feature

New Contributor

Hi Wolfgang,

Thank you for the reply.

 

I now understood that the behavior I saw is  expected and unavoidable because it  comes from the underlying mechanism of Lucene index merging.

I will check Solr NRT API and Flume NRT Indexing to see how I could make this product fit in my use case.

 

Thanks!

Re: Way to avoid duplicated documents after batch indexing with GoLive feature

Expert Contributor
What's the use case? If it's a batch use case why does it need batch updates and why does MR indexing into a new empty collection or inserts into an existing collection not cover it?

Again, once you are done with MR batch inserts into a collection you can update that same collection with standard NRT Solr API.

Re: Way to avoid duplicated documents after batch indexing with GoLive feature

New Contributor

Hi,

I have this same problem, in a batch use case.

 

I keep a Solr index that is built on data that comes, say, once per day. 

However, some of the data may come and update existing data in the cluster, so updates (upserts) are necessary.

Doing a full reload of data (re-generation of the index) is not feasible, since each time the data to be indexed needs to be downloaded from an external system.

 

We tested the application locally and upserts were running fine - however when running it in cluster with MapReduceIndexerTool(MRIT), it just inserted duplicates on unique field.

 

 

I understand this thread is quite old.

Is there any new addition that makes this a possibility? 

It was suggested to use Solr NRT API. We don't have NRT necessities, though, so I guess this does not apply.

Is there an easy way to make this possible - and a resource to explain how to do it, and steps necessary? 

 

Would a solution be to merge Solr indexes with the produced indexes by MRIT "by hand"? Something like this... but would it help with UPSERTing, or would it just keep the duplicates? Do we need to do anything special to remove duplicates, and to keep fresher version from a certain index? 

 

Thanks a lot. 

 

Re: Way to avoid duplicated documents after batch indexing with GoLive feature

Expert Contributor