I have been trying Cloudera Search 1.0, and have a question about behavior of the GoLive feature of the batch indexing.
By following the tutorial, I have confirmed that the command in the section "Batch Indexing into Online Solr Servers Using GoLive Feature" successfully pushes 2106 documents into the solr servers.
However, when I re-invoked the same command again, the document count on the solr cluster was doubled (i.e. it shows 4212 documents). Each time I invoked the command, the documents in the solr server increased by 2106.
I have looked through some query results, and noticed that, the index contains duplicated documents with the same id (uniqueKey) and different _version_. I have expected that a document in the current index is overwritten by the one with the same uniqueKey in the newly pushed index. Is it possible to achieve this behavior by using GoLive feature?
Thank you for the reply.
I now understood that the behavior I saw is expected and unavoidable because it comes from the underlying mechanism of Lucene index merging.
I will check Solr NRT API and Flume NRT Indexing to see how I could make this product fit in my use case.
I have this same problem, in a batch use case.
I keep a Solr index that is built on data that comes, say, once per day.
However, some of the data may come and update existing data in the cluster, so updates (upserts) are necessary.
Doing a full reload of data (re-generation of the index) is not feasible, since each time the data to be indexed needs to be downloaded from an external system.
We tested the application locally and upserts were running fine - however when running it in cluster with MapReduceIndexerTool(MRIT), it just inserted duplicates on unique field.
I understand this thread is quite old.
Is there any new addition that makes this a possibility?
It was suggested to use Solr NRT API. We don't have NRT necessities, though, so I guess this does not apply.
Is there an easy way to make this possible - and a resource to explain how to do it, and steps necessary?
Would a solution be to merge Solr indexes with the produced indexes by MRIT "by hand"? Something like this... but would it help with UPSERTing, or would it just keep the duplicates? Do we need to do anything special to remove duplicates, and to keep fresher version from a certain index?
Thanks a lot.