Community Articles

Find and share helpful community-sourced technical articles.
Labels (1)
avatar

Joining Collections in SOLR (Part 1)

Sometimes you may want to inner join data from one solr connection to another. There is a facility to perform this action using a join query in SOLR. The easiest way to perform the join is by linking a single attribute from one collection to another attribute in another collection. This join works very well for standalone indexes, but does not work well for distributed indexes. To do this in a distributed index, we’ll perform that in part II of this article.

To demonstrate, let’s say we have two collections. Sales, which contains the amount of sales by region. And in the other collection called People, which has people categorized by their region and a flag if they are a manager. Let’s say our goal is to find all of the sales by manager. To do this, we will join the collections using region as our join key, and also filter the people data by if they are a manager or not.

6363-screen-shot-2016-08-02-at-82311-pm.png

Here is the filter query (fq) in solr on how to make this happen:
fq={!join from=region_s to=region_s fromIndex=people}mgr_s:yes

Let's use an actual example to show the functionality...

First let’s create a sales collections and populate it:

curl "http://127.0.0.1:8983/solr/admin/cores?action=CREATE&name=sales&instanceDir=/opt/hostname-hdpsearch/solr/server/solr/sales&configSet=basic_configs" 

We'll populate it with data using the Solr Admin UI. Select the Sales core, then choose Documents. Document Type should be CSV, paste the values below into the text box and then click Submit Document. Very simple way to index sample data.

id,region_s,sales_i
1,east,100000 
2,west,200000 
3,north,300000 
4,south,400000 

6362-screen-shot-2016-08-03-at-114429-pm.png

Now create our second collection, people:

curl "http://127.0.0.1:8983/solr/admin/cores?action=CREATE&name=people&instanceDir=/Users/ccasano/Applications/solr/solr-5.2.1/server/solr/people&configSet=basic_configs" 

You can upload the following data as well into the people collection this time.

id,name_s,region_s,salary_i,mgr_s 
1,chris,east,100000,yes 
2,jen,west,200000,yes 
3,james,east,75000,no 
4,ruby,north,50000,yes 
5,charlotte,west,120000,yes 

Finally let’s run our join query to produce the results we are looking for.

http://localhost:8983/solr/sales/select?q=*:*&fq={!join from=region_s to=region_s fromIndex=people}mgr_s:yes

You should see the following results:

6361-screen-shot-2016-08-04-at-120421-am.png

If you would like to run the same functionality using compounded join keys (i.e. 2 or more join keys). The best things to do is concatenate those keys on ingest to create a single join key.

Additionally, this functionality does not work with distributed indexes, i.e. multiple shards. If you try to attempt this on a distributed index with multiple shards, you’ll get the following error message:

"error": { "msg": "SolrCloud join: multiple shards not yet supported people", "code": 400 

In Conclusion: Joins between SOLR collections are useful but should be taken with caution. As you can see, this query only works with simple non-distributed collections. Additionally, you can only display the fields from the sales collection and not the people collection which is a total bummer. A more common practice is to pre-join the information before it’s indexed. For joining collections with multiple shards, you could also try to attempt this with Spark. Stay tuned on how to do this in Part II of this post.

38,573 Views
Comments
avatar
New Contributor

Hi, Thanks for the guide.

Are there any updates regarding part II? Is there a built-in way in Solr to join across multiple collection or you've implemented it through using Spark?

avatar
New Contributor

Hey ,

Its a wonderful article. But can you share the link to part ii of this article?