Created on 11-01-2016 08:55 PM - edited 08-17-2019 08:31 AM
Cross Data Center Replication, commonly abbreviated as CDCR, is a new feature found in SolrCloud 6.x. This feature enables Solr to replicate data from one source collection to one or more target collections distributed between data centers. The current version provides an active-passive disaster recovery solution for Solr. Data updates, which include adds, updates, and deletes, are copied from the source collection to the target collection. This means the target collection should not be sent data updates outside of the CDRC functionality.
Prior to SolrCloud 6.x you had to manually design a strategy for replication across data centers. This tutorial will guide you through the process of enabling CDCR between two SolrCloud clusters, each with 1 server, in a Vagrant + VirtualBox environment.
NOTE: Solr 6 is being deployed as a standalone application. HDP 2.5 provides support for Solr 5.5.2 via HDPSearch which does not include CDCR functionality.
Vagrant plugin vagrant-hostmanager 1.8.5 ( vagrant-hostmanager)
You should have already downloaded the Apache Solr 6.2.1 release ( Apache Solr 6.2.1)
This tutorial was tested using the following environment and components:
 I like to create project directories. My Vagrant work goes under  ~/Vagrant/<project> and my Docker work goes under ~/Docker/<project>. This allows me to clearly identify which technology is associated with the projects and allows me to use various helper scripts to automate processes, etc. So let's create project directory for this tutorial.
mkdir -p ~/Vagrant/solrcloud-cdcr-tutorial && cd ~/Vagrant/solrcloud-cdcr-tutorial
The Vagrantfile tells Vagrant how to configure your virtual machines. You can copy/paste my Vagrantfile below or use the version in the attachments area of this tutorial. Here is the content from my file:
# -*- mode: ruby -*-
# vi: set ft=ruby :
# Using yaml to load external configuration files
require 'yaml'
Vagrant.configure(2) do |config|
  # Using the hostmanager vagrant plugin to update the host files
  config.hostmanager.enabled = true
  config.hostmanager.manage_host = true
  config.hostmanager.manage_guest = true
  config.hostmanager.ignore_private_ip = false
  # Loading in the list of commands that should be run when the VM is provisioned.
  commands = YAML.load_file('commands.yaml')
  commands.each do |command|
    config.vm.provision :shell, inline: command
  end
  # Loading in the VM configuration information
  servers = YAML.load_file('servers.yaml')
  servers.each do |servers|
    config.vm.define servers[name] do |srv|
      srv.vm.box = servers[box] # Speciy the name of the Vagrant box file to use
      srv.vm.hostname = servers[name] # Set the hostname of the VM
      srv.vm.network private_network, ip: servers[ip], :adapater=>2 # Add a second adapater with a specified IP
      srv.vm.network :forwarded_port, guest: 22, host: servers[port] # Add a port forwarding rule
      srv.vm.provision :shell, inline: "sed -i'' '/^127.0.0.1\t#{srv.vm.hostname}\t#{srv.vm.hostname}$/d' /etc/hosts" # Remove the extraneous first entry in /etc/hosts
      srv.vm.provider :virtualbox do |vb|
        vb.name = servers[name] # Name of the VM in VirtualBox
        vb.cpus = servers[cpus] # How many CPUs to allocate to the VM
        vb.memory = servers[ram] # How much memory to allocate to the VM
        vb.customize [modifyvm, :id, --cpuexecutioncap, 25]  # Limit to VM to 25% of available CPU
      end
    end
  end
end
The servers.yaml file contains the configuration information for our VMs. You can copy/paste my servers.yaml below or use the version in the attachments area of this tutorial. Here is the content from my file:
--- - name: solr-dc01 box: bento/centos-7.2 cpus: 2 ram: 2048 ip: 192.168.56.101 port: 10122 - name: solr-dc02 box: bento/centos-7.2 cpus: 2 ram: 2048 ip: 192.168.56.202 port: 20222
The commands.yaml file contains the list of commands that should be run on each VM when they are first provisioned. This allows us to automate configuration tasks that would otherwise be tedious and/or repetitive. You can copy/paste my commands.yaml below or use the version in the attachments area of this tutorial. Here is the content from my file:
- sudo yum -y install net-tools ntp wget java-1.8.0-openjdk java-1.8.0-openjdk-devel lsof - sudo systemctl enable ntpd && sudo systemctl start ntpd - sudo systemctl disable firewalld && sudo systemctl stop firewalld - sudo sed -i --follow-symlinks 's/^SELINUX=.*/SELINUX=disabled/g' /etc/sysconfig/selinux
 Our project directory is accessible to each of our Vagrant VMs via the  /vagrant mount point. This allows us to easily access files and data located in our project directory. Instead of using scp to copy the Apache Solr release file to each of the VMs and creating duplicate files, we'll use a single copy located in our project directory.
cp ~/Downloads/solr-6.2.1.tgz .
 NOTE: This assumes you are on a Mac and your downloads are in the ~/Downloads directory.
Now we are ready to start our 2 virtual machines for the first time. Creating the VMs for the first time and starting them every time after that uses the same command:
vagrant up
Once the process is complete you should have 2 servers running. You can verify by looking at VirtualBox. Notice I have 2 VMs running called solr-dc01 and solr-dc02:
    
You are able to login to each of the VMs via ssh using the vagrant ssh command. You must specify the name of the VM you want to connect to.
vagrant ssh solr-dc01
  Using another terminal window, repeat this process for   solr-dc02.  
The Solr release archive file contains an installation script. This installation script will do the following by default:
NOTE: This assumes that you downloaded Solr 6.2.1
solr user.  On   solr-dc01, run the following command:  
tar xvfz /vagrant/solr-6.2.1.tgz solr-6.2.1/bin/install_solr_service.sh --strip-components=2
  Repeat this process for   solr-dc02  
  This will create a file called install_solr_services.sh in your current directory, which should be the   /home/vagrant.  
Now we can install Solr using the script defaults:
sudo bash ./install_solr_service.sh /vagrant/solr-6.2.1.tgz
The command above is the same as if you had specified the default settings:
sudo bash ./install_solr_service.sh /vagrant/solr-6.2.1.tgz -i /opt -d /var/solr -u solr -s solr -p 8983
After running the command, you should see something similar to this:
id: solr: no such user
Creating new user: solr
Extracting /vagrant/solr-6.2.1.tgz to /opt
Installing symlink /opt/solr -> /opt/solr-6.2.1 ...
Installing /etc/init.d/solr script ...
Installing /etc/default/solr.in.sh ...
Waiting up to 30 seconds to see Solr running on port 8983 [/]
Started Solr server on port 8983 (pid=29168). Happy searching!
Found 1 Solr nodes:
Solr process 29168 running on port 8983
{
  solr_home:/var/solr/data,
  version:6.2.1 43ab70147eb494324a1410f7a9f16a896a59bc6f - shalin - 2016-09-15 05:20:53,
  startTime:2016-10-31T19:46:27.997Z,
  uptime:0 days, 0 hours, 0 minutes, 12 seconds,
  memory:13.4 MB (%2.7) of 490.7 MB}
Service solr installed.
If you run the following command, you can see the Solr process is running:
ps -ef | grep solr solr 28980 1 0 19:49 ? 00:00:11 java -server -Xms512m -Xmx512m -XX:NewRatio=3 -XX:SurvivorRatio=4 -XX:TargetSurvivorRatio=90 -XX:MaxTenuringThreshold=8 -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:ConcGCThreads=4 -XX:ParallelGCThreads=4 -XX:+CMSScavengeBeforeRemark -XX:PretenureSizeThreshold=64m -XX:+UseCMSInitiatingOccupancyOnly -XX:CMSInitiatingOccupancyFraction=50 -XX:CMSMaxAbortablePrecleanTime=6000 -XX:+CMSParallelRemarkEnabled -XX:+ParallelRefProcEnabled -verbose:gc -XX:+PrintHeapAtGC -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintGCTimeStamps -XX:+PrintTenuringDistribution -XX:+PrintGCApplicationStoppedTime -Xloggc:/var/solr/logs/solr_gc.log -Djetty.port=8983 -DSTOP.PORT=7983 -DSTOP.KEY=solrrocks -Duser.timezone=UTC -Djetty.home=/opt/solr/server -Dsolr.solr.home=/var/solr/data -Dsolr.install.dir=/opt/solr -Dlog4j.configuration=file:/var/solr/log4j.properties -Xss256k -XX:OnOutOfMemoryError=/opt/solr/bin/oom_solr.sh 8983 /var/solr/logs -jar start.jar --module=http
  Repeat this process for   solr-dc02  
It's more convenient to use the OS services infrastructure to manage running Solr processes than manually using scripts. The installation process creates a service script that starts Solr in single instance mode. To take advantage of CDCR, you must use SolrCloud mode. We need to make some changes to the service script for this to work.
  We'll be using the embedded Zookeeper instance for our tutorial. To do this, we need a zookeeper configuration file in our   /var/solr/data directory. We'll copy the default configuration file from /opt/solr/server/solr/zoo.cfg.  
sudo -u solr cp /opt/solr/server/solr/zoo.cfg /var/solr/data/zoo.cfg
  Now we need the   /etc/init.d/solr service script to run Solr in SolrCloud mode. This is done by adding the -c parameter to the start process. When no other parameters are specified, Solr will start an embedded Zookeeper instance on the Solr port + 1000. In our case, that should be 9983 because our default Solr port is 8983. Because this file is owned by root, we'll need to use sudo.  
exit sudo vi /etc/init.d/solr
Look near the end of the file for the line:
...
case $1 in
  start|stop|restart|status)
    SOLR_CMD=$1
...
  This is the section that defines the Solr command. We want to change the   SOLR_CMD=$1 line to look like this SOLR_CMD=$1 -c. This will tell Solr that it should start in cloud mode.  
  NOTE: In production, you would not use the embedded Zookeeper. You would update the /etc/defaults/solr.in.sh to set the ZK_HOST variable to the production Zookeeper instances. When this variable is set, Solr will not start the embedded Zookeeper.  
So the section of your file should now look like this:
...
case $1 in
  start|stop|restart|status)
    SOLR_CMD=$1 -c
...
Now save the file:
Press the `esc` KEY !wq
Let's stop Solr:
sudo service solr stop
Now we can start Solr using the new script:
sudo service solr start
Once the process is started, we can check the status:
sudo service solr status
Found 1 Solr nodes:
Solr process 29426 running on port 8983
{
  solr_home:/var/solr/data,
  version:6.2.1 43ab70147eb494324a1410f7a9f16a896a59bc6f - shalin - 2016-09-15 05:20:53,
  startTime:2016-10-31T22:16:22.116Z,
  uptime:0 days, 0 hours, 0 minutes, 14 seconds,
  memory:30.2 MB (%6.1) of 490.7 MB,
  cloud:{
    ZooKeeper:localhost:9983,
    liveNodes:1,
    collections:0}}
  As you can see, the process started successfully and there is a single cloud node running using Zookeeper on port   9983.  
  Repeat this process for   solr-dc02.  
  The   solr-dc01 Solr instance will be our source collection for replication. To enable CDCR we need to make a few changes to the solrconfig.xml configuration file. We'll use the data_driven_schema_configs as a base for our configuration. We need to create two different configurations because the source collection has a slightly different configuration than the target collection.  
  On the   solr-dc01 VM, copy the data_driven_schema_configs directory to the vagrant home directory. If you are following along, you should still be the vagrant user.  
cd /home/vagrant cp -r /opt/solr/server/solr/configsets/data_driven_schema_configs .
Edit the solrconfig.xml file:
vi data_driven_schema_configs/conf/solrconifg.xml
  The first thing we are going to do is update the   updateHandler definition; there is only one in the file. Find the section in the configuration file that looks like this:  
<updateHandler class=solr.DirectUpdateHandler2>
  We are going to change the   updateLog portion of the configuration. Remember that we are using vi as the text editor, so edit using the appropriate vi commands. Change this:  
    <updateLog>
      <str name=dir>${solr.ulog.dir:}</str>
      <int name=numVersionBuckets>${solr.ulog.numVersionBuckets:65536}</int>
    </updateLog>
to this:
   <updateLog class=solr.CdcrUpdateLog>
      <str name=dir>${solr.ulog.dir:}</str>
      <int name=numVersionBuckets>${solr.ulog.numVersionBuckets:65536}</int>
    </updateLog>
  Now we need to create a new   requestHandler definition. Find the section in the configuration file that looks like this:  
  <!-- A request handler that returns indented JSON by default -->
  <requestHandler name=/query class=solr.SearchHandler>
    <lst name=defaults>
      <str name=echoParams>explicit</str>
      <str name=wt>json</str>
      <str name=indent>true</str>
    </lst>
  </requestHandler>
  We are going to add our new definition just after the closing requestHandler. Add the following new definition:  
  <!-- A request handler for cross data center replication -->
  <requestHandler name=/cdcr class=solr.CdcrRequestHandler>
    <lst name=replica>
      <str name=zkHost>192.168.56.202:9983</str>
      <str name=source>collection1</str>
      <str name=target>collection1</str>
    </lst>
    <lst name=replicator>
      <str name=threadPoolSize>8</str>
      <str name=schedule>1000</str>
      <str name=batchSize>128</str>
    </lst>
    <lst name=updateLogSynchronizer>
      <str name=schedule>1000</str>
    </lst>
  </requestHandler>
Your updated file should now look like this:
...
  <!-- A request handler that returns indented JSON by default -->
  <requestHandler name=/query class=solr.SearchHandler>
    <lst name=defaults>
      <str name=echoParams>explicit</str>
      <str name=wt>json</str>
      <str name=indent>true</str>
    </lst>
  </requestHandler>
  <!-- A request handler for cross data center replication -->
  <requestHandler name=/cdcr class=solr.CdcrRequestHandler>
    <lst name=replica>
      <str name=zkHost>192.168.56.202:9983</str>
      <str name=source>collection1</str>
      <str name=target>collection1</str>
    </lst>
    <lst name=replicator>
      <str name=threadPoolSize>8</str>
      <str name=schedule>1000</str>
      <str name=batchSize>128</str>
    </lst>
    <lst name=updateLogSynchronizer>
      <str name=schedule>1000</str>
    </lst>
  </requestHandler>
...
  NOTE: The zkHost line should have the ip address and port of the Zookeeper instance of the target collection. Our target collection is on solr-dc02, so this ip and port are pointing to solr-dc02. When we create our collections in Solr, we'll use the name collection1.  
Now save the file:
Press the `esc` KEY !wq
  The   solr-dc02 Solr instance will be our target collection for replication. To enable CDCR we need to make a few changes to the solrconfig.xml configuration file. As above, we'll use the data_driven_schema_configs as a base for our configuration.  
  On   solr-dc02, copy the data_driven_schema_configs directory to the vagrant home directory. If you are following along, you should still be the vagrant user.  
cd /home/vagrant cp -r /opt/solr/server/solr/configsets/data_driven_schema_configs .
Edit the solrconfig.xml file:
vi data_driven_schema_configs/conf/solrconifg.xml
  The first thing we are going to do is update the   updateHandler definition; there is only one in the file. Find the section in the configuration file that looks like this:  
<updateHandler class=solr.DirectUpdateHandler2>
  We are going to change the   updateLog portion of the configuration. Remember that we are using vi as the text editor. Change this:  
    <updateLog>
      <str name=dir>${solr.ulog.dir:}</str>
      <int name=numVersionBuckets>${solr.ulog.numVersionBuckets:65536}</int>
    </updateLog>
to this:
   <updateLog class=solr.CdcrUpdateLog>
      <str name=dir>${solr.ulog.dir:}</str>
      <int name=numVersionBuckets>${solr.ulog.numVersionBuckets:65536}</int>
    </updateLog>
  Now we need to create a new   requestHandler definition. Find the section in the configuration file that looks like this:  
  <!-- A request handler that returns indented JSON by default -->
  <requestHandler name=/query class=solr.SearchHandler>
    <lst name=defaults>
      <str name=echoParams>explicit</str>
      <str name=wt>json</str>
      <str name=indent>true</str>
    </lst>
  </requestHandler>
  We are going to add our new definition just after the closing requestHandler. Add the following new definition:  
  <!-- A request handler for cross data center replication -->
  <requestHandler name=/cdcr class=solr.CdcrRequestHandler>
    <lst name=buffer>
      <str name=defaultState>disabled</str>
    </lst>
  </requestHandler>
  <!-- A request handler for cross data center replication -->
  <requestHandler name=/update class=solr.UpdateRequestHandler>
    <lst name=defaults>
      <str name=update.chain>cdcr-processor-chain</str>
    </lst>
  </requestHandler>
  <updateRequestProcessorChain name=cdcr-processor-chain>
    <processor class=solr.CdcrUpdateProcessorFactory/>
    <processor class=solr.RunUpdateProcessorFactory/>
  </updateRequestProcessorChain>
Your updated file should now look like this:
...
  <!-- A request handler that returns indented JSON by default -->
  <requestHandler name=/query class=solr.SearchHandler>
    <lst name=defaults>
      <str name=echoParams>explicit</str>
      <str name=wt>json</str>
      <str name=indent>true</str>
    </lst>
  </requestHandler>
  <!-- A request handler for cross data center replication -->
  <requestHandler name=/cdcr class=solr.CdcrRequestHandler>
    <lst name=buffer>
      <str name=defaultState>disabled</str>
    </lst>
  </requestHandler>
  <!-- A request handler for cross data center replication -->
  <requestHandler name=/update class=solr.UpdateRequestHandler>
    <lst name=defaults>
      <str name=update.chain>cdcr-processor-chain</str>
    </lst>
  </requestHandler>
  <updateRequestProcessorChain name=cdcr-processor-chain>
    <processor class=solr.CdcrUpdateProcessorFactory/>
    <processor class=solr.RunUpdateProcessorFactory/>
  </updateRequestProcessorChain>
...
Now save the file:
Press the `esc` KEY !wq
You should see how the two configurations are different between the source and target collections.
Now we should be able to create a collection using our update configuration. Because the two configurations are different, make sure you run this command on both the solr-dc01 and solr-dc02 VMs. This is creating the collections in our respective data centers.
/opt/solr/bin/solr create -c collection1 -d ./data_driven_schema_configs
NOTE: We are using the same collection name that has CDCR enabled in the configuration.
You should see something similar to this:
/opt/solr/bin/solr create -c collection1 -d ./data_driven_schema_configs Connecting to ZooKeeper at localhost:9983 ... Uploading /home/vagrant/data_driven_schema_configs/conf for config collection1 to ZooKeeper at localhost:9983 Creating new collection 'collection1' using command: http://localhost:8983/solr/admin/collections?action=CREATE&name=collection1&numShards=1&replicationF... { responseHeader:{ status:0, QTime:3684}, success:{192.168.56.101:8983_solr:{ responseHeader:{ status:0, QTime:2546}, core:collection1_shard1_replica1}}}
  Now we can verify the collection exists in the Solr admin ui via:   http://192.168.56.101:8983/solr/#/~cloud  
You should see something similar to this:
    
  As you can see, there is a single collection named   collection1 which has 1 shard. You can repeat this process on solr-dc02 and see something similar.  
NOTE: Remember that solr-dc01 is 192.168.56.101 and solr-dc02 is 192.168.56.202.
Let's first check the status of replication. Each of these curl commands is interacting with the collection api. You can check the status of replication using the following command:
curl -XPOST 'http://192.168.56.101:8983/solr/collection1/cdcr?action=STATUS'
You should see something similar to this:
curl -XPOST 'http://192.168.56.101:8983/solr/collection1/cdcr?action=STATUS' <?xml version=1.0 encoding=UTF-8?> <response> <lst name=responseHeader><int name=status>0</int><int name=QTime>5</int></lst><lst name=status><str name=process>stopped</str><str name=buffer>enabled</str></lst> </response>
  You should notice the   process is displayed as stopped. We want to start the replication process.  
curl -XPOST 'http://192.168.56.101:8983/solr/collection1/cdcr?action=START'
You should see something similar to this:
curl -XPOST 'http://192.168.56.101:8983/solr/collection1/cdcr?action=START' <?xml version=1.0 encoding=UTF-8?> <response> <lst name=responseHeader><int name=status>0</int><int name=QTime>41</int></lst><lst name=status><str name=process>started</str><str name=buffer>enabled</str></lst> </response>
  You should notice the   process is now started. Now we need to disable the buffer on the target colleciton which will buffer the updates by default.  
curl -XPOST 'http://192.168.56.202:8983/solr/collection1/cdcr?action=DISABLEBUFFER'
You should see something similar to this:
curl -XPOST 'http://192.168.56.202:8983/solr/collection1/cdcr?action=DISABLEBUFFER' <?xml version=1.0 encoding=UTF-8?> <response> <lst name=responseHeader><int name=status>0</int><int name=QTime>7</int></lst><lst name=status><str name=process>started</str><str name=buffer>disabled</str></lst> </response>
  You should notice the   buffer is now disabled.  
Now we will add a couple of sample documents to collection1 in solr-dc01. Run the following command to add 2 sample documents:
curl -XPOST -H 'Content-Type: application/json' 'http://192.168.56.101:8983/solr/collection1/update' --data-binary '{
 add : {
  doc : {
   id : 1,
   text_ws : This is document number one.
  }
 },
 add : {
  doc : {
   id : 2,
   text_ws : This is document number two.
  }
 },
 commit : {}
}'
  You should notice the   commit command in the JSON above. That is because the default solrconfig.xml does not have automatic commits enabled. You should get a response back similar to this:  
{responseHeader:{status:0,QTime:362}}
Let's query collection1 on solr-dc01 to ensure the documents are present. Run the following command:
curl -XGET 'http://192.168.56.101:8983/solr/collection1/select?q=*:*&indent=true'
You should see something similar to this:
curl -XGET 'http://192.168.56.101:8983/solr/collection1/select?q=*:*&indent=true'
<?xml version=1.0 encoding=UTF-8?>
<response>
<lst name=responseHeader>
  <bool name=zkConnected>true</bool>
  <int name=status>0</int>
  <int name=QTime>17</int>
  <lst name=params>
    <str name=q>*:*</str>
    <str name=indent>true</str>
  </lst>
</lst>
<result name=response numFound=2 start=0>
  <doc>
    <str name=id>1</str>
    <str name=text_ws>This is document number one.</str>
    <long name=_version_>1549823582071160832</long></doc>
  <doc>
    <str name=id>2</str>
    <str name=text_ws>This is document number two.</str>
    <long name=_version_>1549823582135123968</long></doc>
</result>
</response>
  Before executing the query on   solr-dc02, we need to commit the changes. As mentioned above, automatic commits are not enabled in the default solrconfig.xml. Run the following command;  
curl -XPOST -H 'Content-Type: application/json' 'http://192.168.56.202:8983/solr/collection1/update' --data-binary '{
 commit : {}
}'
You should see a response similar to this:
{responseHeader:{status:0,QTime:5}}
Now we can run our query:
curl -XGET 'http://192.168.56.202:8983/solr/collection1/select?q=*:*&indent=true'
You should see something similar to this:
curl -XGET 'http://192.168.56.202:8983/solr/collection1/select?q=*:*&indent=true'
<?xml version=1.0 encoding=UTF-8?>
<response>
<lst name=responseHeader>
  <bool name=zkConnected>true</bool>
  <int name=status>0</int>
  <int name=QTime>17</int>
  <lst name=params>
    <str name=q>*:*</str>
    <str name=indent>true</str>
  </lst>
</lst>
<result name=response numFound=2 start=0>
  <doc>
    <str name=id>1</str>
    <str name=text_ws>This is document number one.</str>
    <long name=_version_>1549823582071160832</long></doc>
  <doc>
    <str name=id>2</str>
    <str name=text_ws>This is document number two.</str>
    <long name=_version_>1549823582135123968</long></doc>
</result>
</response>
You should notice that you have 2 documents, which have the same id and text_ws content as you pushed to solr-dc01.
If you followed along with this tutorial, you have successfully set up cross data center replication between two SolrCloud configurations.
Some important points to keep in mind:
For more information, read about Cross Data Center Replication https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=62687462
