Member since
02-09-2016
559
Posts
422
Kudos Received
98
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
2138 | 03-02-2018 01:19 AM | |
3527 | 03-02-2018 01:04 AM | |
2372 | 08-02-2017 05:40 PM | |
2345 | 07-17-2017 05:35 PM | |
1721 | 07-10-2017 02:49 PM |
11-08-2016
10:48 AM
1 Kudo
Objective
Many people work exclusively from a laptop where storage space is typically limited to 500GB of space or less. Over time, you may find your available storage space has become a regular concern. It's not uncommon to use an external hard drive to augment available storage space.
The current version of Docker for Mac (1.12.x) does not provide a configuration setting which allows users to change the location where the Docker virtual machine image is located. This means the image, which can grow up to 64GB in size by default, is located on your laptop's primary hard drive.
With the HDP 2.5 version of the Hortonworks sandbox available as a native Docker image, you may find a desire to have more room available to Docker. This tutorial will guide you through the process of moving your Docker virtual machine image to a different location, an external drive in this case. This will free up to 64GB of space on your primary laptop hard drive and let you expand the size of the Docker image file later. This tutorial is the first in a two part series.
Prerequisites
You should have already completed the following tutorial Installing Docker Version of Sandbox on Mac
You should have an external or secondary hard drive available.
Scope
Mac OS X 10.11.6 (El Capitan)
Docker for Mac 1.12.1
HDP 2.5 Docker Sandbox
Steps
Stop Docker for Mac
Before we can make any changes to the Docker virtual machine image, we need to stop Docker for Mac. There should be a Docker for Mac icon in the menu bar. You should see something similar to this:
You can also check via the command line via the ps -ef | grep -i com.docker . You should see something similar to this:
ps -ef | grep -i com.docker
0 123 1 0 8:45AM ?? 0:00.01 /Library/PrivilegedHelperTools/com.docker.vmnetd
502 967 876 0 8:46AM ?? 0:00.08 /Applications/Docker.app/Contents/MacOS/com.docker.osx.hyperkit.linux -watchdog fd:0
502 969 967 0 8:46AM ?? 0:00.04 /Applications/Docker.app/Contents/MacOS/com.docker.osx.hyperkit.linux -watchdog fd:0
502 971 967 0 8:46AM ?? 0:07.96 com.docker.db --url fd:3 --git /Users/myoung/Library/Containers/com.docker.docker/Data/database
502 975 967 0 8:46AM ?? 0:03.40 com.docker.osx.hyperkit.linux
502 977 975 0 8:46AM ?? 0:00.03 /Applications/Docker.app/Contents/MacOS/com.docker.osx.hyperkit.linux
502 12807 967 0 9:17PM ?? 0:00.08 com.docker.osxfs --address fd:3 --connect /Users/myoung/Library/Containers/com.docker.docker/Data/@connect --control fd:4 --volume-control fd:5 --database /Users/myoung/Library/Containers/com.docker.docker/Data/s40
502 12810 967 0 9:17PM ?? 0:00.12 com.docker.slirp --db /Users/myoung/Library/Containers/com.docker.docker/Data/s40 --ethernet fd:3 --port fd:4 --vsock-path /Users/myoung/Library/Containers/com.docker.docker/Data/@connect --max-connections 900
502 12811 967 0 9:17PM ?? 0:00.19 com.docker.driver.amd64-linux -db /Users/myoung/Library/Containers/com.docker.docker/Data/s40 -osxfs-volume /Users/myoung/Library/Containers/com.docker.docker/Data/s30 -slirp /Users/myoung/Library/Containers/com.docker.docker/Data/s50 -vmnet /var/tmp/com.docker.vmnetd.socket -port /Users/myoung/Library/Containers/com.docker.docker/Data/s51 -vsock /Users/myoung/Library/Containers/com.docker.docker/Data -docker /Users/myoung/Library/Containers/com.docker.docker/Data/s60 -addr fd:3 -debug
502 12812 12811 0 9:17PM ?? 0:00.02 /Applications/Docker.app/Contents/MacOS/com.docker.driver.amd64-linux -db /Users/myoung/Library/Containers/com.docker.docker/Data/s40 -osxfs-volume /Users/myoung/Library/Containers/com.docker.docker/Data/s30 -slirp /Users/myoung/Library/Containers/com.docker.docker/Data/s50 -vmnet /var/tmp/com.docker.vmnetd.socket -port /Users/myoung/Library/Containers/com.docker.docker/Data/s51 -vsock /Users/myoung/Library/Containers/com.docker.docker/Data -docker /Users/myoung/Library/Containers/com.docker.docker/Data/s60 -addr fd:3 -debug
502 12814 12811 0 9:17PM ?? 0:16.48 /Applications/Docker.app/Contents/MacOS/com.docker.hyperkit -A -m 12G -c 6 -u -s 0:0,hostbridge -s 31,lpc -s 2:0,virtio-vpnkit,uuid=1f629fed-1ef6-4f34-8fce-753347e3b941,path=/Users/myoung/Library/Containers/com.docker.docker/Data/s50,macfile=/Users/myoung/Library/Containers/com.docker.docker/Data/com.docker.driver.amd64-linux/mac.0 -s 3,virtio-blk,file:///Users/myoung/Library/Containers/com.docker.docker/Data/com.docker.driver.amd64-linux/Docker.qcow2,format=qcow -s 4,virtio-9p,path=/Users/myoung/Library/Containers/com.docker.docker/Data/s40,tag=db -s 5,virtio-rnd -s 6,virtio-9p,path=/Users/myoung/Library/Containers/com.docker.docker/Data/s51,tag=port -s 7,virtio-sock,guest_cid=3,path=/Users/myoung/Library/Containers/com.docker.docker/Data,guest_forwards=2376;1525 -l com1,autopty=/Users/myoung/Library/Containers/com.docker.docker/Data/com.docker.driver.amd64-linux/tty,log=/Users/myoung/Library/Containers/com.docker.docker/Data/com.docker.driver.amd64-linux/console-ring -f kexec,/Applications/Docker.app/Contents/Resources/moby/vmlinuz64,/Applications/Docker.app/Contents/Resources/moby/initrd.img,earlyprintk=serial console=ttyS0 com.docker.driver="com.docker.driver.amd64-linux", com.docker.database="com.docker.driver.amd64-linux" ntp=gateway mobyplatform=mac -F /Users/myoung/Library/Containers/com.docker.docker/Data/com.docker.driver.amd64-linux/hypervisor.pid
502 13790 876 0 9:52PM ?? 0:00.01 /Applications/Docker.app/Contents/MacOS/com.docker.frontend {"action":"vmstateevent","args":{"vmstate":"running"}}
502 13791 13790 0 9:52PM ?? 0:00.01 /Applications/Docker.app/Contents/MacOS/com.docker.frontend {"action":"vmstateevent","args":{"vmstate":"running"}}
502 13793 13146 0 9:52PM ttys000 0:00.00 grep -i com.docker
Now we are going to stop Docker for Mac. Before shutting down Docker, make sure all of your containers have been stopped. Using the menu shown above, click on the Quit Docker menu option. This will stop Docker for Mac. You should notice the Docker for Mac icon is no longer visible.
Now let's confirm the Docker processes we saw before are no longer running:
ps -ef | grep -i com.docker
0 123 1 0 8:45AM ?? 0:00.01 /Library/PrivilegedHelperTools/com.docker.vmnetd
502 13815 13146 0 9:54PM ttys000 0:00.00 grep -i com.docker
NOTE: It may take a minute or two before Docker completely shuts down.
Backup Docker virtual machine image
Before we make any changes to the Docker virtual machine image, we should back it up. This will temporarily use more space on your laptop hard drive. Make sure you have enough room to hold two copies of the data. As mentioned before, the Docker image can be up to 64GB by default. Let's check the current size of our image using du -sh . The Docker image file is located at ~/Library/Containers/com.docker.docker/Data/com.docker.driver.amd64-linux/ by default.
du -sh ~/Library/Containers/com.docker.docker/Data/com.docker.driver.amd64-linux/
64G /Users/myoung/Library/Containers/com.docker.docker/Data/com.docker.driver.amd64-linux/
In my case, my image size is 64GB. You need to be sure you have room for 2 copies of the com.docker.driver.amd64-linux directory. Now we'll make a copy of our image:
cd ~/Library/Containers/com.docker.docker/Data/
cp -r com.docker.driver.amd64-linux com.docker.driver.amd64-linux.backup
This copy serves as our backup of the image.
Copy Docker virtual machine image to external drive
Now we can make a copy of our image on our external hard drive. I have a 1TB SSD mounted at /Volumes/Samsung . I am going to store my Docker virtual machine image in /Volumes/Samsung/Docker/image . You should store the image in a location that makes sense for you.
cp -r com.docker.driver.amd64-linux /Volumes/Samsung/Docker/image/
This process will take a few minutes. It will take longer if you are not using an SSD. Let's confirm the directory now exists on the external hard drive.
ls -la /Volumes/Samsung/Docker/image/
total 0
drwxr-xr-x 3 myoung staff 102 Nov 3 17:08 .
drwxr-xr-x 11 myoung staff 374 Nov 3 17:03 ..
drwxr-xr-x@ 11 myoung staff 374 Nov 7 21:53 com.docker.driver.amd64-linux
You can also check the size:
du -sh /Volumes/Samsung/Docker/image/
64G /Volumes/Samsung/Docker/image/
Create symbolic link for Docker virtual machine image
Now that we have a copy of the Docker image on the external hard drive, we will use a symbolic link from the image directory on the laptop hard drive to image directory on the external hard drive. Before creating the link, we need to remove the current image directory on our laptop hard drive
rm -rf ~/Library/Containers/com.docker.docker/Data/com.docker.driver.amd64-linux
Now let's create the symbolic link. We will use the ln -s command. The syntax for ln is ln -s <target> <source> . In this case, target is the location on the external drive and source is the location on the internal drive.
ln -s /Volumes/Samsung/Docker/image/com.docker.driver.amd64-linux ~/Library/Containers/com.docker.docker/Data/com.docker.driver.amd64-linux
We can confirm the link was created:
ls -la ~/Library/Containers/com.docker.docker/Data/com.docker.driver.amd64-linux
lrwxr-xr-x 1 myoung staff 59 Nov 3 17:05 /Users/myoung/Library/Containers/com.docker.docker/Data/com.docker.driver.amd64-linux -> /Volumes/Samsung/Docker/image/com.docker.driver.amd64-linux
Restart Docker for Mac
Now we can restart Docker for Mac. This is done by running the application from the Applications folder in the Finder. You should see something similar to this:
Double-click on the Docker application to start it. You should notice the Docker for Mac icon is now back in the main menu bar. You can also check via ps -ef | grep -i com.docker . You should see something similar to this:
ps -ef | grep -i com.docker
0 123 1 0 8:45AM ?? 0:00.01 /Library/PrivilegedHelperTools/com.docker.vmnetd
502 14476 14465 0 10:42PM ?? 0:00.03 /Applications/Docker.app/Contents/MacOS/com.docker.osx.hyperkit.linux -watchdog fd:0
502 14479 14476 0 10:42PM ?? 0:00.01 /Applications/Docker.app/Contents/MacOS/com.docker.osx.hyperkit.linux -watchdog fd:0
502 14480 14476 0 10:42PM ?? 0:00.29 com.docker.db --url fd:3 --git /Users/myoung/Library/Containers/com.docker.docker/Data/database
502 14481 14476 0 10:42PM ?? 0:00.08 com.docker.osxfs --address fd:3 --connect /Users/myoung/Library/Containers/com.docker.docker/Data/@connect --control fd:4 --volume-control fd:5 --database /Users/myoung/Library/Containers/com.docker.docker/Data/s40
502 14482 14476 0 10:42PM ?? 0:00.04 com.docker.slirp --db /Users/myoung/Library/Containers/com.docker.docker/Data/s40 --ethernet fd:3 --port fd:4 --vsock-path /Users/myoung/Library/Containers/com.docker.docker/Data/@connect --max-connections 900
502 14483 14476 0 10:42PM ?? 0:00.05 com.docker.osx.hyperkit.linux
502 14484 14476 0 10:42PM ?? 0:00.08 com.docker.driver.amd64-linux -db /Users/myoung/Library/Containers/com.docker.docker/Data/s40 -osxfs-volume /Users/myoung/Library/Containers/com.docker.docker/Data/s30 -slirp /Users/myoung/Library/Containers/com.docker.docker/Data/s50 -vmnet /var/tmp/com.docker.vmnetd.socket -port /Users/myoung/Library/Containers/com.docker.docker/Data/s51 -vsock /Users/myoung/Library/Containers/com.docker.docker/Data -docker /Users/myoung/Library/Containers/com.docker.docker/Data/s60 -addr fd:3 -debug
502 14485 14483 0 10:42PM ?? 0:00.01 /Applications/Docker.app/Contents/MacOS/com.docker.osx.hyperkit.linux
502 14486 14484 0 10:42PM ?? 0:00.01 /Applications/Docker.app/Contents/MacOS/com.docker.driver.amd64-linux -db /Users/myoung/Library/Containers/com.docker.docker/Data/s40 -osxfs-volume /Users/myoung/Library/Containers/com.docker.docker/Data/s30 -slirp /Users/myoung/Library/Containers/com.docker.docker/Data/s50 -vmnet /var/tmp/com.docker.vmnetd.socket -port /Users/myoung/Library/Containers/com.docker.docker/Data/s51 -vsock /Users/myoung/Library/Containers/com.docker.docker/Data -docker /Users/myoung/Library/Containers/com.docker.docker/Data/s60 -addr fd:3 -debug
502 14488 14484 0 10:42PM ?? 0:07.90 /Applications/Docker.app/Contents/MacOS/com.docker.hyperkit -A -m 12G -c 6 -u -s 0:0,hostbridge -s 31,lpc -s 2:0,virtio-vpnkit,uuid=1f629fed-1ef6-4f34-8fce-753347e3b941,path=/Users/myoung/Library/Containers/com.docker.docker/Data/s50,macfile=/Users/myoung/Library/Containers/com.docker.docker/Data/com.docker.driver.amd64-linux/mac.0 -s 3,virtio-blk,file:///Users/myoung/Library/Containers/com.docker.docker/Data/com.docker.driver.amd64-linux/Docker.qcow2,format=qcow -s 4,virtio-9p,path=/Users/myoung/Library/Containers/com.docker.docker/Data/s40,tag=db -s 5,virtio-rnd -s 6,virtio-9p,path=/Users/myoung/Library/Containers/com.docker.docker/Data/s51,tag=port -s 7,virtio-sock,guest_cid=3,path=/Users/myoung/Library/Containers/com.docker.docker/Data,guest_forwards=2376;1525 -l com1,autopty=/Users/myoung/Library/Containers/com.docker.docker/Data/com.docker.driver.amd64-linux/tty,log=/Users/myoung/Library/Containers/com.docker.docker/Data/com.docker.driver.amd64-linux/console-ring -f kexec,/Applications/Docker.app/Contents/Resources/moby/vmlinuz64,/Applications/Docker.app/Contents/Resources/moby/initrd.img,earlyprintk=serial console=ttyS0 com.docker.driver="com.docker.driver.amd64-linux", com.docker.database="com.docker.driver.amd64-linux" ntp=gateway mobyplatform=mac -F /Users/myoung/Library/Containers/com.docker.docker/Data/com.docker.driver.amd64-linux/hypervisor.pid
502 14559 14465 0 10:46PM ?? 0:00.01 /Applications/Docker.app/Contents/MacOS/com.docker.frontend {"action":"vmstateevent","args":{"vmstate":"running"}}
502 14560 14559 0 10:46PM ?? 0:00.01 /Applications/Docker.app/Contents/MacOS/com.docker.frontend {"action":"vmstateevent","args":{"vmstate":"running"}}
502 14562 13146 0 10:46PM ttys000 0:00.00 grep -i com.docker
You should notice the Docker processes are running again. You can also check the timestamp of files in the Docker image directory on the external hard drive:
ls -la /Volumes/Samsung/Docker/image/com.docker.driver.amd64-linux
total 134133536
drwxr-xr-x@ 12 myoung staff 408 Nov 7 22:42 .
drwxr-xr-x 3 myoung staff 102 Nov 3 17:08 ..
-rw-r--r-- 1 myoung staff 68676222976 Nov 7 22:45 Docker.qcow2
-rw-r--r-- 1 myoung staff 65536 Nov 7 22:42 console-ring
-rw-r--r-- 1 myoung staff 5 Nov 7 22:42 hypervisor.pid
-rw-r--r-- 1 myoung staff 0 Aug 24 16:06 lock
drwxr-xr-x 67 myoung staff 2278 Nov 5 22:00 log
-rw-r--r-- 1 myoung staff 17 Nov 7 22:42 mac.0
-rw-r--r-- 1 myoung staff 36 Aug 24 16:06 nic1.uuid
-rw-r--r-- 1 myoung staff 5 Nov 7 22:42 pid
-rw-r--r-- 1 myoung staff 59619 Nov 7 22:42 syslog
lrwxr-xr-x 1 myoung staff 12 Nov 7 22:42 tty -> /dev/ttys001
You should notice the timestamp of the Docker.qcow2 file has been updated which means Docker is now using this location for its image file.
Start a Docker container
You should attempt to start a Docker container to make sure everything is working fine. You can start the HDP sandbox via docker start sandbox if you've already installed it as listed in the prerequisites. If everything is working fine, you can delete the backup.
Delete Docker backup image
Now that everything is working using the new location, we can remove our backup.
rm -rf ~/Library/Containers/com.docker.docker/Data/com.docker.driver.amd64-linux.backup
Review
If you successfully followed along with this tutorial, we were able to move our Docker for Mac virtual machine image to an external hard drive. This should free up to 64GB of space on your laptop hard drive. Look for part two in the series to learn how to increase the size of your Docker image.
... View more
Labels:
11-08-2016
02:25 AM
@Marcia Hon For both options a and b, did you remove the /var/lib/docker/* directory where the image file is created? You have to remove that file and then restart the daemons. If you don't remove the file, Docker will not increase the size automatically. You can run docker info before and after this process to see how much space is available for Docker. NOTE: You will lose your images and containers when you delete the contents of that directory. So backup any containers or images you want to keep.
... View more
11-04-2016
07:24 PM
Good catch. Tutorial has been updated to provide more links.
... View more
11-04-2016
02:29 AM
4 Kudos
Objective
In many organizations "search" is a common requirement for a user friendly means of accessing data. When people thing of "search", they often think of Google. Many organizations use Solr as their enterprise search engine. It is commonly used to power public website search from within the site itself. Organizations will often build custom user interfaces to tailor queries to meet their internal or external end-user needs. In most of these scenarios, users are shielded from the complexity of the Solr query syntax.
Solr has a long list of features and capabilities; you can read more here Apache Solr Features. Solr 6 has a new feature which allows you to submit SQL queries via JDBC. This opens up new ways to interact with Solr. Using Zeppelin with SQL is now possible because of this new feature. This should make you more productive because you can use a language syntax with which you are already familiar: SQL!
This tutorial will guide you through the process of updating the Zeppelin JDBC interpreter configuration to enable submitting SQL queries to Solr via JDBC. We'll use the Hortonworks HDP 2.5 Docker sandbox and Apache Solr 6.2.1.
NOTE: Solr 6 is being deployed as a standalone application within the sandbox. HDP 2.5 ships with Solr 5.5.2 via HDPSearch which does not include the JDBC SQL functionality. Prerequisites
You should have already completed the following tutorial Installing Docker Version of Sandbox on Mac
You should have already downloaded Apache Solr 6.2.1: Apache Solr 6.2.1 Scope
Mac OS X 10.11.6 (El Capitan)
Docker for Mac 1.12.1
HDP 2.5 Docker Sandbox
Apache Solr 6.2.1 Steps Start Sandbox
If you completed the tutorial listed in the prerequisites, then you should be ready to start up your Docker sandbox container.
docker start sandbox
NOTE: If your container is still running from performing the other tutorial, you do not need to start it again.
Once the container is started, you need to login:
ssh -p 2222 root@localhost
Now you can start the services
/etc/init.d/startup_scripts start
NOTE: This process will take several minutes. Create Solr user in the sandbox
We will be running the Solr process as the solr user. Let's create that user in our sandbox:
useradd -d /home/solr -s /bin/bash -U solr
Copy Solr archive file to sandbox
You should already have the Solr archive file downloaded. We will use scp to copy the file to the sandbox. You should do this in another terminal window as your current window should be logged into the sandbox. From your Mac run the following command:
scp -P 2222 ~/Downloads/solr-6.2.1.tgz root@localhost:/root/
NOTE: The ssh and scp commands use different parameters to specify the port and it's easy to confuse them. The ssh command uses -p to specify the port. The scp command uses -P to sepcify the port.
In my case, the Solr file was downloaded to ~/Downloads . Your location may be different. Extract the Solr archive file
We'll run Solr out the /opt/ directory. This makes things a bit cleaner than using the installation script which places some files in /var .
cd /opt
tar xvfz /vagrant/solr-6.2.1.tgz
Now we need to give the solr user ownership over the directory.
chown -R solr:solr /opt/solr-6.2.1/
Install JDK 8 Solr 6.x requires JDK 8 which is not on the current version of the sandbox. You will need to install it before you can run Solr. yum install java-1.8.0-openjdk-devel Start Solr
Now that Solr is installed, we can start up a SolrCloud instance. The Solr start script provides a handy way to start a 2 node SolrCloud cluster. The -e flag tells Solr to start the cloud example. The -noprompt flag tells Solr to use default values.
cd /opt/solr-6.2.1
bin/solr start -e cloud -noprompt
Welcome to the SolrCloud example!
Starting up 2 Solr nodes for your example SolrCloud cluster.
Creating Solr home directory /opt/solr-6.2.1/example/cloud/node1/solr
Cloning /opt/solr-6.2.1/example/cloud/node1 into
/opt/solr-6.2.1/example/cloud/node2
Starting up Solr on port 8983 using command:
bin/solr start -cloud -p 8983 -s "example/cloud/node1/solr"
Waiting up to 30 seconds to see Solr running on port 8983 [\]
Started Solr server on port 8983 (pid=4952). Happy searching!
Starting up Solr on port 7574 using command:
bin/solr start -cloud -p 7574 -s "example/cloud/node2/solr" -z localhost:9983
Waiting up to 30 seconds to see Solr running on port 7574 [|]
Started Solr server on port 7574 (pid=5175). Happy searching!
Connecting to ZooKeeper at localhost:9983 ...
Uploading /opt/solr-6.2.1/server/solr/configsets/data_driven_schema_configs/conf for config gettingstarted to ZooKeeper at localhost:9983
Creating new collection 'gettingstarted' using command:
http://localhost:8983/solr/admin/collections?action=CREATE&name=gettingstarted&numShards=2&replicationFactor=2&maxShardsPerNode=2&collection.configName=gettingstarted
{
"responseHeader":{
"status":0,
"QTime":28324},
"success":{
"192.168.56.151:8983_solr":{
"responseHeader":{
"status":0,
"QTime":17801},
"core":"gettingstarted_shard1_replica1"},
"192.168.56.151:7574_solr":{
"responseHeader":{
"status":0,
"QTime":18096},
"core":"gettingstarted_shard1_replica2"}}}
Enabling auto soft-commits with maxTime 3 secs using the Config API
POSTing request to Config API: http://localhost:8983/solr/gettingstarted/config
{"set-property":{"updateHandler.autoSoftCommit.maxTime":"3000"}}
Successfully set-property updateHandler.autoSoftCommit.maxTime to 3000
SolrCloud example running, please visit: http://localhost:8983/solr
As you can see from the output, we have 2 Solr instances. One instance is listening on port 8983 and the other is listening on 7574 . They are using an embedded Zookeeper instance for coordination and it is listening on port 9983 . If we were going to production, we would the HDP cluster Zookeeper instance for more reliability. Index sample data
Now that our SolrCloud cluster is running, we can index sample data into the cluster. We'll execute our SQL queries against this data. Fortunately, Solr ships with a number of example data sets. For this tutorial index XML data which contains sample product information.
bin/post -c gettingstarted example/exampledocs/*.xml
This command posts the xml documents in the specified path. The -c option defines which collection to use. The command we used previously to create the SolrCloud cluster automatically created a gettingstarted collection using the data_driven_schema_configs configuration. This configuration is what we call schemaless because the fields are dynamically added to the collection. Without dynamic fields, you have to explicitly define every field you want to have in your collection.
You should see something like this:
bin/post -c gettingstarted example/exampledocs/*.xml
/usr/lib/jvm/java/bin/java -classpath /opt/solr-6.2.1/dist/solr-core-6.2.1.jar -Dauto=yes -Dc=gettingstarted -Ddata=files org.apache.solr.util.SimplePostTool example/exampledocs/gb18030-example.xml example/exampledocs/hd.xml example/exampledocs/ipod_other.xml example/exampledocs/ipod_video.xml example/exampledocs/manufacturers.xml example/exampledocs/mem.xml example/exampledocs/money.xml example/exampledocs/monitor2.xml example/exampledocs/monitor.xml example/exampledocs/mp500.xml example/exampledocs/sd500.xml example/exampledocs/solr.xml example/exampledocs/utf8-example.xml example/exampledocs/vidcard.xml
SimplePostTool version 5.0.0
Posting files to [base] url http://localhost:8983/solr/gettingstarted/update.
Entering auto mode. File endings considered are xml,json,jsonl,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log
POSTing file gb18030-example.xml (application/xml) to [base]
POSTing file hd.xml (application/xml) to [base]
POSTing file ipod_other.xml (application/xml) to [base]
POSTing file ipod_video.xml (application/xml) to [base]
POSTing file manufacturers.xml (application/xml) to [base]
POSTing file mem.xml (application/xml) to [base]
POSTing file money.xml (application/xml) to [base]
POSTing file monitor2.xml (application/xml) to [base]
POSTing file monitor.xml (application/xml) to [base]
POSTing file mp500.xml (application/xml) to [base]
POSTing file sd500.xml (application/xml) to [base]
POSTing file solr.xml (application/xml) to [base]
POSTing file utf8-example.xml (application/xml) to [base]
POSTing file vidcard.xml (application/xml) to [base]
14 files indexed.
COMMITting Solr index changes to http://localhost:8983/solr/gettingstarted/update.
Time spent: 0:00:02.379
Query Solr data
Now we can use curl to run a test query against Solr. The following command will query the gettingstarted collection for all documents. It also returns the results as JSON instead of the default XML.
curl -XGET 'http://localhost:8983/solr/gettingstarted/select?q=*:*&wt=json&indent=true'
You should see something like this:
curl -XGET 'http://localhost:8983/solr/gettingstarted/select?q=*:*&wt=json&indent=true'
{
"responseHeader":{
"zkConnected":true,
"status":0,
"QTime":11,
"params":{
"q":"*:*",
"indent":"true",
"wt":"json"}},
"response":{"numFound":32,"start":0,"maxScore":1.0,"docs":[
{
"id":"GB18030TEST",
"name":["Test with some GB18030 encoded characters"],
"features":["No accents here",
"这是一个功能",
"This is a feature (translated)",
"这份文件是很有光泽",
"This document is very shiny (translated)"],
"price":[0.0],
"inStock":[true],
"_version_":1550023359021973504},
{
"id":"IW-02",
"name":["iPod & iPod Mini USB 2.0 Cable"],
"manu":["Belkin"],
"manu_id_s":"belkin",
"cat":["electronics",
"connector"],
"features":["car power adapter for iPod, white"],
"weight":[2.0],
"price":[11.5],
"popularity":[1],
"inStock":[false],
"store":["37.7752,-122.4232"],
"manufacturedate_dt":"2006-02-14T23:55:59Z",
"_version_":1550023359918505984},
{
"id":"MA147LL/A",
"name":["Apple 60 GB iPod with Video Playback Black"],
"manu":["Apple Computer Inc."],
"manu_id_s":"apple",
"cat":["electronics",
"music"],
"features":["iTunes, Podcasts, Audiobooks",
"Stores up to 15,000 songs, 25,000 photos, or 150 hours of video",
"2.5-inch, 320x240 color TFT LCD display with LED backlight",
"Up to 20 hours of battery life",
"Plays AAC, MP3, WAV, AIFF, Audible, Apple Lossless, H.264 video",
"Notes, Calendar, Phone book, Hold button, Date display, Photo wallet, Built-in games, JPEG photo playback, Upgradeable firmware, USB 2.0 compatibility, Playback speed control, Rechargeable capability, Battery level indication"],
"includes":["earbud headphones, USB cable"],
"weight":[5.5],
"price":[399.0],
"popularity":[10],
"inStock":[true],
"store":["37.7752,-100.0232"],
"manufacturedate_dt":"2005-10-12T08:00:00Z",
"_version_":1550023360204767232},
{
"id":"adata",
"compName_s":"A-Data Technology",
"address_s":"46221 Landing Parkway Fremont, CA 94538",
"_version_":1550023360573865984},
{
"id":"asus",
"compName_s":"ASUS Computer",
"address_s":"800 Corporate Way Fremont, CA 94539",
"_version_":1550023360584351744},
{
"id":"belkin",
"compName_s":"Belkin",
"address_s":"12045 E. Waterfront Drive Playa Vista, CA 90094",
"_version_":1550023360586448896},
{
"id":"maxtor",
"compName_s":"Maxtor Corporation",
"address_s":"920 Disc Drive Scotts Valley, CA 95066",
"_version_":1550023360587497472},
{
"id":"TWINX2048-3200PRO",
"name":["CORSAIR XMS 2GB (2 x 1GB) 184-Pin DDR SDRAM Unbuffered DDR 400 (PC 3200) Dual Channel Kit System Memory - Retail"],
"manu":["Corsair Microsystems Inc."],
"manu_id_s":"corsair",
"cat":["electronics",
"memory"],
"features":["CAS latency 2, 2-3-3-6 timing, 2.75v, unbuffered, heat-spreader"],
"price":[185.0],
"popularity":[5],
"inStock":[true],
"store":["37.7752,-122.4232"],
"manufacturedate_dt":"2006-02-13T15:26:37Z",
"payloads":["electronics|6.0 memory|3.0"],
"_version_":1550023360602177536},
{
"id":"VS1GB400C3",
"name":["CORSAIR ValueSelect 1GB 184-Pin DDR SDRAM Unbuffered DDR 400 (PC 3200) System Memory - Retail"],
"manu":["Corsair Microsystems Inc."],
"manu_id_s":"corsair",
"cat":["electronics",
"memory"],
"price":[74.99],
"popularity":[7],
"inStock":[true],
"store":["37.7752,-100.0232"],
"manufacturedate_dt":"2006-02-13T15:26:37Z",
"payloads":["electronics|4.0 memory|2.0"],
"_version_":1550023360647266304},
{
"id":"VDBDB1A16",
"name":["A-DATA V-Series 1GB 184-Pin DDR SDRAM Unbuffered DDR 400 (PC 3200) System Memory - OEM"],
"manu":["A-DATA Technology Inc."],
"manu_id_s":"corsair",
"cat":["electronics",
"memory"],
"features":["CAS latency 3, 2.7v"],
"popularity":[0],
"inStock":[true],
"store":["45.18414,-93.88141"],
"manufacturedate_dt":"2006-02-13T15:26:37Z",
"payloads":["electronics|0.9 memory|0.1"],
"_version_":1550023360648314880}]
}}
By default Solr will return the top 10 documents. If you look at the top of the results, you will notice there are 32 documents in our collection.
...
"response":{"numFound":32,"start":0,"maxScore":1.0,"docs":[
...
Modify Zeppelin JDBC interpreter
Now we need to modify the existing JDBC interpreter in Zeppelin. By default, this interpreter will work with Hive, Postgres and Phoenix. We will be adding Solr to the configuration.
Open the Zeppelin UI. You can either use the link in Ambari or directly via http://localhost:9995 . You should see something like this:
Click on the user menu in the upper right. You are logged into Zeppelin as anonymous . You should see a menu like this:
Click on the Interpreter link. You should see something like this:
You should see the jdbc interpreter near the top of the list. If you don't, you can either scroll down or use the build-in search feature at the top of the patch. You should click on the edit button for the jdbc interpreter. You will notice the screen changes to allow you to add new properties or modify existing ones. You should see something like this:
Scroll down until you see the empty entry line. You should see something like this:
We need to add 3 properities/values here.
solr.url jdbc:solr://localhost:9983?collection=gettingstarted
solr.user solr
solr.driver org.apache.solr.client.solrj.io.sql.DriverImpl
Why are we using port 9983 ? That is because we are in SolrCloud mode. We are pointing to the Zookeeper instance. If one of the nodes goes down, Zookeeper will know and direct us to a node that is working.
Add each of these properties and click the + button after each entry. You should now have 3 new properties in your list:
Now we need to add an artifact to the Dependencies section. It's just below the properties. We are going to add the following:
org.apache.solr:solr-solrj:6.2.1
Click the + button. You should see something like this:
Now click the blue Save button to save the changes. Create a new notebook
Now that we have our JDBC interpreter updated, we are going to create a new notebook. Click the Notebook drop down menu in the upper left. You should see something like this:
Click the + Create a new note link. You should see something like this:
Give the notebook the name Solr JDBC , then click the Create Note button.
You should see something like this:
We can query Solr using a prefix for jdbc like %jdbc(solr) . The prefix refers to the name of the prefix of the properties in the JDBC interpreter we setup. If you recall, there were properties like:
solr.url
phoenix.url
hive.url
psql.url
Our prefix is solr . Create the following query as the first note:
%jdbc(solr)
select name, price, inStock from gettingstarted
Now click the run arrow icon. This will run the query against Solr and return results if our configuration is correct. You should see something like this:
Now add another note below our first one with the following query:
%jdbc(solr)
select name, price, inStock from gettingstarted where inStock = false
You should see something like this:
And finally add one more note below our second one with the following query:
%jdbc(solr)
select price, count(*) from gettingstarted group by price order by price desc
You should see something like this:
As you can see it was easy to simple queries and more complex aggregations using pure SQL. For comparison, here is Solr query that does the same thing as our second note:
curl -XGET 'http://localhost:8983/solr/gettingstarted/select?fl=price,name,inStock&indent=on&q=inStock:true&wt=json'
If you ran this command in the terminal, you should see something like this: curl -XGET 'http://localhost:8983/solr/gettingstarted/select?fl=price,name,inStock&indent=on&q=inStock:true&wt=json'
{
"responseHeader":{
"zkConnected":true,
"status":0,
"QTime":16,
"params":{
"q":"inStock:true",
"indent":"on",
"fl":"price,name,inStock",
"wt":"json"}},
"response":{"numFound":17,"start":0,"maxScore":0.2578291,"docs":[
{
"name":["Test with some GB18030 encoded characters"],
"price":[0.0],
"inStock":[true]},
{
"name":["Apple 60 GB iPod with Video Playback Black"],
"price":[399.0],
"inStock":[true]},
{
"name":["CORSAIR XMS 2GB (2 x 1GB) 184-Pin DDR SDRAM Unbuffered DDR 400 (PC 3200) Dual Channel Kit System Memory - Retail"],
"price":[185.0],
"inStock":[true]},
{
"name":["CORSAIR ValueSelect 1GB 184-Pin DDR SDRAM Unbuffered DDR 400 (PC 3200) System Memory - Retail"],
"price":[74.99],
"inStock":[true]},
{
"name":["A-DATA V-Series 1GB 184-Pin DDR SDRAM Unbuffered DDR 400 (PC 3200) System Memory - OEM"],
"inStock":[true]},
{
"name":["One Dollar"],
"inStock":[true]},
{
"name":["One British Pound"],
"inStock":[true]},
{
"name":["Dell Widescreen UltraSharp 3007WFP"],
"price":[2199.0],
"inStock":[true]},
{
"name":["Samsung SpinPoint P120 SP2514N - hard drive - 250 GB - ATA-133"],
"price":[92.0],
"inStock":[true]},
{
"name":["Maxtor DiamondMax 11 - hard drive - 500 GB - SATA-300"],
"price":[350.0],
"inStock":[true]}]
}}
Now here is the query for the aggregations: curl -XGET 'http://localhost:8983/solr/gettingstarted/select?facet.field=price&facet=on&fl=price&indent=on&q=*:*&wt=json' Which do you find easier to use? My guess is the SQL syntax. 😉 Review
If you successfully followed along with this tutorial, we were able to install Solr and run it in SolrCloud mode. We indexed some sample xml documents. We updated our Zeppelin interpreter configuration to support Solr JDBC queries. We created a notebook and ran a few queries against Solr using SQL. And finally we saw the comparatively more complex native Solr query syntax. You can read more here:
Solr SQL: https://cwiki.apache.org/confluence/display/solr/Parallel+SQL+Interface Zeppelin + Solr JDBC: https://cwiki.apache.org/confluence/display/solr/Solr+JDBC+-+Apache+Zeppelin
... View more
Labels:
11-03-2016
09:10 PM
Awesome! Thank you for sharing the steps you followed.
... View more
11-03-2016
07:58 PM
1 Kudo
@Shankar P Running out of space for Docker seems to be a common problem and there doesn't seem to be a single good answer on how to solve that. I created a VirtualBox CentOS 7 VM via Vagrant. This is a 40GB disk image. I then increased the disk size to 100GB via VirtualBox tools. Even with 100GB, I get a similar error when I try to import the sandbox. I haven't yet tried to use the -g option. The problem is that Docker still uses a virtual machine behind the scenes to run containers. The default storage size for that VM appears to be 20GB. I've found a number of threads from people wanting to see that increased to 100 or 200GB, which seems reasonable to me. Unfortunately, I don't think that has been changed/released. Having said all of that, Amazon Linux uses upstart at the init system. Have you tried making your changes to /etc/default/docker instead of /etc/sysconfig/docker? This thread details some of the hassles people have gone through to figure out which configuration file to use: https://github.com/docker/docker/issues/9889
... View more
11-02-2016
06:11 PM
@sai d You must have VT-x features enabled within your computer BIOS. This is a common requirement for most Virtual Machines these days. What kind of computer are you using? Have you enabled VT-x?
... View more
11-01-2016
08:55 PM
6 Kudos
Objective Cross Data Center Replication, commonly abbreviated as CDCR, is a new feature found in SolrCloud 6.x. This feature enables Solr to replicate data from one source collection to one or more target collections distributed between data centers. The current version provides an active-passive disaster recovery solution for Solr. Data updates, which include adds, updates, and deletes, are copied from the source collection to the target collection. This means the target collection should not be sent data updates outside of the CDRC functionality. Prior to SolrCloud 6.x you had to manually design a strategy for replication across data centers. This tutorial will guide you through the process of enabling CDCR between two SolrCloud clusters, each with 1 server, in a Vagrant + VirtualBox environment. NOTE: Solr 6 is being deployed as a standalone application. HDP 2.5 provides support for Solr 5.5.2 via HDPSearch which does not include CDCR functionality. Prerequisites You should have already installed the following: VirtualBox 5.1.6 (VirtualBox) Vagrant 1.8.6 (Vagrant) Vagrant plugin vagrant-vbguest 0.13.x (vagrant-vbguest) Vagrant plugin vagrant-hostmanager 1.8.5 ( vagrant-hostmanager) You should have already downloaded the Apache Solr 6.2.1 release ( Apache Solr 6.2.1) Scope This tutorial was tested using the following environment and components: Mac OS X 10.11.6 (El Capitan) VirtualBox 5.1.6 (tutorial should work with any newer version) Vagrant 1.8.6 vagrant-vbguest plugin 0.13.0 vagrant-hostnamanger plugin 1.8.5 Apache Solr 6.2.1 Steps Create Vagrant project directory I like to create project directories. My Vagrant work goes under ~/Vagrant/<project> and my Docker work goes under ~/Docker/<project> . This allows me to clearly identify which technology is associated with the projects and allows me to use various helper scripts to automate processes, etc. So let's create project directory for this tutorial.
mkdir -p ~/Vagrant/solrcloud-cdcr-tutorial && cd ~/Vagrant/solrcloud-cdcr-tutorial
Create Vagrantfile The Vagrantfile tells Vagrant how to configure your virtual machines. You can copy/paste my Vagrantfile below or use the version in the attachments area of this tutorial. Here is the content from my file:
# -*- mode: ruby -*-
# vi: set ft=ruby :
# Using yaml to load external configuration files
require 'yaml'
Vagrant.configure(2) do |config|
# Using the hostmanager vagrant plugin to update the host files
config.hostmanager.enabled = true
config.hostmanager.manage_host = true
config.hostmanager.manage_guest = true
config.hostmanager.ignore_private_ip = false
# Loading in the list of commands that should be run when the VM is provisioned.
commands = YAML.load_file('commands.yaml')
commands.each do |command|
config.vm.provision :shell, inline: command
end
# Loading in the VM configuration information
servers = YAML.load_file('servers.yaml')
servers.each do |servers|
config.vm.define servers[name] do |srv|
srv.vm.box = servers[box] # Speciy the name of the Vagrant box file to use
srv.vm.hostname = servers[name] # Set the hostname of the VM
srv.vm.network private_network, ip: servers[ip], :adapater=>2 # Add a second adapater with a specified IP
srv.vm.network :forwarded_port, guest: 22, host: servers[port] # Add a port forwarding rule
srv.vm.provision :shell, inline: "sed -i'' '/^127.0.0.1\t#{srv.vm.hostname}\t#{srv.vm.hostname}$/d' /etc/hosts" # Remove the extraneous first entry in /etc/hosts
srv.vm.provider :virtualbox do |vb|
vb.name = servers[name] # Name of the VM in VirtualBox
vb.cpus = servers[cpus] # How many CPUs to allocate to the VM
vb.memory = servers[ram] # How much memory to allocate to the VM
vb.customize [modifyvm, :id, --cpuexecutioncap, 25] # Limit to VM to 25% of available CPU
end
end
end
end
Create a servers.yaml file The servers.yaml file contains the configuration information for our VMs. You can copy/paste my servers.yaml below or use the version in the attachments area of this tutorial. Here is the content from my file:
---
- name: solr-dc01
box: bento/centos-7.2
cpus: 2
ram: 2048
ip: 192.168.56.101
port: 10122
- name: solr-dc02
box: bento/centos-7.2
cpus: 2
ram: 2048
ip: 192.168.56.202
port: 20222
Create commands.yaml file The commands.yaml file contains the list of commands that should be run on each VM when they are first provisioned. This allows us to automate configuration tasks that would otherwise be tedious and/or repetitive. You can copy/paste my commands.yaml below or use the version in the attachments area of this tutorial. Here is the content from my file:
- sudo yum -y install net-tools ntp wget java-1.8.0-openjdk java-1.8.0-openjdk-devel lsof
- sudo systemctl enable ntpd && sudo systemctl start ntpd
- sudo systemctl disable firewalld && sudo systemctl stop firewalld
- sudo sed -i --follow-symlinks 's/^SELINUX=.*/SELINUX=disabled/g' /etc/sysconfig/selinux
Copy Solr release file to Vagrant our project directory Our project directory is accessible to each of our Vagrant VMs via the /vagrant mount point. This allows us to easily access files and data located in our project directory. Instead of using scp to copy the Apache Solr release file to each of the VMs and creating duplicate files, we'll use a single copy located in our project directory.
cp ~/Downloads/solr-6.2.1.tgz .
NOTE: This assumes you are on a Mac and your downloads are in the ~/Downloads directory. Start virtual machines Now we are ready to start our 2 virtual machines for the first time. Creating the VMs for the first time and starting them every time after that uses the same command:
vagrant up
Once the process is complete you should have 2 servers running. You can verify by looking at VirtualBox. Notice I have 2 VMs running called solr-dc01 and solr-dc02:
Connect to each virtual machine You are able to login to each of the VMs via ssh using the vagrant ssh command. You must specify the name of the VM you want to connect to. vagrant ssh solr-dc01
Using another terminal window, repeat this process for solr-dc02 . Extract Solr install scripts The Solr release archive file contains an installation script. This installation script will do the following by default: NOTE: This assumes that you downloaded Solr 6.2.1 Install Solr under /opt/solr-6.2.1 Create a symbolic link between /opt/solr and /opt/solr-6.2.1 Create a solr user. Live data such as indexes, logs, etc are stored in /var/solr. On solr-dc01 , run the following command: tar xvfz /vagrant/solr-6.2.1.tgz solr-6.2.1/bin/install_solr_service.sh --strip-components=2
Repeat this process for solr-dc02 This will create a file called install_solr_services.sh in your current directory, which should be the /home/vagrant . Install Apache Solr Now we can install Solr using the script defaults: sudo bash ./install_solr_service.sh /vagrant/solr-6.2.1.tgz
The command above is the same as if you had specified the default settings: sudo bash ./install_solr_service.sh /vagrant/solr-6.2.1.tgz -i /opt -d /var/solr -u solr -s solr -p 8983
After running the command, you should see something similar to this: id: solr: no such user
Creating new user: solr
Extracting /vagrant/solr-6.2.1.tgz to /opt
Installing symlink /opt/solr -> /opt/solr-6.2.1 ...
Installing /etc/init.d/solr script ...
Installing /etc/default/solr.in.sh ...
Waiting up to 30 seconds to see Solr running on port 8983 [/]
Started Solr server on port 8983 (pid=29168). Happy searching!
Found 1 Solr nodes:
Solr process 29168 running on port 8983
{
solr_home:/var/solr/data,
version:6.2.1 43ab70147eb494324a1410f7a9f16a896a59bc6f - shalin - 2016-09-15 05:20:53,
startTime:2016-10-31T19:46:27.997Z,
uptime:0 days, 0 hours, 0 minutes, 12 seconds,
memory:13.4 MB (%2.7) of 490.7 MB}
Service solr installed.
If you run the following command, you can see the Solr process is running: ps -ef | grep solr
solr 28980 1 0 19:49 ? 00:00:11 java -server -Xms512m -Xmx512m -XX:NewRatio=3 -XX:SurvivorRatio=4 -XX:TargetSurvivorRatio=90 -XX:MaxTenuringThreshold=8 -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:ConcGCThreads=4 -XX:ParallelGCThreads=4 -XX:+CMSScavengeBeforeRemark -XX:PretenureSizeThreshold=64m -XX:+UseCMSInitiatingOccupancyOnly -XX:CMSInitiatingOccupancyFraction=50 -XX:CMSMaxAbortablePrecleanTime=6000 -XX:+CMSParallelRemarkEnabled -XX:+ParallelRefProcEnabled -verbose:gc -XX:+PrintHeapAtGC -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintGCTimeStamps -XX:+PrintTenuringDistribution -XX:+PrintGCApplicationStoppedTime -Xloggc:/var/solr/logs/solr_gc.log -Djetty.port=8983 -DSTOP.PORT=7983 -DSTOP.KEY=solrrocks -Duser.timezone=UTC -Djetty.home=/opt/solr/server -Dsolr.solr.home=/var/solr/data -Dsolr.install.dir=/opt/solr -Dlog4j.configuration=file:/var/solr/log4j.properties -Xss256k -XX:OnOutOfMemoryError=/opt/solr/bin/oom_solr.sh 8983 /var/solr/logs -jar start.jar --module=http
Repeat this process for solr-dc02 Modify Solr service It's more convenient to use the OS services infrastructure to manage running Solr processes than manually using scripts. The installation process creates a service script that starts Solr in single instance mode. To take advantage of CDCR, you must use SolrCloud mode. We need to make some changes to the service script for this to work. We'll be using the embedded Zookeeper instance for our tutorial. To do this, we need a zookeeper configuration file in our /var/solr/data directory. We'll copy the default configuration file from /opt/solr/server/solr/zoo.cfg . sudo -u solr cp /opt/solr/server/solr/zoo.cfg /var/solr/data/zoo.cfg
Now we need the /etc/init.d/solr service script to run Solr in SolrCloud mode. This is done by adding the -c parameter to the start process. When no other parameters are specified, Solr will start an embedded Zookeeper instance on the Solr port + 1000. In our case, that should be 9983 because our default Solr port is 8983 . Because this file is owned by root, we'll need to use sudo. exit
sudo vi /etc/init.d/solr
Look near the end of the file for the line: ...
case $1 in
start|stop|restart|status)
SOLR_CMD=$1
...
This is the section that defines the Solr command. We want to change the SOLR_CMD=$1 line to look like this SOLR_CMD=$1 -c . This will tell Solr that it should start in cloud mode. NOTE: In production, you would not use the embedded Zookeeper. You would update the /etc/defaults/solr.in.sh to set the ZK_HOST variable to the production Zookeeper instances. When this variable is set, Solr will not start the embedded Zookeeper. So the section of your file should now look like this: ...
case $1 in
start|stop|restart|status)
SOLR_CMD=$1 -c
...
Now save the file: Press the `esc` KEY
!wq Let's stop Solr: sudo service solr stop
Now we can start Solr using the new script: sudo service solr start
Once the process is started, we can check the status: sudo service solr status
Found 1 Solr nodes:
Solr process 29426 running on port 8983
{
solr_home:/var/solr/data,
version:6.2.1 43ab70147eb494324a1410f7a9f16a896a59bc6f - shalin - 2016-09-15 05:20:53,
startTime:2016-10-31T22:16:22.116Z,
uptime:0 days, 0 hours, 0 minutes, 14 seconds,
memory:30.2 MB (%6.1) of 490.7 MB,
cloud:{
ZooKeeper:localhost:9983,
liveNodes:1,
collections:0}}
As you can see, the process started successfully and there is a single cloud node running using Zookeeper on port 9983 . Repeat this process for solr-dc02 . Create Solr dc01 configuration The solr-dc01 Solr instance will be our source collection for replication. To enable CDCR we need to make a few changes to the solrconfig.xml configuration file. We'll use the data_driven_schema_configs as a base for our configuration. We need to create two different configurations because the source collection has a slightly different configuration than the target collection. On the solr-dc01 VM, copy the data_driven_schema_configs directory to the vagrant home directory. If you are following along, you should still be the vagrant user. cd /home/vagrant
cp -r /opt/solr/server/solr/configsets/data_driven_schema_configs .
Edit the solrconfig.xml file: vi data_driven_schema_configs/conf/solrconifg.xml
The first thing we are going to do is update the updateHandler definition; there is only one in the file. Find the section in the configuration file that looks like this: <updateHandler class=solr.DirectUpdateHandler2>
We are going to change the updateLog portion of the configuration. Remember that we are using vi as the text editor, so edit using the appropriate vi commands. Change this: <updateLog>
<str name=dir>${solr.ulog.dir:}</str>
<int name=numVersionBuckets>${solr.ulog.numVersionBuckets:65536}</int>
</updateLog>
to this: <updateLog class=solr.CdcrUpdateLog>
<str name=dir>${solr.ulog.dir:}</str>
<int name=numVersionBuckets>${solr.ulog.numVersionBuckets:65536}</int>
</updateLog>
Now we need to create a new requestHandler definition. Find the section in the configuration file that looks like this: <!-- A request handler that returns indented JSON by default -->
<requestHandler name=/query class=solr.SearchHandler>
<lst name=defaults>
<str name=echoParams>explicit</str>
<str name=wt>json</str>
<str name=indent>true</str>
</lst>
</requestHandler>
We are going to add our new definition just after the closing requestHandler . Add the following new definition: <!-- A request handler for cross data center replication -->
<requestHandler name=/cdcr class=solr.CdcrRequestHandler>
<lst name=replica>
<str name=zkHost>192.168.56.202:9983</str>
<str name=source>collection1</str>
<str name=target>collection1</str>
</lst>
<lst name=replicator>
<str name=threadPoolSize>8</str>
<str name=schedule>1000</str>
<str name=batchSize>128</str>
</lst>
<lst name=updateLogSynchronizer>
<str name=schedule>1000</str>
</lst>
</requestHandler>
Your updated file should now look like this: ...
<!-- A request handler that returns indented JSON by default -->
<requestHandler name=/query class=solr.SearchHandler>
<lst name=defaults>
<str name=echoParams>explicit</str>
<str name=wt>json</str>
<str name=indent>true</str>
</lst>
</requestHandler>
<!-- A request handler for cross data center replication -->
<requestHandler name=/cdcr class=solr.CdcrRequestHandler>
<lst name=replica>
<str name=zkHost>192.168.56.202:9983</str>
<str name=source>collection1</str>
<str name=target>collection1</str>
</lst>
<lst name=replicator>
<str name=threadPoolSize>8</str>
<str name=schedule>1000</str>
<str name=batchSize>128</str>
</lst>
<lst name=updateLogSynchronizer>
<str name=schedule>1000</str>
</lst>
</requestHandler>
...
NOTE: The zkHost line should have the ip address and port of the Zookeeper instance of the target collection. Our target collection is on solr-dc02 , so this ip and port are pointing to solr-dc02. When we create our collections in Solr, we'll use the name collection1 . Now save the file: Press the `esc` KEY
!wq
Create Solr dc02 configuration The solr-dc02 Solr instance will be our target collection for replication. To enable CDCR we need to make a few changes to the solrconfig.xml configuration file. As above, we'll use the data_driven_schema_configs as a base for our configuration. On solr-dc02 , copy the data_driven_schema_configs directory to the vagrant home directory. If you are following along, you should still be the vagrant user. cd /home/vagrant
cp -r /opt/solr/server/solr/configsets/data_driven_schema_configs .
Edit the solrconfig.xml file: vi data_driven_schema_configs/conf/solrconifg.xml
The first thing we are going to do is update the updateHandler definition; there is only one in the file. Find the section in the configuration file that looks like this: <updateHandler class=solr.DirectUpdateHandler2>
We are going to change the updateLog portion of the configuration. Remember that we are using vi as the text editor. Change this: <updateLog>
<str name=dir>${solr.ulog.dir:}</str>
<int name=numVersionBuckets>${solr.ulog.numVersionBuckets:65536}</int>
</updateLog>
to this: <updateLog class=solr.CdcrUpdateLog>
<str name=dir>${solr.ulog.dir:}</str>
<int name=numVersionBuckets>${solr.ulog.numVersionBuckets:65536}</int>
</updateLog>
Now we need to create a new requestHandler definition. Find the section in the configuration file that looks like this: <!-- A request handler that returns indented JSON by default -->
<requestHandler name=/query class=solr.SearchHandler>
<lst name=defaults>
<str name=echoParams>explicit</str>
<str name=wt>json</str>
<str name=indent>true</str>
</lst>
</requestHandler>
We are going to add our new definition just after the closing requestHandler . Add the following new definition: <!-- A request handler for cross data center replication -->
<requestHandler name=/cdcr class=solr.CdcrRequestHandler>
<lst name=buffer>
<str name=defaultState>disabled</str>
</lst>
</requestHandler>
<!-- A request handler for cross data center replication -->
<requestHandler name=/update class=solr.UpdateRequestHandler>
<lst name=defaults>
<str name=update.chain>cdcr-processor-chain</str>
</lst>
</requestHandler>
<updateRequestProcessorChain name=cdcr-processor-chain>
<processor class=solr.CdcrUpdateProcessorFactory/>
<processor class=solr.RunUpdateProcessorFactory/>
</updateRequestProcessorChain>
Your updated file should now look like this: ...
<!-- A request handler that returns indented JSON by default -->
<requestHandler name=/query class=solr.SearchHandler>
<lst name=defaults>
<str name=echoParams>explicit</str>
<str name=wt>json</str>
<str name=indent>true</str>
</lst>
</requestHandler>
<!-- A request handler for cross data center replication -->
<requestHandler name=/cdcr class=solr.CdcrRequestHandler>
<lst name=buffer>
<str name=defaultState>disabled</str>
</lst>
</requestHandler>
<!-- A request handler for cross data center replication -->
<requestHandler name=/update class=solr.UpdateRequestHandler>
<lst name=defaults>
<str name=update.chain>cdcr-processor-chain</str>
</lst>
</requestHandler>
<updateRequestProcessorChain name=cdcr-processor-chain>
<processor class=solr.CdcrUpdateProcessorFactory/>
<processor class=solr.RunUpdateProcessorFactory/>
</updateRequestProcessorChain>
...
Now save the file: Press the `esc` KEY
!wq
You should see how the two configurations are different between the source and target collections. Create Solr collection on solr-dc01 and solr-dc02 Now we should be able to create a collection using our update configuration. Because the two configurations are different, make sure you run this command on both the solr-dc01 and solr-dc02 VMs. This is creating the collections in our respective data centers. /opt/solr/bin/solr create -c collection1 -d ./data_driven_schema_configs
NOTE: We are using the same collection name that has CDCR enabled in the configuration. You should see something similar to this: /opt/solr/bin/solr create -c collection1 -d ./data_driven_schema_configs
Connecting to ZooKeeper at localhost:9983 ...
Uploading /home/vagrant/data_driven_schema_configs/conf for config collection1 to ZooKeeper at localhost:9983
Creating new collection 'collection1' using command:
http://localhost:8983/solr/admin/collections?action=CREATE&name=collection1&numShards=1&replicationFactor=1&maxShardsPerNode=1&collection.configName=collection1
{
responseHeader:{
status:0,
QTime:3684},
success:{192.168.56.101:8983_solr:{
responseHeader:{
status:0,
QTime:2546},
core:collection1_shard1_replica1}}}
Now we can verify the collection exists in the Solr admin ui via: http://192.168.56.101:8983/solr/#/~cloud You should see something similar to this: As you can see, there is a single collection named collection1 which has 1 shard. You can repeat this process on solr-dc02 and see something similar. NOTE: Remember that solr-dc01 is 192.168.56.101 and solr-dc02 is 192.168.56.202. Turn on replication Let's first check the status of replication. Each of these curl commands is interacting with the collection api. You can check the status of replication using the following command: curl -XPOST 'http://192.168.56.101:8983/solr/collection1/cdcr?action=STATUS'
You should see something similar to this: curl -XPOST 'http://192.168.56.101:8983/solr/collection1/cdcr?action=STATUS'
<?xml version=1.0 encoding=UTF-8?>
<response>
<lst name=responseHeader><int name=status>0</int><int name=QTime>5</int></lst><lst name=status><str name=process>stopped</str><str name=buffer>enabled</str></lst>
</response>
You should notice the process is displayed as stopped . We want to start the replication process. curl -XPOST 'http://192.168.56.101:8983/solr/collection1/cdcr?action=START'
You should see something similar to this: curl -XPOST 'http://192.168.56.101:8983/solr/collection1/cdcr?action=START'
<?xml version=1.0 encoding=UTF-8?>
<response>
<lst name=responseHeader><int name=status>0</int><int name=QTime>41</int></lst><lst name=status><str name=process>started</str><str name=buffer>enabled</str></lst>
</response>
You should notice the process is now started . Now we need to disable the buffer on the target colleciton which will buffer the updates by default. curl -XPOST 'http://192.168.56.202:8983/solr/collection1/cdcr?action=DISABLEBUFFER'
You should see something similar to this: curl -XPOST 'http://192.168.56.202:8983/solr/collection1/cdcr?action=DISABLEBUFFER'
<?xml version=1.0 encoding=UTF-8?>
<response>
<lst name=responseHeader><int name=status>0</int><int name=QTime>7</int></lst><lst name=status><str name=process>started</str><str name=buffer>disabled</str></lst>
</response>
You should notice the buffer is now disabled . Add documents to source Solr collection in solr-dc01 Now we will add a couple of sample documents to collection1 in solr-dc01. Run the following command to add 2 sample documents: curl -XPOST -H 'Content-Type: application/json' 'http://192.168.56.101:8983/solr/collection1/update' --data-binary '{
add : {
doc : {
id : 1,
text_ws : This is document number one.
}
},
add : {
doc : {
id : 2,
text_ws : This is document number two.
}
},
commit : {}
}'
You should notice the commit command in the JSON above. That is because the default solrconfig.xml does not have automatic commits enabled. You should get a response back similar to this: {responseHeader:{status:0,QTime:362}}
Query solr-dc01 collection Let's query collection1 on solr-dc01 to ensure the documents are present. Run the following command: curl -XGET 'http://192.168.56.101:8983/solr/collection1/select?q=*:*&indent=true'
You should see something similar to this: curl -XGET 'http://192.168.56.101:8983/solr/collection1/select?q=*:*&indent=true'
<?xml version=1.0 encoding=UTF-8?>
<response>
<lst name=responseHeader>
<bool name=zkConnected>true</bool>
<int name=status>0</int>
<int name=QTime>17</int>
<lst name=params>
<str name=q>*:*</str>
<str name=indent>true</str>
</lst>
</lst>
<result name=response numFound=2 start=0>
<doc>
<str name=id>1</str>
<str name=text_ws>This is document number one.</str>
<long name=_version_>1549823582071160832</long></doc>
<doc>
<str name=id>2</str>
<str name=text_ws>This is document number two.</str>
<long name=_version_>1549823582135123968</long></doc>
</result>
</response>
Query solr-dc02 collection Before executing the query on solr-dc02 , we need to commit the changes. As mentioned above, automatic commits are not enabled in the default solrconfig.xml . Run the following command; curl -XPOST -H 'Content-Type: application/json' 'http://192.168.56.202:8983/solr/collection1/update' --data-binary '{
commit : {}
}'
You should see a response similar to this: {responseHeader:{status:0,QTime:5}}
Now we can run our query: curl -XGET 'http://192.168.56.202:8983/solr/collection1/select?q=*:*&indent=true'
You should see something similar to this: curl -XGET 'http://192.168.56.202:8983/solr/collection1/select?q=*:*&indent=true'
<?xml version=1.0 encoding=UTF-8?>
<response>
<lst name=responseHeader>
<bool name=zkConnected>true</bool>
<int name=status>0</int>
<int name=QTime>17</int>
<lst name=params>
<str name=q>*:*</str>
<str name=indent>true</str>
</lst>
</lst>
<result name=response numFound=2 start=0>
<doc>
<str name=id>1</str>
<str name=text_ws>This is document number one.</str>
<long name=_version_>1549823582071160832</long></doc>
<doc>
<str name=id>2</str>
<str name=text_ws>This is document number two.</str>
<long name=_version_>1549823582135123968</long></doc>
</result>
</response>
You should notice that you have 2 documents, which have the same id and text_ws content as you pushed to solr-dc01. Review If you followed along with this tutorial, you have successfully set up cross data center replication between two SolrCloud configurations. Some important points to keep in mind: Because this is an active-passive approach, there is only a single source system. If the source system goes down, your ingest will stop as the other data center is read-only and should not have updates pushed outside of the replication process. Work is being done to make Solr CDCR active-active. Cross data center communications can be a potential bottleneck. If the cross data center connection can not sustain sufficient throughput, the target data center(s) can fall behind in replication. CDCR is not intended nor optimized for bulk inserts. If you have a need to do bulk inserts, first synchronize the indexes between the data centers outside of the replication process. Then enable replication for incremental updates. For more information, read about Cross Data Center Replication https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=62687462
... View more
Labels:
10-28-2016
08:21 PM
My pleasure. If you believe either of my answers are helpful, "accept" them. This helps the community find answered questions.
... View more