About myoung

myoung · ‎11-08-2016

Good to know, thank you.

myoung · ‎11-08-2016

Objective Many people work exclusively from a laptop where storage space is typically limited to 500GB of space or less. Over time, you may find your available storage space has become a regular concern. It's not uncommon to use an external hard drive to augment available storage space. The current version of Docker for Mac (1.12.x) does not provide a configuration setting which allows users to change the location where the Docker virtual machine image is located. This means the image, which can grow up to 64GB in size by default, is located on your laptop's primary hard drive. With the HDP 2.5 version of the Hortonworks sandbox available as a native Docker image, you may find a desire to have more room available to Docker. This tutorial will guide you through the process of moving your Docker virtual machine image to a different location, an external drive in this case. This will free up to 64GB of space on your primary laptop hard drive and let you expand the size of the Docker image file later. This tutorial is the first in a two part series. Prerequisites You should have already completed the following tutorial Installing Docker Version of Sandbox on Mac You should have an external or secondary hard drive available. Scope Mac OS X 10.11.6 (El Capitan) Docker for Mac 1.12.1 HDP 2.5 Docker Sandbox Steps Stop Docker for Mac Before we can make any changes to the Docker virtual machine image, we need to stop Docker for Mac. There should be a Docker for Mac icon in the menu bar. You should see something similar to this: You can also check via the command line via the ps -ef | grep -i com.docker . You should see something similar to this: ps -ef | grep -i com.docker 0 123 1 0 8:45AM ?? 0:00.01 /Library/PrivilegedHelperTools/com.docker.vmnetd 502 967 876 0 8:46AM ?? 0:00.08 /Applications/Docker.app/Contents/MacOS/com.docker.osx.hyperkit.linux -watchdog fd:0 502 969 967 0 8:46AM ?? 0:00.04 /Applications/Docker.app/Contents/MacOS/com.docker.osx.hyperkit.linux -watchdog fd:0 502 971 967 0 8:46AM ?? 0:07.96 com.docker.db --url fd:3 --git /Users/myoung/Library/Containers/com.docker.docker/Data/database 502 975 967 0 8:46AM ?? 0:03.40 com.docker.osx.hyperkit.linux 502 977 975 0 8:46AM ?? 0:00.03 /Applications/Docker.app/Contents/MacOS/com.docker.osx.hyperkit.linux 502 12807 967 0 9:17PM ?? 0:00.08 com.docker.osxfs --address fd:3 --connect /Users/myoung/Library/Containers/com.docker.docker/Data/@connect --control fd:4 --volume-control fd:5 --database /Users/myoung/Library/Containers/com.docker.docker/Data/s40 502 12810 967 0 9:17PM ?? 0:00.12 com.docker.slirp --db /Users/myoung/Library/Containers/com.docker.docker/Data/s40 --ethernet fd:3 --port fd:4 --vsock-path /Users/myoung/Library/Containers/com.docker.docker/Data/@connect --max-connections 900 502 12811 967 0 9:17PM ?? 0:00.19 com.docker.driver.amd64-linux -db /Users/myoung/Library/Containers/com.docker.docker/Data/s40 -osxfs-volume /Users/myoung/Library/Containers/com.docker.docker/Data/s30 -slirp /Users/myoung/Library/Containers/com.docker.docker/Data/s50 -vmnet /var/tmp/com.docker.vmnetd.socket -port /Users/myoung/Library/Containers/com.docker.docker/Data/s51 -vsock /Users/myoung/Library/Containers/com.docker.docker/Data -docker /Users/myoung/Library/Containers/com.docker.docker/Data/s60 -addr fd:3 -debug 502 12812 12811 0 9:17PM ?? 0:00.02 /Applications/Docker.app/Contents/MacOS/com.docker.driver.amd64-linux -db /Users/myoung/Library/Containers/com.docker.docker/Data/s40 -osxfs-volume /Users/myoung/Library/Containers/com.docker.docker/Data/s30 -slirp /Users/myoung/Library/Containers/com.docker.docker/Data/s50 -vmnet /var/tmp/com.docker.vmnetd.socket -port /Users/myoung/Library/Containers/com.docker.docker/Data/s51 -vsock /Users/myoung/Library/Containers/com.docker.docker/Data -docker /Users/myoung/Library/Containers/com.docker.docker/Data/s60 -addr fd:3 -debug 502 12814 12811 0 9:17PM ?? 0:16.48 /Applications/Docker.app/Contents/MacOS/com.docker.hyperkit -A -m 12G -c 6 -u -s 0:0,hostbridge -s 31,lpc -s 2:0,virtio-vpnkit,uuid=1f629fed-1ef6-4f34-8fce-753347e3b941,path=/Users/myoung/Library/Containers/com.docker.docker/Data/s50,macfile=/Users/myoung/Library/Containers/com.docker.docker/Data/com.docker.driver.amd64-linux/mac.0 -s 3,virtio-blk,file:///Users/myoung/Library/Containers/com.docker.docker/Data/com.docker.driver.amd64-linux/Docker.qcow2,format=qcow -s 4,virtio-9p,path=/Users/myoung/Library/Containers/com.docker.docker/Data/s40,tag=db -s 5,virtio-rnd -s 6,virtio-9p,path=/Users/myoung/Library/Containers/com.docker.docker/Data/s51,tag=port -s 7,virtio-sock,guest_cid=3,path=/Users/myoung/Library/Containers/com.docker.docker/Data,guest_forwards=2376;1525 -l com1,autopty=/Users/myoung/Library/Containers/com.docker.docker/Data/com.docker.driver.amd64-linux/tty,log=/Users/myoung/Library/Containers/com.docker.docker/Data/com.docker.driver.amd64-linux/console-ring -f kexec,/Applications/Docker.app/Contents/Resources/moby/vmlinuz64,/Applications/Docker.app/Contents/Resources/moby/initrd.img,earlyprintk=serial console=ttyS0 com.docker.driver="com.docker.driver.amd64-linux", com.docker.database="com.docker.driver.amd64-linux" ntp=gateway mobyplatform=mac -F /Users/myoung/Library/Containers/com.docker.docker/Data/com.docker.driver.amd64-linux/hypervisor.pid 502 13790 876 0 9:52PM ?? 0:00.01 /Applications/Docker.app/Contents/MacOS/com.docker.frontend {"action":"vmstateevent","args":{"vmstate":"running"}} 502 13791 13790 0 9:52PM ?? 0:00.01 /Applications/Docker.app/Contents/MacOS/com.docker.frontend {"action":"vmstateevent","args":{"vmstate":"running"}} 502 13793 13146 0 9:52PM ttys000 0:00.00 grep -i com.docker Now we are going to stop Docker for Mac. Before shutting down Docker, make sure all of your containers have been stopped. Using the menu shown above, click on the Quit Docker menu option. This will stop Docker for Mac. You should notice the Docker for Mac icon is no longer visible. Now let's confirm the Docker processes we saw before are no longer running: ps -ef | grep -i com.docker 0 123 1 0 8:45AM ?? 0:00.01 /Library/PrivilegedHelperTools/com.docker.vmnetd 502 13815 13146 0 9:54PM ttys000 0:00.00 grep -i com.docker NOTE: It may take a minute or two before Docker completely shuts down. Backup Docker virtual machine image Before we make any changes to the Docker virtual machine image, we should back it up. This will temporarily use more space on your laptop hard drive. Make sure you have enough room to hold two copies of the data. As mentioned before, the Docker image can be up to 64GB by default. Let's check the current size of our image using du -sh . The Docker image file is located at ~/Library/Containers/com.docker.docker/Data/com.docker.driver.amd64-linux/ by default. du -sh ~/Library/Containers/com.docker.docker/Data/com.docker.driver.amd64-linux/ 64G /Users/myoung/Library/Containers/com.docker.docker/Data/com.docker.driver.amd64-linux/ In my case, my image size is 64GB. You need to be sure you have room for 2 copies of the com.docker.driver.amd64-linux directory. Now we'll make a copy of our image: cd ~/Library/Containers/com.docker.docker/Data/ cp -r com.docker.driver.amd64-linux com.docker.driver.amd64-linux.backup This copy serves as our backup of the image. Copy Docker virtual machine image to external drive Now we can make a copy of our image on our external hard drive. I have a 1TB SSD mounted at /Volumes/Samsung . I am going to store my Docker virtual machine image in /Volumes/Samsung/Docker/image . You should store the image in a location that makes sense for you. cp -r com.docker.driver.amd64-linux /Volumes/Samsung/Docker/image/ This process will take a few minutes. It will take longer if you are not using an SSD. Let's confirm the directory now exists on the external hard drive. ls -la /Volumes/Samsung/Docker/image/ total 0 drwxr-xr-x 3 myoung staff 102 Nov 3 17:08 . drwxr-xr-x 11 myoung staff 374 Nov 3 17:03 .. drwxr-xr-x@ 11 myoung staff 374 Nov 7 21:53 com.docker.driver.amd64-linux You can also check the size: du -sh /Volumes/Samsung/Docker/image/ 64G /Volumes/Samsung/Docker/image/ Create symbolic link for Docker virtual machine image Now that we have a copy of the Docker image on the external hard drive, we will use a symbolic link from the image directory on the laptop hard drive to image directory on the external hard drive. Before creating the link, we need to remove the current image directory on our laptop hard drive rm -rf ~/Library/Containers/com.docker.docker/Data/com.docker.driver.amd64-linux Now let's create the symbolic link. We will use the ln -s command. The syntax for ln is ln -s <target> <source> . In this case, target is the location on the external drive and source is the location on the internal drive. ln -s /Volumes/Samsung/Docker/image/com.docker.driver.amd64-linux ~/Library/Containers/com.docker.docker/Data/com.docker.driver.amd64-linux We can confirm the link was created: ls -la ~/Library/Containers/com.docker.docker/Data/com.docker.driver.amd64-linux lrwxr-xr-x 1 myoung staff 59 Nov 3 17:05 /Users/myoung/Library/Containers/com.docker.docker/Data/com.docker.driver.amd64-linux -> /Volumes/Samsung/Docker/image/com.docker.driver.amd64-linux Restart Docker for Mac Now we can restart Docker for Mac. This is done by running the application from the Applications folder in the Finder. You should see something similar to this: Double-click on the Docker application to start it. You should notice the Docker for Mac icon is now back in the main menu bar. You can also check via ps -ef | grep -i com.docker . You should see something similar to this: ps -ef | grep -i com.docker 0 123 1 0 8:45AM ?? 0:00.01 /Library/PrivilegedHelperTools/com.docker.vmnetd 502 14476 14465 0 10:42PM ?? 0:00.03 /Applications/Docker.app/Contents/MacOS/com.docker.osx.hyperkit.linux -watchdog fd:0 502 14479 14476 0 10:42PM ?? 0:00.01 /Applications/Docker.app/Contents/MacOS/com.docker.osx.hyperkit.linux -watchdog fd:0 502 14480 14476 0 10:42PM ?? 0:00.29 com.docker.db --url fd:3 --git /Users/myoung/Library/Containers/com.docker.docker/Data/database 502 14481 14476 0 10:42PM ?? 0:00.08 com.docker.osxfs --address fd:3 --connect /Users/myoung/Library/Containers/com.docker.docker/Data/@connect --control fd:4 --volume-control fd:5 --database /Users/myoung/Library/Containers/com.docker.docker/Data/s40 502 14482 14476 0 10:42PM ?? 0:00.04 com.docker.slirp --db /Users/myoung/Library/Containers/com.docker.docker/Data/s40 --ethernet fd:3 --port fd:4 --vsock-path /Users/myoung/Library/Containers/com.docker.docker/Data/@connect --max-connections 900 502 14483 14476 0 10:42PM ?? 0:00.05 com.docker.osx.hyperkit.linux 502 14484 14476 0 10:42PM ?? 0:00.08 com.docker.driver.amd64-linux -db /Users/myoung/Library/Containers/com.docker.docker/Data/s40 -osxfs-volume /Users/myoung/Library/Containers/com.docker.docker/Data/s30 -slirp /Users/myoung/Library/Containers/com.docker.docker/Data/s50 -vmnet /var/tmp/com.docker.vmnetd.socket -port /Users/myoung/Library/Containers/com.docker.docker/Data/s51 -vsock /Users/myoung/Library/Containers/com.docker.docker/Data -docker /Users/myoung/Library/Containers/com.docker.docker/Data/s60 -addr fd:3 -debug 502 14485 14483 0 10:42PM ?? 0:00.01 /Applications/Docker.app/Contents/MacOS/com.docker.osx.hyperkit.linux 502 14486 14484 0 10:42PM ?? 0:00.01 /Applications/Docker.app/Contents/MacOS/com.docker.driver.amd64-linux -db /Users/myoung/Library/Containers/com.docker.docker/Data/s40 -osxfs-volume /Users/myoung/Library/Containers/com.docker.docker/Data/s30 -slirp /Users/myoung/Library/Containers/com.docker.docker/Data/s50 -vmnet /var/tmp/com.docker.vmnetd.socket -port /Users/myoung/Library/Containers/com.docker.docker/Data/s51 -vsock /Users/myoung/Library/Containers/com.docker.docker/Data -docker /Users/myoung/Library/Containers/com.docker.docker/Data/s60 -addr fd:3 -debug 502 14488 14484 0 10:42PM ?? 0:07.90 /Applications/Docker.app/Contents/MacOS/com.docker.hyperkit -A -m 12G -c 6 -u -s 0:0,hostbridge -s 31,lpc -s 2:0,virtio-vpnkit,uuid=1f629fed-1ef6-4f34-8fce-753347e3b941,path=/Users/myoung/Library/Containers/com.docker.docker/Data/s50,macfile=/Users/myoung/Library/Containers/com.docker.docker/Data/com.docker.driver.amd64-linux/mac.0 -s 3,virtio-blk,file:///Users/myoung/Library/Containers/com.docker.docker/Data/com.docker.driver.amd64-linux/Docker.qcow2,format=qcow -s 4,virtio-9p,path=/Users/myoung/Library/Containers/com.docker.docker/Data/s40,tag=db -s 5,virtio-rnd -s 6,virtio-9p,path=/Users/myoung/Library/Containers/com.docker.docker/Data/s51,tag=port -s 7,virtio-sock,guest_cid=3,path=/Users/myoung/Library/Containers/com.docker.docker/Data,guest_forwards=2376;1525 -l com1,autopty=/Users/myoung/Library/Containers/com.docker.docker/Data/com.docker.driver.amd64-linux/tty,log=/Users/myoung/Library/Containers/com.docker.docker/Data/com.docker.driver.amd64-linux/console-ring -f kexec,/Applications/Docker.app/Contents/Resources/moby/vmlinuz64,/Applications/Docker.app/Contents/Resources/moby/initrd.img,earlyprintk=serial console=ttyS0 com.docker.driver="com.docker.driver.amd64-linux", com.docker.database="com.docker.driver.amd64-linux" ntp=gateway mobyplatform=mac -F /Users/myoung/Library/Containers/com.docker.docker/Data/com.docker.driver.amd64-linux/hypervisor.pid 502 14559 14465 0 10:46PM ?? 0:00.01 /Applications/Docker.app/Contents/MacOS/com.docker.frontend {"action":"vmstateevent","args":{"vmstate":"running"}} 502 14560 14559 0 10:46PM ?? 0:00.01 /Applications/Docker.app/Contents/MacOS/com.docker.frontend {"action":"vmstateevent","args":{"vmstate":"running"}} 502 14562 13146 0 10:46PM ttys000 0:00.00 grep -i com.docker You should notice the Docker processes are running again. You can also check the timestamp of files in the Docker image directory on the external hard drive: ls -la /Volumes/Samsung/Docker/image/com.docker.driver.amd64-linux total 134133536 drwxr-xr-x@ 12 myoung staff 408 Nov 7 22:42 . drwxr-xr-x 3 myoung staff 102 Nov 3 17:08 .. -rw-r--r-- 1 myoung staff 68676222976 Nov 7 22:45 Docker.qcow2 -rw-r--r-- 1 myoung staff 65536 Nov 7 22:42 console-ring -rw-r--r-- 1 myoung staff 5 Nov 7 22:42 hypervisor.pid -rw-r--r-- 1 myoung staff 0 Aug 24 16:06 lock drwxr-xr-x 67 myoung staff 2278 Nov 5 22:00 log -rw-r--r-- 1 myoung staff 17 Nov 7 22:42 mac.0 -rw-r--r-- 1 myoung staff 36 Aug 24 16:06 nic1.uuid -rw-r--r-- 1 myoung staff 5 Nov 7 22:42 pid -rw-r--r-- 1 myoung staff 59619 Nov 7 22:42 syslog lrwxr-xr-x 1 myoung staff 12 Nov 7 22:42 tty -> /dev/ttys001 You should notice the timestamp of the Docker.qcow2 file has been updated which means Docker is now using this location for its image file. Start a Docker container You should attempt to start a Docker container to make sure everything is working fine. You can start the HDP sandbox via docker start sandbox if you've already installed it as listed in the prerequisites. If everything is working fine, you can delete the backup. Delete Docker backup image Now that everything is working using the new location, we can remove our backup. rm -rf ~/Library/Containers/com.docker.docker/Data/com.docker.driver.amd64-linux.backup Review If you successfully followed along with this tutorial, we were able to move our Docker for Mac virtual machine image to an external hard drive. This should free up to 64GB of space on your laptop hard drive. Look for part two in the series to learn how to increase the size of your Docker image.

myoung · ‎11-08-2016

@Marcia Hon For both options a and b, did you remove the /var/lib/docker/* directory where the image file is created? You have to remove that file and then restart the daemons. If you don't remove the file, Docker will not increase the size automatically. You can run docker info before and after this process to see how much space is available for Docker. NOTE: You will lose your images and containers when you delete the contents of that directory. So backup any containers or images you want to keep.

myoung · ‎11-04-2016

Good catch. Tutorial has been updated to provide more links.

myoung · ‎11-04-2016

Objective In many organizations "search" is a common requirement for a user friendly means of accessing data. When people thing of "search", they often think of Google. Many organizations use Solr as their enterprise search engine. It is commonly used to power public website search from within the site itself. Organizations will often build custom user interfaces to tailor queries to meet their internal or external end-user needs. In most of these scenarios, users are shielded from the complexity of the Solr query syntax. Solr has a long list of features and capabilities; you can read more here Apache Solr Features. Solr 6 has a new feature which allows you to submit SQL queries via JDBC. This opens up new ways to interact with Solr. Using Zeppelin with SQL is now possible because of this new feature. This should make you more productive because you can use a language syntax with which you are already familiar: SQL! This tutorial will guide you through the process of updating the Zeppelin JDBC interpreter configuration to enable submitting SQL queries to Solr via JDBC. We'll use the Hortonworks HDP 2.5 Docker sandbox and Apache Solr 6.2.1. NOTE: Solr 6 is being deployed as a standalone application within the sandbox. HDP 2.5 ships with Solr 5.5.2 via HDPSearch which does not include the JDBC SQL functionality. Prerequisites You should have already completed the following tutorial Installing Docker Version of Sandbox on Mac You should have already downloaded Apache Solr 6.2.1: Apache Solr 6.2.1 Scope Mac OS X 10.11.6 (El Capitan) Docker for Mac 1.12.1 HDP 2.5 Docker Sandbox Apache Solr 6.2.1 Steps Start Sandbox If you completed the tutorial listed in the prerequisites, then you should be ready to start up your Docker sandbox container. docker start sandbox NOTE: If your container is still running from performing the other tutorial, you do not need to start it again. Once the container is started, you need to login: ssh -p 2222 root@localhost Now you can start the services /etc/init.d/startup_scripts start NOTE: This process will take several minutes. Create Solr user in the sandbox We will be running the Solr process as the solr user. Let's create that user in our sandbox: useradd -d /home/solr -s /bin/bash -U solr Copy Solr archive file to sandbox You should already have the Solr archive file downloaded. We will use scp to copy the file to the sandbox. You should do this in another terminal window as your current window should be logged into the sandbox. From your Mac run the following command: scp -P 2222 ~/Downloads/solr-6.2.1.tgz root@localhost:/root/ NOTE: The ssh and scp commands use different parameters to specify the port and it's easy to confuse them. The ssh command uses -p to specify the port. The scp command uses -P to sepcify the port. In my case, the Solr file was downloaded to ~/Downloads . Your location may be different. Extract the Solr archive file We'll run Solr out the /opt/ directory. This makes things a bit cleaner than using the installation script which places some files in /var . cd /opt tar xvfz /vagrant/solr-6.2.1.tgz Now we need to give the solr user ownership over the directory. chown -R solr:solr /opt/solr-6.2.1/ Install JDK 8 Solr 6.x requires JDK 8 which is not on the current version of the sandbox. You will need to install it before you can run Solr. yum install java-1.8.0-openjdk-devel Start Solr Now that Solr is installed, we can start up a SolrCloud instance. The Solr start script provides a handy way to start a 2 node SolrCloud cluster. The -e flag tells Solr to start the cloud example. The -noprompt flag tells Solr to use default values. cd /opt/solr-6.2.1 bin/solr start -e cloud -noprompt Welcome to the SolrCloud example! Starting up 2 Solr nodes for your example SolrCloud cluster. Creating Solr home directory /opt/solr-6.2.1/example/cloud/node1/solr Cloning /opt/solr-6.2.1/example/cloud/node1 into /opt/solr-6.2.1/example/cloud/node2 Starting up Solr on port 8983 using command: bin/solr start -cloud -p 8983 -s "example/cloud/node1/solr" Waiting up to 30 seconds to see Solr running on port 8983 [\] Started Solr server on port 8983 (pid=4952). Happy searching! Starting up Solr on port 7574 using command: bin/solr start -cloud -p 7574 -s "example/cloud/node2/solr" -z localhost:9983 Waiting up to 30 seconds to see Solr running on port 7574 [|] Started Solr server on port 7574 (pid=5175). Happy searching! Connecting to ZooKeeper at localhost:9983 ... Uploading /opt/solr-6.2.1/server/solr/configsets/data_driven_schema_configs/conf for config gettingstarted to ZooKeeper at localhost:9983 Creating new collection 'gettingstarted' using command: http://localhost:8983/solr/admin/collections?action=CREATE&name=gettingstarted&numShards=2&replicationFactor=2&maxShardsPerNode=2&collection.configName=gettingstarted { "responseHeader":{ "status":0, "QTime":28324}, "success":{ "192.168.56.151:8983_solr":{ "responseHeader":{ "status":0, "QTime":17801}, "core":"gettingstarted_shard1_replica1"}, "192.168.56.151:7574_solr":{ "responseHeader":{ "status":0, "QTime":18096}, "core":"gettingstarted_shard1_replica2"}}} Enabling auto soft-commits with maxTime 3 secs using the Config API POSTing request to Config API: http://localhost:8983/solr/gettingstarted/config {"set-property":{"updateHandler.autoSoftCommit.maxTime":"3000"}} Successfully set-property updateHandler.autoSoftCommit.maxTime to 3000 SolrCloud example running, please visit: http://localhost:8983/solr As you can see from the output, we have 2 Solr instances. One instance is listening on port 8983 and the other is listening on 7574 . They are using an embedded Zookeeper instance for coordination and it is listening on port 9983 . If we were going to production, we would the HDP cluster Zookeeper instance for more reliability. Index sample data Now that our SolrCloud cluster is running, we can index sample data into the cluster. We'll execute our SQL queries against this data. Fortunately, Solr ships with a number of example data sets. For this tutorial index XML data which contains sample product information. bin/post -c gettingstarted example/exampledocs/*.xml This command posts the xml documents in the specified path. The -c option defines which collection to use. The command we used previously to create the SolrCloud cluster automatically created a gettingstarted collection using the data_driven_schema_configs configuration. This configuration is what we call schemaless because the fields are dynamically added to the collection. Without dynamic fields, you have to explicitly define every field you want to have in your collection. You should see something like this: bin/post -c gettingstarted example/exampledocs/*.xml /usr/lib/jvm/java/bin/java -classpath /opt/solr-6.2.1/dist/solr-core-6.2.1.jar -Dauto=yes -Dc=gettingstarted -Ddata=files org.apache.solr.util.SimplePostTool example/exampledocs/gb18030-example.xml example/exampledocs/hd.xml example/exampledocs/ipod_other.xml example/exampledocs/ipod_video.xml example/exampledocs/manufacturers.xml example/exampledocs/mem.xml example/exampledocs/money.xml example/exampledocs/monitor2.xml example/exampledocs/monitor.xml example/exampledocs/mp500.xml example/exampledocs/sd500.xml example/exampledocs/solr.xml example/exampledocs/utf8-example.xml example/exampledocs/vidcard.xml SimplePostTool version 5.0.0 Posting files to [base] url http://localhost:8983/solr/gettingstarted/update. Entering auto mode. File endings considered are xml,json,jsonl,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log POSTing file gb18030-example.xml (application/xml) to [base] POSTing file hd.xml (application/xml) to [base] POSTing file ipod_other.xml (application/xml) to [base] POSTing file ipod_video.xml (application/xml) to [base] POSTing file manufacturers.xml (application/xml) to [base] POSTing file mem.xml (application/xml) to [base] POSTing file money.xml (application/xml) to [base] POSTing file monitor2.xml (application/xml) to [base] POSTing file monitor.xml (application/xml) to [base] POSTing file mp500.xml (application/xml) to [base] POSTing file sd500.xml (application/xml) to [base] POSTing file solr.xml (application/xml) to [base] POSTing file utf8-example.xml (application/xml) to [base] POSTing file vidcard.xml (application/xml) to [base] 14 files indexed. COMMITting Solr index changes to http://localhost:8983/solr/gettingstarted/update. Time spent: 0:00:02.379 Query Solr data Now we can use curl to run a test query against Solr. The following command will query the gettingstarted collection for all documents. It also returns the results as JSON instead of the default XML. curl -XGET 'http://localhost:8983/solr/gettingstarted/select?q=*:*&wt=json&indent=true' You should see something like this: curl -XGET 'http://localhost:8983/solr/gettingstarted/select?q=*:*&wt=json&indent=true' { "responseHeader":{ "zkConnected":true, "status":0, "QTime":11, "params":{ "q":"*:*", "indent":"true", "wt":"json"}}, "response":{"numFound":32,"start":0,"maxScore":1.0,"docs":[ { "id":"GB18030TEST", "name":["Test with some GB18030 encoded characters"], "features":["No accents here", "这是一个功能", "This is a feature (translated)", "这份文件是很有光泽", "This document is very shiny (translated)"], "price":[0.0], "inStock":[true], "_version_":1550023359021973504}, { "id":"IW-02", "name":["iPod & iPod Mini USB 2.0 Cable"], "manu":["Belkin"], "manu_id_s":"belkin", "cat":["electronics", "connector"], "features":["car power adapter for iPod, white"], "weight":[2.0], "price":[11.5], "popularity":[1], "inStock":[false], "store":["37.7752,-122.4232"], "manufacturedate_dt":"2006-02-14T23:55:59Z", "_version_":1550023359918505984}, { "id":"MA147LL/A", "name":["Apple 60 GB iPod with Video Playback Black"], "manu":["Apple Computer Inc."], "manu_id_s":"apple", "cat":["electronics", "music"], "features":["iTunes, Podcasts, Audiobooks", "Stores up to 15,000 songs, 25,000 photos, or 150 hours of video", "2.5-inch, 320x240 color TFT LCD display with LED backlight", "Up to 20 hours of battery life", "Plays AAC, MP3, WAV, AIFF, Audible, Apple Lossless, H.264 video", "Notes, Calendar, Phone book, Hold button, Date display, Photo wallet, Built-in games, JPEG photo playback, Upgradeable firmware, USB 2.0 compatibility, Playback speed control, Rechargeable capability, Battery level indication"], "includes":["earbud headphones, USB cable"], "weight":[5.5], "price":[399.0], "popularity":[10], "inStock":[true], "store":["37.7752,-100.0232"], "manufacturedate_dt":"2005-10-12T08:00:00Z", "_version_":1550023360204767232}, { "id":"adata", "compName_s":"A-Data Technology", "address_s":"46221 Landing Parkway Fremont, CA 94538", "_version_":1550023360573865984}, { "id":"asus", "compName_s":"ASUS Computer", "address_s":"800 Corporate Way Fremont, CA 94539", "_version_":1550023360584351744}, { "id":"belkin", "compName_s":"Belkin", "address_s":"12045 E. Waterfront Drive Playa Vista, CA 90094", "_version_":1550023360586448896}, { "id":"maxtor", "compName_s":"Maxtor Corporation", "address_s":"920 Disc Drive Scotts Valley, CA 95066", "_version_":1550023360587497472}, { "id":"TWINX2048-3200PRO", "name":["CORSAIR XMS 2GB (2 x 1GB) 184-Pin DDR SDRAM Unbuffered DDR 400 (PC 3200) Dual Channel Kit System Memory - Retail"], "manu":["Corsair Microsystems Inc."], "manu_id_s":"corsair", "cat":["electronics", "memory"], "features":["CAS latency 2, 2-3-3-6 timing, 2.75v, unbuffered, heat-spreader"], "price":[185.0], "popularity":[5], "inStock":[true], "store":["37.7752,-122.4232"], "manufacturedate_dt":"2006-02-13T15:26:37Z", "payloads":["electronics|6.0 memory|3.0"], "_version_":1550023360602177536}, { "id":"VS1GB400C3", "name":["CORSAIR ValueSelect 1GB 184-Pin DDR SDRAM Unbuffered DDR 400 (PC 3200) System Memory - Retail"], "manu":["Corsair Microsystems Inc."], "manu_id_s":"corsair", "cat":["electronics", "memory"], "price":[74.99], "popularity":[7], "inStock":[true], "store":["37.7752,-100.0232"], "manufacturedate_dt":"2006-02-13T15:26:37Z", "payloads":["electronics|4.0 memory|2.0"], "_version_":1550023360647266304}, { "id":"VDBDB1A16", "name":["A-DATA V-Series 1GB 184-Pin DDR SDRAM Unbuffered DDR 400 (PC 3200) System Memory - OEM"], "manu":["A-DATA Technology Inc."], "manu_id_s":"corsair", "cat":["electronics", "memory"], "features":["CAS latency 3, 2.7v"], "popularity":[0], "inStock":[true], "store":["45.18414,-93.88141"], "manufacturedate_dt":"2006-02-13T15:26:37Z", "payloads":["electronics|0.9 memory|0.1"], "_version_":1550023360648314880}] }} By default Solr will return the top 10 documents. If you look at the top of the results, you will notice there are 32 documents in our collection. ... "response":{"numFound":32,"start":0,"maxScore":1.0,"docs":[ ... Modify Zeppelin JDBC interpreter Now we need to modify the existing JDBC interpreter in Zeppelin. By default, this interpreter will work with Hive, Postgres and Phoenix. We will be adding Solr to the configuration. Open the Zeppelin UI. You can either use the link in Ambari or directly via http://localhost:9995 . You should see something like this: Click on the user menu in the upper right. You are logged into Zeppelin as anonymous . You should see a menu like this: Click on the Interpreter link. You should see something like this: You should see the jdbc interpreter near the top of the list. If you don't, you can either scroll down or use the build-in search feature at the top of the patch. You should click on the edit button for the jdbc interpreter. You will notice the screen changes to allow you to add new properties or modify existing ones. You should see something like this: Scroll down until you see the empty entry line. You should see something like this: We need to add 3 properities/values here. solr.url jdbc:solr://localhost:9983?collection=gettingstarted solr.user solr solr.driver org.apache.solr.client.solrj.io.sql.DriverImpl Why are we using port 9983 ? That is because we are in SolrCloud mode. We are pointing to the Zookeeper instance. If one of the nodes goes down, Zookeeper will know and direct us to a node that is working. Add each of these properties and click the + button after each entry. You should now have 3 new properties in your list: Now we need to add an artifact to the Dependencies section. It's just below the properties. We are going to add the following: org.apache.solr:solr-solrj:6.2.1 Click the + button. You should see something like this: Now click the blue Save button to save the changes. Create a new notebook Now that we have our JDBC interpreter updated, we are going to create a new notebook. Click the Notebook drop down menu in the upper left. You should see something like this: Click the + Create a new note link. You should see something like this: Give the notebook the name Solr JDBC , then click the Create Note button. You should see something like this: We can query Solr using a prefix for jdbc like %jdbc(solr) . The prefix refers to the name of the prefix of the properties in the JDBC interpreter we setup. If you recall, there were properties like: solr.url phoenix.url hive.url psql.url Our prefix is solr . Create the following query as the first note: %jdbc(solr) select name, price, inStock from gettingstarted Now click the run arrow icon. This will run the query against Solr and return results if our configuration is correct. You should see something like this: Now add another note below our first one with the following query: %jdbc(solr) select name, price, inStock from gettingstarted where inStock = false You should see something like this: And finally add one more note below our second one with the following query: %jdbc(solr) select price, count(*) from gettingstarted group by price order by price desc You should see something like this: As you can see it was easy to simple queries and more complex aggregations using pure SQL. For comparison, here is Solr query that does the same thing as our second note: curl -XGET 'http://localhost:8983/solr/gettingstarted/select?fl=price,name,inStock&indent=on&q=inStock:true&wt=json' If you ran this command in the terminal, you should see something like this: curl -XGET 'http://localhost:8983/solr/gettingstarted/select?fl=price,name,inStock&indent=on&q=inStock:true&wt=json' { "responseHeader":{ "zkConnected":true, "status":0, "QTime":16, "params":{ "q":"inStock:true", "indent":"on", "fl":"price,name,inStock", "wt":"json"}}, "response":{"numFound":17,"start":0,"maxScore":0.2578291,"docs":[ { "name":["Test with some GB18030 encoded characters"], "price":[0.0], "inStock":[true]}, { "name":["Apple 60 GB iPod with Video Playback Black"], "price":[399.0], "inStock":[true]}, { "name":["CORSAIR XMS 2GB (2 x 1GB) 184-Pin DDR SDRAM Unbuffered DDR 400 (PC 3200) Dual Channel Kit System Memory - Retail"], "price":[185.0], "inStock":[true]}, { "name":["CORSAIR ValueSelect 1GB 184-Pin DDR SDRAM Unbuffered DDR 400 (PC 3200) System Memory - Retail"], "price":[74.99], "inStock":[true]}, { "name":["A-DATA V-Series 1GB 184-Pin DDR SDRAM Unbuffered DDR 400 (PC 3200) System Memory - OEM"], "inStock":[true]}, { "name":["One Dollar"], "inStock":[true]}, { "name":["One British Pound"], "inStock":[true]}, { "name":["Dell Widescreen UltraSharp 3007WFP"], "price":[2199.0], "inStock":[true]}, { "name":["Samsung SpinPoint P120 SP2514N - hard drive - 250 GB - ATA-133"], "price":[92.0], "inStock":[true]}, { "name":["Maxtor DiamondMax 11 - hard drive - 500 GB - SATA-300"], "price":[350.0], "inStock":[true]}] }} Now here is the query for the aggregations: curl -XGET 'http://localhost:8983/solr/gettingstarted/select?facet.field=price&facet=on&fl=price&indent=on&q=*:*&wt=json' Which do you find easier to use? My guess is the SQL syntax. 😉 Review If you successfully followed along with this tutorial, we were able to install Solr and run it in SolrCloud mode. We indexed some sample xml documents. We updated our Zeppelin interpreter configuration to support Solr JDBC queries. We created a notebook and ran a few queries against Solr using SQL. And finally we saw the comparatively more complex native Solr query syntax. You can read more here: Solr SQL: https://cwiki.apache.org/confluence/display/solr/Parallel+SQL+Interface Zeppelin + Solr JDBC: https://cwiki.apache.org/confluence/display/solr/Solr+JDBC+-+Apache+Zeppelin

myoung · ‎11-03-2016

Awesome! Thank you for sharing the steps you followed.

myoung · ‎11-03-2016

@Shankar P Running out of space for Docker seems to be a common problem and there doesn't seem to be a single good answer on how to solve that. I created a VirtualBox CentOS 7 VM via Vagrant. This is a 40GB disk image. I then increased the disk size to 100GB via VirtualBox tools. Even with 100GB, I get a similar error when I try to import the sandbox. I haven't yet tried to use the -g option. The problem is that Docker still uses a virtual machine behind the scenes to run containers. The default storage size for that VM appears to be 20GB. I've found a number of threads from people wanting to see that increased to 100 or 200GB, which seems reasonable to me. Unfortunately, I don't think that has been changed/released. Having said all of that, Amazon Linux uses upstart at the init system. Have you tried making your changes to /etc/default/docker instead of /etc/sysconfig/docker? This thread details some of the hassles people have gone through to figure out which configuration file to use: https://github.com/docker/docker/issues/9889

myoung · ‎11-02-2016

@sai d You must have VT-x features enabled within your computer BIOS. This is a common requirement for most Virtual Machines these days. What kind of computer are you using? Have you enabled VT-x?

myoung · ‎11-01-2016

Objective Cross Data Center Replication, commonly abbreviated as CDCR, is a new feature found in SolrCloud 6.x. This feature enables Solr to replicate data from one source collection to one or more target collections distributed between data centers. The current version provides an active-passive disaster recovery solution for Solr. Data updates, which include adds, updates, and deletes, are copied from the source collection to the target collection. This means the target collection should not be sent data updates outside of the CDRC functionality. Prior to SolrCloud 6.x you had to manually design a strategy for replication across data centers. This tutorial will guide you through the process of enabling CDCR between two SolrCloud clusters, each with 1 server, in a Vagrant + VirtualBox environment. NOTE: Solr 6 is being deployed as a standalone application. HDP 2.5 provides support for Solr 5.5.2 via HDPSearch which does not include CDCR functionality. Prerequisites You should have already installed the following: VirtualBox 5.1.6 (VirtualBox) Vagrant 1.8.6 (Vagrant) Vagrant plugin vagrant-vbguest 0.13.x (vagrant-vbguest) Vagrant plugin vagrant-hostmanager 1.8.5 ( vagrant-hostmanager) You should have already downloaded the Apache Solr 6.2.1 release ( Apache Solr 6.2.1) Scope This tutorial was tested using the following environment and components: Mac OS X 10.11.6 (El Capitan) VirtualBox 5.1.6 (tutorial should work with any newer version) Vagrant 1.8.6 vagrant-vbguest plugin 0.13.0 vagrant-hostnamanger plugin 1.8.5 Apache Solr 6.2.1 Steps Create Vagrant project directory I like to create project directories. My Vagrant work goes under ~/Vagrant/<project> and my Docker work goes under ~/Docker/<project> . This allows me to clearly identify which technology is associated with the projects and allows me to use various helper scripts to automate processes, etc. So let's create project directory for this tutorial. mkdir -p ~/Vagrant/solrcloud-cdcr-tutorial && cd ~/Vagrant/solrcloud-cdcr-tutorial Create Vagrantfile The Vagrantfile tells Vagrant how to configure your virtual machines. You can copy/paste my Vagrantfile below or use the version in the attachments area of this tutorial. Here is the content from my file: # -*- mode: ruby -*- # vi: set ft=ruby : # Using yaml to load external configuration files require 'yaml' Vagrant.configure(2) do |config| # Using the hostmanager vagrant plugin to update the host files config.hostmanager.enabled = true config.hostmanager.manage_host = true config.hostmanager.manage_guest = true config.hostmanager.ignore_private_ip = false # Loading in the list of commands that should be run when the VM is provisioned. commands = YAML.load_file('commands.yaml') commands.each do |command| config.vm.provision :shell, inline: command end # Loading in the VM configuration information servers = YAML.load_file('servers.yaml') servers.each do |servers| config.vm.define servers[name] do |srv| srv.vm.box = servers[box] # Speciy the name of the Vagrant box file to use srv.vm.hostname = servers[name] # Set the hostname of the VM srv.vm.network private_network, ip: servers[ip], :adapater=>2 # Add a second adapater with a specified IP srv.vm.network :forwarded_port, guest: 22, host: servers[port] # Add a port forwarding rule srv.vm.provision :shell, inline: "sed -i'' '/^127.0.0.1\t#{srv.vm.hostname}\t#{srv.vm.hostname}$/d' /etc/hosts" # Remove the extraneous first entry in /etc/hosts srv.vm.provider :virtualbox do |vb| vb.name = servers[name] # Name of the VM in VirtualBox vb.cpus = servers[cpus] # How many CPUs to allocate to the VM vb.memory = servers[ram] # How much memory to allocate to the VM vb.customize [modifyvm, :id, --cpuexecutioncap, 25] # Limit to VM to 25% of available CPU end end end end Create a servers.yaml file The servers.yaml file contains the configuration information for our VMs. You can copy/paste my servers.yaml below or use the version in the attachments area of this tutorial. Here is the content from my file: --- - name: solr-dc01 box: bento/centos-7.2 cpus: 2 ram: 2048 ip: 192.168.56.101 port: 10122 - name: solr-dc02 box: bento/centos-7.2 cpus: 2 ram: 2048 ip: 192.168.56.202 port: 20222 Create commands.yaml file The commands.yaml file contains the list of commands that should be run on each VM when they are first provisioned. This allows us to automate configuration tasks that would otherwise be tedious and/or repetitive. You can copy/paste my commands.yaml below or use the version in the attachments area of this tutorial. Here is the content from my file: - sudo yum -y install net-tools ntp wget java-1.8.0-openjdk java-1.8.0-openjdk-devel lsof - sudo systemctl enable ntpd && sudo systemctl start ntpd - sudo systemctl disable firewalld && sudo systemctl stop firewalld - sudo sed -i --follow-symlinks 's/^SELINUX=.*/SELINUX=disabled/g' /etc/sysconfig/selinux Copy Solr release file to Vagrant our project directory Our project directory is accessible to each of our Vagrant VMs via the /vagrant mount point. This allows us to easily access files and data located in our project directory. Instead of using scp to copy the Apache Solr release file to each of the VMs and creating duplicate files, we'll use a single copy located in our project directory. cp ~/Downloads/solr-6.2.1.tgz . NOTE: This assumes you are on a Mac and your downloads are in the ~/Downloads directory. Start virtual machines Now we are ready to start our 2 virtual machines for the first time. Creating the VMs for the first time and starting them every time after that uses the same command: vagrant up Once the process is complete you should have 2 servers running. You can verify by looking at VirtualBox. Notice I have 2 VMs running called solr-dc01 and solr-dc02: Connect to each virtual machine You are able to login to each of the VMs via ssh using the vagrant ssh command. You must specify the name of the VM you want to connect to. vagrant ssh solr-dc01 Using another terminal window, repeat this process for solr-dc02 . Extract Solr install scripts The Solr release archive file contains an installation script. This installation script will do the following by default: NOTE: This assumes that you downloaded Solr 6.2.1 Install Solr under /opt/solr-6.2.1 Create a symbolic link between /opt/solr and /opt/solr-6.2.1 Create a solr user. Live data such as indexes, logs, etc are stored in /var/solr. On solr-dc01 , run the following command: tar xvfz /vagrant/solr-6.2.1.tgz solr-6.2.1/bin/install_solr_service.sh --strip-components=2 Repeat this process for solr-dc02 This will create a file called install_solr_services.sh in your current directory, which should be the /home/vagrant . Install Apache Solr Now we can install Solr using the script defaults: sudo bash ./install_solr_service.sh /vagrant/solr-6.2.1.tgz The command above is the same as if you had specified the default settings: sudo bash ./install_solr_service.sh /vagrant/solr-6.2.1.tgz -i /opt -d /var/solr -u solr -s solr -p 8983 After running the command, you should see something similar to this: id: solr: no such user Creating new user: solr Extracting /vagrant/solr-6.2.1.tgz to /opt Installing symlink /opt/solr -> /opt/solr-6.2.1 ... Installing /etc/init.d/solr script ... Installing /etc/default/solr.in.sh ... Waiting up to 30 seconds to see Solr running on port 8983 [/] Started Solr server on port 8983 (pid=29168). Happy searching! Found 1 Solr nodes: Solr process 29168 running on port 8983 { solr_home:/var/solr/data, version:6.2.1 43ab70147eb494324a1410f7a9f16a896a59bc6f - shalin - 2016-09-15 05:20:53, startTime:2016-10-31T19:46:27.997Z, uptime:0 days, 0 hours, 0 minutes, 12 seconds, memory:13.4 MB (%2.7) of 490.7 MB} Service solr installed. If you run the following command, you can see the Solr process is running: ps -ef | grep solr solr 28980 1 0 19:49 ? 00:00:11 java -server -Xms512m -Xmx512m -XX:NewRatio=3 -XX:SurvivorRatio=4 -XX:TargetSurvivorRatio=90 -XX:MaxTenuringThreshold=8 -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:ConcGCThreads=4 -XX:ParallelGCThreads=4 -XX:+CMSScavengeBeforeRemark -XX:PretenureSizeThreshold=64m -XX:+UseCMSInitiatingOccupancyOnly -XX:CMSInitiatingOccupancyFraction=50 -XX:CMSMaxAbortablePrecleanTime=6000 -XX:+CMSParallelRemarkEnabled -XX:+ParallelRefProcEnabled -verbose:gc -XX:+PrintHeapAtGC -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintGCTimeStamps -XX:+PrintTenuringDistribution -XX:+PrintGCApplicationStoppedTime -Xloggc:/var/solr/logs/solr_gc.log -Djetty.port=8983 -DSTOP.PORT=7983 -DSTOP.KEY=solrrocks -Duser.timezone=UTC -Djetty.home=/opt/solr/server -Dsolr.solr.home=/var/solr/data -Dsolr.install.dir=/opt/solr -Dlog4j.configuration=file:/var/solr/log4j.properties -Xss256k -XX:OnOutOfMemoryError=/opt/solr/bin/oom_solr.sh 8983 /var/solr/logs -jar start.jar --module=http Repeat this process for solr-dc02 Modify Solr service It's more convenient to use the OS services infrastructure to manage running Solr processes than manually using scripts. The installation process creates a service script that starts Solr in single instance mode. To take advantage of CDCR, you must use SolrCloud mode. We need to make some changes to the service script for this to work. We'll be using the embedded Zookeeper instance for our tutorial. To do this, we need a zookeeper configuration file in our /var/solr/data directory. We'll copy the default configuration file from /opt/solr/server/solr/zoo.cfg . sudo -u solr cp /opt/solr/server/solr/zoo.cfg /var/solr/data/zoo.cfg Now we need the /etc/init.d/solr service script to run Solr in SolrCloud mode. This is done by adding the -c parameter to the start process. When no other parameters are specified, Solr will start an embedded Zookeeper instance on the Solr port + 1000. In our case, that should be 9983 because our default Solr port is 8983 . Because this file is owned by root, we'll need to use sudo. exit sudo vi /etc/init.d/solr Look near the end of the file for the line: ... case $1 in start|stop|restart|status) SOLR_CMD=$1 ... This is the section that defines the Solr command. We want to change the SOLR_CMD=$1 line to look like this SOLR_CMD=$1 -c . This will tell Solr that it should start in cloud mode. NOTE: In production, you would not use the embedded Zookeeper. You would update the /etc/defaults/solr.in.sh to set the ZK_HOST variable to the production Zookeeper instances. When this variable is set, Solr will not start the embedded Zookeeper. So the section of your file should now look like this: ... case $1 in start|stop|restart|status) SOLR_CMD=$1 -c ... Now save the file: Press the `esc` KEY !wq Let's stop Solr: sudo service solr stop Now we can start Solr using the new script: sudo service solr start Once the process is started, we can check the status: sudo service solr status Found 1 Solr nodes: Solr process 29426 running on port 8983 { solr_home:/var/solr/data, version:6.2.1 43ab70147eb494324a1410f7a9f16a896a59bc6f - shalin - 2016-09-15 05:20:53, startTime:2016-10-31T22:16:22.116Z, uptime:0 days, 0 hours, 0 minutes, 14 seconds, memory:30.2 MB (%6.1) of 490.7 MB, cloud:{ ZooKeeper:localhost:9983, liveNodes:1, collections:0}} As you can see, the process started successfully and there is a single cloud node running using Zookeeper on port 9983 . Repeat this process for solr-dc02 . Create Solr dc01 configuration The solr-dc01 Solr instance will be our source collection for replication. To enable CDCR we need to make a few changes to the solrconfig.xml configuration file. We'll use the data_driven_schema_configs as a base for our configuration. We need to create two different configurations because the source collection has a slightly different configuration than the target collection. On the solr-dc01 VM, copy the data_driven_schema_configs directory to the vagrant home directory. If you are following along, you should still be the vagrant user. cd /home/vagrant cp -r /opt/solr/server/solr/configsets/data_driven_schema_configs . Edit the solrconfig.xml file: vi data_driven_schema_configs/conf/solrconifg.xml The first thing we are going to do is update the updateHandler definition; there is only one in the file. Find the section in the configuration file that looks like this: <updateHandler class=solr.DirectUpdateHandler2> We are going to change the updateLog portion of the configuration. Remember that we are using vi as the text editor, so edit using the appropriate vi commands. Change this: <updateLog> <str name=dir>${solr.ulog.dir:}</str> <int name=numVersionBuckets>${solr.ulog.numVersionBuckets:65536}</int> </updateLog> to this: <updateLog class=solr.CdcrUpdateLog> <str name=dir>${solr.ulog.dir:}</str> <int name=numVersionBuckets>${solr.ulog.numVersionBuckets:65536}</int> </updateLog> Now we need to create a new requestHandler definition. Find the section in the configuration file that looks like this:  <requestHandler name=/query class=solr.SearchHandler> <lst name=defaults> <str name=echoParams>explicit</str> <str name=wt>json</str> <str name=indent>true</str> </lst> </requestHandler> We are going to add our new definition just after the closing requestHandler . Add the following new definition:  <requestHandler name=/cdcr class=solr.CdcrRequestHandler> <lst name=replica> <str name=zkHost>192.168.56.202:9983</str> <str name=source>collection1</str> <str name=target>collection1</str> </lst> <lst name=replicator> <str name=threadPoolSize>8</str> <str name=schedule>1000</str> <str name=batchSize>128</str> </lst> <lst name=updateLogSynchronizer> <str name=schedule>1000</str> </lst> </requestHandler> Your updated file should now look like this: ...  <requestHandler name=/query class=solr.SearchHandler> <lst name=defaults> <str name=echoParams>explicit</str> <str name=wt>json</str> <str name=indent>true</str> </lst> </requestHandler>  <requestHandler name=/cdcr class=solr.CdcrRequestHandler> <lst name=replica> <str name=zkHost>192.168.56.202:9983</str> <str name=source>collection1</str> <str name=target>collection1</str> </lst> <lst name=replicator> <str name=threadPoolSize>8</str> <str name=schedule>1000</str> <str name=batchSize>128</str> </lst> <lst name=updateLogSynchronizer> <str name=schedule>1000</str> </lst> </requestHandler> ... NOTE: The zkHost line should have the ip address and port of the Zookeeper instance of the target collection. Our target collection is on solr-dc02 , so this ip and port are pointing to solr-dc02. When we create our collections in Solr, we'll use the name collection1 . Now save the file: Press the `esc` KEY !wq Create Solr dc02 configuration The solr-dc02 Solr instance will be our target collection for replication. To enable CDCR we need to make a few changes to the solrconfig.xml configuration file. As above, we'll use the data_driven_schema_configs as a base for our configuration. On solr-dc02 , copy the data_driven_schema_configs directory to the vagrant home directory. If you are following along, you should still be the vagrant user. cd /home/vagrant cp -r /opt/solr/server/solr/configsets/data_driven_schema_configs . Edit the solrconfig.xml file: vi data_driven_schema_configs/conf/solrconifg.xml The first thing we are going to do is update the updateHandler definition; there is only one in the file. Find the section in the configuration file that looks like this: <updateHandler class=solr.DirectUpdateHandler2> We are going to change the updateLog portion of the configuration. Remember that we are using vi as the text editor. Change this: <updateLog> <str name=dir>${solr.ulog.dir:}</str> <int name=numVersionBuckets>${solr.ulog.numVersionBuckets:65536}</int> </updateLog> to this: <updateLog class=solr.CdcrUpdateLog> <str name=dir>${solr.ulog.dir:}</str> <int name=numVersionBuckets>${solr.ulog.numVersionBuckets:65536}</int> </updateLog> Now we need to create a new requestHandler definition. Find the section in the configuration file that looks like this:  <requestHandler name=/query class=solr.SearchHandler> <lst name=defaults> <str name=echoParams>explicit</str> <str name=wt>json</str> <str name=indent>true</str> </lst> </requestHandler> We are going to add our new definition just after the closing requestHandler . Add the following new definition:  <requestHandler name=/cdcr class=solr.CdcrRequestHandler> <lst name=buffer> <str name=defaultState>disabled</str> </lst> </requestHandler>  <requestHandler name=/update class=solr.UpdateRequestHandler> <lst name=defaults> <str name=update.chain>cdcr-processor-chain</str> </lst> </requestHandler> <updateRequestProcessorChain name=cdcr-processor-chain> <processor class=solr.CdcrUpdateProcessorFactory/> <processor class=solr.RunUpdateProcessorFactory/> </updateRequestProcessorChain> Your updated file should now look like this: ...  <requestHandler name=/query class=solr.SearchHandler> <lst name=defaults> <str name=echoParams>explicit</str> <str name=wt>json</str> <str name=indent>true</str> </lst> </requestHandler>  <requestHandler name=/cdcr class=solr.CdcrRequestHandler> <lst name=buffer> <str name=defaultState>disabled</str> </lst> </requestHandler>  <requestHandler name=/update class=solr.UpdateRequestHandler> <lst name=defaults> <str name=update.chain>cdcr-processor-chain</str> </lst> </requestHandler> <updateRequestProcessorChain name=cdcr-processor-chain> <processor class=solr.CdcrUpdateProcessorFactory/> <processor class=solr.RunUpdateProcessorFactory/> </updateRequestProcessorChain> ... Now save the file: Press the `esc` KEY !wq You should see how the two configurations are different between the source and target collections. Create Solr collection on solr-dc01 and solr-dc02 Now we should be able to create a collection using our update configuration. Because the two configurations are different, make sure you run this command on both the solr-dc01 and solr-dc02 VMs. This is creating the collections in our respective data centers. /opt/solr/bin/solr create -c collection1 -d ./data_driven_schema_configs NOTE: We are using the same collection name that has CDCR enabled in the configuration. You should see something similar to this: /opt/solr/bin/solr create -c collection1 -d ./data_driven_schema_configs Connecting to ZooKeeper at localhost:9983 ... Uploading /home/vagrant/data_driven_schema_configs/conf for config collection1 to ZooKeeper at localhost:9983 Creating new collection 'collection1' using command: http://localhost:8983/solr/admin/collections?action=CREATE&name=collection1&numShards=1&replicationFactor=1&maxShardsPerNode=1&collection.configName=collection1 { responseHeader:{ status:0, QTime:3684}, success:{192.168.56.101:8983_solr:{ responseHeader:{ status:0, QTime:2546}, core:collection1_shard1_replica1}}} Now we can verify the collection exists in the Solr admin ui via: http://192.168.56.101:8983/solr/#/~cloud You should see something similar to this: As you can see, there is a single collection named collection1 which has 1 shard. You can repeat this process on solr-dc02 and see something similar. NOTE: Remember that solr-dc01 is 192.168.56.101 and solr-dc02 is 192.168.56.202. Turn on replication Let's first check the status of replication. Each of these curl commands is interacting with the collection api. You can check the status of replication using the following command: curl -XPOST 'http://192.168.56.101:8983/solr/collection1/cdcr?action=STATUS' You should see something similar to this: curl -XPOST 'http://192.168.56.101:8983/solr/collection1/cdcr?action=STATUS' <?xml version=1.0 encoding=UTF-8?> <response> <lst name=responseHeader><int name=status>0</int><int name=QTime>5</int></lst><lst name=status><str name=process>stopped</str><str name=buffer>enabled</str></lst> </response> You should notice the process is displayed as stopped . We want to start the replication process. curl -XPOST 'http://192.168.56.101:8983/solr/collection1/cdcr?action=START' You should see something similar to this: curl -XPOST 'http://192.168.56.101:8983/solr/collection1/cdcr?action=START' <?xml version=1.0 encoding=UTF-8?> <response> <lst name=responseHeader><int name=status>0</int><int name=QTime>41</int></lst><lst name=status><str name=process>started</str><str name=buffer>enabled</str></lst> </response> You should notice the process is now started . Now we need to disable the buffer on the target colleciton which will buffer the updates by default. curl -XPOST 'http://192.168.56.202:8983/solr/collection1/cdcr?action=DISABLEBUFFER' You should see something similar to this: curl -XPOST 'http://192.168.56.202:8983/solr/collection1/cdcr?action=DISABLEBUFFER' <?xml version=1.0 encoding=UTF-8?> <response> <lst name=responseHeader><int name=status>0</int><int name=QTime>7</int></lst><lst name=status><str name=process>started</str><str name=buffer>disabled</str></lst> </response> You should notice the buffer is now disabled . Add documents to source Solr collection in solr-dc01 Now we will add a couple of sample documents to collection1 in solr-dc01. Run the following command to add 2 sample documents: curl -XPOST -H 'Content-Type: application/json' 'http://192.168.56.101:8983/solr/collection1/update' --data-binary '{ add : { doc : { id : 1, text_ws : This is document number one. } }, add : { doc : { id : 2, text_ws : This is document number two. } }, commit : {} }' You should notice the commit command in the JSON above. That is because the default solrconfig.xml does not have automatic commits enabled. You should get a response back similar to this: {responseHeader:{status:0,QTime:362}} Query solr-dc01 collection Let's query collection1 on solr-dc01 to ensure the documents are present. Run the following command: curl -XGET 'http://192.168.56.101:8983/solr/collection1/select?q=*:*&indent=true' You should see something similar to this: curl -XGET 'http://192.168.56.101:8983/solr/collection1/select?q=*:*&indent=true' <?xml version=1.0 encoding=UTF-8?> <response> <lst name=responseHeader> <bool name=zkConnected>true</bool> <int name=status>0</int> <int name=QTime>17</int> <lst name=params> <str name=q>*:*</str> <str name=indent>true</str> </lst> </lst> <result name=response numFound=2 start=0> <doc> <str name=id>1</str> <str name=text_ws>This is document number one.</str> <long name=_version_>1549823582071160832</long></doc> <doc> <str name=id>2</str> <str name=text_ws>This is document number two.</str> <long name=_version_>1549823582135123968</long></doc> </result> </response> Query solr-dc02 collection Before executing the query on solr-dc02 , we need to commit the changes. As mentioned above, automatic commits are not enabled in the default solrconfig.xml . Run the following command; curl -XPOST -H 'Content-Type: application/json' 'http://192.168.56.202:8983/solr/collection1/update' --data-binary '{ commit : {} }' You should see a response similar to this: {responseHeader:{status:0,QTime:5}} Now we can run our query: curl -XGET 'http://192.168.56.202:8983/solr/collection1/select?q=*:*&indent=true' You should see something similar to this: curl -XGET 'http://192.168.56.202:8983/solr/collection1/select?q=*:*&indent=true' <?xml version=1.0 encoding=UTF-8?> <response> <lst name=responseHeader> <bool name=zkConnected>true</bool> <int name=status>0</int> <int name=QTime>17</int> <lst name=params> <str name=q>*:*</str> <str name=indent>true</str> </lst> </lst> <result name=response numFound=2 start=0> <doc> <str name=id>1</str> <str name=text_ws>This is document number one.</str> <long name=_version_>1549823582071160832</long></doc> <doc> <str name=id>2</str> <str name=text_ws>This is document number two.</str> <long name=_version_>1549823582135123968</long></doc> </result> </response> You should notice that you have 2 documents, which have the same id and text_ws content as you pushed to solr-dc01. Review If you followed along with this tutorial, you have successfully set up cross data center replication between two SolrCloud configurations. Some important points to keep in mind: Because this is an active-passive approach, there is only a single source system. If the source system goes down, your ingest will stop as the other data center is read-only and should not have updates pushed outside of the replication process. Work is being done to make Solr CDCR active-active. Cross data center communications can be a potential bottleneck. If the cross data center connection can not sustain sufficient throughput, the target data center(s) can fall behind in replication. CDCR is not intended nor optimized for bulk inserts. If you have a need to do bulk inserts, first synchronize the indexes between the data centers outside of the replication process. Then enable replication for incremental updates. For more information, read about Cross Data Center Replication https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=62687462

myoung · ‎10-28-2016

My pleasure. If you believe either of my answers are helpful, "accept" them. This helps the community find answered questions.

Online	Offline
Last Visited	‎02-08-2019 07:03 PM

Member Since	‎02-09-2016 09:44 PM
Last Visited	‎02-08-2019 07:03 PM
Posts	559
Kudos received	413

Cloudera Community

Re: How can I force the getTwitter processor to no...

Re: Send Ambari Metric to Elasticsearch

Re: Ingesting unformatted, unordered data from hdf...

Re: What would the audit record on Zeppelin users ...

Re: Automate loading data into HDFS

Re: distcp from hdp 2.1 to s3a.

How to move Docker for Mac vm image from internal ...

Re: Cannot docker load < HDP_2.5_docker.tar on Cen...

Re: Update Zeppelin JDCB Interpreter To Support So...

Update Zeppelin JDCB Interpreter To Support Solr S...

Re: Installing HDP Sandbox on Docker for AWS gives...

Re: Installing HDP Sandbox on Docker for AWS gives...

Re: Sandbox Import into Virtualbox Failed (HDP 2.5...

How to setup cross data center replication in Solr...

Re: View subset of columns in Tableau defined by A...