Community Articles

Find and share helpful community-sourced technical articles.
Announcements
Celebrating as our community reaches 100,000 members! Thank you!
Labels (1)
avatar
Rising Star

HDP Search provides the tools to index data from your HDP cluster to Solr. You can utilize the power of connectors that are shipped with HDPSearch to index data from HDFS, Hive tables, and Spark dataframes to Solr. Once you have your data in Solr, search and querying is simpler. You may find the official Hortonworks documentation for HDP Search 4.0 here: https://docs.hortonworks.com/HDPDocuments/HDPS/HDPS-4.0.0/bk_solr-search-installation/content/ch_hdp...

In this article, you may find how you could set up your HDP Cluster with HDP Search using the Solr Management pack shipped by Hortonworks. Additionally, you will find details on how to use connectors for Hive, Spark, and Hdfs to index data to Solr and query from Solr. (This document assumes that you already have an Ambari-2.7.0 + HDP-3.0 cluster up and running)


Setup

Install Management Pack:

  • Install mpack:
    • ambari-server install-mpack --mpack=solr-service-mpack-4.0.0.tar.gz
  • Restart ambari-server:
    • ambari-server restart

 

Add Solr to HDP Cluster:

Navigate and Login to Ambari UI

  • Choose Services - Add Service from the left navigation panel
  • Choose Solr from Choose Services Page and click Next
  • Now you land on the Assign Masters page where you can choose a number of Solr Servers you need for your cluster. Choose 2 or more for SolrCloud. Click Next - You would see Customize Services Page where no changes are required. Next, you land on the Review page as shown below where you can verify the hosts and Solr package
  • If this looks good click on Deploy to get Solr added to your cluster.

You can go to SolrUI directly from QuickLinks. If Kerberos is enabled you have to enable spnego authentication to access Solr UI (Instructions mentioned in this doc). Now you have your HDPSearch cluster up and running fine!

Connector Usage

The following shows how you can use Connectors that are shipped with HDPSearch to index data to Solr and query using SolrUI. Below shows Hive, Hadoop and Spark Connector usages.

Hive Connector:

For the Hive connector to work in this version(Hdp search 4.0) you need to have Serde Jar in Hive’s classpath. You can do this as below:

  • Create a directory ‘auxlib’ in /usr/hdp/current/hive-server2
  • Copy serde jar to auxlib
    • cp /opt/lucidworks-hdpsearch/hive/solr-hive-serde-4.0.0.jar
  • Restart Hive

Create a Collection:

Index Data

As this is an example, first we will create a table in hive whose data we want to be indexed in Solr. You will skip this step for your real data. Now create an external table for Solr and proceed with indexing.

Create Table in Hive and insert data:

  • As hive user, connect to beeline (kinit before connecting to beeline if it is a secure cluster)
CREATE TABLE books (id STRING, cat STRING, title STRING, price FLOAT, in_stock BOOLEAN, author STRING, series STRING, seq INT, genre STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE;
  • Load data from books.csv in the example directory (/opt/lucidworks-hdpsearch/solr/example/exampledocs/books.csv)
LOAD DATA LOCAL INPATH '/opt/lucidworks-hdpsearch/solr/example/exampledocs/books.csv' OVERWRITE INTO TABLE books;
  • Update books table (if needed) so that column name is ignored: ALTER TABLE books SET TBLPROPERTIES ("skip.header.line.count"="1");

Create External Table for Solr and index data to Solr:

CREATE EXTERNAL TABLE solr_sec (id STRING, cat_s STRING, title_s STRING, price_f STRING, in_stock_b STRING, author_s STRING, series_s STRING, seq_i INT, genre_s STRING) STORED BY 'com.lucidworks.hadoop.hive.LWStorageHandler' LOCATION '/tmp/solr' TBLPROPERTIES('solr.zkhost' = <zk_connection_string>, 'solr.collection' = ‘hivecollection’, 'solr.query' = '*:*');
    • If this is a secure cluster you have to mention the path to jaas-client.conf which will contain the service principal and keytab for the user who has permission to read and write to and from solr and hive.
    • Then you append : 'lww.jaas.file' = '/tmp/jaas-client.conf' : as well to the create external table command above.
    • A sample jaas-client.conf looks like:
   Client {                    
         com.sun.security.auth.module.Krb5LoginModule required                    
 	useKeyTab=true                   
	keyTab="/etc/security/keytabs/smokeuser.headless.keytab"
        storeKey=true                  
    	useTicketCache=false                 
	debug=true                
	principal="ambari-qa@EXAMPLE.COM";
   };

The owner of the file is solr:hadoop. This file needs to be copied to all nodes where a NodeManager is running.

  • Insert data from books table which you want to index to solr external table:
    INSERT OVERWRITE TABLE solr_sec SELECT b.* FROM books b;
  • You can issue a select query to verify data is inserted
  • Now you should be able to see your data in solr ‘hivecollection’. You can search for the data/info you were looking for using Solr UI or API calls.

Querying Data via Solr UI:

You can issue the below call on your cluster command line as well:

curl -v -i --negotiate -u : "http://<solr_host>:8983/solr/hivecollection/select?q=*:*&wt=json&indent=true"

Change query ‘q’ based on what you want to look for.

Spark Connector:

You need to first build spark-solr jar from the source.

  • From /opt/lucidworks-hdpsearch/spark/spark-solr:
  • mvn clean package -DskipTests
  • This will create spark-solr jars in target directory.
  • Now switch to spark shell
    • /usr/hdp/current/spark2-thriftserver/bin/spark-shell --jars ./spark-solr-3.5.6-shaded.jar

(Add jaas-client.conf (--conf 'spark.driver.extraJavaOptions=-Djava.security.auth.login.config=/tmp/jaas-client.conf' --conf 'spark.executor.extraJavaOptions=-Djava.security.auth.login.config=/tmp/jaas-client.conf') if this is a secure cluster. See Hive section for details on jaas-client.conf)

Create Collection:

Just like how we created hivecollection, create a ‘sparkcollection’

curl -X GET "http://<solr_host>:8983/solr/admin/collections?action=CREATE&name=sparkcollection&numShards=2&replicationFactor=1"

Index data to Solr:

(Below is the same example you’ll find in Lucidworks doc as well)

The CSV file used in this sample is located here. Move the CSV file to the /tmp HDFS directory. Read it as a Spark DataFrame as shown below.

Index this data to Solr using the command: http://<solr_host>:8983/solr/sparkcollection/update?commit=true

  1. Run a query on Solr UI to validate the setup:
    *:* returns all 999 docs indexed
  2. Query for a particular pickup location returns one document
  3. Reading data from Solr
  4. Read from Spark (tip and fare):
  5. Read from spark (total amount and toll amount)

Hdfs Connector:

Hdfs connector provides the ability to index files of the following formats or contents :

  • CSV
  • Zip
  • War
  • Sequence
  • XML
  • SolrXML
  • Directories
  • Regex (allows to define a regular expression on the incoming data and filter content)
  • Grok (indexes incoming data based on a grok configuration)

Job jar (which is the HDFS connector) has different IngestMappers to handle these different types of files/formats. Let's try to index the same books.csv we had in the hive example using CSVIngestMapper as an example. This example assumes that you have a collection already created which is explained in the above examples.

Suppose books.csv resides in /user/solr/csvDir in HDFS. First command below will index this data to Solr. The rest of the commands are for all other types of IngestMappers. For secure cluster add -Dlww.jaas.file=/tmp/jaas-client.conf to the commands and do a kinit as needed.

CSVIngestMapper

hadoop jar solr-hadoop-job-4.0.0.jar com.lucidworks.hadoop.ingest.IngestJob -Dlww.commit.on.close=true -DcsvDelimiter=@ -DcsvFirstLineComment=true -cls com.lucidworks.hadoop.ingest.CSVIngestMapper -c csvCollection -of com.lucidworks.hadoop.io.LWMapRedOutputFormat -i /user/solr/csvDir/* -zk <zk_connection_string>

RegexIngestMapper

hadoop jar solr-hadoop-job-4.0.0.jar com.lucidworks.hadoop.ingest.IngestJob -Dlww.commit.on.close=true -Dcom.lucidworks.hadoop.ingest.RegexIngestMapper.regex=".([01]?\d\d?|2[0-4]\d|25[0-5])\.([01]?\d\d?|2[0-4]\d|25[0-5])\.([01]?\d\d?|2[0-4]\d|25[0-5])\.([01]?\d\d?|2[0-4]\d|25[0-5])." -Dcom.lucidworks.hadoop.ingest.RegexIngestMapper.groups_to_fields="0=data,1=ip1,2=ip2,3=ip3,4=ip4" -cls com.lucidworks.hadoop.ingest.RegexIngestMapper -c regexCollection -of com.lucidworks.hadoop.io.LWMapRedOutputFormat -i /user/solr/regexDir/* -zk <zk_connection_string>

GrokIngestMapper

hadoop jar solr-hadoop-job-4.0.0.jar com.lucidworks.hadoop.ingest.IngestJob -Dlww.commit.on.close=true -Dgrok.uri=/tmp/grok.conf -cls com.lucidworks.hadoop.ingest.GrokIngestMapper -c grokCollection -of com.lucidworks.hadoop.io.LWMapRedOutputFormat -i /user/solr/grokDir/*8.log -zk <zk_connection_string>

Sample grok.conf:

input { stdin { type => example } }filter {grok {match => [ "message", "%{IP:ip} %{WORD:log_code} %{GREEDYDATA:log_message}" ]add_field => [ "received_from_field", "%{ip}" ]add_field => [ "message_code", "%{log_code}" ]add_field => [ "message_field", "%{log_message}" ]}}output {stdout { codec => rubydebug }}

SequenceFileIngestMapper

hadoop jar solr-hadoop-job-4.0.0.jar com.lucidworks.hadoop.ingest.IngestJob -Dlww.commit.on.close=true -cls com.lucidworks.hadoop.ingest.SequenceFileIngestMapper -c seqCollection -of com.lucidworks.hadoop.io.LWMapRedOutputFormat -i /user/solr/seqDir/*.seq -zk <zk_connection_string>

SolrXMLIngestMapper

hadoop jar solr-hadoop-job-4.0.0.jar com.lucidworks.hadoop.ingest.IngestJob -Dlww.commit.on.close=true -cls com.lucidworks.hadoop.ingest.SolrXMLIngestMapper -c solrxmlCollection -of com.lucidworks.hadoop.io.LWMapRedOutputFormat -i /user/solr/solrXmlDir/* -zk <zk_connection_string>

WarcIngestMapper

hadoop jar solr-hadoop-job-4.0.0.jar com.lucidworks.hadoop.ingest.IngestJob -Dlww.commit.on.close=true -cls com.lucidworks.hadoop.ingest.WarcIngestMapper -c warcCollection -of com.lucidworks.hadoop.io.LWMapRedOutputFormat -i /user/solr/warcDir/* -zk <zk_connection_string>

ZipIngestMapper

hadoop jar solr-hadoop-job-4.0.0.jar com.lucidworks.hadoop.ingest.IngestJob -Dlww.commit.on.close=true -cls com.lucidworks.hadoop.ingest.ZipIngestMapper -c zipCollection -of com.lucidworks.hadoop.io.LWMapRedOutputFormat -i /user/solr/test-zip/* -zk <zk_connection_string>

DirectoryIngestMapper

hadoop jar solr-hadoop-job-4.0.0.jar com.lucidworks.hadoop.ingest.IngestJob -Dlww.commit.on.close=true -cls com.lucidworks.hadoop.ingest.DirectoryIngestMapper -c dirCollection -i /user/solr/test-documents/hadoop-dir/* -of com.lucidworks.hadoop.io.LWMapRedOutputFormat -zk <zk_connection_string>

XMLIngestMapper

hadoop jar solr-hadoop-job-4.0.0.jar com.lucidworks.hadoop.ingest.IngestJob -Dlww.xml.start=root -Dlww.xml.end=root -Dlww.jaas.file=/tmp/jaas-client.conf -Dlww.xml.docXPathExpr=//doc -Dlww.commit.on.close=true -cls com.lucidworks.hadoop.ingest.XMLIngestMapper -c xmlCollection -of com.lucidworks.hadoop.io.LWMapRedOutputFormat -i /user/solr/xmlDir/* -zk <zk_connection_string> 

You should be able to issue queries to these collections via SolrUI/through APIs just like we did in the previous examples.

3,002 Views
Comments

Thanks, it helps.

avatar
Contributor

Hi,


Thanks, This document is really helpful. I have enabled HDPSearch 4.0 in our kerberised cluster. Could you please help me to enable auto indexing in solr to index hive and hdfs data?


Thanks,
Shikha