Created on 10-19-2018 08:55 PM - edited on 01-27-2021 09:38 AM by VidyaSargur
HDP Search provides the tools to index data from your HDP cluster to Solr. You can utilize the power of connectors that are shipped with HDPSearch to index data from HDFS, Hive tables, and Spark dataframes to Solr. Once you have your data in Solr, search and querying is simpler. You may find the official Hortonworks documentation for HDP Search 4.0 here: https://docs.hortonworks.com/HDPDocuments/HDPS/HDPS-4.0.0/bk_solr-search-installation/content/ch_hdp...
In this article, you may find how you could set up your HDP Cluster with HDP Search using the Solr Management pack shipped by Hortonworks. Additionally, you will find details on how to use connectors for Hive, Spark, and Hdfs to index data to Solr and query from Solr. (This document assumes that you already have an Ambari-2.7.0 + HDP-3.0 cluster up and running)
Navigate and Login to Ambari UI
You can go to SolrUI directly from QuickLinks. If Kerberos is enabled you have to enable spnego authentication to access Solr UI (Instructions mentioned in this doc). Now you have your HDPSearch cluster up and running fine!
The following shows how you can use Connectors that are shipped with HDPSearch to index data to Solr and query using SolrUI. Below shows Hive, Hadoop and Spark Connector usages.
For the Hive connector to work in this version(Hdp search 4.0) you need to have Serde Jar in Hive’s classpath. You can do this as below:
Create a Collection:
http://<solr_host>:8983/solr/admin/collections?action=CREATE&name=hivecollection&numShards=2&replica...
Index Data
As this is an example, first we will create a table in hive whose data we want to be indexed in Solr. You will skip this step for your real data. Now create an external table for Solr and proceed with indexing.
Create Table in Hive and insert data:
CREATE TABLE books (id STRING, cat STRING, title STRING, price FLOAT, in_stock BOOLEAN, author STRING, series STRING, seq INT, genre STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE;
LOAD DATA LOCAL INPATH '/opt/lucidworks-hdpsearch/solr/example/exampledocs/books.csv' OVERWRITE INTO TABLE books;
Create External Table for Solr and index data to Solr:
CREATE EXTERNAL TABLE solr_sec (id STRING, cat_s STRING, title_s STRING, price_f STRING, in_stock_b STRING, author_s STRING, series_s STRING, seq_i INT, genre_s STRING) STORED BY 'com.lucidworks.hadoop.hive.LWStorageHandler' LOCATION '/tmp/solr' TBLPROPERTIES('solr.zkhost' = <zk_connection_string>, 'solr.collection' = ‘hivecollection’, 'solr.query' = '*:*');
Client { com.sun.security.auth.module.Krb5LoginModule required useKeyTab=true keyTab="/etc/security/keytabs/smokeuser.headless.keytab" storeKey=true useTicketCache=false debug=true principal="ambari-qa@EXAMPLE.COM"; };
The owner of the file is solr:hadoop. This file needs to be copied to all nodes where a NodeManager is running.
INSERT OVERWRITE TABLE solr_sec SELECT b.* FROM books b;
Querying Data via Solr UI:
You can issue the below call on your cluster command line as well:
curl -v -i --negotiate -u : "http://<solr_host>:8983/solr/hivecollection/select?q=*:*&wt=json&indent=true"
Change query ‘q’ based on what you want to look for.
You need to first build spark-solr jar from the source.
(Add jaas-client.conf (--conf 'spark.driver.extraJavaOptions=-Djava.security.auth.login.config=/tmp/jaas-client.conf' --conf 'spark.executor.extraJavaOptions=-Djava.security.auth.login.config=/tmp/jaas-client.conf') if this is a secure cluster. See Hive section for details on jaas-client.conf)
Create Collection:
Just like how we created hivecollection, create a ‘sparkcollection’
curl -X GET "http://<solr_host>:8983/solr/admin/collections?action=CREATE&name=sparkcollection&numShards=2&replicationFactor=1"
Index data to Solr:
(Below is the same example you’ll find in Lucidworks doc as well)
The CSV file used in this sample is located here. Move the CSV file to the /tmp HDFS directory. Read it as a Spark DataFrame as shown below.
Index this data to Solr using the command: http://<solr_host>:8983/solr/sparkcollection/update?commit=true
Hdfs connector provides the ability to index files of the following formats or contents :
Job jar (which is the HDFS connector) has different IngestMappers to handle these different types of files/formats. Let's try to index the same books.csv we had in the hive example using CSVIngestMapper as an example. This example assumes that you have a collection already created which is explained in the above examples.
Suppose books.csv resides in /user/solr/csvDir in HDFS. First command below will index this data to Solr. The rest of the commands are for all other types of IngestMappers. For secure cluster add -Dlww.jaas.file=/tmp/jaas-client.conf to the commands and do a kinit as needed.
CSVIngestMapper
hadoop jar solr-hadoop-job-4.0.0.jar com.lucidworks.hadoop.ingest.IngestJob -Dlww.commit.on.close=true -DcsvDelimiter=@ -DcsvFirstLineComment=true -cls com.lucidworks.hadoop.ingest.CSVIngestMapper -c csvCollection -of com.lucidworks.hadoop.io.LWMapRedOutputFormat -i /user/solr/csvDir/* -zk <zk_connection_string>
RegexIngestMapper
hadoop jar solr-hadoop-job-4.0.0.jar com.lucidworks.hadoop.ingest.IngestJob -Dlww.commit.on.close=true -Dcom.lucidworks.hadoop.ingest.RegexIngestMapper.regex=".([01]?\d\d?|2[0-4]\d|25[0-5])\.([01]?\d\d?|2[0-4]\d|25[0-5])\.([01]?\d\d?|2[0-4]\d|25[0-5])\.([01]?\d\d?|2[0-4]\d|25[0-5])." -Dcom.lucidworks.hadoop.ingest.RegexIngestMapper.groups_to_fields="0=data,1=ip1,2=ip2,3=ip3,4=ip4" -cls com.lucidworks.hadoop.ingest.RegexIngestMapper -c regexCollection -of com.lucidworks.hadoop.io.LWMapRedOutputFormat -i /user/solr/regexDir/* -zk <zk_connection_string>
GrokIngestMapper
hadoop jar solr-hadoop-job-4.0.0.jar com.lucidworks.hadoop.ingest.IngestJob -Dlww.commit.on.close=true -Dgrok.uri=/tmp/grok.conf -cls com.lucidworks.hadoop.ingest.GrokIngestMapper -c grokCollection -of com.lucidworks.hadoop.io.LWMapRedOutputFormat -i /user/solr/grokDir/*8.log -zk <zk_connection_string>
Sample grok.conf:
input { stdin { type => example } }filter {grok {match => [ "message", "%{IP:ip} %{WORD:log_code} %{GREEDYDATA:log_message}" ]add_field => [ "received_from_field", "%{ip}" ]add_field => [ "message_code", "%{log_code}" ]add_field => [ "message_field", "%{log_message}" ]}}output {stdout { codec => rubydebug }}
SequenceFileIngestMapper
hadoop jar solr-hadoop-job-4.0.0.jar com.lucidworks.hadoop.ingest.IngestJob -Dlww.commit.on.close=true -cls com.lucidworks.hadoop.ingest.SequenceFileIngestMapper -c seqCollection -of com.lucidworks.hadoop.io.LWMapRedOutputFormat -i /user/solr/seqDir/*.seq -zk <zk_connection_string>
SolrXMLIngestMapper
hadoop jar solr-hadoop-job-4.0.0.jar com.lucidworks.hadoop.ingest.IngestJob -Dlww.commit.on.close=true -cls com.lucidworks.hadoop.ingest.SolrXMLIngestMapper -c solrxmlCollection -of com.lucidworks.hadoop.io.LWMapRedOutputFormat -i /user/solr/solrXmlDir/* -zk <zk_connection_string>
WarcIngestMapper
hadoop jar solr-hadoop-job-4.0.0.jar com.lucidworks.hadoop.ingest.IngestJob -Dlww.commit.on.close=true -cls com.lucidworks.hadoop.ingest.WarcIngestMapper -c warcCollection -of com.lucidworks.hadoop.io.LWMapRedOutputFormat -i /user/solr/warcDir/* -zk <zk_connection_string>
ZipIngestMapper
hadoop jar solr-hadoop-job-4.0.0.jar com.lucidworks.hadoop.ingest.IngestJob -Dlww.commit.on.close=true -cls com.lucidworks.hadoop.ingest.ZipIngestMapper -c zipCollection -of com.lucidworks.hadoop.io.LWMapRedOutputFormat -i /user/solr/test-zip/* -zk <zk_connection_string>
DirectoryIngestMapper
hadoop jar solr-hadoop-job-4.0.0.jar com.lucidworks.hadoop.ingest.IngestJob -Dlww.commit.on.close=true -cls com.lucidworks.hadoop.ingest.DirectoryIngestMapper -c dirCollection -i /user/solr/test-documents/hadoop-dir/* -of com.lucidworks.hadoop.io.LWMapRedOutputFormat -zk <zk_connection_string>
XMLIngestMapper
hadoop jar solr-hadoop-job-4.0.0.jar com.lucidworks.hadoop.ingest.IngestJob -Dlww.xml.start=root -Dlww.xml.end=root -Dlww.jaas.file=/tmp/jaas-client.conf -Dlww.xml.docXPathExpr=//doc -Dlww.commit.on.close=true -cls com.lucidworks.hadoop.ingest.XMLIngestMapper -c xmlCollection -of com.lucidworks.hadoop.io.LWMapRedOutputFormat -i /user/solr/xmlDir/* -zk <zk_connection_string>
You should be able to issue queries to these collections via SolrUI/through APIs just like we did in the previous examples.
Created on 10-24-2018 05:47 AM
Thanks, it helps.
Created on 04-09-2019 10:28 AM
Hi,
Thanks, This document is really helpful. I have enabled HDPSearch 4.0 in our kerberised cluster. Could you please help me to enable auto indexing in solr to index hive and hdfs data?
Thanks,
Shikha