Member since
10-21-2015
26
Posts
77
Kudos Received
3
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
1168 | 06-29-2018 10:07 PM | |
1299 | 10-12-2017 08:22 PM | |
13992 | 06-29-2017 08:26 PM |
10-23-2018
06:50 AM
Please see latest doc on HDPSearch4.0 at https://docs.hortonworks.com/HDPDocuments/HDPS/HDPS-4.0.0/bk_solr-search-installation/content/ch_hdp-search.html . In case you are interested here is an article where you could find details on how to Add Solr to and HDP-3 cluster : https://community.hortonworks.com/articles/224593/hdp-search-40-deployment-and-basic-connector-usage.html. Hope this helps
... View more
10-19-2018
08:55 PM
6 Kudos
HDP Search provides the tools to index data from your HDP cluster to Solr. You can utilize the power of connectors that are shipped with HDPSearch to index data from HDFS, Hive tables, and Spark dataframes to Solr. Once you have your data in Solr, search and querying is simpler. You may find the official Hortonworks documentation for HDP Search 4.0 here: https://docs.hortonworks.com/HDPDocuments/HDPS/HDPS-4.0.0/bk_solr-search-installation/content/ch_hdp-search.html
In this article, you may find how you could set up your HDP Cluster with HDP Search using the Solr Management pack shipped by Hortonworks. Additionally, you will find details on how to use connectors for Hive, Spark, and Hdfs to index data to Solr and query from Solr. (This document assumes that you already have an Ambari-2.7.0 + HDP-3.0 cluster up and running)
Setup
Install Management Pack:
Download Solr service mpack on Ambari server node:
wget http://public-repo-1.hortonworks.com/HDP-SOLR/hdp-solr-ambari-mp/solr-service-mpack-4.0.0.tar.gz
Install mpack:
ambari-server install-mpack --mpack=solr-service-mpack-4.0.0.tar.gz
Restart ambari-server:
ambari-server restart
Add Solr to HDP Cluster:
Navigate and Login to Ambari UI
Choose Services - Add Service from the left navigation panel
Choose Solr from Choose Services Page and click Next
Now you land on the Assign Masters page where you can choose a number of Solr Servers you need for your cluster. Choose 2 or more for SolrCloud. Click Next - You would see Customize Services Page where no changes are required. Next, you land on the Review page as shown below where you can verify the hosts and Solr package
If this looks good click on Deploy to get Solr added to your cluster.
You can go to SolrUI directly from QuickLinks. If Kerberos is enabled you have to enable spnego authentication to access Solr UI (Instructions mentioned in this doc). Now you have your HDPSearch cluster up and running fine!
Connector Usage
The following shows how you can use Connectors that are shipped with HDPSearch to index data to Solr and query using SolrUI. Below shows Hive, Hadoop and Spark Connector usages.
Hive Connector:
For the Hive connector to work in this version(Hdp search 4.0) you need to have Serde Jar in Hive’s classpath. You can do this as below:
Create a directory ‘auxlib’ in /usr/hdp/current/hive-server2
Copy serde jar to auxlib
cp /opt/lucidworks-hdpsearch/hive/solr-hive-serde-4.0.0.jar
Restart Hive
Create a Collection:
You may use the below command to create a new ‘hivecollection’ in Solr with 2 shards and a replication factor 1.
http://<solr_host>:8983/solr/admin/collections?action=CREATE&name=hivecollection&numShards=2&replicationFactor=1
Index Data
As this is an example, first we will create a table in hive whose data we want to be indexed in Solr. You will skip this step for your real data. Now create an external table for Solr and proceed with indexing.
Create Table in Hive and insert data:
As hive user, connect to beeline (kinit before connecting to beeline if it is a secure cluster)
CREATE TABLE books (id STRING, cat STRING, title STRING, price FLOAT, in_stock BOOLEAN, author STRING, series STRING, seq INT, genre STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE;
Load data from books.csv in the example directory (/opt/lucidworks-hdpsearch/solr/example/exampledocs/books.csv)
LOAD DATA LOCAL INPATH '/opt/lucidworks-hdpsearch/solr/example/exampledocs/books.csv' OVERWRITE INTO TABLE books;
Update books table (if needed) so that column name is ignored: ALTER TABLE books SET TBLPROPERTIES ("skip.header.line.count"="1");
Create External Table for Solr and index data to Solr:
CREATE EXTERNAL TABLE solr_sec (id STRING, cat_s STRING, title_s STRING, price_f STRING, in_stock_b STRING, author_s STRING, series_s STRING, seq_i INT, genre_s STRING) STORED BY 'com.lucidworks.hadoop.hive.LWStorageHandler' LOCATION '/tmp/solr' TBLPROPERTIES('solr.zkhost' = <zk_connection_string>, 'solr.collection' = ‘hivecollection’, 'solr.query' = '*:*');
If this is a secure cluster you have to mention the path to jaas-client.conf which will contain the service principal and keytab for the user who has permission to read and write to and from solr and hive.
Then you append : 'lww.jaas.file' = '/tmp/jaas-client.conf' : as well to the create external table command above.
A sample jaas-client.conf looks like:
Client {
com.sun.security.auth.module.Krb5LoginModule required
useKeyTab=true
keyTab="/etc/security/keytabs/smokeuser.headless.keytab"
storeKey=true
useTicketCache=false
debug=true
principal="ambari-qa@EXAMPLE.COM";
};
The owner of the file is solr:hadoop. This file needs to be copied to all nodes where a NodeManager is running.
Insert data from books table which you want to index to solr external table:
INSERT OVERWRITE TABLE solr_sec SELECT b.* FROM books b;
You can issue a select query to verify data is inserted
Now you should be able to see your data in solr ‘hivecollection’. You can search for the data/info you were looking for using Solr UI or API calls.
Querying Data via Solr UI:
You can issue the below call on your cluster command line as well:
curl -v -i --negotiate -u : "http://<solr_host>:8983/solr/hivecollection/select?q=*:*&wt=json&indent=true"
Change query ‘q’ based on what you want to look for.
Spark Connector:
You need to first build spark-solr jar from the source.
From /opt/lucidworks-hdpsearch/spark/spark-solr:
mvn clean package -DskipTests
This will create spark-solr jars in target directory.
Now switch to spark shell
/usr/hdp/current/spark2-thriftserver/bin/spark-shell --jars ./spark-solr-3.5.6-shaded.jar
(Add jaas-client.conf (--conf 'spark.driver.extraJavaOptions=-Djava.security.auth.login.config=/tmp/jaas-client.conf' --conf 'spark.executor.extraJavaOptions=-Djava.security.auth.login.config=/tmp/jaas-client.conf') if this is a secure cluster. See Hive section for details on jaas-client.conf)
Create Collection:
Just like how we created hivecollection, create a ‘sparkcollection’
curl -X GET "http://<solr_host>:8983/solr/admin/collections?action=CREATE&name=sparkcollection&numShards=2&replicationFactor=1"
Index data to Solr:
(Below is the same example you’ll find in Lucidworks doc as well)
The CSV file used in this sample is located here. Move the CSV file to the /tmp HDFS directory. Read it as a Spark DataFrame as shown below.
Index this data to Solr using the command: http://<solr_host>:8983/solr/sparkcollection/update?commit=true
Run a query on Solr UI to validate the setup: *:* returns all 999 docs indexed
Query for a particular pickup location returns one document
Reading data from Solr
Read from Spark (tip and fare):
Read from spark (total amount and toll amount)
Hdfs Connector:
Hdfs connector provides the ability to index files of the following formats or contents :
CSV
Zip
War
Sequence
XML
SolrXML
Directories
Regex (allows to define a regular expression on the incoming data and filter content)
Grok (indexes incoming data based on a grok configuration)
Job jar (which is the HDFS connector) has different IngestMappers to handle these different types of files/formats. Let's try to index the same books.csv we had in the hive example using CSVIngestMapper as an example. This example assumes that you have a collection already created which is explained in the above examples.
Suppose books.csv resides in /user/solr/csvDir in HDFS. First command below will index this data to Solr. The rest of the commands are for all other types of IngestMappers. For secure cluster add -Dlww.jaas.file=/tmp/jaas-client.conf to the commands and do a kinit as needed.
CSVIngestMapper
hadoop jar solr-hadoop-job-4.0.0.jar com.lucidworks.hadoop.ingest.IngestJob -Dlww.commit.on.close=true -DcsvDelimiter=@ -DcsvFirstLineComment=true -cls com.lucidworks.hadoop.ingest.CSVIngestMapper -c csvCollection -of com.lucidworks.hadoop.io.LWMapRedOutputFormat -i /user/solr/csvDir/* -zk <zk_connection_string>
RegexIngestMapper
hadoop jar solr-hadoop-job-4.0.0.jar com.lucidworks.hadoop.ingest.IngestJob -Dlww.commit.on.close=true -Dcom.lucidworks.hadoop.ingest.RegexIngestMapper.regex=".([01]?\d\d?|2[0-4]\d|25[0-5])\.([01]?\d\d?|2[0-4]\d|25[0-5])\.([01]?\d\d?|2[0-4]\d|25[0-5])\.([01]?\d\d?|2[0-4]\d|25[0-5])." -Dcom.lucidworks.hadoop.ingest.RegexIngestMapper.groups_to_fields="0=data,1=ip1,2=ip2,3=ip3,4=ip4" -cls com.lucidworks.hadoop.ingest.RegexIngestMapper -c regexCollection -of com.lucidworks.hadoop.io.LWMapRedOutputFormat -i /user/solr/regexDir/* -zk <zk_connection_string>
GrokIngestMapper
hadoop jar solr-hadoop-job-4.0.0.jar com.lucidworks.hadoop.ingest.IngestJob -Dlww.commit.on.close=true -Dgrok.uri=/tmp/grok.conf -cls com.lucidworks.hadoop.ingest.GrokIngestMapper -c grokCollection -of com.lucidworks.hadoop.io.LWMapRedOutputFormat -i /user/solr/grokDir/*8.log -zk <zk_connection_string>
Sample grok.conf:
input { stdin { type => example } }filter {grok {match => [ "message", "%{IP:ip} %{WORD:log_code} %{GREEDYDATA:log_message}" ]add_field => [ "received_from_field", "%{ip}" ]add_field => [ "message_code", "%{log_code}" ]add_field => [ "message_field", "%{log_message}" ]}}output {stdout { codec => rubydebug }}
SequenceFileIngestMapper
hadoop jar solr-hadoop-job-4.0.0.jar com.lucidworks.hadoop.ingest.IngestJob -Dlww.commit.on.close=true -cls com.lucidworks.hadoop.ingest.SequenceFileIngestMapper -c seqCollection -of com.lucidworks.hadoop.io.LWMapRedOutputFormat -i /user/solr/seqDir/*.seq -zk <zk_connection_string>
SolrXMLIngestMapper
hadoop jar solr-hadoop-job-4.0.0.jar com.lucidworks.hadoop.ingest.IngestJob -Dlww.commit.on.close=true -cls com.lucidworks.hadoop.ingest.SolrXMLIngestMapper -c solrxmlCollection -of com.lucidworks.hadoop.io.LWMapRedOutputFormat -i /user/solr/solrXmlDir/* -zk <zk_connection_string>
WarcIngestMapper
hadoop jar solr-hadoop-job-4.0.0.jar com.lucidworks.hadoop.ingest.IngestJob -Dlww.commit.on.close=true -cls com.lucidworks.hadoop.ingest.WarcIngestMapper -c warcCollection -of com.lucidworks.hadoop.io.LWMapRedOutputFormat -i /user/solr/warcDir/* -zk <zk_connection_string>
ZipIngestMapper
hadoop jar solr-hadoop-job-4.0.0.jar com.lucidworks.hadoop.ingest.IngestJob -Dlww.commit.on.close=true -cls com.lucidworks.hadoop.ingest.ZipIngestMapper -c zipCollection -of com.lucidworks.hadoop.io.LWMapRedOutputFormat -i /user/solr/test-zip/* -zk <zk_connection_string>
DirectoryIngestMapper
hadoop jar solr-hadoop-job-4.0.0.jar com.lucidworks.hadoop.ingest.IngestJob -Dlww.commit.on.close=true -cls com.lucidworks.hadoop.ingest.DirectoryIngestMapper -c dirCollection -i /user/solr/test-documents/hadoop-dir/* -of com.lucidworks.hadoop.io.LWMapRedOutputFormat -zk <zk_connection_string>
XMLIngestMapper
hadoop jar solr-hadoop-job-4.0.0.jar com.lucidworks.hadoop.ingest.IngestJob -Dlww.xml.start=root -Dlww.xml.end=root -Dlww.jaas.file=/tmp/jaas-client.conf -Dlww.xml.docXPathExpr=//doc -Dlww.commit.on.close=true -cls com.lucidworks.hadoop.ingest.XMLIngestMapper -c xmlCollection -of com.lucidworks.hadoop.io.LWMapRedOutputFormat -i /user/solr/xmlDir/* -zk <zk_connection_string>
You should be able to issue queries to these collections via SolrUI/through APIs just like we did in the previous examples.
... View more
Labels:
06-29-2018
10:07 PM
7 Kudos
Could you try escaping it with u002fc So if you want a/b try using a\\u002fb
... View more
10-12-2017
08:22 PM
6 Kudos
Could you please try : /api/v1/clusters/clusterName/hosts/hostName/host_components/componentName; adding service and hostname in the request body {"{SERVICE}", serviceName, "{HOST}", hostName} This article has sample calls that would walk you through the process: https://cwiki.apache.org/confluence/display/AMBARI/Adding+a+New+Service+to+an+Existing+Cluster Hope this helps.
... View more
09-22-2017
02:44 AM
2 Kudos
Thanks @kramakrishnan. This works!
... View more
09-21-2017
10:19 PM
9 Kudos
I am looking for an API call with which I can get the property differences between 2 config versions of a Service? Would be great if it can be extended to config versions in a config group.
... View more
Labels:
- Labels:
-
Apache Ambari
06-29-2017
08:26 PM
5 Kudos
@aswathy, by default you should be able to login with admin/admin. Could you please try
... View more
05-02-2017
06:29 PM
1 Kudo
@cduby, which version of HDCloud are you using?
... View more
03-27-2017
07:57 PM
2 Kudos
Thanks @Ram Venkatesh. After registering the metastore I can see the named entry in my json. Thankyou!
... View more