Member since
10-21-2015
26
Posts
77
Kudos Received
3
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
626 | 06-29-2018 10:07 PM | |
685 | 10-12-2017 08:22 PM | |
10608 | 06-29-2017 08:26 PM |
10-23-2018
06:50 AM
Please see latest doc on HDPSearch4.0 at https://docs.hortonworks.com/HDPDocuments/HDPS/HDPS-4.0.0/bk_solr-search-installation/content/ch_hdp-search.html . In case you are interested here is an article where you could find details on how to Add Solr to and HDP-3 cluster : https://community.hortonworks.com/articles/224593/hdp-search-40-deployment-and-basic-connector-usage.html. Hope this helps
... View more
10-19-2018
08:55 PM
6 Kudos
HDP Search provides the tools to index data from your HDP cluster to Solr. You can utilize the power of connectors that are shipped with HDPSearch to index data from HDFS, Hive tables, and Spark dataframes to Solr. Once you have your data in Solr, search and querying is simpler. You may find the official Hortonworks documentation for HDP Search 4.0 here: https://docs.hortonworks.com/HDPDocuments/HDPS/HDPS-4.0.0/bk_solr-search-installation/content/ch_hdp-search.html
In this article, you may find how you could set up your HDP Cluster with HDP Search using the Solr Management pack shipped by Hortonworks. Additionally, you will find details on how to use connectors for Hive, Spark, and Hdfs to index data to Solr and query from Solr. (This document assumes that you already have an Ambari-2.7.0 + HDP-3.0 cluster up and running)
Setup
Install Management Pack:
Download Solr service mpack on Ambari server node:
wget http://public-repo-1.hortonworks.com/HDP-SOLR/hdp-solr-ambari-mp/solr-service-mpack-4.0.0.tar.gz
Install mpack:
ambari-server install-mpack --mpack=solr-service-mpack-4.0.0.tar.gz
Restart ambari-server:
ambari-server restart
Add Solr to HDP Cluster:
Navigate and Login to Ambari UI
Choose Services - Add Service from the left navigation panel
Choose Solr from Choose Services Page and click Next
Now you land on the Assign Masters page where you can choose a number of Solr Servers you need for your cluster. Choose 2 or more for SolrCloud. Click Next - You would see Customize Services Page where no changes are required. Next, you land on the Review page as shown below where you can verify the hosts and Solr package
If this looks good click on Deploy to get Solr added to your cluster.
You can go to SolrUI directly from QuickLinks. If Kerberos is enabled you have to enable spnego authentication to access Solr UI (Instructions mentioned in this doc). Now you have your HDPSearch cluster up and running fine!
Connector Usage
The following shows how you can use Connectors that are shipped with HDPSearch to index data to Solr and query using SolrUI. Below shows Hive, Hadoop and Spark Connector usages.
Hive Connector:
For the Hive connector to work in this version(Hdp search 4.0) you need to have Serde Jar in Hive’s classpath. You can do this as below:
Create a directory ‘auxlib’ in /usr/hdp/current/hive-server2
Copy serde jar to auxlib
cp /opt/lucidworks-hdpsearch/hive/solr-hive-serde-4.0.0.jar
Restart Hive
Create a Collection:
You may use the below command to create a new ‘hivecollection’ in Solr with 2 shards and a replication factor 1.
http://<solr_host>:8983/solr/admin/collections?action=CREATE&name=hivecollection&numShards=2&replicationFactor=1
Index Data
As this is an example, first we will create a table in hive whose data we want to be indexed in Solr. You will skip this step for your real data. Now create an external table for Solr and proceed with indexing.
Create Table in Hive and insert data:
As hive user, connect to beeline (kinit before connecting to beeline if it is a secure cluster)
CREATE TABLE books (id STRING, cat STRING, title STRING, price FLOAT, in_stock BOOLEAN, author STRING, series STRING, seq INT, genre STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE;
Load data from books.csv in the example directory (/opt/lucidworks-hdpsearch/solr/example/exampledocs/books.csv)
LOAD DATA LOCAL INPATH '/opt/lucidworks-hdpsearch/solr/example/exampledocs/books.csv' OVERWRITE INTO TABLE books;
Update books table (if needed) so that column name is ignored: ALTER TABLE books SET TBLPROPERTIES ("skip.header.line.count"="1");
Create External Table for Solr and index data to Solr:
CREATE EXTERNAL TABLE solr_sec (id STRING, cat_s STRING, title_s STRING, price_f STRING, in_stock_b STRING, author_s STRING, series_s STRING, seq_i INT, genre_s STRING) STORED BY 'com.lucidworks.hadoop.hive.LWStorageHandler' LOCATION '/tmp/solr' TBLPROPERTIES('solr.zkhost' = <zk_connection_string>, 'solr.collection' = ‘hivecollection’, 'solr.query' = '*:*');
If this is a secure cluster you have to mention the path to jaas-client.conf which will contain the service principal and keytab for the user who has permission to read and write to and from solr and hive.
Then you append : 'lww.jaas.file' = '/tmp/jaas-client.conf' : as well to the create external table command above.
A sample jaas-client.conf looks like:
Client {
com.sun.security.auth.module.Krb5LoginModule required
useKeyTab=true
keyTab="/etc/security/keytabs/smokeuser.headless.keytab"
storeKey=true
useTicketCache=false
debug=true
principal="ambari-qa@EXAMPLE.COM";
};
The owner of the file is solr:hadoop. This file needs to be copied to all nodes where a NodeManager is running.
Insert data from books table which you want to index to solr external table:
INSERT OVERWRITE TABLE solr_sec SELECT b.* FROM books b;
You can issue a select query to verify data is inserted
Now you should be able to see your data in solr ‘hivecollection’. You can search for the data/info you were looking for using Solr UI or API calls.
Querying Data via Solr UI:
You can issue the below call on your cluster command line as well:
curl -v -i --negotiate -u : "http://<solr_host>:8983/solr/hivecollection/select?q=*:*&wt=json&indent=true"
Change query ‘q’ based on what you want to look for.
Spark Connector:
You need to first build spark-solr jar from the source.
From /opt/lucidworks-hdpsearch/spark/spark-solr:
mvn clean package -DskipTests
This will create spark-solr jars in target directory.
Now switch to spark shell
/usr/hdp/current/spark2-thriftserver/bin/spark-shell --jars ./spark-solr-3.5.6-shaded.jar
(Add jaas-client.conf (--conf 'spark.driver.extraJavaOptions=-Djava.security.auth.login.config=/tmp/jaas-client.conf' --conf 'spark.executor.extraJavaOptions=-Djava.security.auth.login.config=/tmp/jaas-client.conf') if this is a secure cluster. See Hive section for details on jaas-client.conf)
Create Collection:
Just like how we created hivecollection, create a ‘sparkcollection’
curl -X GET "http://<solr_host>:8983/solr/admin/collections?action=CREATE&name=sparkcollection&numShards=2&replicationFactor=1"
Index data to Solr:
(Below is the same example you’ll find in Lucidworks doc as well)
The CSV file used in this sample is located here. Move the CSV file to the /tmp HDFS directory. Read it as a Spark DataFrame as shown below.
Index this data to Solr using the command: http://<solr_host>:8983/solr/sparkcollection/update?commit=true
Run a query on Solr UI to validate the setup: *:* returns all 999 docs indexed
Query for a particular pickup location returns one document
Reading data from Solr
Read from Spark (tip and fare):
Read from spark (total amount and toll amount)
Hdfs Connector:
Hdfs connector provides the ability to index files of the following formats or contents :
CSV
Zip
War
Sequence
XML
SolrXML
Directories
Regex (allows to define a regular expression on the incoming data and filter content)
Grok (indexes incoming data based on a grok configuration)
Job jar (which is the HDFS connector) has different IngestMappers to handle these different types of files/formats. Let's try to index the same books.csv we had in the hive example using CSVIngestMapper as an example. This example assumes that you have a collection already created which is explained in the above examples.
Suppose books.csv resides in /user/solr/csvDir in HDFS. First command below will index this data to Solr. The rest of the commands are for all other types of IngestMappers. For secure cluster add -Dlww.jaas.file=/tmp/jaas-client.conf to the commands and do a kinit as needed.
CSVIngestMapper
hadoop jar solr-hadoop-job-4.0.0.jar com.lucidworks.hadoop.ingest.IngestJob -Dlww.commit.on.close=true -DcsvDelimiter=@ -DcsvFirstLineComment=true -cls com.lucidworks.hadoop.ingest.CSVIngestMapper -c csvCollection -of com.lucidworks.hadoop.io.LWMapRedOutputFormat -i /user/solr/csvDir/* -zk <zk_connection_string>
RegexIngestMapper
hadoop jar solr-hadoop-job-4.0.0.jar com.lucidworks.hadoop.ingest.IngestJob -Dlww.commit.on.close=true -Dcom.lucidworks.hadoop.ingest.RegexIngestMapper.regex=".([01]?\d\d?|2[0-4]\d|25[0-5])\.([01]?\d\d?|2[0-4]\d|25[0-5])\.([01]?\d\d?|2[0-4]\d|25[0-5])\.([01]?\d\d?|2[0-4]\d|25[0-5])." -Dcom.lucidworks.hadoop.ingest.RegexIngestMapper.groups_to_fields="0=data,1=ip1,2=ip2,3=ip3,4=ip4" -cls com.lucidworks.hadoop.ingest.RegexIngestMapper -c regexCollection -of com.lucidworks.hadoop.io.LWMapRedOutputFormat -i /user/solr/regexDir/* -zk <zk_connection_string>
GrokIngestMapper
hadoop jar solr-hadoop-job-4.0.0.jar com.lucidworks.hadoop.ingest.IngestJob -Dlww.commit.on.close=true -Dgrok.uri=/tmp/grok.conf -cls com.lucidworks.hadoop.ingest.GrokIngestMapper -c grokCollection -of com.lucidworks.hadoop.io.LWMapRedOutputFormat -i /user/solr/grokDir/*8.log -zk <zk_connection_string>
Sample grok.conf:
input { stdin { type => example } }filter {grok {match => [ "message", "%{IP:ip} %{WORD:log_code} %{GREEDYDATA:log_message}" ]add_field => [ "received_from_field", "%{ip}" ]add_field => [ "message_code", "%{log_code}" ]add_field => [ "message_field", "%{log_message}" ]}}output {stdout { codec => rubydebug }}
SequenceFileIngestMapper
hadoop jar solr-hadoop-job-4.0.0.jar com.lucidworks.hadoop.ingest.IngestJob -Dlww.commit.on.close=true -cls com.lucidworks.hadoop.ingest.SequenceFileIngestMapper -c seqCollection -of com.lucidworks.hadoop.io.LWMapRedOutputFormat -i /user/solr/seqDir/*.seq -zk <zk_connection_string>
SolrXMLIngestMapper
hadoop jar solr-hadoop-job-4.0.0.jar com.lucidworks.hadoop.ingest.IngestJob -Dlww.commit.on.close=true -cls com.lucidworks.hadoop.ingest.SolrXMLIngestMapper -c solrxmlCollection -of com.lucidworks.hadoop.io.LWMapRedOutputFormat -i /user/solr/solrXmlDir/* -zk <zk_connection_string>
WarcIngestMapper
hadoop jar solr-hadoop-job-4.0.0.jar com.lucidworks.hadoop.ingest.IngestJob -Dlww.commit.on.close=true -cls com.lucidworks.hadoop.ingest.WarcIngestMapper -c warcCollection -of com.lucidworks.hadoop.io.LWMapRedOutputFormat -i /user/solr/warcDir/* -zk <zk_connection_string>
ZipIngestMapper
hadoop jar solr-hadoop-job-4.0.0.jar com.lucidworks.hadoop.ingest.IngestJob -Dlww.commit.on.close=true -cls com.lucidworks.hadoop.ingest.ZipIngestMapper -c zipCollection -of com.lucidworks.hadoop.io.LWMapRedOutputFormat -i /user/solr/test-zip/* -zk <zk_connection_string>
DirectoryIngestMapper
hadoop jar solr-hadoop-job-4.0.0.jar com.lucidworks.hadoop.ingest.IngestJob -Dlww.commit.on.close=true -cls com.lucidworks.hadoop.ingest.DirectoryIngestMapper -c dirCollection -i /user/solr/test-documents/hadoop-dir/* -of com.lucidworks.hadoop.io.LWMapRedOutputFormat -zk <zk_connection_string>
XMLIngestMapper
hadoop jar solr-hadoop-job-4.0.0.jar com.lucidworks.hadoop.ingest.IngestJob -Dlww.xml.start=root -Dlww.xml.end=root -Dlww.jaas.file=/tmp/jaas-client.conf -Dlww.xml.docXPathExpr=//doc -Dlww.commit.on.close=true -cls com.lucidworks.hadoop.ingest.XMLIngestMapper -c xmlCollection -of com.lucidworks.hadoop.io.LWMapRedOutputFormat -i /user/solr/xmlDir/* -zk <zk_connection_string>
You should be able to issue queries to these collections via SolrUI/through APIs just like we did in the previous examples.
... View more
- Find more articles tagged with:
- Data Processing
- hdpsearch
- How-ToTutorial
- solr
- solrcloud
Labels:
08-27-2018
08:54 PM
6 Kudos
@Anurag Mishra In case you haven't found a solution: - Did you install the mpack and restart Ambari after that? Once mpack is installed and ambari-server is restarted, you should be able to see Solr in Services at Choose Services page of Ambari Install or Add Service Wizard. Please see if this helps
... View more
06-29-2018
10:07 PM
7 Kudos
Could you try escaping it with u002fc So if you want a/b try using a\\u002fb
... View more
10-12-2017
08:22 PM
6 Kudos
Could you please try : /api/v1/clusters/clusterName/hosts/hostName/host_components/componentName; adding service and hostname in the request body {"{SERVICE}", serviceName, "{HOST}", hostName} This article has sample calls that would walk you through the process: https://cwiki.apache.org/confluence/display/AMBARI/Adding+a+New+Service+to+an+Existing+Cluster Hope this helps.
... View more
09-22-2017
02:44 AM
2 Kudos
Thanks @kramakrishnan. This works!
... View more
09-21-2017
10:19 PM
9 Kudos
I am looking for an API call with which I can get the property differences between 2 config versions of a Service? Would be great if it can be extended to config versions in a config group.
... View more
Labels:
- Labels:
-
Apache Ambari
06-29-2017
08:37 PM
6 Kudos
I did an amabri setup with AD and synced few users. I could see them synced correctly via Ambari. Followed https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.5.0/bk_zeppelin-component-guide/content/config-secure-prod-ad.html to configure Zeppelin with this AD. But after this is done I am not able to login to Zeppelin UI even with default admin/admin username/password. Following error is seen in Zeppelin logs. Any clue what would have gone wrong? Caused by: javax.naming.AuthenticationException: [LDAP: error code 49 - 80090308: LdapErr: DSID-0C0903C8, comment: AcceptSecurityContext error, data 52e, v2580]
at com.sun.jndi.ldap.LdapCtx.mapErrorCode(LdapCtx.java:3136)
at com.sun.jndi.ldap.LdapCtx.processReturnCode(LdapCtx.java:3082)
at com.sun.jndi.ldap.LdapCtx.processReturnCode(LdapCtx.java:2883)
at com.sun.jndi.ldap.LdapCtx.connect(LdapCtx.java:2797)
at com.sun.jndi.ldap.LdapCtx.<init>(LdapCtx.java:319)
at com.sun.jndi.ldap.LdapCtxFactory.getUsingURL(LdapCtxFactory.java:192)
at com.sun.jndi.ldap.LdapCtxFactory.getUsingURLs(LdapCtxFactory.java:210)
at com.sun.jndi.ldap.LdapCtxFactory.getLdapCtxInstance(LdapCtxFactory.java:153)
at com.sun.jndi.ldap.LdapCtxFactory.getInitialContext(LdapCtxFactory.java:83)
at javax.naming.spi.NamingManager.getInitialContext(NamingManager.java:684)
at javax.naming.InitialContext.getDefaultInitCtx(InitialContext.java:313)
at javax.naming.InitialContext.init(InitialContext.java:244)
at javax.naming.ldap.InitialLdapContext.<init>(InitialLdapContext.java:154)
at org.apache.shiro.realm.ldap.JndiLdapContextFactory.createLdapContext(JndiLdapContextFactory.java:508)
at org.apache.shiro.realm.ldap.JndiLdapContextFactory.getLdapContext(JndiLdapContextFactory.java:495)
at org.apache.shiro.realm.ldap.JndiLdapRealm.queryForAuthenticationInfo(JndiLdapRealm.java:375)
at org.apache.shiro.realm.ldap.JndiLdapRealm.doGetAuthenticationInfo(JndiLdapRealm.java:295)
... View more
Labels:
- Labels:
-
Apache Zeppelin
06-29-2017
08:26 PM
5 Kudos
@aswathy, by default you should be able to login with admin/admin. Could you please try
... View more
05-02-2017
06:29 PM
1 Kudo
@cduby, which version of HDCloud are you using?
... View more
03-27-2017
07:57 PM
2 Kudos
Thanks @Ram Venkatesh. After registering the metastore I can see the named entry in my json. Thankyou!
... View more
03-27-2017
07:56 PM
1 Kudo
@Dominika Bialek, yeah, As @Ram Venkatesh mentioned below after I register the metastore I can see the named entry. Thankyou!
... View more
03-27-2017
04:29 PM
2 Kudos
My cli json doesnt have the name of RDS too
... View more
03-27-2017
05:06 AM
7 Kudos
My aim is to save a template from HDC UI and reuse it to create a cluster via CLI. So I chose the 'Create Cluster' option from HDC UI, entered all the required fields, chose to 'Register new Hive Metastore' and added a name and JDBC Connection string, username and password for an existing RDS instance. Clicked on Create Cluster and chose to 'SHOW CLI JSON' (tried by saving the template as well). But I am not able to see any parameters regarding this Hive Metastore. Could you please let me know if that is expected or if i am missing something?
... View more
Labels:
- Labels:
-
Hortonworks Cloudbreak
03-25-2017
06:20 PM
Yes, hdc list-cluster-types works @Tamas Bihari. Thankyou!
... View more
03-24-2017
06:31 PM
3 Kudos
I brought up a cloud controller using AWS CLI. After ssh-ing to the controller instance I have downloaded hdc cli jar. Configured hdc with server address, username and password. Now the attempt to create a cluster fails with Blueprint not found error. If I just login to the HDC UI once and then retry the same command via hdc cli it goes through fine too. I checked ~/.hdc/config file server, user and password are all correct. Has anyone faced similar issue or any clue what I may be doing wrong. Which log files can i check to get more info on what is happening when we issue a command on cli? ./hdc create-cluster -cli-input-json /tmp/hdc-cli-deploy.json --wait true
ERROR: status code: 404, message: Blueprint 'EDW-ETL: Apache Hive 1.2.1, Apache Spark 2.1' not found.
... View more
Labels:
- Labels:
-
Apache Hive
-
Apache Spark
03-22-2017
04:10 PM
thankyou @khorvath!
... View more
03-22-2017
05:10 AM
4 Kudos
I could download HDC CLI tar/tgz file from HDC UI. Is there a corresponding api call or command to do the same once controller is up?
... View more
Labels:
- Labels:
-
Hortonworks Cloudbreak
03-16-2017
11:36 PM
1 Kudo
thanks @bbihari. this works!
... View more
03-16-2017
09:27 PM
1 Kudo
i can see it from HDC UI. but i wanted to get the same from cloudbreak shell too. I want to select a stack from shell and see which is the blueprint used by that stack. I will bring up a cluster and try the steps mentioned by @bbihari below and respond.
... View more
03-13-2017
07:43 PM
1 Kudo
Thanks @Ayub Khan. This helps me get the list of all blueprints and choosing a blueprint for deployment. But if a deployment is completed how can I see the blueprint name used to deploy that stack?
... View more
03-13-2017
07:24 PM
1 Kudo
I installed cloud controller and created a stack by choosing a Cluster Type from HDC UI. I do see the blueprint name on HDC UI whichever I selected. How can I find the same info via cloudbreak shell?
... View more
Labels:
- Labels:
-
Hortonworks Cloudbreak
03-12-2017
05:20 AM
1 Kudo
@Matt FoleyThanks for your reply. I have not seen any issues yet. But was wondering how can i make sure it has indeed downloaded all client configs. So one way would be to understand the client config files for a service. Hence the question. Yeah, it was out of curiosity 🙂
... View more
03-09-2017
06:56 PM
1 Kudo
@Xiaoyu Yao, thanks for the reply. What i was looking for is how to identify client specific files in a conf dir. eg: in /etc/hadoop/conf we have multiple config files. How can i find which ones are server specific and which ones are client specific?
... View more
03-09-2017
03:42 PM
4 Kudos
On a cluster I am downloading the client configs via Ambari UI/API. How can I make sure all the client configs are actually downloaded? Or in other words, how do i know what are the config files for a client component?
... View more
Labels:
- Labels:
-
Apache Ambari