Member since
09-17-2015
436
Posts
736
Kudos Received
81
Solutions
11-03-2015
04:19 AM
14 Kudos
Exploring Apache Flink with HDP Apache Flink is an open source platform for distributed stream and batch data processing. More details on Flink and how it is being used in the industry today available here: http://flink-forward.org/?post_type=session. There are a few ways you can explore Flink on HDP 2.3: 1. Compilation on HDP 2.3.2 To compile Flink from source on HDP 2.3 you can use these commands: curl -o /etc/yum.repos.d/epel-apache-maven.repo https://repos.fedorapeople.org/repos/dchen/apache-maven/epel-apache-maven.repo
yum -y install apache-maven-3.2*
git clone https://github.com/apache/flink.git
cd flink
mvn clean install -DskipTests -Dhadoop.version=2.7.1.2.3.2.0-2950 -Pvendor-repos Note that with this option I ran into a classpath bug and raised it here: https://issues.apache.org/jira/browse/FLINK-3032 2. Run using precompiledtarball wget http://www.gtlib.gatech.edu/pub/apache/flink/flink-0.9.1/flink-0.9.1-bin-hadoop27.tgz
tar xvzf flink-0.9.1-bin-hadoop27.tgzcd flink-0.9.1
export HADOOP_CONF_DIR=/etc/hadoop/conf./bin/yarn-session.sh -n 1 -jm 768 -tm 1024 3. Using Ambari service (demo purposes only for now) The Ambari service lets you easily install/compile Flink on HDP 2.3
Features:
By default, downloads prebuilt package of Flink 0.9.1, but also gives option to build the latest Flink from source instead Exposes flink-conf.yaml in Ambari UI Setup
Download HDP 2.3 sandbox VM image (Sandbox_HDP_2.3_1_VMware.ova) from Hortonworks website Import Sandbox_HDP_2.3_1_VMware.ova into VMWare and set the VM memory size to 8GB Now start the VM After it boots up, find the IP address of the VM and add an entry into your machines hosts file. For example: 192.168.191.241 sandbox.hortonworks.com sandbox
Note that you will need to replace the above with the IP for your own VM
Connect to the VM via SSH (password hadoop) ssh root@sandbox.hortonworks.com
To download the Flink service folder, run below VERSION=`hdp-select status hadoop-client | sed 's/hadoop-client - \([0-9]\.[0-9]\).*/\1/'`
sudo git clone https://github.com/abajwa-hw/ambari-flink-service.git /var/lib/ambari-server/resources/stacks/HDP/$VERSION/services/FLINK
Restart Ambari #sandbox
service ambari restart
#non sandbox
sudo service ambari-server restart
Then you can click on 'Add Service' from the 'Actions' dropdown menu in the bottom left of the Ambari dashboard: On bottom left -> Actions -> Add service -> check Flink server -> Next -> Next -> Change any config you like (e.g. install dir, memory sizes, num containers or values in flink-conf.yaml) -> Next -> Deploy
By default:
Container memory is 1024 MB Job manager memory of 768 MB Number of YARN container is 1
On successful deployment you will see the Flink service as part of Ambari stack and will be able to start/stop the service from here: You can see the parameters you configured under 'Configs' tab One benefit to wrapping the component in Ambari service is that you can now monitor/manage this service remotely via REST API export SERVICE=FLINK
export PASSWORD=admin
export AMBARI_HOST=localhost
#detect name of cluster
output=`curl -u admin:$PASSWORD -i -H 'X-Requested-By: ambari' http://$AMBARI_HOST:8080/api/v1/clusters`
CLUSTER=`echo $output | sed -n 's/.*"cluster_name" : "\([^\"]*\)".*/\1/p'`
#get service status
curl -u admin:$PASSWORD -i -H 'X-Requested-By: ambari' -X GET http://$AMBARI_HOST:8080/api/v1/clusters/$CLUSTER/services/$SERVICE
#start service
curl -u admin:$PASSWORD -i -H 'X-Requested-By: ambari' -X PUT -d '{"RequestInfo": {"context" :"Start $SERVICE via REST"}, "Body": {"ServiceInfo": {"state": "STARTED"}}}' http://$AMBARI_HOST:8080/api/v1/clusters/$CLUSTER/services/$SERVICE
#stop service
curl -u admin:$PASSWORD -i -H 'X-Requested-By: ambari' -X PUT -d '{"RequestInfo": {"context" :"Stop $SERVICE via REST"}, "Body": {"ServiceInfo": {"state": "INSTALLED"}}}' http://$AMBARI_HOST:8080/api/v1/clusters/$CLUSTER/services/$SERVICE
...and also install via Blueprint. See example here on how to deploy custom services via Blueprints Use Flink
Run word count job su flink
export HADOOP_CONF_DIR=/etc/hadoop/conf
cd /opt/flink
./bin/flink run ./examples/flink-java-examples-0.9.1-WordCount.jar
This should generate a series of word counts Open the YARN ResourceManager UI. Notice Flink is running on YARN Click the ApplicationMaster link to access Flink webUI Use the History tab to review details of the job that ran: View metrics in the Task Manager tab: Other things to try
Apache Zeppelin now also supports Flink. You can also install it via Zeppelin Ambari service for vizualization More details on Flink and how it is being used in the industry today available here: http://flink-forward.org/?post_type=session Remove service
To remove the Flink service:
Stop the service via Ambari Unregister the service export SERVICE=FLINK
export PASSWORD=admin
export AMBARI_HOST=localhost
#detect name of cluster
output=`curl -u admin:$PASSWORD -i -H 'X-Requested-By: ambari' http://$AMBARI_HOST:8080/api/v1/clusters`
CLUSTER=`echo $output | sed -n 's/.*"cluster_name" : "\([^\"]*\)".*/\1/p'`
curl -u admin:$PASSWORD -i -H 'X-Requested-By: ambari' -X DELETE http://$AMBARI_HOST:8080/api/v1/clusters/$CLUSTER/services/$SERVICE
#if above errors out, run below first to fully stop the service
#curl -u admin:$PASSWORD -i -H 'X-Requested-By: ambari' -X PUT -d '{"RequestInfo": {"context" :"Stop $SERVICE via REST"}, "Body": {"ServiceInfo": {"state": "INSTALLED"}}}' http://$AMBARI_HOST:8080/api/v1/clusters/$CLUSTER/services/$SERVICE
Remove artifacts rm -rf /opt/flink*
rm /tmp/flink.tgz
... View more
Labels:
10-30-2015
02:58 PM
Thanks @George Vetticaden! As @Jonas Straub and @Andrew Grande mentioned, in this example I used cloud mode (notice that Solr was started with -c -z arguments) but you can easily change the Solr processor to point to Solr standalone instance too
... View more
10-26-2015
04:29 PM
2 Kudos
I'm surprised it wasn't @bbende@hortonworks.com who wrote this article 😉
... View more
10-12-2015
08:41 AM
46 Kudos
Build flow in HDF/Nifi to push tweets to HDP In this tutorial, we will learn how to use HDF to create a simple event processing flow by: Install HDF/Nifi on sandbox using the Ambari service Setup Solr/Banana/Hive table Import/Instantiate a prebuilt Nifi template Verify tweets got pushed to HDFS, Hive using Ambari views Visualize tweets in Solr using Banana dashboard Explore provenance features of Nifi Change log 9/30: Automation script to deploy HDP clusters (on any cloud) with this demo already setup, is available here 9/15: Updated: Demo Ambari service for Nifi updated to support HDP 2.5 sandbox and Nifi 1.0. Steps to manually install demo artifacts remains unchanged (but below Nifi screenshots need to be updated) References For a primer on HDF, you can refer to the materials here to get a basic background Thanks to @bbende@hortonworks.com for his earlier blog post that helped make this tutorial possible
Pre-Requisites The lab is designed for the HDP Sandbox VM. To run on Azure sandbox, Azure specific pre-req steps provided here Download the HDP Sandbox here, import into VMWare Fusion and start the VM If running on VirtualBox you will need to forward port 9090. See here for detailed steps After it boots up, find the IP address of the VM and add an entry into your machines hosts file e.g. 192.168.191.241 sandbox.hortonworks.com sandbox Connect to the VM via SSH (root/hadoop), correct the /etc/hosts entry ssh root@sandbox.hortonworks.com If using HDP 2.5 sandbox, you will also need to SSH into the docker based sandbox container: ssh root@127.0.0.1 -p 2222
Deploy/update Nifi Ambari service on sandbox by running below
Note: on HDP 2.5 sandbox, the Nifi service definition is already installed, so you can skip this and proceed to installing Nifi via 'Install Wizard' VERSION=`hdp-select status hadoop-client | sed 's/hadoop-client - \([0-9]\.[0-9]\).*/\1/'`
rm -rf /var/lib/ambari-server/resources/stacks/HDP/$VERSION/services/NIFI
sudo git clone https://github.com/abajwa-hw/ambari-nifi-service.git /var/lib/ambari-server/resources/stacks/HDP/$VERSION/services/NIFI
#sandbox
service ambari restart
#non sandbox
service ambari-server restart
To install Nifi, start the 'Install Wizard': Open Ambari (http://sandbox.hortonworks.com:8080) then:
On bottom left -> Actions -> Add service -> check NiFi server -> Next -> Next -> Change any config you like (e.g. install dir, port, setup_prebuilt or values in nifi.properties) -> Next -> Deploy. This will kick off the install which will run for 5-10min. Steps
Import a simple flow to read Tweets into HDFS/Solr and visualize using Banana dashboard
HDP sandbox comes LW HDP search. Follow the steps below to use it to setup Banana, start SolrCloud and create a collection
On HDP 2.5 sandbox, HDPsearch can be installed via Ambari. Just use the same 'Install Wizard' used above and select all defaults To install HDP search on non-sandbox, you can either: install via Ambari (for this you will need to install its management pack in Ambari) OR install HDPsearch manually: yum install -y lucidworks-hdpsearch
sudo -u hdfs hadoop fs -mkdir /user/solr
sudo -u hdfs hadoop fs -chown solr /user/solr
Ensure no log files owned by root (current sandbox version has files owned by root in log dir which causes problems when starting solr) chown -R solr:solr /opt/lucidworks-hdpsearch/solr
Run solr setup steps as solr user su solr
Setup the Banana dashboard by copying default.json to dashboard dir cd /opt/lucidworks-hdpsearch/solr/server/solr-webapp/webapp/banana/app/dashboards/
mv default.json default.json.orig
wget https://raw.githubusercontent.com/abajwa-hw/ambari-nifi-service/master/demofiles/default.json
Edit solrconfig.xml by adding <str>EEE MMM d HH:mm:ss Z yyyy</str> under ParseDateFieldUpdateProcessorFactory so it looks like below. This is done to allow Solr to recognize the timestamp format of tweets. vi /opt/lucidworks-hdpsearch/solr/server/solr/configsets/data_driven_schema_configs/conf/solrconfig.xml
<processor>
<arr name="format">
<str>EEE MMM d HH:mm:ss Z yyyy</str>
Start/Restart Solr in cloud mode If you installed Solr via Ambari, just use the 'Service Actions' dropdown to restart it Otherwise, if you installed manually, start Solr as below after setting JAVA_HOME to the right location: export JAVA_HOME=/usr/lib/jvm/java-1.7.0-openjdk.x86_64
/opt/lucidworks-hdpsearch/solr/bin/solr start -c -z localhost:2181 create a collection called tweets /opt/lucidworks-hdpsearch/solr/bin/solr create -c tweets -d data_driven_schema_configs -s 1 -rf 1
Solr setup is complete. Return to root user exit Ensure the time on your sandbox is accurate or you will get errors using the GetTwitter processor. In case the time is not correct, run the below to fix it: yum install -y ntp
service ntpd stop
ntpdate pool.ntp.org
service ntpd start Now open Nifi webui (http://sandbox.hortonworks.com:9090/nifi) and run the remaining steps there:
Download prebuilt Twitter_Dashboard.xml template to your laptop from here Import flow template info Nifi:
Import template by clicking on Templates (third icon from right) which will launch the 'Nifi Flow templates' popup Browse and navigate to where ever you downloaded Twitter_Dashboard.xml on your local machine Click Import. Now the template should appear: Close the popup Instantiate the Twitter dashboard template:
Drag/drop the Template icon (7th icon form left) onto the canvas so that a picklist popup appears Select 'Twitter dashboard' and click Add This should create a box (i.e processor group) named 'Twitter Dashboard'. Double click it to drill into the actual flow Configure GetTwitter processor
Right click on 'GetTwitter' processor (near top) and click Configure Under Properties: Enter your Twitter key/secrets ensure the 'Twitter Endpoint' is set to 'Filter Endpoint' enter the search terms (e.g. AAPL,GOOG,MSFT,ORCL) under 'Terms to Filter on' Configure PutContentSolrStream processor Writes the selected attributes to Solr. In this case, assuming Solr is running in cloud mode with a collection 'tweets' Confirm the Solr Location property is updated to reflect your Zookeeper configuration (for SolrCloud) or Solr standalone instance If you installed Solr via Ambari, you will need to append /solr to the ZK string in the 'Solr Location': Review the other processors and modify properties as needed:
EvaluateJsonPath: Pulls out attributes of tweets RouteonAttribute: Ensures only tweets with non-empty messages are processed ReplaceText: Formats each tweet as pipe (|) delimited line entry e.g. tweet_id|unixtime|humantime|user_handle|message|full_tweet MergeContent: Merges tweets into a single file (either 20 tweets or 120s, whichever comes first) to avoid having a large number of small files in HDFS. These values can be configured. PutFile: writes tweets to local disk under /tmp/tweets/ PutHDFS: writes tweets to HDFS under /tmp/tweets_staging If setup correctly, the top left hand of each processor on the canvas will show a red square (indicating the flow is stopped) Click the Start button (green triangle near top of screen) to start the flow After few seconds you will see tweets flowing Create Hive table to be able to run queries on the tweets in HDFS sudo -u hdfs hadoop fs -chmod -R 777 /tmp/tweets_staging
hive> create table if not exists tweets_text_partition(
tweet_id bigint,
created_unixtime bigint,
created_time string,
displayname string,
msg string,
fulltext string
)
row format delimited fields terminated by "|"
location "/tmp/tweets_staging";
Viewing results
Verify that:
tweets appear under /tmp/tweets_staging dir in HDFS. You can see this via Files view in Ambari: tweets appear in Solr:
http://sandbox.hortonworks.com:8983/solr/tweets_shard1_replica1/select?q=*:* http://sandbox.hortonworks.com:8983/solr/#/tweets_shard1_replica1/query Tweets appear in Banana:
http://sandbox.hortonworks.com:8983/solr/banana/index.html#/dashboard To search for tweets by language (e.g. Italian) enter the below in the search text box:
language_s:it To search for tweets by a particular user (e.g. warrenbuffett) enter the below in the search text box: screenName_s:warrenbuffett To search for tweets containing some text (e.g. tax) enter the below in the search text box: text_t:tax Tweets appear in Hive:
http://sandbox.hortonworks.com:8080/#/main/views/HIVE/1.0.0/Hive Other Nifi features
Flow statistics/graphs:
Right click on one of the processors (e.g. PutHDFS) and select click 'Stats' to see a number of charts/metrics: You should also see Nifi metrics in Ambari (assuming you started Ambari metrics earlier) Data provenance in Nifi:
In Nifi home screen, click Provenance icon (5th icon from top right corner) to open Provenance page: Click Show lineage icon (2nd icon from right) on any row
Right click Send > View details > Content
From here you can view the tweet itself by Clicking Content > View > formatted You can also replay the event by Replay > Submit Close the provenance window using x icon on the inner window Notice the event was replayed Re-open the the provenance window on the row you you had originally selected Notice that by viewing and replaying the tweet, you changed the provenance graph of this event: Send and replay events were added to the lineage graph Also notice the time slider on the bottom left of the page which allows users to 'rewind' time and 'replay' the provenance events as they happened. Right click on the Send event near the bottom of the flow and select Details Notice that the details of request to view the tweet are captured here (who requested it, at what time etc) Exit the Provenance window but clicking the x icon on the outer window You have successfully created a basic Nifi flow that perfoms simple event processing to ingest tweets into HDP. Why was the processing 'simple'? There were no complex features like alerting users based on time windows (e.g. if a particular topic was tweeted about more than x times in 30s) etc which requires a higher fidelity form of transportation. For such functionality the recommendation would be to use Kafka/Storm. To see how you would use these technologies of the HDP stack to perform complex processing, take a look at the Twitter Storm demo at the Hortonworks Gallery under 'Sample Apps'
Other things to try: Learn more about Nifi expression language and how to get started building a custom Nifi processor: http://community.hortonworks.com/articles/4356/getting-started-with-nifi-expression-language-and.html
... View more
Labels:
10-09-2015
01:37 AM
24 Kudos
Hbase indexing to Solr with HDP Search in HDP 2.3
Background: The HBase Indexer provides the ability to stream events from HBase to Solr for near real time searching. The HBase indexer is included with HDPSearch as an additional service. The indexer works by acting as an HBase replication sink. As updates are written to HBase, the events are asynchronously replicated to the HBase Indexer processes, which in turn creates Solr documents and pushes them to Solr.
References: https://doc.lucidworks.com/lucidworks-hdpsearch/2.3/Guide-Jobs.html#_hbase-indexer https://github.com/NGDATA/hbase-indexer/wiki/Tutorial Steps
Download and start HDP 2.3 sandbox VM which comes with LW HDP search installed (under /opt/lucidworks-hdpsearch) and run below to ensure no log files owned by root remain chown -R solr:solr /opt/lucidworks-hdpsearch/solr If running on an Ambari installed HDP 2.3 cluster (instead of sandbox), run the below to install HDPsearch and setup the user dir in HDFS: yum install -y lucidworks-hdpsearch
sudo -u hdfs hadoop fs -mkdir /user/solr
sudo -u hdfs hadoop fs -chown solr /user/solr Point Solr to Zookeeper by configuring hbase-indexer-site.xml vi /opt/lucidworks-hdpsearch/hbase-indexer/conf/hbase-indexer-site.xml
<?xml version="1.0"?>
<configuration>
<property>
<name>hbaseindexer.zookeeper.connectstring</name>
<value>sandbox.hortonworks.com:2181</value>
</property>
<property>
<name>hbase.zookeeper.quorum</name>
<value>sandbox.hortonworks.com</value>
</property>
</configuration>
In Ambari > HBase > Configs > Custom hbase-site add the below properties, but do not restart HBase just yet: hbase.replication=true
replication.source.ratio=1.0
replication.source.nb.capacity=1000
replication.replicationsource.implementation=com.ngdata.sep.impl.SepReplicationSource
Copy Solrs Hbase related libs to $HBASE_HOME/lib cp /opt/lucidworks-hdpsearch/hbase-indexer/lib/hbase-sep* /usr/hdp/current/hbase-master/lib/
Restart Hbase Copy hbase-site.xml to hbase-indexer's conf dir cp /etc/hbase/conf/hbase-site.xml /opt/lucidworks-hdpsearch/hbase-indexer/conf/
Start Solr in cloud mode (pointing to ZK) cd /opt/lucidworks-hdpsearch/solr
bin/solr start -c -z sandbox.hortonworks.com:2181
Create collection bin/solr create -c hbaseCollection \
-d data_driven_schema_configs \
-n myCollConfigs \
-s 2 \
-rf 2
Start Hbase indexer cd /opt/lucidworks-hdpsearch/hbase-indexer/bin/
./hbase-indexer server
In a second terminal, create table to be indexed in HBase. Open hbase shell and run below to create a table named "indexdemo-user", with a single column family named "info". Note that the REPLICATION_SCOPE of the column family of the table must be set to 1.: create 'indexdemo-user', { NAME => 'info', REPLICATION_SCOPE => '1' }
!quit
Now we'll create an indexer that will index the the indexdemo-user table as its contents are updated. vi /opt/lucidworks-hdpsearch/hbase-indexer/indexdemo-indexer.xml
<?xml version="1.0"?>
<indexer table="indexdemo-user">
<field name="firstname_s" value="info:firstname"/>
<field name="lastname_s" value="info:lastname"/>
<field name="age_i" value="info:age" type="int"/>
</indexer>
The above file defines three pieces of information that will be used for indexing, how to interpret them, and how they will be stored in Solr.
Next, create an indexer based on the created indexer xml file. /opt/lucidworks-hdpsearch/hbase-indexer/bin/hbase-indexer add-indexer -n hbaseindexer -c /opt/lucidworks-hdpsearch/hbase-indexer/indexdemo-indexer.xml -cp solr.zk=sandbox.hortonworks.com:2181 -cp solr.collection=hbaseCollection
Check it got created /opt/lucidworks-hdpsearch/hbase-indexer/bin/hbase-indexer list-indexers
Check that the index server output shows below INFO supervisor.IndexerSupervisor: Started indexer for hbaseindexer
Log back in the hbase shell try adding some data to the indexdemo-user table hbase> put 'indexdemo-user', 'row1', 'info:firstname', 'John'
hbase> put 'indexdemo-user', 'row1', 'info:lastname', 'Smith'
Run commit curl http://sandbox.hortonworks.com:8983/solr/hbaseCollection/update?commit=true
Open Solr UI and notice under statistics the "Num Docs" has increased: http://sandbox.hortonworks.com:8983/solr/#/hbaseCollection_shard1_replica1 Run query using Solr REST API: http://sandbox.hortonworks.com:8983/solr/hbaseCollection_shard1_replica1/select?q=*%3A*&wt=json&indent=true Now try updating the data you've just added in hbase shell and commit hbase> put 'indexdemo-user', 'row1', 'info:firstname', 'Jim'
curl http://sandbox.hortonworks.com:8983/solr/hbaseCollection/update?commit=true
Check the content in Solr: http://sandbox.hortonworks.com:8983/solr/hbaseCollection_shard1_replica1/select?q=*%3A*&wt=json&indent=true Note that the document's firstname_s field now contains the string "Jim". Finally, delete the row from HBase and commit hbase> deleteall 'indexdemo-user', 'row1'
curl http://sandbox.hortonworks.com:8983/solr/hbaseCollection/update?commit=true
Check the content in Solr and notice that the document has been removedhttp://sandbox.hortonworks.com:8983/solr/hbaseCollection_shard1_replica1/select?q=*%3A*&wt=json&indent=true You have successfully setup Hbase indexing with HDP search
... View more
Labels:
10-08-2015
07:10 PM
9 Kudos
Lab Overview In this lab, we will learn to:
Configure Solr to store indexes in HDFS Create a solr cluster of 2 solr instances running on port 8983 and 8984 Index documents in HDFS using the Hadoop connectors Use Solr to search documents Pre-Requisite
The lab is designed for the HDP Sandbox. Download the HDP Sandbox here, import into VMWare Fusion and start the VM LAB Step 1 - Log into Sandbox
After it boots up, find the IP address of the VM and add an entry into your machines hosts file e.g. 192.168.191.241 sandbox.hortonworks.com sandbox
Connect to the VM via SSH (root/hadoop), correct the /etc/hosts entry ssh root@sandbox.hortonworks.com
If running on an Ambari installed HDP 2.3 cluster (instead of sandbox), run the below to install HDPsearch yum install -y lucidworks-hdpsearch
sudo -u hdfs hadoop fs -mkdir /user/solr
sudo -u hdfs hadoop fs -chown solr /user/solr
If running on HDP 2.3 sandbox, run below chown -R solr:solr /opt/lucidworks-hdpsearch
Run remaining steps as solr su solr
Step 2 - Configure Solr to store index files in HDFS
For the lab, we will use schemaless configuration that ships with Solr
Schemaless configuration is a set of SOLR features that allow one to index documents without pre-specifying the schema of indexed documents Sample schemaless configruation can be found in the directory /opt/lucidworks-hdpsearch/solr/server/solr/configsets/data_driven_schema_configs Let's create a copy of the sample schemaless configuration and modify it to store indexes in HDFS cp -R /opt/lucidworks-hdpsearch/solr/server/solr/configsets/data_driven_schema_configs /opt/lucidworks-hdpsearch/solr/server/solr/configsets/data_driven_schema_configs_hdfs
Open /opt/lucidworks-hdpsearch/solr/server/solr/configsets/data_driven_schema_configs_hdfs/conf/solrconfig.xml in your favorite editor and make the following changes: 1- Replace the section: <directoryFactory name="DirectoryFactory"
>
</directoryFactory>
with <directoryFactory name="DirectoryFactory" class="solr.HdfsDirectoryFactory">
<str name="solr.hdfs.home">hdfs://sandbox.hortonworks.com/user/solr</str>
<bool name="solr.hdfs.blockcache.enabled">true</bool>
<int name="solr.hdfs.blockcache.slab.count">1</int>
<bool name="solr.hdfs.blockcache.direct.memory.allocation">false</bool>
<int name="solr.hdfs.blockcache.blocksperbank">16384</int>
<bool name="solr.hdfs.blockcache.read.enabled">true</bool>
<bool name="solr.hdfs.blockcache.write.enabled">false</bool>
<bool name="solr.hdfs.nrtcachingdirectory.enable">true</bool>
<int name="solr.hdfs.nrtcachingdirectory.maxmergesizemb">16</int>
<int name="solr.hdfs.nrtcachingdirectory.maxcachedmb">192</int>
</directoryFactory>
2- set locktype to <lockType>hdfs</lockType>
3- Save and exit the file Step 3 - Start 2 Solr instances in solrcloud mode mkdir -p ~/solr-cores/core1
mkdir -p ~/solr-cores/core2
cp /opt/lucidworks-hdpsearch/solr/server/solr/solr.xml ~/solr-cores/core1
cp /opt/lucidworks-hdpsearch/solr/server/solr/solr.xml ~/solr-cores/core2
#you may need to set JAVA_HOME
#export JAVA_HOME=/usr/lib/jvm/java-1.7.0-openjdk.x86_64
/opt/lucidworks-hdpsearch/solr/bin/solr start -cloud -p 8983 -z sandbox.hortonworks.com:2181 -s ~/solr-cores/core1
/opt/lucidworks-hdpsearch/solr/bin/solr restart -cloud -p 8984 -z sandbox.hortonworks.com:2181 -s ~/solr-cores/core2
Step 4 - Create a Solr Collection named "labs" with 2 shards and a replication factor of 2 /opt/lucidworks-hdpsearch/solr/bin/solr create -c labs -d /opt/lucidworks-hdpsearch/solr/server/solr/configsets/data_driven_schema_configs_hdfs/conf -n labs -s 2 -rf 2
Step 5 - Validate that the labs collection got created
Using the browser, visit http://sandbox.hortonworks.com:8983/solr/#/~cloud. You should see the labs collection with 2 shards, each with a replication factor of 2. Step 6 - Load documents to HDFS
Upload sample csv file to hdfs. We will index the file with Solr using the Solr Hadoop connectors hadoop fs -mkdir -p csv
hadoop fs -put /opt/lucidworks-hdpsearch/solr/example/exampledocs/books.csv csv/
Step 7 - Index documents with Solr using Solr Hadoop Connector hadoop jar /opt/lucidworks-hdpsearch/job/lucidworks-hadoop-job-2.0.3.jar com.lucidworks.hadoop.ingest.IngestJob -DcsvFieldMapping=0=id,1=cat,2=name,3=price,4=instock,5=author -DcsvFirstLineComment -DidField=id -DcsvDelimiter="," -Dlww.commit.on.close=true -cls com.lucidworks.hadoop.ingest.CSVIngestMapper -c labs -i csv/* -of com.lucidworks.hadoop.io.LWMapRedOutputFormat -zk localhost:2181
Step 8 - Search indexed documents
Search the indexed documents. Using the browser, visit the urlhttp://sandbox.hortonworks.com:8984/solr/labs/select?q=*:* You will see search results like below Step 9 - Lab Complete
You have sucessfully completed the lab and learnt how to:
Store Solr indexes in HDFS Create a Solr Cluster Index documents in HDFS using Solr Hadoop connectors
... View more
Labels:
10-08-2015
06:35 PM
18 Kudos
There have been a number of questions recently on using AD/IPA with HDP 2.3 security: How to setup Active Directory/IPA? How to setup cluster OS to recognize users from AD using SSSD? How to enable kerberos for authentication? How to install Ranger for authorization/audit and setup plugins for HDFS, Hive, HBase, Kafka, Storm, Yarn, Knox and test these components on kerborized cluster? How to sync Ranger user/group sync with AD/IPA? How to integrate Knox with AD/IPA? How to setup encryption at rest with Ranger KMS? To help answer some of these questions, the partner team have prepared cheatsheets on security workshops. These are living materials with sample code snippets which are being updated/enhanced per the feedback from the field so rather than replicate the materials here, the latest materials can be referenced at the GitHub repo linked from here: https://community.hortonworks.com/repos/4465/workshops-on-how-to-setup-security-on-hadoop-using.html To help get started with security, we have also made available secured sandbox and LDAP VMs after running through above steps. Note that these are unofficial and for the final word on security with HDP, the official docs should be referenced at: http://docs.hortonworks.com. For example: http://docs.hortonworks.com/HDPDocuments/Ambari-2.1.1.0/bk_Ambari_Security_Guide/content/ch_amb_sec_guide.html http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.2/bk_Ranger_Install_Guide/content/ch_overview_ranger_ambari_install.html http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.2/bk_Ranger_KMS_Admin_Guide/content/ch_ranger_kms_overview.html For help with the workshop materials please use GitHub issues: https://github.com/abajwa-hw/security-workshops/issues
... View more
Labels:
10-08-2015
04:29 PM
Note that you would need to add this user to the list of sudoers first which the documentation hadn't mentioned. I ran into the same while building the ambari service. See https://issues.apache.org/jira/browse/NIFI-930
... View more
09-25-2015
09:16 PM
One good thing to show in the tutorial would be how this lets you manage multi-tenancy for Spark (currently only available via Spark on YARN) https://github.com/hortonworks-gallery/ambari-zeppelin-service/blob/master/README.md#zeppelin-yarn-integration
... View more
09-25-2015
09:11 PM
You can do this through the view as well. See this lab we recently put together: https://github.com/abajwa-hw/hdp22-hive-streaming/blob/master/LAB-STEPS.md
... View more
- « Previous
- Next »