Member since
02-09-2016
559
Posts
422
Kudos Received
98
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
2132 | 03-02-2018 01:19 AM | |
3472 | 03-02-2018 01:04 AM | |
2360 | 08-02-2017 05:40 PM | |
2340 | 07-17-2017 05:35 PM | |
1709 | 07-10-2017 02:49 PM |
09-07-2016
03:48 PM
@Arun I'm glad to help. If you run across any other links you find helpful, come back and share them!
... View more
09-07-2016
02:43 PM
4 Kudos
@Arun What language are you interested in using for Pig UDFs? I pesonally prefer Python, as I'm most comfortable with it. However, you'll likely see better performance from Java based UDFs. This site provides a decent overview: http://help.mortardata.com/technologies/pig/writing_python_udfs. I find I learn best from examples. Here is a link to some examples they have written: https://github.com/mortardata/mortar-examples/tree/master/udfs/python. You may find this link helpful as well: https://www.codementor.io/data-science/tutorial/extending-hadoop-apache-pig-with-python-udfs There is also the Apache documentation: https://pig.apache.org/docs/r0.16.0/udf.html
... View more
09-07-2016
02:32 PM
I'm very glad to hear that!
... View more
09-06-2016
04:30 PM
@Piyush Jhawar If @Laurence Da Luz answered your question, please accept the answer to help others in the community.
... View more
09-04-2016
07:29 PM
@Shashi Vish In my case, the value of twitter.text is set by a processor before the RouteOnAttribute. I use EvaluateJsonPath to pull data out of the flowfile and set properties. You can use expression language to access flow content or flow attributes. Here is a link to the documentation for expression language: https://nifi.apache.org/docs/nifi-docs/html/expression-language-guide.html In your case, where are these properties? Are they part of the flow file content?
... View more
09-03-2016
04:45 PM
5 Kudos
@Shashi Vish RouteOnAttribute should be the correct processor to use here. How many possible values are you wanting to route on? If the number is small, you can define each of them as properties within the RouteOnAttribute processor. When you configure the processor, set the routing strategy to Route to Property name. Then add a property for each value you need to deal with, using expression language to evaluate whether it is true. Here is an example where I'm checking to see if a the twitter.text has 'elasticsearch' or 'solr': Then when I connect the RouteOnAttribute processor to each of my other processors, I can specify the condition on which they should be routed. Here is an example when I connect it to my PutElasticsearch processor: When I created the connection, I only checked the "elasticsearch" relationship. If I create a connection to the PutSolr processor, I might check the 'solr" relationship. In your case, the default processor could be routed on the "unmatched" relationship. I hope this helps.
... View more
09-03-2016
05:49 AM
6 Kudos
Objective:
The purpose of this tutorial is to walk you through the process of enabling the Elasticsearch interpreter for Zeppelin on the HDP 2.5 TP sandbox. As part of this process, we will install Elasticsearch and use Zeppelin to index and query data using Zeppelin and Elasticsearch. This is the first of two articles covering Elasticsearch on HDP. The second article covers pushing Twitter data to Elasticsearch using NiFi and provides a sample Zeppelin dashboard. You can find that article here: HCC Article
Note: The Zeppelin Elasticserch interpreter is a community provided interpreter. It is not yet considered GA by Hortonworks and should only be used for development and testing purposes. Prerequisites:
You should already have installed the Hortonworks Sandbox (HDP 2.5 Tech Preview).
Note: While not required, I recommend using Vagrant to manage multiple versions of the Sandbox. Follow my tutorial here to set that up: HCC Article Scope:
This tutorial was tested using the following environment and components:
Mac OS X 10.11.6
HDP 2.5 Tech Preview on Hortonworks Sandbox
Elasticsearch 2.3.5 and Elasticsearch 2.4.0
Note: This has also been tested on HDP 2.5 deployed with Cloudbreak on AWS. The specific steps may vary depending on your environment, but the high level process is the same. Steps:
Here is the online documentation for the Elasticsearch interpreter for Zeppelin:
Elasticseach Interpreter. If you follow the steps provided in this documentation, you will find that adding the Elasticserch interpreter is not possible as the documentation shows. That is because the interpreter is not enabled.
If you try to add the interpreter, you will see it is not in the list. You should see something similar to:
Verify Elasticsearch Interpreter is available
The first thing we are going to do is ensure the Elasticsearch interpreter is available within the Zeppelin installation. You can verify the Elasticsearch intepreter is available by looking in the interpreter directory:
$ ls -la /usr/hdp/current/zeppelin-server/interpreter/
total 76
drwxr-xr-x 19 zeppelin zeppelin 4096 2016-06-24 00:00 .
drwxr-xr-x 8 zeppelin zeppelin 4096 2016-08-31 02:57 ..
drwxr-xr-x 2 zeppelin zeppelin 4096 2016-06-23 23:59 alluxio
drwxr-xr-x 2 zeppelin zeppelin 4096 2016-06-23 23:59 angular
drwxr-xr-x 2 zeppelin zeppelin 4096 2016-06-24 00:00 cassandra
drwxr-xr-x 2 zeppelin zeppelin 4096 2016-06-24 00:00 elasticsearch
drwxr-xr-x 2 zeppelin zeppelin 4096 2016-06-24 00:00 file
drwxr-xr-x 2 zeppelin zeppelin 4096 2016-06-24 00:00 flink
drwxr-xr-x 2 zeppelin zeppelin 4096 2016-06-24 00:00 hbase
drwxr-xr-x 2 zeppelin zeppelin 4096 2016-06-24 00:00 ignite
drwxr-xr-x 2 zeppelin zeppelin 4096 2016-06-24 00:00 jdbc
drwxr-xr-x 2 zeppelin zeppelin 4096 2016-06-24 00:00 kylin
drwxr-xr-x 2 zeppelin zeppelin 4096 2016-06-24 00:00 lens
drwxr-xr-x 2 zeppelin zeppelin 4096 2016-06-24 00:00 livy
drwxr-xr-x 2 zeppelin zeppelin 4096 2016-06-24 00:00 md
drwxr-xr-x 2 zeppelin zeppelin 4096 2016-06-24 00:00 psql
drwxr-xr-x 2 zeppelin zeppelin 4096 2016-06-24 00:00 python
drwxr-xr-x 2 zeppelin zeppelin 4096 2016-06-24 00:00 sh
drwxr-xr-x 3 zeppelin zeppelin 4096 2016-06-24 00:00 spark
Note: This process is easy on the sandbox. If you are using a different HDP environment, then you need to perform this step on the server on which Zeppelin is installed.
If you do not see a directory for elasticsearch, you may have to run an interpreter install script. Here are the steps to run the interpreter install script:
$ cd /usr/hdp/current/zeppelin-server/bin
$ sudo ./install-interpreter.sh --name elasticsearch
Add Elasticsearch Interpreter to the Zeppelin configuration
Now we need to add the Elasticsearch interpreter to the Zeppelin configuration, which enables access to it. You need to modify the zeppelin.interpreters parameter.
Click on the Zeppelin Notebook service in Ambari:
Now, click on the Configs link:
Expand Advanced zeppelin-config:
Add the following string to the end of the zeppelin.interpreters parameter:
,org.apache.zeppelin.elasticsearch.ElasticsearchInterpreter
Note: The comma is not a typo. It is required to seperate our added value from the previous value.
It should look similar to this:
Now click the Save button to save the settings. You should see an indication that you need to restart the Zeppelin service. It should look similar like this:
Restart the Zeppelin Notebook service. Configure Zeppelin Interpreter
Now you should be able to follow the documentation I linked previously for setting up the Elasticsearch interpreter. You should have something similar to this:
The elasticsearch.host value will correspond to your ip address or sandbox.hortonworks.com if you have edited your local /etc/hosts file. Download Elasticsearch
Now that Zeppelin is configured, we need to download Elasticsearch. The latest version is 2.4.0. You can read more about Elasticsearch here:
Elasticsearch Website
You can use curl to download Elasticsearch to your sandbox.
$ cd ~
$ curl -O https://download.elastic.co/elasticsearch/release/org/elasticsearch/distribution/tar/elasticsearch/2.4.0/elasticsearch-2.4.0.tar.gz
Note: If you are using vagrant, you are able to download the file on your local computer and simply copy it to your Vagrant directory. The file will be visible within the sandbox in the /vagrant directory. Install Elasticsearch
Next we need to extract Elasticsearch to /opt directory, which is where we'll run it.
$ cd /opt
$ sudo tar xvfz ~/elasticsearch-2.4.0.tar.gz
Configure Elasticsearch
We need to make a couple of changes to the Elasticsearch configuration file /opt/elasticsearch-2.4.0/config/elastiserach.yml.
$ cd elasticsearch-2.4.0/config
$ vi elasticsearch.yml
We need to set the cluster.name setting to "elasticsearch". This is the default Zeppelin expects, however you can change this value in the Zeppelin configuration.
cluster.name: elasticsearch
We need to set the network.host setting to our sandbox hostname or ip. Elastic will default to binding to 127.0.0.1 which won't allow us to easily access it from outside of the sandbox.
network.host: sandbox.hortonworks.com
Make sure you have removed the # character at the start of the line for these two settings. Once you have completed these two changes, save the file:
Press the esc key
!wq
Create Elasticsearch user
We are going to create an elastic user to run the application.
$ sudo useradd elastic -d /home/elastic
Change Ownership of Elasticserach diretories
We are going to change the ownership of the elastic directories to the elastic user:
$ sudo chown -R elastic:elastsic /opt/elasticserach-2.4.0
Start elasticsearch
We want to run Elasticsearch as the elastic user so first we'll switch to that user.
$ sudo su - elastic
$ cd /opt/elasticsearch-2.4.0
$ bin/elasticsearch
You will see something similar to :
$ bin/elasticsearch
[2016-09-02 19:44:34,905][WARN ][bootstrap ] unable to install syscall filter: seccomp unavailable: CONFIG_SECCOMP not compiled into kernel, CONFIG_SECCOMP and CONFIG_SECCOMP_FILTER are needed
[2016-09-02 19:44:35,168][INFO ][node ] [Skyhawk] version[2.4.0], pid[22983], build[ce9f0c7/2016-08-29T09:14:17Z]
[2016-09-02 19:44:35,168][INFO ][node ] [Skyhawk] initializing ...
[2016-09-02 19:44:35,807][INFO ][plugins ] [Skyhawk] modules [lang-groovy, reindex, lang-expression], plugins [], sites []
[2016-09-02 19:44:35,856][INFO ][env ] [Skyhawk] using [1] data paths, mounts [[/ (/dev/mapper/vg_sandbox-lv_root)]], net usable_space [26.2gb], net total_space [42.6gb], spins? [possibly], types [ext4]
[2016-09-02 19:44:35,856][INFO ][env ] [Skyhawk] heap size [990.7mb], compressed ordinary object pointers [true]
[2016-09-02 19:44:35,856][WARN ][env ] [Skyhawk] max file descriptors [4096] for elasticsearch process likely too low, consider increasing to at least [65536]
[2016-09-02 19:44:38,032][INFO ][node ] [Skyhawk] initialized
[2016-09-02 19:44:38,032][INFO ][node ] [Skyhawk] starting ...
[2016-09-02 19:44:38,115][INFO ][transport ] [Skyhawk] publish_address {172.28.128.4:9300}, bound_addresses {172.28.128.4:9300}
[2016-09-02 19:44:38,119][INFO ][discovery ] [Skyhawk] elasticsearch/31d3OvlZT5WRnqYUW-GJwA
[2016-09-02 19:44:41,157][INFO ][cluster.service ] [Skyhawk] new_master {Skyhawk}{31d3OvlZT5WRnqYUW-GJwA}{172.28.128.4}{172.28.128.4:9300}, reason: zen-disco-join(elected_as_master, [0] joins received)
[2016-09-02 19:44:41,206][INFO ][http ] [Skyhawk] publish_address {172.28.128.4:9200}, bound_addresses {172.28.128.4:9200}
[2016-09-02 19:44:41,207][INFO ][node ] [Skyhawk] started
[2016-09-02 19:44:41,223][INFO ][gateway ] [Skyhawk] recovered [0] indices into cluster_state
Verify access to Elasticsearch
Using your web browser, verify you get a response from Elasticsearch by using the following address:
http://sandbox.hortonworks.com:9200
You should see something similar to:
Alternatively, you can use curl:
curl -XGET http://sandbox.hortonworks.com:9200
You will see a similar json output message. Add data to elasticsearch
Now we are going to create a notebook in Zeppelin. You should have a note for each index operation in the notebook. Let's use the %elasticsearch and the index command to index some data:
%elasticsearch
index movies/default/1 {
"title": "The Godfather",
"director": "Francis Ford Coppola",
"year": 1972,
"genres": ["Crime", "Drama"]
}
%elasticsearch
index movies/default/2 {
"title": "Lawrence of Arabia",
"director": "David Lean",
"year": 1962,
"genres": ["Adventure", "Biography", "Drama"]
}
%elasticsearch
index movies/default/3 {
"title": "To Kill a Mockingbird",
"director": "Robert Mulligan",
"year": 1962,
"genres": ["Crime", "Drama", "Mystery"]
}
%elasticsearch
index movies/default/4 {
"title": "Apocalypse Now",
"director": "Francis Ford Coppola",
"year": 1979,
"genres": ["Drama", "War"]
}
%elasticsearch
index movies/default/5 {
"title": "Kill Bill: Vol. 1",
"director": "Quentin Tarantino",
"year": 2003,
"genres": ["Action", "Crime", "Thriller"]
}
%elasticsearch
index movies/default/6 {
"title": "The Assassination of Jesse James by the Coward Robert Ford",
"director": "Andrew Dominik",
"year": 2007,
"genres": ["Biography", "Crime", "Drama"]
}
You should have a notebook that looks similar to this:
For each of the index notes, click the play button to insert the data. Query Elasticsearch data
Once the data is in Elasticseach, we can search using Zeppelin like this:
%elasticsearch
search /movies/default
For this note, click the play button to run the query. You should see something similar to this:
The Elasticsearch interpreter has great support for the Elasticsearch Query DSL (Domain Specific Language). You have the ability to easily filter the fields returned, create buckets and aggregations. Review:
We have enabled the Elasticsearch interpreter in Zeppelin, indexed data into Elasticsearch and queried data from Elasticsearch using Zeppelin. Try indexing and querying data using your own data and using a different index name.
... View more
Labels:
09-02-2016
12:57 PM
@Amit Nandi
If you want to use libraries not included in the standard Python distribution, then you have to ensure those libraries are install on every server where the Spark job is going to run.
As you may be aware, using something like Anaconda Python makes that process much easier. To ensure Zeppelin uses that version of Python when you use the Python interpreter, set the zeppelin.python setting to the path to Anaconda.
https://github.com/apache/zeppelin/blob/master/docs/interpreter/python.md
You should set your PYSPARK_DRIVER_PYTHON environment variable so that Spark uses Anaconda. You can get more information here:
https://spark.apache.org/docs/1.6.2/programming-guide.html
... View more
09-01-2016
05:43 PM
1 Kudo
Yes, you can store UTF-8 characters in Hive tables and retrieve them. I have tested this with both Left-to-Right and Right-to-Left languages without any problems before. You typically don't have to do anything special to get it to work. Try this: 1. Create a delimited text file with a couple of rows of data (including UTF-8 characters). You can use \t as the delimiter.
2. Make sure you save the file as an UTF-8 text file and push it to HDFS.
3. Create an external table in Hive that points to the directory where you placed that file.
4. Run the same query as before to see if the data is displayed correctly. I'm wondering if there is something happening in the environment when you do the insert. Did you do the insert from the command line or did you use the Hive view to do it. From the command line, try setting your environment using before running hive or beeline. export LANG=en_US.UTF-8
... View more
08-31-2016
04:48 PM
@Zack Riesland Have you tried setting the following: set hive.exec.compress.output=false
... View more