Member since
07-30-2019
93
Posts
96
Kudos Received
2
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
1298 | 08-12-2016 02:57 PM | |
1799 | 05-02-2016 09:14 PM |
08-16-2016
04:50 AM
6 Kudos
To begin you have 2 options. 1. Download the Apache NiFi package onto hn0 of your HDI
cluster and start nifi using "NIFIHOME"/bin/nifi.sh start 2. Install NiFi through Ambari See this link for a walkthrough of getting the Ambari nifi
installer loaded. : https://community.hortonworks.com/content/kbentry/1282/sample-hdfnifi-flow-to-push-tweets-into-solrbanana.html To
install Nifi, start the 'Install Wizard': Open Ambari then:
On bottom left -> Actions -> Add service ->
check NiFi server -> Next -> Next -> Change any config you like
(e.g. install dir, port, setup_prebuilt or values in nifi.properties)
-> Next -> Deploy. This will kick off the install which will run for
5-10min. That will get you a NiFi instance on your Cluster hn0 and
next you will want to set up your laptop to be able to connect to NiFi. To do this you first need to set up an ssh tunnel into
hn0. Ex. ssh -L
9000:localhost:22 user@example.azurehdinsight.net this will route all traffic from localhost 9000 into Azure
and also encrypt it. next go to your network settings. click Proxy Setting Tab Create new proxy (server) localhost (port) 9000 Now from your browser navigate to http://hn0-"serveraddress".com:8080/nifi * Of course substitute the proper "serveraddress". From here you can begin building workflows and ingesting data
into HDI. If you wish to write to HDFS on your HDI cluster you will
need to also update some library files in NiFi in order to use WASB. https://github.com/jfrazee/nifi/releases/tag/v0.6.x Verify that the checksum is correct and replace the following two files in your $NIFI_HOME/lib directory.
62f1261b837f3b92f4ee1decc708494c
./nifi-hadoop-libraries-nar-0.6.1.nar
efdb9941185100c1a2432f408b9db187
./nifi-hadoop-nar-0.6.1.nar And
if the head node has unzip, check that unzip -t
.//nifi-assembly/target/nifi-0.6.1-bin/nifi-0.6.1/lib/nifi-hadoop-nar-0.6.1.nar
| grep -i azure returns: testing:
META-INF/bundled-dependencies/azure-storage-2.0.0.jar OKtesting:
META-INF/bundled-dependencies/hadoop-azure-2.7.2.jar OK Now inside of your PutHDFS processor you will need to point
to the location of your hdfs-site.xml and core-site.xml files (Full Path) and
also the directory you wish to write in. Tutorials on what to do next can be found all over HCC. Try importing Twitter data by following along with this tutorial : Twitter Ingest Tutorial
... View more
Labels:
08-12-2016
02:57 PM
You must verify that the solrconfig.xml has valid XML format and isn't missing entities. You can use this website to validate the XML : XML Validator You can also get a fresh copy of the solrconfig file from github if you have modified the original and don't have a backup. solrconfig.xml
... View more
08-12-2016
02:53 PM
If the following error shows up when you are trying to create a solr core: Unable to create core [tweets_shard1_replica1] Caused by: XML document structures must start and end within the same entity.
How do you get past this issue?
... View more
Labels:
- Labels:
-
Apache Solr
08-08-2016
11:39 PM
Anybody seen the following error when trying to create the tweet shard? And is there a known solution?
Unable to create core [tweets_shard1_replica1] Caused by: XML document structures must start and end within the same entity.
... View more
05-25-2016
04:30 PM
17 Kudos
In NiFi the data being passed between operators is referred to as a FlowFile and can be accessed via various scripting languages in the ExecuteScript operator. In order to access the data in the FlowFile you need to understand a few requirements first. In this example we will access the json data being passed into the ExecuteScript operator via a getTwitter operator. This could be any data ingestion operator. Using the ExecuteScript (Note: Execute script will use Jython libraries which are good but limited. If you have a python script which uses other libs and produces an output you can use Execute Process instead which will execute the python script on the machine using the full python lib and the output will become your flow file. ) operator set the script type to python and in the "script body" section enter the following code: import json
import java.io
from org.apache.commons.io import IOUtils
from java.nio.charset import StandardCharsets
from org.apache.nifi.processor.io import StreamCallback
class ModJSON(StreamCallback):
def __init__(self):
pass
def process(self, inputStream, outputStream):
text = IOUtils.toString(inputStream, StandardCharsets.UTF_8)
obj = json.loads(text)
newObj = {
"Source": "NiFi",
"ID": obj['id'],
"Name": obj['user']['screen_name']
}
outputStream.write(bytearray(json.dumps(newObj, indent=4).encode('utf-8')))
flowFile = session.get()
if (flowFile != None):
flowFile = session.write(flowFile, ModJSON())
flowFile = session.putAttribute(flowFile, "filename", flowFile.getAttribute('filename').split('.')[0]+'_translated.json')
session.transfer(flowFile, REL_SUCCESS)
session.commit()
the import statements are required to take advantage of the NiFi components. The way we actually access the flowfile is through a global variable available to you in NiFi called "session". As you can see: flowFile = session.get() is how we grab the IOStream from NiFi and the pass to the processing Class via the session.write() method. You can build up a new JSON object while referencing attributes from the source JSON as you can see in the script: newObj = {
"Source": "NiFi",
"ID": obj['id'],
"Name": obj['user']['screen_name']
} This should help you get started with using python scripts inside of NiFi. I hope to see some posts of how you modify this to create more interesting flows.
... View more
Labels:
05-18-2016
07:32 PM
In Data Science, R is commonly used for analytics and data exploration. When moving to a Hadoop architecture and a connected data platform a big question is what happens to my already existing R scripts? You can transition nicely to Hadoop using the rHadoop package for R which allows you to read from hdfs and get data back into a dataframe in R. To enable this you first need to get the R package: install.packages("rhadoop") You can also wget the package wget https://cran.r-project.org/src/contrib/Archive/rHadoopClient/rHadoopClient_0.2.tar.gz and then install.packages("/path/to/package") now you can read a file in using the rHadoopClient: rHadoopClient::read.hdfs("/path/to/data.csv") That's all you need to get started. This allows you to change your file read steps in your R scripts to point to HDFS and still run your R scripts as you are used to doing.
... View more
05-09-2016
05:11 PM
2 Kudos
A question encountered by most organizations when attempting to take on new Big Data Initiatives or adapt current operations to keep up with the pace of innovation is What approach to take in architecting their streaming workflows and applications.
Because Twitter data is free and easily accessible I chose to examine twitter events in this article; however, twitter events are simply json messages and can be conceptualized as machine generated json packets for a more generalized streaming architecture.
Lets move forward with an evaluation of NiFi vs Python ingesting streaming events from twitter.
When approaching a problem there are often several tools and methods that could ultimately yield similar results.
In this article I will look at the problem of ingesting streaming twitter events and show how HDF and python can be used to
create an application which achieves the goal.
I will first lay out a solution using both platforms, then I will
point out different aspects and considerations.
To ingest data from Twitter, regardless of what solution you choose, API tokens are first required.
Get those here:
Twitter Developer Console
First lets look at Python:
You can get the source for the basic ingestion application here:
Twitter Python Example
Tweepy is used to connect to twitter and begin receiving status objects. import tweepy
import threading, logging, time
import string
######################################################################
# Authentication details. To obtain these visit dev.twitter.com
######################################################################
consumer_key = '<insert your consumer key>'
consumer_secret = '<insert your consumer secret>'
access_token = '<insert your access token>'
access_token_secret = '<insert your access secret>'
Now to create a connection we initiate the connection in the main loop: if __name__ == '__main__':
listener = StdOutListener()
#sign oath cert
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
#uncomment to use api in stream for data send/retrieve algorithms
#api = tweepy.API(auth)
stream = tweepy.Stream(auth, listener)
######################################################################
#Sample delivers a stream of 1% (random selection) of all tweets
######################################################################
client = KafkaClient("localhost:9092")
producer = SimpleProducer(client)
stream.sample()
And to begin parsing the messages we must do so programmatically and any changes must be made in the source code.
Now lets examine the process in NiFi. To avoid redundancy look at this thorough tutorial which illustrates the use of NiFi to ingest streaming twitter data. Twitter NiFi Example
To create the connection we simply enter the API credentials into the twitter operator shown below: Then we can write these JSON objects to HDFS using the PutHDFS operator or we can parse out any information of interest while the message is in flight using the EvaluateJsonPath operator. Looking at these 2 approaches you can probably already see the benefits of choosing Nifi, Which is part of Hortonworks HDF "Data in Motion" platform.
With NiFi our architecture is not limited and can be easily maintained and extended.
Note that to write to Kafka we can simply add a PutKafka operator that branches the stream into a PutKafka operator and leave our original workflow intact.
Also note the degree of programming knowledge required to manage and extend the Python application where NiFi can be managed visually with no need for code. The key takeaways here are the following: When making decisions about the technology to use in Big Data initiatives for streaming analytics platforms in an organization you should put a great deal of thought into: 1. Ease of implementation 2. Ease of extensibility 3. Openness of the technology The third point can't be emphasized enough. One of the biggest obstacles that organizations face when attempting to implement or extend their existing analytics applications is the lack of openness and the lock in caused by past technology decisions. Choosing to go with platforms like Hortonworks HDF and HDP allows for a complementary solution to existing architectures and technology that is already present in the organization and also leaves the architecture open to future innovations and allows the architecture and organization to keep up with the speed of innovation.
... View more
Labels:
05-04-2016
06:44 PM
Good to know. Thanks for clearing that up for me.
... View more
05-04-2016
02:36 PM
When I have a table in Atlas which blocks a user from selecting on the ssn column of table1, I create a view (view1) which has ssn as a column and then I am able to select * on view1 with the user that would have been denied select * on table1. Is there a way to ensure that all downstream tables / views cary the same policy as their parent? Or is this the expected and desired behavior?
... View more
Labels:
- Labels:
-
Apache Atlas
05-02-2016
09:14 PM
You mention that you start the Atlas server but did you shutdown Atlas before running the quick_start.py script? Perhaps there was a lock on the files which needed to be updated. You may also need to verify that all variables and options are properly set. Here is a link to the install guide. http://atlas.incubator.apache.org/InstallationSteps.html
... View more
- « Previous
- Next »