About TimothySpann

TimothySpann · ‎01-11-2017

Are you running Apache Atlas?

TimothySpann · ‎01-11-2017

https://www.mail-archive.com/commits@ambari.apache.org/msg30743.html Indicates removing that name from the UI in ambari

TimothySpann · ‎01-11-2017

HiveServer2 Interactive Host localhost There's no edit button and I don't see that field anywhere else. How do I change that in ambari?

TimothySpann · ‎01-11-2017

ghost.xml NIFI Template

TimothySpann · ‎01-11-2017

There's a number of tools from the command line, that I like to use from NiFi as part of a Big Data Flow. The first tool I wanted to use was Ghostscript. On Centos/RHEL, you can install it via: yum install ghostscript I use GhostScript to extract content from PDFs (this can be passed in from an existing flow using ExecuteStreamCommand). It then outputs to the Standard Output the text from those files. run.sh gs -dBATCH -dNOPAUSE -sDEVICE=txtwrite -dFirstPage=1 -dLastPage=500 -sOutputFile=- $@ Output from the Hadoop documentation: GPL Ghostscript 9.07 (2013-02-14) Copyright (C) 2012 Artifex Software, Inc. All rights reserved. This software comes with NO WARRANTY: see the file PUBLIC for details. Processing pages 1 through 8. Page 1 Can't find (or can't open) font file NimbusSanL-ReguItal. Querying operating system for font files... Loading NimbusSanL-ReguItal font from /usr/share/fonts/default/Type1/n019023l.pfb... 3984660 2473586 2498328 1197484 3 done. Loading NimbusSanL-Bold font from /usr/share/fonts/default/Type1/n019004l.pfb... 4025652 2573865 2498328 1199061 3 done. Loading NimbusRomNo9L-Regu font from /usr/share/fonts/default/Type1/n021003l.pfb... 4072164 2714750 2518512 1214631 3 done. Welcome to Apache™ Hadoop��! Table of contents 1 What Is Apache Hadoop?.................................................................................................. 2 2 Getting Started .................................................................................................................. 3 3 Download Hadoop..............................................................................................................3 4 Who Uses Hadoop?............................................................................................................3 5 News................................................................................................................................... 3 Copyright �� 2014 The Apache Software Foundation. All rights reserved. Page 2 Loading NimbusSanL-Regu font from /usr/share/fonts/default/Type1/n019003l.pfb... 4243344 2902744 2478144 1170708 3 done. Loading NimbusRomNo9L-Medi font from /usr/share/fonts/default/Type1/n021004l.pfb... 4410848 3063836 2518512 1208517 3 done. Welcome to Apache™ Hadoop��! 1 What Is Apache Hadoop? The Apache™ Hadoop�� project develops open-source software for reliable, scalable, distributed computing. The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures. The project includes these modules: • Hadoop Common: The common utilities that support the other Hadoop modules. • Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data. • Hadoop YARN: A framework for job scheduling and cluster resource management. • Hadoop MapReduce: A YARN-based system for parallel processing of large data sets. Other Hadoop-related projects at Apache include: You will probably want to clean this up a little and remove some of the formatting, this can be done in NiFi or later in Hive or Phoenix for further cleaning. Or you could send it as a message through Kafka and process with Apache Storm, Apache Spark or other streaming tools. For fans of old UNIX, everyone loved those fortunes. Those are still available for install on CentOS. yum install fortune-mod.x86_64 The results of a flow calling fortune, this requires no parameters and just put in the command parameter of fortune. It outputs the information to the console, which we extract using (.*).+ to an attribute and then I convert it to a JSON file for storage in HDFS. Output JSON <strong></strong>{"fortune":"My little brother got this fortune"} Reference: https://ghostscript.com/doc/current/Use.htm#Pipes https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi.processors.standard.ExecuteStreamCommand/

TimothySpann · ‎01-11-2017

https://community.hortonworks.com/articles/73833/an-example-websocket-application-in-apache-nifi-11.html if they are connected to the websocket you can send them information from the flow. You could also package that information into KAFKA or JMS and send them to a queue and a backend will pull from the queue and send the WebSocket messages. That is probably more common to connect to the front end web.

TimothySpann · ‎01-11-2017

Spark has lazy execution so show is where it tries to connect. can you access that db from the sandbox command line check for errors check postgres permissions and sandbox mapping / firewalls.

TimothySpann · ‎01-10-2017

https://help.sumologic.com/Send_Data/Sources/02Sources_for_Hosted_Collectors/HTTP_Source

TimothySpann · ‎01-08-2017

Introduction For NLP, mostly I want to do two things, Entity Recognition (people, facility, organizations, locations, products, events, art, language, groups, dates, time, percent, money, quantity, ordinal and cardinal) Sentiment Analysis So basically what is it and why don't people like it. These two features are very useful as part of a real-time streaming processing of social, email, logs and semistructured document data. I can use both of these in Twitter ingest via Apache NiFi or Apache Spark. Don't confuse text entity recognition with image recognition that we looked at with TensorFlow previously. You can certainly add that to your flow as well, but that is working with images and not text. My debate with sentiment analysis is do you give numbers, really general terms like Neutral, Negative or Positive. Or do you get more detailed like Stanford CoreNLP which has multiple of each. There are a lot of libraries available for NLP and Sentiment Analysis. The first two decisions are do you want to run JVM programs (good for Hadoop MR, Apache Spark, Apache Storm, Enterprise applications, Spring applications, Microservices, NiFi Processors, Hive UDFs, Pig UDFs and have multiple programming language support (Java, Scala, ...). Or run on Python which is already well-known by many data scientists and engineers, is simple to prototype with no compiling, very easy to call from NiFi and scripts and also has a ton of great Deep Learning libraries and interfaces. Python Libraries Like most things in Python, you can use Pip to install them. You will need a Python 2.7 or 3.0 environment setup with PIP to install and use the libraries I have looked at. spaCY requires numpy and so do many of the others. spaCy pip install -U spacy python -m spacy.en.download all Downloading parsing model Downloading... Downloaded 532.28MB 100.00% 9.59MB/s eta 0s archive.gz checksum/md5 OK Model successfully installed to /usr/lib64/python2.7/site-packages/spacy/data Downloading GloVe vectors Downloading... Downloaded 708.08MB 100.00% 19.38MB/s eta 0s archive.gz checksum/md5 OK Model successfully installed to /usr/lib64/python2.7/site-packages/spacy/data After you install you need to download text and models to be used by the tool. import spacy nlp = spacy.load('en') doc5 = nlp(u"Timothy Spann is studying at Princeton University in New Jersey.") # Named Entity Recognizer (NER) for ent in doc5.ents: print ent, ent.label, ent.label_ NLTK from nltk.sentiment.vader import SentimentIntensityAnalyzer import sys sid = SentimentIntensityAnalyzer() ss = sid.polarity_scores(sys.argv[1]) if ss['compound'] == 0.00: print('Neutral') elif ss['compound'] < 0.00: print ('Negative') else: print('Positive') Another NLTK Option import sys sid = SentimentIntensityAnalyzer() ss = sid.polarity_scores(sys.argv[1]) print('Compound {0} Negative {1} Neutral {2} Positive {3} '.format(ss['compound'],ss['neg'],ss['neu'],ss['pos'])) NLTK does sentiment analysis very easily as shown above. It runs fairly quickly so you can call this in a stream without too much overhead. TextBlob from textblob import TextBlob b = TextBlob("Spellin iz vaerry haerd to do. I do not like this spelling product at all it is terrible and I am very mad.") print(b.correct()) print(b.sentiment) print(b.sentiment.polarity) python tb.py Spelling in very heard to do. I do not like this spelling product at all it is terrible and I am very mad. Sentiment(polarity=-0.90625, subjectivity=1.0) -0.90625 TextBlob is a nice library that does Sentiment Analysis as well as spell checking and other useful text processing. The install will look familiar. sudo pip install -U textblob sudo python -m textblob.download_corpora JVM Natural Language Processing for JVM languages (NLP4J) is one option, I have not tried this one yet. Apache OpenNLP This one is very widely used and is an Apache project which makes the licensing ideal for most users. I have a long example of this in this article on Apache OpenNLP. Training Models Pre-built for Entity Recognition in Apache OpenNLP http://opennlp.sourceforge.net/models-1.5/en-ner-location.bin http://opennlp.sourceforge.net/models-1.5/en-ner-organization.bin http://opennlp.sourceforge.net/models-1.5/en-ner-money.bin http://opennlp.sourceforge.net/models-1.5/en-ner-date.bin StanfordNLP I love StanfordNLP, it works very well, integrates in a Twitter processing flow and is very accurate. The only issue for many is that it is GPLd and for many use cases will require purchasing a license. It is very easy to use Stanford CoreNLP with Java, Scala and Spark. import java.util.Properties import org.apache.spark.SparkContext import org.apache.spark.sql.SQLContext import org.apache.phoenix.spark._ import com.vader.SentimentAnalyzer import edu.stanford.nlp.ling.CoreAnnotations import edu.stanford.nlp.neural.rnn.RNNCoreAnnotations import edu.stanford.nlp.pipeline.StanfordCoreNLP import edu.stanford.nlp.sentiment.SentimentCoreAnnotations import org.apache.log4j.{Level, Logger} import org.apache.spark.{SparkConf, SparkContext} import org.apache.spark.serializer.KryoSerializer import org.apache.spark.sql._ import scala.collection.JavaConversions._ import scala.collection.mutable.ListBuffer case class Tweet(coordinates: String, geo:String, handle: String, hashtags: String, language: String, location: String, msg: String, time: String, tweet_id: String, unixtime: String, user_name: String, tag: String, profile_image_url: String, source: String, place: String, friends_count: String, followers_count: String, retweet_count: String, time_zone: String, sentiment: String, stanfordSentiment: String) val message = convert(anyMessage) val pipeline = new StanfordCoreNLP(nlpProps) val annotation = pipeline.process(message) var sentiments: ListBuffer[Double] = ListBuffer() var sizes: ListBuffer[Int] = ListBuffer() var longest = 0 var mainSentiment = 0 for (sentence <- annotation.get(classOf[CoreAnnotations.SentencesAnnotation])) { val tree = sentence.get(classOf[SentimentCoreAnnotations.AnnotatedTree]) val sentiment = RNNCoreAnnotations.getPredictedClass(tree) val partText = sentence.toString if (partText.length() > longest) { mainSentiment = sentiment longest = partText.length() } sentiments += sentiment.toDouble sizes += partText.length } val averageSentiment:Double = { if(sentiments.nonEmpty) sentiments.sum / sentiments.size else -1 } val weightedSentiments = (sentiments, sizes).zipped.map((sentiment, size) => sentiment * size) var weightedSentiment = weightedSentiments.sum / (sizes.fold(0)(_ + _)) if(sentiments.isEmpty) { mainSentiment = -1 weightedSentiment = -1 } weightedSentiment match { case s if s <= 0.0 => NOT_UNDERSTOOD case s if s < 1.0 => VERY_NEGATIVE case s if s < 2.0 => NEGATIVE case s if s < 3.0 => NEUTRAL case s if s < 4.0 => POSITIVE case s if s < 5.0 => VERY_POSITIVE case s if s > 5.0 => NOT_UNDERSTOOD } } trait SENTIMENT_TYPE case object VERY_NEGATIVE extends SENTIMENT_TYPE case object NEGATIVE extends SENTIMENT_TYPE case object NEUTRAL extends SENTIMENT_TYPE case object POSITIVE extends SENTIMENT_TYPE case object VERY_POSITIVE extends SENTIMENT_TYPE case object NOT_UNDERSTOOD extends SENTIMENT_TYPE Summary Do you have to use just one of these libraries? Of course not, I use different ones depending on my needs. Licensing, performance, accuracy on your dataset, programming language choice, enterprise environment, volume of data, your corpus, human language involved and many other factors come into play. One size does not fit all. If you have sophisticated data scientists and strong machine learning pipelines, you may want to pick one and build up your own custom models and corpus. This will work with Hortonworks HDP 2.3 - HDP 2.6 and HDF 1.0 - 3.x. References: https://shirishkadam.com/2016/10/06/setting-up-natural-language-processing-environment-with-python/ https://explosion.ai/blog/spacy-deep-learning-keras https://github.com/explosion/spaCy/blob/master/examples/deep_learning_keras.py https://github.com/explosion/spaCy/tree/master/examples/sentiment https://spacy.io/docs/usage/entity-recognition https://spacy.io/ https://spacy.io/docs/usage/entity-recognition http://textminingonline.com/how-to-use-stanford-named-entity-recognizer-ner-in-python-nltk-and-other-programming-languages https://www.quora.com/How-does-Googles-open-source-natural-language-parser-SyntaxNet-compare-with-spaCy-io-or-Stanfords-CoreNLP https://github.com/explosion/spaCy/tree/master/examples/inventory_count http://iamaaditya.github.io/2016/04/visual_question_answering_demo_notebook https://github.com/iamaaditya/VQA_Demo https://avisingh599.github.io/deeplearning/visual-qa/ https://github.com/avisingh599/visual-qa https://github.com/chartbeat-labs/textacy https://github.com/cytora/pycon-nlp-in-10-lines https://nicschrading.com/project/Intro-to-NLP-with-spaCy/ http://textblob.readthedocs.io/en/dev/advanced_usage.html#sentiment-analyzers ttps://github.com/sloria/TextBlob https://textblob.readthedocs.io/en/latest/quickstart.html#quickstart https://github.com/tensorflow/models/tree/master/syntaxnet#getting-started https://explosion.ai/blog/syntaxnet-in-context https://emorynlp.github.io/nlp4j/ https://github.com/emorynlp/nlp4j https://github.com/robhinds/opennlp-ingredient-finder https://aiaioo.wordpress.com/2016/01/13/naive-bayes-classifier-in-opennlp/ https://community.hortonworks.com/articles/36884/using-parsey-mcparseface-google-tensorflow-syntaxn.html https://community.hortonworks.com/questions/2728/what-is-recommended-nlp-solution-on-top-of-hdp-sta.html https://community.hortonworks.com/content/kbentry/52415/processing-social-media-feeds-in-stream-with-apach.html https://community.hortonworks.com/articles/49465/how-to-add-sentiment-analytics-to-twitterapache-ni.html http://www.nltk.org/api/nltk.sentiment.html https://dzone.com/articles/in-progress-natural-language-processing http://brnrd.me/social-sentiment-sentiment-analysis/ http://www.nltk.org/howto/twitter.html https://marcobonzanini.com/2015/03/09/mining-twitter-data-with-python-part-2/ https://github.com/nltk/nltk/wiki/Sentiment-Analysis https://community.hortonworks.com/content/kbentry/49465/how-to-add-sentiment-analytics-to-twitterapache-ni.html https://community.hortonworks.com/articles/35568/python-script-in-nifi.html https://community.hortonworks.com/articles/76240/using-opennlp-for-identifying-names-from-text.html http://stanfordnlp.github.io/CoreNLP/ http://nlp.stanford.edu/software/CRF-NER.shtml

TimothySpann · ‎01-08-2017

To create a simple Java 8 application to use to extract text from PDFs and then identify people's names, I have create a simple data application. This can be used as part of a larger Data Processing Pipeline or HDF flow by calling it via REST, Command Line or converting it to a NiFi Processor. <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd"> <modelVersion>4.0.0</modelVersion> <groupId>com.dataflowdeveloper</groupId> <artifactId>categorizer</artifactId> <packaging>jar</packaging> <version>1.0</version> <name>categorizer</name> <url>http://maven.apache.org</url> <dependencies> <dependency> <groupId>junit</groupId> <artifactId>junit</artifactId> <version>3.8.1</version> <scope>test</scope> </dependency> <dependency> <groupId>org.slf4j</groupId> <artifactId>slf4j-simple</artifactId> <version>1.7.7</version> </dependency> <dependency> <groupId>org.apache.opennlp</groupId> <artifactId>opennlp-tools</artifactId> <version>1.7.0</version> </dependency> <dependency> <groupId>com.google.code.gson</groupId> <artifactId>gson</artifactId> <version>2.8.0</version> </dependency>  <dependency> <groupId>org.apache.tika</groupId> <artifactId>tika-core</artifactId> <version>1.14</version> </dependency>  <dependency> <groupId>org.apache.tika</groupId> <artifactId>tika-parsers</artifactId> <version>1.14</version> </dependency> <dependency> <groupId>org.apache.tika</groupId> <artifactId>tika-langdetect</artifactId> <version>1.14</version> </dependency> <dependency> <groupId>org.apache.lucene</groupId> <artifactId>lucene-core</artifactId> <version>3.5.0</version> </dependency> <dependency> <groupId>commons-io</groupId> <artifactId>commons-io</artifactId> <version>2.5</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-client</artifactId> <version>2.7.3</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-common</artifactId> <version>2.7.3</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-hdfs</artifactId> <version>2.7.3</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-common</artifactId> <version>2.7.3</version> </dependency> </dependencies> </project> Java Application package com.dataflowdeveloper; import java.io.BufferedInputStream; import java.io.FileInputStream; import java.io.FileNotFoundException; import java.io.FileOutputStream; import java.io.InputStream; import java.io.OutputStream; import java.net.URI; import java.nio.file.FileSystems; import java.nio.file.Files; import java.util.logging.Level; import java.util.logging.Logger; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IOUtils; import java.io.File; import java.io.FileInputStream; import java.io.IOException; import java.io.InputStream; import java.util.ArrayList; import java.util.List; import com.google.gson.Gson; import opennlp.tools.namefind.NameFinderME; import opennlp.tools.namefind.TokenNameFinderModel; import opennlp.tools.tokenize.SimpleTokenizer; import opennlp.tools.tokenize.Tokenizer; import opennlp.tools.tokenize.TokenizerME; import opennlp.tools.tokenize.TokenizerModel; import opennlp.tools.util.InvalidFormatException; import opennlp.tools.util.Span; import java.io.File; import java.io.FileInputStream; import java.io.IOException; import org.apache.tika.exception.TikaException; import org.apache.tika.metadata.Metadata; import org.apache.tika.parser.ParseContext; import org.apache.tika.parser.pdf.PDFParser; import org.apache.tika.sax.BodyContentHandler; import org.xml.sax.SAXException; public class App { public static void main(String args[]) { BodyContentHandler handler = new BodyContentHandler(); Metadata metadata = new Metadata(); FileInputStream inputstream = null; try { inputstream = new FileInputStream(new File(System.getProperty("user.dir") + "/testdocs/opennlp.pdf")); } catch (FileNotFoundException e1) { e1.printStackTrace(); } ParseContext pcontext = new ParseContext(); // parsing the document using PDF parser PDFParser pdfparser = new PDFParser(); try { pdfparser.parse(inputstream, handler, metadata, pcontext); } catch (SAXException e) { e.printStackTrace(); } catch (IOException e) { e.printStackTrace(); } catch (TikaException e) { e.printStackTrace(); } NameFinder nameFinder = new NameFinder(); System.out.println(nameFinder.getPeople(handler.toString())); } } package com.dataflowdeveloper; import java.io.File; import java.io.FileInputStream; import java.io.IOException; import java.io.InputStream; import java.util.ArrayList; import java.util.List; import com.google.gson.Gson; import opennlp.tools.namefind.NameFinderME; import opennlp.tools.namefind.TokenNameFinderModel; import opennlp.tools.tokenize.SimpleTokenizer; import opennlp.tools.tokenize.Tokenizer; import opennlp.tools.tokenize.TokenizerME; import opennlp.tools.tokenize.TokenizerModel; import opennlp.tools.util.InvalidFormatException; import opennlp.tools.util.Span; /** * */ public class NameFinder { private static final String CURRENT_DIR = System.getProperty("user.dir"); private static final String OLD_FILE = "/Volumes/Transcend/projects/categorizer/input/en-ner-person.bin"; private static final String CURRENT_FILE = CURRENT_DIR + "/input/en-ner-person.bin"; private static final String CURRENT_TOKEN_FILE = CURRENT_DIR + "/input/en-token.bin"; /** * sentence to people * @param sentence * @return JSON */ public String getPeople(String sentence) { // String outputJSON = ""; TokenNameFinderModel model = null; InputStream tokenStream = null; Tokenizer tokenizer = null; try { tokenStream = new FileInputStream( new File(CURRENT_TOKEN_FILE)); model = new TokenNameFinderModel( new File(CURRENT_FILE)); TokenizerModel tokenModel = new TokenizerModel(tokenStream); tokenizer = new TokenizerME(tokenModel); } catch (InvalidFormatException e) { e.printStackTrace(); } catch (IOException e) { e.printStackTrace(); } // Create a NameFinder using the model NameFinderME finder = new NameFinderME(model); // Split the sentence into tokens String[] tokens = tokenizer.tokenize(sentence); // Find the names in the tokens and return Span objects Span[] nameSpans = finder.find(tokens); List<PersonName> people = new ArrayList<PersonName>(); String[] spanns = Span.spansToStrings(nameSpans, tokens); for (int i = 0; i < spanns.length; i++) { people.add(new PersonName(spanns[i])); } outputJSON = new Gson().toJson(people); finder.clearAdaptiveData(); return "{\"names\":" + outputJSON + "}"; } } Process 1. We open the file stream for reading (this can be from HDFS, S3 or a regular file system). 2. Then we use Apache Tika's PDF Parser to parse out the text. We also get the metadata for other processing. 3. Using OpenNLP we parse out all the names from that text. 4. Using Google GSON, we then turns the names into JSON for easy usage. References https://raw.githubusercontent.com/apache/tika/master/tika-example/pom.xml https://github.com/apache/tika/tree/master/tika-example

Online	Offline
Last Visited	‎05-20-2024 05:42 PM

Member Since	‎01-07-2019 11:58 AM
Last Visited	‎05-20-2024 05:42 PM
Posts	1,973
Kudos received	1122

Cloudera Community

Re: Has anyone tried NiFi consuming (JMSConsume) f...

Re: NiFi Crash after runing chain of lookups

Re: Recommend approach for listening to RSS Feed i...

Re: NiFi ListenFTP Processor Default Data Port

Re: Nifi: Kafka Producer with Avro format in both ...

Re: "E090 HDFS020 Could not write file" error occu...

Re: How do you change HiveServer2 Interactive Host

How do you change HiveServer2 Interactive Host

Re: Basic Image Processing and Linux Utilities As ...

Basic Image Processing and Linux Utilities As Part...

Re: Registering WebSocket connections with NiFi

Re: Sandbox: Initial job has not accepted any reso...

Re: Routing Logs Through Apache NiFi to Phoenix, H...

Using Sentiment Analysis and NLP Tools With HDP 2....

Data Processing Pipeline: Parsing PDFs and Identi...