1973
Posts
1225
Kudos Received
124
Solutions
My Accepted Solutions
| Title | Views | Posted |
|---|---|---|
| 1927 | 04-03-2024 06:39 AM | |
| 3018 | 01-12-2024 08:19 AM | |
| 1655 | 12-07-2023 01:49 PM | |
| 2425 | 08-02-2023 07:30 AM | |
| 3373 | 03-29-2023 01:22 PM |
01-11-2017
07:24 PM
Are you running Apache Atlas?
... View more
01-11-2017
06:49 PM
https://www.mail-archive.com/commits@ambari.apache.org/msg30743.html Indicates removing that name from the UI in ambari
... View more
01-11-2017
06:38 PM
HiveServer2 Interactive Host localhost There's no edit button and I don't see that field anywhere else. How do I change that in ambari?
... View more
Labels:
- Labels:
-
Apache Hive
01-11-2017
04:25 PM
ghost.xml NIFI Template
... View more
01-11-2017
08:37 AM
4 Kudos
There's a number of tools from the command line, that I like to use from NiFi as part of a Big Data Flow. The first tool I wanted to use was Ghostscript. On Centos/RHEL, you can install it via: yum install ghostscript I use GhostScript to extract content from PDFs (this can be passed in from an existing flow using ExecuteStreamCommand). It then outputs to the Standard Output the text from those files. run.sh
gs -dBATCH -dNOPAUSE -sDEVICE=txtwrite -dFirstPage=1 -dLastPage=500 -sOutputFile=- $@ Output from the Hadoop documentation: GPL Ghostscript 9.07 (2013-02-14)
Copyright (C) 2012 Artifex Software, Inc. All rights reserved.
This software comes with NO WARRANTY: see the file PUBLIC for details.
Processing pages 1 through 8.
Page 1
Can't find (or can't open) font file NimbusSanL-ReguItal.
Querying operating system for font files...
Loading NimbusSanL-ReguItal font from /usr/share/fonts/default/Type1/n019023l.pfb... 3984660 2473586 2498328 1197484 3 done.
Loading NimbusSanL-Bold font from /usr/share/fonts/default/Type1/n019004l.pfb... 4025652 2573865 2498328 1199061 3 done.
Loading NimbusRomNo9L-Regu font from /usr/share/fonts/default/Type1/n021003l.pfb... 4072164 2714750 2518512 1214631 3 done.
Welcome to Apache™ Hadoop��!
Table of contents
1 What Is Apache Hadoop?.................................................................................................. 2
2 Getting Started .................................................................................................................. 3
3 Download Hadoop..............................................................................................................3
4 Who Uses Hadoop?............................................................................................................3
5 News................................................................................................................................... 3
Copyright �� 2014 The Apache Software Foundation. All rights reserved.
Page 2
Loading NimbusSanL-Regu font from /usr/share/fonts/default/Type1/n019003l.pfb... 4243344 2902744 2478144 1170708 3 done.
Loading NimbusRomNo9L-Medi font from /usr/share/fonts/default/Type1/n021004l.pfb... 4410848 3063836 2518512 1208517 3 done.
Welcome to Apache™ Hadoop��!
1 What Is Apache Hadoop?
The Apache™ Hadoop�� project develops open-source software for reliable, scalable,
distributed computing.
The Apache Hadoop software library is a framework that allows for the distributed
processing of large data sets across clusters of computers using simple programming models.
It is designed to scale up from single servers to thousands of machines, each offering local
computation and storage. Rather than rely on hardware to deliver high-availability, the
library itself is designed to detect and handle failures at the application layer, so delivering
a highly-available service on top of a cluster of computers, each of which may be prone to
failures.
The project includes these modules:
• Hadoop Common: The common utilities that support the other Hadoop modules.
• Hadoop Distributed File System (HDFS™): A distributed file system that provides
high-throughput access to application data.
• Hadoop YARN: A framework for job scheduling and cluster resource management.
• Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.
Other Hadoop-related projects at Apache include:
You will probably want to clean this up a little and remove some of the formatting, this can be done in NiFi or later in Hive or Phoenix for further cleaning. Or you could send it as a message through Kafka and process with Apache Storm, Apache Spark or other streaming tools. For fans of old UNIX, everyone loved those fortunes. Those are still available for install on CentOS. yum install fortune-mod.x86_64 The results of a flow calling fortune, this requires no parameters and just put in the command parameter of fortune. It outputs the information to the console, which we extract using (.*).+ to an attribute and then I convert it to a JSON file for storage in HDFS. Output JSON
<strong></strong>{"fortune":"My little brother got this fortune"} Reference:
https://ghostscript.com/doc/current/Use.htm#Pipes https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi.processors.standard.ExecuteStreamCommand/
... View more
Labels:
01-11-2017
01:04 AM
https://community.hortonworks.com/articles/73833/an-example-websocket-application-in-apache-nifi-11.html if they are connected to the websocket you can send them information from the flow. You could also package that information into KAFKA or JMS and send them to a queue and a backend will pull from the queue and send the WebSocket messages. That is probably more common to connect to the front end web.
... View more
01-11-2017
12:07 AM
Spark has lazy execution so show is where it tries to connect. can you access that db from the sandbox command line check for errors check postgres permissions and sandbox mapping / firewalls.
... View more
01-10-2017
04:50 PM
https://help.sumologic.com/Send_Data/Sources/02Sources_for_Hosted_Collectors/HTTP_Source
... View more
01-08-2017
03:30 PM
2 Kudos
Introduction For NLP, mostly I want to do two things,
Entity Recognition (people, facility, organizations, locations, products, events, art, language, groups, dates, time, percent, money, quantity, ordinal and cardinal) Sentiment Analysis So basically what is it and why don't people like it. These two features are very useful as part of a real-time streaming processing of social, email, logs and semistructured document data. I can use both of these in Twitter ingest via Apache NiFi or Apache Spark. Don't confuse text entity recognition with image recognition that we looked at with TensorFlow previously. You can certainly add that to your flow as well, but that is working with images and not text. My debate with sentiment analysis is do you give numbers, really general terms like Neutral, Negative or Positive. Or do you get more detailed like Stanford CoreNLP which has multiple of each. There are a lot of libraries available for NLP and Sentiment Analysis. The first two decisions are do you want to run JVM programs (good for Hadoop MR, Apache Spark, Apache Storm, Enterprise applications, Spring applications, Microservices, NiFi Processors, Hive UDFs, Pig UDFs and have multiple programming language support (Java, Scala, ...). Or run on Python which is already well-known by many data scientists and engineers, is simple to prototype with no compiling, very easy to call from NiFi and scripts and also has a ton of great Deep Learning libraries and interfaces. Python Libraries Like most things in Python, you can use Pip to install them. You will need a Python 2.7 or 3.0 environment setup with PIP to install and use the libraries I have looked at. spaCY requires numpy and so do many of the others. spaCy pip install -U spacy
python -m spacy.en.download all
Downloading parsing model
Downloading...
Downloaded 532.28MB 100.00% 9.59MB/s eta 0s
archive.gz checksum/md5 OK
Model successfully installed to /usr/lib64/python2.7/site-packages/spacy/data
Downloading GloVe vectors
Downloading...
Downloaded 708.08MB 100.00% 19.38MB/s eta 0s
archive.gz checksum/md5 OK
Model successfully installed to /usr/lib64/python2.7/site-packages/spacy/data
After you install you need to download text and models to be used by the tool. import spacy
nlp = spacy.load('en')
doc5 = nlp(u"Timothy Spann is studying at Princeton University in New Jersey.")
# Named Entity Recognizer (NER)
for ent in doc5.ents:
print ent, ent.label, ent.label_
NLTK from nltk.sentiment.vader import SentimentIntensityAnalyzer
import sys
sid = SentimentIntensityAnalyzer()
ss = sid.polarity_scores(sys.argv[1])
if ss['compound'] == 0.00:
print('Neutral')
elif ss['compound'] < 0.00:
print ('Negative')
else:
print('Positive')
Another NLTK Option import sys
sid = SentimentIntensityAnalyzer()
ss = sid.polarity_scores(sys.argv[1])
print('Compound {0} Negative {1} Neutral {2} Positive {3} '.format(ss['compound'],ss['neg'],ss['neu'],ss['pos']))
NLTK does sentiment analysis very easily as shown above. It runs fairly quickly so you can call this in a stream without too much overhead. TextBlob from textblob import TextBlob
b = TextBlob("Spellin iz vaerry haerd to do. I do not like this spelling product at all it is terrible and I am very mad.")
print(b.correct())
print(b.sentiment)
print(b.sentiment.polarity)
python tb.py
Spelling in very heard to do. I do not like this spelling product at all it is terrible and I am very mad.
Sentiment(polarity=-0.90625, subjectivity=1.0)
-0.90625
TextBlob is a nice library that does Sentiment Analysis as well as spell checking and other useful text processing. The install will look familiar. sudo pip install -U textblob
sudo python -m textblob.download_corpora
JVM Natural Language Processing for JVM languages (NLP4J) is one option, I have not tried this one yet. Apache OpenNLP This one is very widely used and is an Apache project which makes the licensing ideal for most users. I have a long example of this in this article on Apache OpenNLP. Training Models Pre-built for Entity Recognition in Apache OpenNLP
http://opennlp.sourceforge.net/models-1.5/en-ner-location.bin http://opennlp.sourceforge.net/models-1.5/en-ner-organization.bin http://opennlp.sourceforge.net/models-1.5/en-ner-money.bin http://opennlp.sourceforge.net/models-1.5/en-ner-date.bin StanfordNLP I love StanfordNLP, it works very well, integrates in a Twitter processing flow and is very accurate. The only issue for many is that it is GPLd and for many use cases will require purchasing a license. It is very easy to use Stanford CoreNLP with Java, Scala and Spark. import java.util.Properties
import org.apache.spark.SparkContext
import org.apache.spark.sql.SQLContext
import org.apache.phoenix.spark._
import com.vader.SentimentAnalyzer
import edu.stanford.nlp.ling.CoreAnnotations
import edu.stanford.nlp.neural.rnn.RNNCoreAnnotations
import edu.stanford.nlp.pipeline.StanfordCoreNLP
import edu.stanford.nlp.sentiment.SentimentCoreAnnotations
import org.apache.log4j.{Level, Logger}
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.serializer.KryoSerializer
import org.apache.spark.sql._
import scala.collection.JavaConversions._
import scala.collection.mutable.ListBuffer
case class Tweet(coordinates: String, geo:String, handle: String, hashtags: String, language: String,
location: String, msg: String, time: String, tweet_id: String, unixtime: String, user_name: String, tag: String, profile_image_url: String,
source: String, place: String, friends_count: String, followers_count: String, retweet_count: String,
time_zone: String, sentiment: String, stanfordSentiment: String)
val message = convert(anyMessage)
val pipeline = new StanfordCoreNLP(nlpProps)
val annotation = pipeline.process(message)
var sentiments: ListBuffer[Double] = ListBuffer()
var sizes: ListBuffer[Int] = ListBuffer()
var longest = 0
var mainSentiment = 0
for (sentence <- annotation.get(classOf[CoreAnnotations.SentencesAnnotation])) {
val tree = sentence.get(classOf[SentimentCoreAnnotations.AnnotatedTree])
val sentiment = RNNCoreAnnotations.getPredictedClass(tree)
val partText = sentence.toString
if (partText.length() > longest) {
mainSentiment = sentiment
longest = partText.length()
}
sentiments += sentiment.toDouble
sizes += partText.length
}
val averageSentiment:Double = {
if(sentiments.nonEmpty) sentiments.sum / sentiments.size
else -1
}
val weightedSentiments = (sentiments, sizes).zipped.map((sentiment, size) => sentiment * size)
var weightedSentiment = weightedSentiments.sum / (sizes.fold(0)(_ + _))
if(sentiments.isEmpty) {
mainSentiment = -1
weightedSentiment = -1
}
weightedSentiment match {
case s if s <= 0.0 => NOT_UNDERSTOOD
case s if s < 1.0 => VERY_NEGATIVE
case s if s < 2.0 => NEGATIVE
case s if s < 3.0 => NEUTRAL
case s if s < 4.0 => POSITIVE
case s if s < 5.0 => VERY_POSITIVE
case s if s > 5.0 => NOT_UNDERSTOOD
}
}
trait SENTIMENT_TYPE
case object VERY_NEGATIVE extends SENTIMENT_TYPE
case object NEGATIVE extends SENTIMENT_TYPE
case object NEUTRAL extends SENTIMENT_TYPE
case object POSITIVE extends SENTIMENT_TYPE
case object VERY_POSITIVE extends SENTIMENT_TYPE
case object NOT_UNDERSTOOD extends SENTIMENT_TYPE
Summary Do you have to use just one of these libraries? Of course not, I use different ones depending on my needs. Licensing, performance, accuracy on your dataset, programming language choice, enterprise environment, volume of data, your corpus, human language involved and many other factors come into play. One size does not fit all. If you have sophisticated data scientists and strong machine learning pipelines, you may want to pick one and build up your own custom models and corpus. This will work with Hortonworks HDP 2.3 - HDP 2.6 and HDF 1.0 - 3.x. References:
https://shirishkadam.com/2016/10/06/setting-up-natural-language-processing-environment-with-python/ https://explosion.ai/blog/spacy-deep-learning-keras
https://github.com/explosion/spaCy/blob/master/examples/deep_learning_keras.py
https://github.com/explosion/spaCy/tree/master/examples/sentiment https://spacy.io/docs/usage/entity-recognition https://spacy.io/ https://spacy.io/docs/usage/entity-recognition http://textminingonline.com/how-to-use-stanford-named-entity-recognizer-ner-in-python-nltk-and-other-programming-languages https://www.quora.com/How-does-Googles-open-source-natural-language-parser-SyntaxNet-compare-with-spaCy-io-or-Stanfords-CoreNLP https://github.com/explosion/spaCy/tree/master/examples/inventory_count http://iamaaditya.github.io/2016/04/visual_question_answering_demo_notebook https://github.com/iamaaditya/VQA_Demo https://avisingh599.github.io/deeplearning/visual-qa/ https://github.com/avisingh599/visual-qa https://github.com/chartbeat-labs/textacy https://github.com/cytora/pycon-nlp-in-10-lines https://nicschrading.com/project/Intro-to-NLP-with-spaCy/ http://textblob.readthedocs.io/en/dev/advanced_usage.html#sentiment-analyzers ttps://github.com/sloria/TextBlob https://textblob.readthedocs.io/en/latest/quickstart.html#quickstart https://github.com/tensorflow/models/tree/master/syntaxnet#getting-started https://explosion.ai/blog/syntaxnet-in-context https://emorynlp.github.io/nlp4j/ https://github.com/emorynlp/nlp4j https://github.com/robhinds/opennlp-ingredient-finder https://aiaioo.wordpress.com/2016/01/13/naive-bayes-classifier-in-opennlp/ https://community.hortonworks.com/articles/36884/using-parsey-mcparseface-google-tensorflow-syntaxn.html https://community.hortonworks.com/questions/2728/what-is-recommended-nlp-solution-on-top-of-hdp-sta.html https://community.hortonworks.com/content/kbentry/52415/processing-social-media-feeds-in-stream-with-apach.html https://community.hortonworks.com/articles/49465/how-to-add-sentiment-analytics-to-twitterapache-ni.html http://www.nltk.org/api/nltk.sentiment.html https://dzone.com/articles/in-progress-natural-language-processing http://brnrd.me/social-sentiment-sentiment-analysis/ http://www.nltk.org/howto/twitter.html https://marcobonzanini.com/2015/03/09/mining-twitter-data-with-python-part-2/ https://github.com/nltk/nltk/wiki/Sentiment-Analysis https://community.hortonworks.com/content/kbentry/49465/how-to-add-sentiment-analytics-to-twitterapache-ni.html https://community.hortonworks.com/articles/35568/python-script-in-nifi.html https://community.hortonworks.com/articles/76240/using-opennlp-for-identifying-names-from-text.html http://stanfordnlp.github.io/CoreNLP/ http://nlp.stanford.edu/software/CRF-NER.shtml
... View more
01-08-2017
11:24 AM
4 Kudos
To create a simple Java 8 application to use to extract text from PDFs and then identify people's names, I have create a simple data application. This can be used as part of a larger Data Processing Pipeline or HDF flow by calling it via REST, Command Line or converting it to a NiFi Processor. <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.dataflowdeveloper</groupId>
<artifactId>categorizer</artifactId>
<packaging>jar</packaging>
<version>1.0</version>
<name>categorizer</name>
<url>http://maven.apache.org</url>
<dependencies>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>3.8.1</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-simple</artifactId>
<version>1.7.7</version>
</dependency>
<dependency>
<groupId>org.apache.opennlp</groupId>
<artifactId>opennlp-tools</artifactId>
<version>1.7.0</version>
</dependency>
<dependency>
<groupId>com.google.code.gson</groupId>
<artifactId>gson</artifactId>
<version>2.8.0</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.tika/tika-core -->
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-core</artifactId>
<version>1.14</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.tika/tika-parsers -->
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-parsers</artifactId>
<version>1.14</version>
</dependency>
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-langdetect</artifactId>
<version>1.14</version>
</dependency>
<dependency>
<groupId>org.apache.lucene</groupId>
<artifactId>lucene-core</artifactId>
<version>3.5.0</version>
</dependency>
<dependency>
<groupId>commons-io</groupId>
<artifactId>commons-io</artifactId>
<version>2.5</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>2.7.3</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>2.7.3</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-hdfs</artifactId>
<version>2.7.3</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>2.7.3</version>
</dependency>
</dependencies>
</project>
Java Application package com.dataflowdeveloper;
import java.io.BufferedInputStream;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.InputStream;
import java.io.OutputStream;
import java.net.URI;
import java.nio.file.FileSystems;
import java.nio.file.Files;
import java.util.logging.Level;
import java.util.logging.Logger;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.util.ArrayList;
import java.util.List;
import com.google.gson.Gson;
import opennlp.tools.namefind.NameFinderME;
import opennlp.tools.namefind.TokenNameFinderModel;
import opennlp.tools.tokenize.SimpleTokenizer;
import opennlp.tools.tokenize.Tokenizer;
import opennlp.tools.tokenize.TokenizerME;
import opennlp.tools.tokenize.TokenizerModel;
import opennlp.tools.util.InvalidFormatException;
import opennlp.tools.util.Span;
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.pdf.PDFParser;
import org.apache.tika.sax.BodyContentHandler;
import org.xml.sax.SAXException;
public class App {
public static void main(String args[]) {
BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
FileInputStream inputstream = null;
try {
inputstream = new FileInputStream(new File(System.getProperty("user.dir") + "/testdocs/opennlp.pdf"));
} catch (FileNotFoundException e1) {
e1.printStackTrace();
}
ParseContext pcontext = new ParseContext();
// parsing the document using PDF parser
PDFParser pdfparser = new PDFParser();
try {
pdfparser.parse(inputstream, handler, metadata, pcontext);
} catch (SAXException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
} catch (TikaException e) {
e.printStackTrace();
}
NameFinder nameFinder = new NameFinder();
System.out.println(nameFinder.getPeople(handler.toString()));
}
}
package com.dataflowdeveloper;
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.util.ArrayList;
import java.util.List;
import com.google.gson.Gson;
import opennlp.tools.namefind.NameFinderME;
import opennlp.tools.namefind.TokenNameFinderModel;
import opennlp.tools.tokenize.SimpleTokenizer;
import opennlp.tools.tokenize.Tokenizer;
import opennlp.tools.tokenize.TokenizerME;
import opennlp.tools.tokenize.TokenizerModel;
import opennlp.tools.util.InvalidFormatException;
import opennlp.tools.util.Span;
/**
*
*/
public class NameFinder {
private static final String CURRENT_DIR = System.getProperty("user.dir");
private static final String OLD_FILE = "/Volumes/Transcend/projects/categorizer/input/en-ner-person.bin";
private static final String CURRENT_FILE = CURRENT_DIR + "/input/en-ner-person.bin";
private static final String CURRENT_TOKEN_FILE = CURRENT_DIR + "/input/en-token.bin";
/**
* sentence to people
* @param sentence
* @return JSON
*/
public String getPeople(String sentence) {
//
String outputJSON = "";
TokenNameFinderModel model = null;
InputStream tokenStream = null;
Tokenizer tokenizer = null;
try {
tokenStream = new FileInputStream( new File(CURRENT_TOKEN_FILE));
model = new TokenNameFinderModel(
new File(CURRENT_FILE));
TokenizerModel tokenModel = new TokenizerModel(tokenStream);
tokenizer = new TokenizerME(tokenModel);
} catch (InvalidFormatException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
// Create a NameFinder using the model
NameFinderME finder = new NameFinderME(model);
// Split the sentence into tokens
String[] tokens = tokenizer.tokenize(sentence);
// Find the names in the tokens and return Span objects
Span[] nameSpans = finder.find(tokens);
List<PersonName> people = new ArrayList<PersonName>();
String[] spanns = Span.spansToStrings(nameSpans, tokens);
for (int i = 0; i < spanns.length; i++) {
people.add(new PersonName(spanns[i]));
}
outputJSON = new Gson().toJson(people);
finder.clearAdaptiveData();
return "{\"names\":" + outputJSON + "}";
}
} Process 1. We open the file stream for reading (this can be from HDFS, S3 or a regular file system). 2. Then we use Apache Tika's PDF Parser to parse out the text. We also get the metadata for other processing. 3. Using OpenNLP we parse out all the names from that text. 4. Using Google GSON, we then turns the names into JSON for easy usage. References https://raw.githubusercontent.com/apache/tika/master/tika-example/pom.xml https://github.com/apache/tika/tree/master/tika-example
... View more
Labels: