1973
Posts
1225
Kudos Received
124
Solutions
My Accepted Solutions
| Title | Views | Posted |
|---|---|---|
| 2459 | 04-03-2024 06:39 AM | |
| 3807 | 01-12-2024 08:19 AM | |
| 2054 | 12-07-2023 01:49 PM | |
| 3039 | 08-02-2023 07:30 AM | |
| 4157 | 03-29-2023 01:22 PM |
01-12-2017
09:01 AM
Protect Your Cloud Big Data Assets Step 1: Do not put anything into the cloud unless you have a CISO, Chieft Security Architect, Certified Cloud Administrator, full understanding of your PII and private data, a Lawyer to defend you against the coming lawsuits, full understanding of Hadoop, Hadoop Certified Administrators, a Hadoop premier support contract, a security plan, full understanding of your Hadoop architecture and layout. Step 2: Study all running services in Ambari. Step 3: Confirm and check all of your TCP/IP ports. Hadoop has a lot of them! Step 4: if you are not using a service, do not run it. Step 5: By default, disable all access to everything, always. Only open ports and access when something and someone critical cannot access them. Step 6: SSL, SSH, VPN and Encryption Everywhere. Step 7: Run Knox! Set it up correctly. Step 8: Run Kali and audit all your IPs and ports. Step 9: Use Kali hacking tools to attempt to access all your web ports, shells and other access points. Step 10: Run in a VPC Step 11: Setup security groups. Never open to 0.0.0.0 or all ports or all IPs!?!??!?!!! Step 12: If this seems too hard, don't run in the cloud. Step 14: Step 13 is unlucky, skip that one. Step 15: Read all the recommended security documentation and use it. Step 16: Kerberize everything. Step 17: Run Metron My recommendation is get a professional services contract with an experience Hadoop organization or use something like Microsoft HDInsight or HDC that is managed. TCP/IP Ports 50070
: Name Node Web UI 50470
: Name Node HTTPS Web UI 8020,
8022, 9000 : Name Node via HDFS 50075
: Data Node(s) WebUI 50475
: Data Node(s) HTTPS Web UI 50090
: Secondary Name Node 60000
: HBase Master 8080
: HBase REST 9090
: Thrift Server 50111
: WebHCat 8005
: Sqoop2 2181:
Zookeeper 9010:
Zookeeper JMX 50020 50010 50030 8021 50060 51111 9083 10000, 60010, 60020, 60030, 2888,
3888, 8660, 8661, 8662, 8663, 8660, 8651, 3306,
80, 8085, 1004, 1006, 8485, 8480, 2049, 4242,14000,
14001, 8021, 9290, 50060, 8032, 8030, 8031, 8033,
8088, 8040, 8042, 8041, 10020, 13562, 19888, 9090,
9095, 9083, 16000, 12000, 12001, 3181, 4181, 8019,
9010, 8888, 11000, 11001, 7077, 7078, 18080, 18081, 50100 There's more of these if you are also running your own visualization tools, other data websites, other tools, Oracle, SQL Server, mail, NiFi, Druid, etc... Reference http://www.slideshare.net/bunkertor/hadoop-security-54483815 https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.4.0/bk_installing_manually_book/content/set_up_validate_knox_gateway_installation.html https://aws.amazon.com/articles/1233/ http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-network-security.html https://www.quora.com/What-are-the-best-practices-in-hardening-Amazon-EC2-instance https://stratumsecurity.com/2012/12/03/practical-tactical-cloud-security-ec2/ http://hortonworks.com/solutions/security-and-governance/ http://metron.incubator.apache.org/
... View more
Labels:
01-12-2017
06:11 AM
1 Kudo
Protect Your Cloud Big Data Assets Step 1: Do not put anything into the cloud unless you have a CISO, Chieft Security Architect, Certified Cloud Administrator, full understanding of your PII and private data, a Lawyer to defend you against the coming lawsuits, full understanding of Hadoop, Hadoop Certified Administrators, a Hadoop premier support contract, a security plan, full understanding of your Hadoop architecture and layout. Step 2: Study all running services in Ambari. Step 3: Confirm and check all of your TCP/IP ports. Hadoop has a lot of them! Step 4: if you are not using a service, do not run it. Step 5: By default, disable all access to everything, always. Only open ports and access when something and someone critical cannot access them. Step 6: SSL, SSH, VPN and Encryption Everywhere. Step 7: Run Knox! Set it up correctly. Step 8: Run Kali and audit all your IPs and ports. Step 9: Use Kali hacking tools to attempt to access all your web ports, shells and other access points. Step 10: Run in a VPC Step 11: Setup security groups. Never open to 0.0.0.0 or all ports or all IPs!?!??!?!!! Step 12: If this seems too hard, don't run in the cloud. Step 14: Step 13 is unlucky, skip that one. Step 15: Read all the recommended security documentation and use it. Step 16: Kerberize everything. Step 17: Run Metron My recommendation is get a professional services contract with an experience Hadoop organization or use something like Microsoft HDInsight or HDC that is managed. Reference http://www.slideshare.net/bunkertor/hadoop-security-54483815 https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.4.0/bk_installing_manually_book/content/set_up_validate_knox_gateway_installation.html https://aws.amazon.com/articles/1233/ http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-network-security.html https://www.quora.com/What-are-the-best-practices-in-hardening-Amazon-EC2-instance https://stratumsecurity.com/2012/12/03/practical-tactical-cloud-security-ec2/ http://hortonworks.com/solutions/security-and-governance/ http://metron.incubator.apache.org/
... View more
01-11-2017
04:25 PM
ghost.xml NIFI Template
... View more
01-11-2017
08:37 AM
4 Kudos
There's a number of tools from the command line, that I like to use from NiFi as part of a Big Data Flow. The first tool I wanted to use was Ghostscript. On Centos/RHEL, you can install it via: yum install ghostscript I use GhostScript to extract content from PDFs (this can be passed in from an existing flow using ExecuteStreamCommand). It then outputs to the Standard Output the text from those files. run.sh
gs -dBATCH -dNOPAUSE -sDEVICE=txtwrite -dFirstPage=1 -dLastPage=500 -sOutputFile=- $@ Output from the Hadoop documentation: GPL Ghostscript 9.07 (2013-02-14)
Copyright (C) 2012 Artifex Software, Inc. All rights reserved.
This software comes with NO WARRANTY: see the file PUBLIC for details.
Processing pages 1 through 8.
Page 1
Can't find (or can't open) font file NimbusSanL-ReguItal.
Querying operating system for font files...
Loading NimbusSanL-ReguItal font from /usr/share/fonts/default/Type1/n019023l.pfb... 3984660 2473586 2498328 1197484 3 done.
Loading NimbusSanL-Bold font from /usr/share/fonts/default/Type1/n019004l.pfb... 4025652 2573865 2498328 1199061 3 done.
Loading NimbusRomNo9L-Regu font from /usr/share/fonts/default/Type1/n021003l.pfb... 4072164 2714750 2518512 1214631 3 done.
Welcome to Apache™ Hadoop��!
Table of contents
1 What Is Apache Hadoop?.................................................................................................. 2
2 Getting Started .................................................................................................................. 3
3 Download Hadoop..............................................................................................................3
4 Who Uses Hadoop?............................................................................................................3
5 News................................................................................................................................... 3
Copyright �� 2014 The Apache Software Foundation. All rights reserved.
Page 2
Loading NimbusSanL-Regu font from /usr/share/fonts/default/Type1/n019003l.pfb... 4243344 2902744 2478144 1170708 3 done.
Loading NimbusRomNo9L-Medi font from /usr/share/fonts/default/Type1/n021004l.pfb... 4410848 3063836 2518512 1208517 3 done.
Welcome to Apache™ Hadoop��!
1 What Is Apache Hadoop?
The Apache™ Hadoop�� project develops open-source software for reliable, scalable,
distributed computing.
The Apache Hadoop software library is a framework that allows for the distributed
processing of large data sets across clusters of computers using simple programming models.
It is designed to scale up from single servers to thousands of machines, each offering local
computation and storage. Rather than rely on hardware to deliver high-availability, the
library itself is designed to detect and handle failures at the application layer, so delivering
a highly-available service on top of a cluster of computers, each of which may be prone to
failures.
The project includes these modules:
• Hadoop Common: The common utilities that support the other Hadoop modules.
• Hadoop Distributed File System (HDFS™): A distributed file system that provides
high-throughput access to application data.
• Hadoop YARN: A framework for job scheduling and cluster resource management.
• Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.
Other Hadoop-related projects at Apache include:
You will probably want to clean this up a little and remove some of the formatting, this can be done in NiFi or later in Hive or Phoenix for further cleaning. Or you could send it as a message through Kafka and process with Apache Storm, Apache Spark or other streaming tools. For fans of old UNIX, everyone loved those fortunes. Those are still available for install on CentOS. yum install fortune-mod.x86_64 The results of a flow calling fortune, this requires no parameters and just put in the command parameter of fortune. It outputs the information to the console, which we extract using (.*).+ to an attribute and then I convert it to a JSON file for storage in HDFS. Output JSON
<strong></strong>{"fortune":"My little brother got this fortune"} Reference:
https://ghostscript.com/doc/current/Use.htm#Pipes https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi.processors.standard.ExecuteStreamCommand/
... View more
Labels:
01-10-2017
04:50 PM
https://help.sumologic.com/Send_Data/Sources/02Sources_for_Hosted_Collectors/HTTP_Source
... View more
01-08-2017
03:30 PM
2 Kudos
Introduction For NLP, mostly I want to do two things,
Entity Recognition (people, facility, organizations, locations, products, events, art, language, groups, dates, time, percent, money, quantity, ordinal and cardinal) Sentiment Analysis So basically what is it and why don't people like it. These two features are very useful as part of a real-time streaming processing of social, email, logs and semistructured document data. I can use both of these in Twitter ingest via Apache NiFi or Apache Spark. Don't confuse text entity recognition with image recognition that we looked at with TensorFlow previously. You can certainly add that to your flow as well, but that is working with images and not text. My debate with sentiment analysis is do you give numbers, really general terms like Neutral, Negative or Positive. Or do you get more detailed like Stanford CoreNLP which has multiple of each. There are a lot of libraries available for NLP and Sentiment Analysis. The first two decisions are do you want to run JVM programs (good for Hadoop MR, Apache Spark, Apache Storm, Enterprise applications, Spring applications, Microservices, NiFi Processors, Hive UDFs, Pig UDFs and have multiple programming language support (Java, Scala, ...). Or run on Python which is already well-known by many data scientists and engineers, is simple to prototype with no compiling, very easy to call from NiFi and scripts and also has a ton of great Deep Learning libraries and interfaces. Python Libraries Like most things in Python, you can use Pip to install them. You will need a Python 2.7 or 3.0 environment setup with PIP to install and use the libraries I have looked at. spaCY requires numpy and so do many of the others. spaCy pip install -U spacy
python -m spacy.en.download all
Downloading parsing model
Downloading...
Downloaded 532.28MB 100.00% 9.59MB/s eta 0s
archive.gz checksum/md5 OK
Model successfully installed to /usr/lib64/python2.7/site-packages/spacy/data
Downloading GloVe vectors
Downloading...
Downloaded 708.08MB 100.00% 19.38MB/s eta 0s
archive.gz checksum/md5 OK
Model successfully installed to /usr/lib64/python2.7/site-packages/spacy/data
After you install you need to download text and models to be used by the tool. import spacy
nlp = spacy.load('en')
doc5 = nlp(u"Timothy Spann is studying at Princeton University in New Jersey.")
# Named Entity Recognizer (NER)
for ent in doc5.ents:
print ent, ent.label, ent.label_
NLTK from nltk.sentiment.vader import SentimentIntensityAnalyzer
import sys
sid = SentimentIntensityAnalyzer()
ss = sid.polarity_scores(sys.argv[1])
if ss['compound'] == 0.00:
print('Neutral')
elif ss['compound'] < 0.00:
print ('Negative')
else:
print('Positive')
Another NLTK Option import sys
sid = SentimentIntensityAnalyzer()
ss = sid.polarity_scores(sys.argv[1])
print('Compound {0} Negative {1} Neutral {2} Positive {3} '.format(ss['compound'],ss['neg'],ss['neu'],ss['pos']))
NLTK does sentiment analysis very easily as shown above. It runs fairly quickly so you can call this in a stream without too much overhead. TextBlob from textblob import TextBlob
b = TextBlob("Spellin iz vaerry haerd to do. I do not like this spelling product at all it is terrible and I am very mad.")
print(b.correct())
print(b.sentiment)
print(b.sentiment.polarity)
python tb.py
Spelling in very heard to do. I do not like this spelling product at all it is terrible and I am very mad.
Sentiment(polarity=-0.90625, subjectivity=1.0)
-0.90625
TextBlob is a nice library that does Sentiment Analysis as well as spell checking and other useful text processing. The install will look familiar. sudo pip install -U textblob
sudo python -m textblob.download_corpora
JVM Natural Language Processing for JVM languages (NLP4J) is one option, I have not tried this one yet. Apache OpenNLP This one is very widely used and is an Apache project which makes the licensing ideal for most users. I have a long example of this in this article on Apache OpenNLP. Training Models Pre-built for Entity Recognition in Apache OpenNLP
http://opennlp.sourceforge.net/models-1.5/en-ner-location.bin http://opennlp.sourceforge.net/models-1.5/en-ner-organization.bin http://opennlp.sourceforge.net/models-1.5/en-ner-money.bin http://opennlp.sourceforge.net/models-1.5/en-ner-date.bin StanfordNLP I love StanfordNLP, it works very well, integrates in a Twitter processing flow and is very accurate. The only issue for many is that it is GPLd and for many use cases will require purchasing a license. It is very easy to use Stanford CoreNLP with Java, Scala and Spark. import java.util.Properties
import org.apache.spark.SparkContext
import org.apache.spark.sql.SQLContext
import org.apache.phoenix.spark._
import com.vader.SentimentAnalyzer
import edu.stanford.nlp.ling.CoreAnnotations
import edu.stanford.nlp.neural.rnn.RNNCoreAnnotations
import edu.stanford.nlp.pipeline.StanfordCoreNLP
import edu.stanford.nlp.sentiment.SentimentCoreAnnotations
import org.apache.log4j.{Level, Logger}
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.serializer.KryoSerializer
import org.apache.spark.sql._
import scala.collection.JavaConversions._
import scala.collection.mutable.ListBuffer
case class Tweet(coordinates: String, geo:String, handle: String, hashtags: String, language: String,
location: String, msg: String, time: String, tweet_id: String, unixtime: String, user_name: String, tag: String, profile_image_url: String,
source: String, place: String, friends_count: String, followers_count: String, retweet_count: String,
time_zone: String, sentiment: String, stanfordSentiment: String)
val message = convert(anyMessage)
val pipeline = new StanfordCoreNLP(nlpProps)
val annotation = pipeline.process(message)
var sentiments: ListBuffer[Double] = ListBuffer()
var sizes: ListBuffer[Int] = ListBuffer()
var longest = 0
var mainSentiment = 0
for (sentence <- annotation.get(classOf[CoreAnnotations.SentencesAnnotation])) {
val tree = sentence.get(classOf[SentimentCoreAnnotations.AnnotatedTree])
val sentiment = RNNCoreAnnotations.getPredictedClass(tree)
val partText = sentence.toString
if (partText.length() > longest) {
mainSentiment = sentiment
longest = partText.length()
}
sentiments += sentiment.toDouble
sizes += partText.length
}
val averageSentiment:Double = {
if(sentiments.nonEmpty) sentiments.sum / sentiments.size
else -1
}
val weightedSentiments = (sentiments, sizes).zipped.map((sentiment, size) => sentiment * size)
var weightedSentiment = weightedSentiments.sum / (sizes.fold(0)(_ + _))
if(sentiments.isEmpty) {
mainSentiment = -1
weightedSentiment = -1
}
weightedSentiment match {
case s if s <= 0.0 => NOT_UNDERSTOOD
case s if s < 1.0 => VERY_NEGATIVE
case s if s < 2.0 => NEGATIVE
case s if s < 3.0 => NEUTRAL
case s if s < 4.0 => POSITIVE
case s if s < 5.0 => VERY_POSITIVE
case s if s > 5.0 => NOT_UNDERSTOOD
}
}
trait SENTIMENT_TYPE
case object VERY_NEGATIVE extends SENTIMENT_TYPE
case object NEGATIVE extends SENTIMENT_TYPE
case object NEUTRAL extends SENTIMENT_TYPE
case object POSITIVE extends SENTIMENT_TYPE
case object VERY_POSITIVE extends SENTIMENT_TYPE
case object NOT_UNDERSTOOD extends SENTIMENT_TYPE
Summary Do you have to use just one of these libraries? Of course not, I use different ones depending on my needs. Licensing, performance, accuracy on your dataset, programming language choice, enterprise environment, volume of data, your corpus, human language involved and many other factors come into play. One size does not fit all. If you have sophisticated data scientists and strong machine learning pipelines, you may want to pick one and build up your own custom models and corpus. This will work with Hortonworks HDP 2.3 - HDP 2.6 and HDF 1.0 - 3.x. References:
https://shirishkadam.com/2016/10/06/setting-up-natural-language-processing-environment-with-python/ https://explosion.ai/blog/spacy-deep-learning-keras
https://github.com/explosion/spaCy/blob/master/examples/deep_learning_keras.py
https://github.com/explosion/spaCy/tree/master/examples/sentiment https://spacy.io/docs/usage/entity-recognition https://spacy.io/ https://spacy.io/docs/usage/entity-recognition http://textminingonline.com/how-to-use-stanford-named-entity-recognizer-ner-in-python-nltk-and-other-programming-languages https://www.quora.com/How-does-Googles-open-source-natural-language-parser-SyntaxNet-compare-with-spaCy-io-or-Stanfords-CoreNLP https://github.com/explosion/spaCy/tree/master/examples/inventory_count http://iamaaditya.github.io/2016/04/visual_question_answering_demo_notebook https://github.com/iamaaditya/VQA_Demo https://avisingh599.github.io/deeplearning/visual-qa/ https://github.com/avisingh599/visual-qa https://github.com/chartbeat-labs/textacy https://github.com/cytora/pycon-nlp-in-10-lines https://nicschrading.com/project/Intro-to-NLP-with-spaCy/ http://textblob.readthedocs.io/en/dev/advanced_usage.html#sentiment-analyzers ttps://github.com/sloria/TextBlob https://textblob.readthedocs.io/en/latest/quickstart.html#quickstart https://github.com/tensorflow/models/tree/master/syntaxnet#getting-started https://explosion.ai/blog/syntaxnet-in-context https://emorynlp.github.io/nlp4j/ https://github.com/emorynlp/nlp4j https://github.com/robhinds/opennlp-ingredient-finder https://aiaioo.wordpress.com/2016/01/13/naive-bayes-classifier-in-opennlp/ https://community.hortonworks.com/articles/36884/using-parsey-mcparseface-google-tensorflow-syntaxn.html https://community.hortonworks.com/questions/2728/what-is-recommended-nlp-solution-on-top-of-hdp-sta.html https://community.hortonworks.com/content/kbentry/52415/processing-social-media-feeds-in-stream-with-apach.html https://community.hortonworks.com/articles/49465/how-to-add-sentiment-analytics-to-twitterapache-ni.html http://www.nltk.org/api/nltk.sentiment.html https://dzone.com/articles/in-progress-natural-language-processing http://brnrd.me/social-sentiment-sentiment-analysis/ http://www.nltk.org/howto/twitter.html https://marcobonzanini.com/2015/03/09/mining-twitter-data-with-python-part-2/ https://github.com/nltk/nltk/wiki/Sentiment-Analysis https://community.hortonworks.com/content/kbentry/49465/how-to-add-sentiment-analytics-to-twitterapache-ni.html https://community.hortonworks.com/articles/35568/python-script-in-nifi.html https://community.hortonworks.com/articles/76240/using-opennlp-for-identifying-names-from-text.html http://stanfordnlp.github.io/CoreNLP/ http://nlp.stanford.edu/software/CRF-NER.shtml
... View more
01-08-2017
11:24 AM
4 Kudos
To create a simple Java 8 application to use to extract text from PDFs and then identify people's names, I have create a simple data application. This can be used as part of a larger Data Processing Pipeline or HDF flow by calling it via REST, Command Line or converting it to a NiFi Processor. <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.dataflowdeveloper</groupId>
<artifactId>categorizer</artifactId>
<packaging>jar</packaging>
<version>1.0</version>
<name>categorizer</name>
<url>http://maven.apache.org</url>
<dependencies>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>3.8.1</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-simple</artifactId>
<version>1.7.7</version>
</dependency>
<dependency>
<groupId>org.apache.opennlp</groupId>
<artifactId>opennlp-tools</artifactId>
<version>1.7.0</version>
</dependency>
<dependency>
<groupId>com.google.code.gson</groupId>
<artifactId>gson</artifactId>
<version>2.8.0</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.tika/tika-core -->
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-core</artifactId>
<version>1.14</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.tika/tika-parsers -->
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-parsers</artifactId>
<version>1.14</version>
</dependency>
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-langdetect</artifactId>
<version>1.14</version>
</dependency>
<dependency>
<groupId>org.apache.lucene</groupId>
<artifactId>lucene-core</artifactId>
<version>3.5.0</version>
</dependency>
<dependency>
<groupId>commons-io</groupId>
<artifactId>commons-io</artifactId>
<version>2.5</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>2.7.3</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>2.7.3</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-hdfs</artifactId>
<version>2.7.3</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>2.7.3</version>
</dependency>
</dependencies>
</project>
Java Application package com.dataflowdeveloper;
import java.io.BufferedInputStream;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.InputStream;
import java.io.OutputStream;
import java.net.URI;
import java.nio.file.FileSystems;
import java.nio.file.Files;
import java.util.logging.Level;
import java.util.logging.Logger;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.util.ArrayList;
import java.util.List;
import com.google.gson.Gson;
import opennlp.tools.namefind.NameFinderME;
import opennlp.tools.namefind.TokenNameFinderModel;
import opennlp.tools.tokenize.SimpleTokenizer;
import opennlp.tools.tokenize.Tokenizer;
import opennlp.tools.tokenize.TokenizerME;
import opennlp.tools.tokenize.TokenizerModel;
import opennlp.tools.util.InvalidFormatException;
import opennlp.tools.util.Span;
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.pdf.PDFParser;
import org.apache.tika.sax.BodyContentHandler;
import org.xml.sax.SAXException;
public class App {
public static void main(String args[]) {
BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
FileInputStream inputstream = null;
try {
inputstream = new FileInputStream(new File(System.getProperty("user.dir") + "/testdocs/opennlp.pdf"));
} catch (FileNotFoundException e1) {
e1.printStackTrace();
}
ParseContext pcontext = new ParseContext();
// parsing the document using PDF parser
PDFParser pdfparser = new PDFParser();
try {
pdfparser.parse(inputstream, handler, metadata, pcontext);
} catch (SAXException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
} catch (TikaException e) {
e.printStackTrace();
}
NameFinder nameFinder = new NameFinder();
System.out.println(nameFinder.getPeople(handler.toString()));
}
}
package com.dataflowdeveloper;
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.util.ArrayList;
import java.util.List;
import com.google.gson.Gson;
import opennlp.tools.namefind.NameFinderME;
import opennlp.tools.namefind.TokenNameFinderModel;
import opennlp.tools.tokenize.SimpleTokenizer;
import opennlp.tools.tokenize.Tokenizer;
import opennlp.tools.tokenize.TokenizerME;
import opennlp.tools.tokenize.TokenizerModel;
import opennlp.tools.util.InvalidFormatException;
import opennlp.tools.util.Span;
/**
*
*/
public class NameFinder {
private static final String CURRENT_DIR = System.getProperty("user.dir");
private static final String OLD_FILE = "/Volumes/Transcend/projects/categorizer/input/en-ner-person.bin";
private static final String CURRENT_FILE = CURRENT_DIR + "/input/en-ner-person.bin";
private static final String CURRENT_TOKEN_FILE = CURRENT_DIR + "/input/en-token.bin";
/**
* sentence to people
* @param sentence
* @return JSON
*/
public String getPeople(String sentence) {
//
String outputJSON = "";
TokenNameFinderModel model = null;
InputStream tokenStream = null;
Tokenizer tokenizer = null;
try {
tokenStream = new FileInputStream( new File(CURRENT_TOKEN_FILE));
model = new TokenNameFinderModel(
new File(CURRENT_FILE));
TokenizerModel tokenModel = new TokenizerModel(tokenStream);
tokenizer = new TokenizerME(tokenModel);
} catch (InvalidFormatException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
// Create a NameFinder using the model
NameFinderME finder = new NameFinderME(model);
// Split the sentence into tokens
String[] tokens = tokenizer.tokenize(sentence);
// Find the names in the tokens and return Span objects
Span[] nameSpans = finder.find(tokens);
List<PersonName> people = new ArrayList<PersonName>();
String[] spanns = Span.spansToStrings(nameSpans, tokens);
for (int i = 0; i < spanns.length; i++) {
people.add(new PersonName(spanns[i]));
}
outputJSON = new Gson().toJson(people);
finder.clearAdaptiveData();
return "{\"names\":" + outputJSON + "}";
}
} Process 1. We open the file stream for reading (this can be from HDFS, S3 or a regular file system). 2. Then we use Apache Tika's PDF Parser to parse out the text. We also get the metadata for other processing. 3. Using OpenNLP we parse out all the names from that text. 4. Using Google GSON, we then turns the names into JSON for easy usage. References https://raw.githubusercontent.com/apache/tika/master/tika-example/pom.xml https://github.com/apache/tika/tree/master/tika-example
... View more
Labels:
01-04-2017
05:01 PM
6 Kudos
My first caveat would be that in my tests, the pre-trained models is missing a lot of names. If this is for a production work load, I would recommend training your own models using your own data. Maybe use all of your corporate directory, client list, Salesforce data, LinkedIn and social media. I would recommend full name, first names and any nicknames that are commonly used. The current version is 1.7.0 and there are pre-trained 1.5.0 models that work. They have a number of pre-trained models in a few human languages. I chose English (http://opennlp.sourceforge.net/models-1.5/en-ner-person.bin). Walk Through: Create TokenNameFinderModel from pre-built person model. Tokenize the input sentence. Find the identified people. Convert to JSON array. You can easily plug this into a custom NiFi processor, microservice, command line tool or routine in a larger Apache Storm or Apache Spark pipeline. Code (JavaBean) public class PersonName {
private String name = "";
public String getName() {
return name;
}
public void setName(String name) {
this.name = name;
}
} Code (getPeople) import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.util.ArrayList;
import java.util.List;
import com.google.gson.Gson;
import opennlp.tools.namefind.NameFinderME;
import opennlp.tools.namefind.TokenNameFinderModel;
import opennlp.tools.tokenize.SimpleTokenizer;
import opennlp.tools.tokenize.Tokenizer;
import opennlp.tools.tokenize.TokenizerME;
import opennlp.tools.tokenize.TokenizerModel;
import opennlp.tools.util.InvalidFormatException;
import opennlp.tools.util.Span;
public String getPeople(String sentence) {
String outputJSON = "";
TokenNameFinderModel model = null;
try {
model = new TokenNameFinderModel(
new File("en-ner-person.bin"));
} catch (InvalidFormatException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
NameFinderME finder = new NameFinderME(model);
Tokenizer tokenizer = SimpleTokenizer.INSTANCE;
String[] tokens = tokenizer.tokenize(sentence);
Span[] nameSpans = finder.find(tokens);
List<PersonName> people = new ArrayList<PersonName>();
String[] spanns = Span.spansToStrings(nameSpans, tokens);
for (int i = 0; i < spanns.length; i++) {
people.add(new PersonName(spanns[i]));
}
outputJSON = new Gson().toJson(people);
finder.clearAdaptiveData();
return "{\"names\":" + outputJSON + "}";
}
I used Eclipse for building and testing and you can build it with mvn package. Maven <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.dataflowdeveloper</groupId>
<artifactId>categorizer</artifactId>
<packaging>jar</packaging>
<version>1.0</version>
<name>categorizer</name>
<url>http://maven.apache.org</url>
<dependencies>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>3.8.1</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-simple</artifactId>
<version>1.7.7</version>
</dependency>
<dependency>
<groupId>org.apache.opennlp</groupId>
<artifactId>opennlp-tools</artifactId>
<version>1.7.0</version>
</dependency>
<dependency>
<groupId>com.google.code.gson</groupId>
<artifactId>gson</artifactId>
<version>2.8.0</version>
</dependency>
</dependencies>
</project>
Run Input: Tim Spann is going to the store. Peter Smith is using Hortonworks Hive.
Output: {"names":[{"name":"Tim Spann"},{"name":"Peter Smith"}]} Reference: http://opennlp.apache.org/ http://opennlp.apache.org/documentation/1.7.0/manual/opennlp.html#tools.namefind https://www.packtpub.com/books/content/finding-people-and-things http://opennlp.sourceforge.net/models-1.5/
... View more
12-30-2016
06:12 PM
https://community.hortonworks.com/questions/62213/nifi-putsql-row-length-exception-for-phoenix-upser.html
... View more
12-28-2016
06:33 PM
5 Kudos
Create a Box.com Application https://YourCompany.app.box.com/developers/services/ Get your client api, client secret, developer token, use server authentication with OAuth 2.0 + JWT, Add a public key from your developer machine and server. This takes a few steps and you have to create a Private and Public key. openssl genrsa -aes256 -out private_key.pem 2048
openssl rsa -pubout -in private_key.pem -out public_key.pem Anatomy of a Box.com Directory https://myenterprise.app.box.com/files/0/f/26783331215/NIFITEST You need the bolded # for accessing that directory, it is the Folder ID. Box.Com Java SDK <dependency>
<groupId>com.box</groupId>
<artifactId>box-java-sdk</artifactId>
<version>2.1.1</version>
</dependency> Create a New Java Maven Application mvn archetype:generate -DgroupId=com.yourenterprise -DartifactId=boxapp -DarchetypeArtifactId=maven-archetype-quickstart -DinteractiveMode=false Java Code package com.dataflowdeveloper;
import java.io.BufferedInputStream;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.InputStream;
import java.io.OutputStream;
import java.net.URI;
import java.nio.file.FileSystems;
import java.nio.file.Files;
import java.util.logging.Level;
import java.util.logging.Logger;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;
import com.box.sdk.BoxAPIConnection;
import com.box.sdk.BoxFile;
import com.box.sdk.BoxFolder;
import com.box.sdk.BoxItem;
import com.box.sdk.BoxUser;
public final class Main {
// developer token expires in an hour
private static final String DEVELOPER_TOKEN = "somelongtokenlasts1hour";
private static final int MAX_DEPTH = 1;
private Main() { }
public static void main(String[] args) {
Logger.getLogger("com.box.sdk").setLevel(Level.ALL);
BoxAPIConnection api = new BoxAPIConnection(DEVELOPER_TOKEN);
BoxUser.Info userInfo = BoxUser.getCurrentUser(api).getInfo();
System.out.format("Welcome, %s <%s>!\n\n", userInfo.getName(), userInfo.getLogin());
// the example code lists everything from your root folder down, that could be
// alot, I have 75K files
// BoxFolder rootFolder = BoxFolder.getRootFolder(api);
// listFolder(rootFolder, 0);
BoxFile file = null;
// this is the id of the folder, you can get this two ways from either the URL or
// looking at the output of the root crawl
BoxFolder folder = new BoxFolder(api, "15296958056");
for (BoxItem.Info itemInfo : folder) {
if (itemInfo instanceof BoxFile.Info) {
BoxFile.Info fileInfo = (BoxFile.Info) itemInfo;
// lets look at all the attributes, many are null
System.out.println("File:" + fileInfo.getCreatedAt() + "," +
fileInfo.getDescription() + "," +
fileInfo.getExtension() + ",name=" +
fileInfo.getName() + ",id=" +
fileInfo.getID() + "," +
fileInfo.getCreatedBy() + "," +
fileInfo.getSize() + "," +
fileInfo.getVersion().getName() + "," +
fileInfo.getCreatedAt() + "," +
fileInfo.getModifiedAt() + "," +
fileInfo.getModifiedBy() +
"");
// download all the pdfs
if ( fileInfo.getName() != null && fileInfo.getID() != null && fileInfo.getName().endsWith(".pdf")) {
file = new BoxFile(api, fileInfo.getID());
FileOutputStream stream = null;
try {
stream = new FileOutputStream(fileInfo.getName());
} catch (FileNotFoundException e) {
e.printStackTrace();
}
file.download(stream); // downloads to current directory specified in above fileoutputstream
//Input stream for the file in local file system to be written to HDFS
InputStream in = null;
try {
in = new BufferedInputStream(new FileInputStream(fileInfo.getName()));
} catch (FileNotFoundException e1) {
e1.printStackTrace();
}
try{
System.out.println("Save to HDFS " + fileInfo.getName());
//Destination file in HDFS
Configuration conf = new Configuration();
String dst = "hdfs://yourserver:8020/box/" + fileInfo.getName();
FileSystem fs = FileSystem.get(URI.create(dst), conf);
OutputStream out = fs.create(new Path(dst));
//Copy file from local to HDFS
IOUtils.copyBytes(in, out, 4096, true);
java.nio.file.Path path = FileSystems.getDefault().getPath(fileInfo.getName());
Files.delete(path);
}catch(Exception e){
e.printStackTrace();
System.out.println("File not found");
}
}
}
}
}
private static void listFolder(BoxFolder folder, int depth) {
for (BoxItem.Info itemInfo : folder) {
String indent = "";
for (int i = 0; i < depth; i++) {
indent += " ";
}
// you need this ID for accessing a folder
System.out.println(indent + itemInfo.getName() + ",ID=" + itemInfo.getID() );
if (itemInfo instanceof BoxFolder.Info) {
BoxFolder childFolder = (BoxFolder) itemInfo.getResource();
if (depth < MAX_DEPTH) {
listFolder(childFolder, depth + 1);
}
}
}
}
} Caveats By default you can only use the Developer Token which only lasts for 1 hour and as soon as you save it will vanish from the screen, so copy it first. Reference:
http://opensource.box.com/box-java-sdk/ https://github.com/box/box-java-sdk/blob/master/doc/authentication.md
https://github.com/box/box-java-sdk/blob/master/doc/folders.md
https://github.com/box/box-java-sdk/blob/master/doc/files.md
https://github.com/tspannhw/boxprocessor
https://docs.box.com/v2.0/docs/configuring-box-platform
https://docs.box.com/docs/app-auth https://github.com/box/box-java-sdk/blob/master/src/example/java/com/box/sdk/example/Main.java https://github.com/box/box-java-sdk/blob/master/doc/folders.md#get-a-folders-items https://github.com/box/box-java-sdk/blob/master/doc/files.md#download-a-file
... View more
Labels: