1973
Posts
1225
Kudos Received
124
Solutions
My Accepted Solutions
| Title | Views | Posted |
|---|---|---|
| 1913 | 04-03-2024 06:39 AM | |
| 3010 | 01-12-2024 08:19 AM | |
| 1642 | 12-07-2023 01:49 PM | |
| 2419 | 08-02-2023 07:30 AM | |
| 3355 | 03-29-2023 01:22 PM |
03-20-2018
05:47 PM
7 Kudos
Integrating lucene-geo-gazetteer For Geo Parsing with Apache NiFi
lucene-geo-gazetteer is a very cool Apache Tika, Apache Lucene and Apache OpenNLP tool that builds a fast index of geo data built from a large list of all countries data. It then provides a REST API that we can easily integrate into a flow.
So I have connected this to a NiFi flow for enhancing and enriching Twitter data with Geo data.
Example NiFi Flow To Convert Twitter Locations Into Geo Information
Downloading the Countries Data and Building the Geo Indexes
Calling the Local Geo Server
Example JSON Data Returned
Let's Pull out the fields we want after the split
Let's build a new JSON file of just the fields we like including the new geo ones.
Example JSON Processed
{"msg":"RT @pauljauregui: Cybersecurity Startups Struggle - https://t.co/wADHLyUEEB #CyberSecurity #AI #IoT #IIoT #IndustrialIoT #DataSecurity #Sec…","unixtime":"1516754942404","friends_count":"4293","sentiment":"NEGATIVE","geolongitude":"-98.5","hashtags":"[\"CyberSecurity\",\"AI\",\"IoT\",\"IIoT\",\"IndustrialIoT\",\"DataSecurity\"]","listed_count":"520","tweet_id":"955965632402485248","user_name":"Lee Weiden","favourites_count":"12454","source":"<a href=\"http://twitter.com\" rel=\"nofollow\">Twitter Web Client</a>","vadersentiment":"Compound -0.3182 Negative 0.161 Neutral 0.839 Positive 0.0 \n","placename":"United States","media_url":"[]","sentiment2":"Negative\n","retweet_count":"0","user_mentions_name":"[]","geo":"","urls":"[]","countryCode":"US","user_url":"","place":"","timestamp":"1516754942404","geolatitude":"39.76","coordinates":"","handle":"LeeWeiden","profile_image_url":"http://xxx.xxxx.com/profile_images/777401884629803009/dUOFoLnt_normal.jpg","time_zone":"Eastern Time (US & Canada)","ext_media":"[]","statuses_count":"186127","followers_count":"1461","location":"United States","time":"Wed Jan 24 00:49:02 +0000 2018","user_mentions":"[]","user_description":"Drivers, Entrepreneur, Family, Faith, Fun, Fitness, Technology, CRM, Social, Mobility & Customer Experience."}
Test The API
http://localhost:8765/api/search?s=Hightstown&s=New+Jersey
Build the Index From All Countries Dataset
./src/main/bin/lucene-geo-gazetteer -i geoIndex -b allCountries.txt
Run the REST Server
./src/main/bin/lucene-geo-gazetteer -server
Mar 20, 2018 12:33:35 PM org.apache.catalina.core.StandardContext setPath
WARNING: A context path must either be an empty string or start with a '/' and do not end with a '/'. The path [/] does not meet these criteria and has been changed to []
Starting Embedded Tomcat on port : 8765
Mar 20, 2018 12:33:35 PM org.apache.coyote.AbstractProtocol init
INFO: Initializing ProtocolHandler ["http-nio-8765"]
Mar 20, 2018 12:33:35 PM org.apache.tomcat.util.net.NioSelectorPool getSharedSelector
INFO: Using a shared selector for servlet write/read
Mar 20, 2018 12:33:35 PM org.apache.catalina.core.StandardService startInternal
INFO: Starting service Tomcat
Mar 20, 2018 12:33:35 PM org.apache.catalina.core.StandardEngine startInternal
INFO: Starting Servlet Engine: Apache Tomcat/8.0.28
Mar 20, 2018 12:33:35 PM org.apache.cxf.transport.servlet.CXFNonSpringServlet loadBusNoConfig
INFO: Load the bus without application context
Mar 20, 2018 12:33:36 PM org.springframework.context.support.AbstractApplicationContext prepareRefresh
INFO: Refreshing org.apache.cxf.bus.spring.BusApplicationContext@293eb4d9: display name [org.apache.cxf.bus.spring.BusApplicationContext@293eb4d9]; startup date [Tue Mar 20 12:33:36 EDT 2018]; root of context hierarchy
Mar 20, 2018 12:33:36 PM org.apache.cxf.bus.spring.BusApplicationContext getConfigResources
INFO: No cxf.xml configuration file detected, relying on defaults.
Mar 20, 2018 12:33:36 PM org.springframework.context.support.AbstractApplicationContext obtainFreshBeanFactory
INFO: Bean factory for application context [org.apache.cxf.bus.spring.BusApplicationContext@293eb4d9]: org.springframework.beans.factory.support.DefaultListableBeanFactory@3d3d719b
Mar 20, 2018 12:33:36 PM org.springframework.beans.factory.support.DefaultListableBeanFactory preInstantiateSingletons
INFO: Pre-instantiating singletons in org.springframework.beans.factory.support.DefaultListableBeanFactory@3d3d719b: defining beans [cxf,org.apache.cxf.bus.spring.BusApplicationListener,org.apache.cxf.bus.spring.BusWiringBeanFactoryPostProcessor,org.apache.cxf.bus.spring.Jsr250BeanPostProcessor,org.apache.cxf.bus.spring.BusExtensionPostProcessor,org.apache.cxf.resource.ResourceManager,org.apache.cxf.configuration.Configurer,org.apache.cxf.binding.BindingFactoryManager,org.apache.cxf.transport.DestinationFactoryManager,org.apache.cxf.transport.ConduitInitiatorManager,org.apache.cxf.wsdl.WSDLManager,org.apache.cxf.phase.PhaseManager,org.apache.cxf.workqueue.WorkQueueManager,org.apache.cxf.buslifecycle.BusLifeCycleManager,org.apache.cxf.endpoint.ServerRegistry,org.apache.cxf.endpoint.ServerLifeCycleManager,org.apache.cxf.endpoint.ClientLifeCycleManager,org.apache.cxf.transports.http.QueryHandlerRegistry,org.apache.cxf.endpoint.EndpointResolverRegistry,org.apache.cxf.headers.HeaderManager,org.apache.cxf.catalog.OASISCatalogManager,org.apache.cxf.endpoint.ServiceContractResolverRegistry,org.apache.cxf.jaxrs.JAXRSBindingFactory]; root of factory hierarchy
Mar 20, 2018 12:33:36 PM org.apache.cxf.transport.servlet.AbstractCXFServlet replaceDestinationFactory
INFO: Replaced the http destination factory with servlet transport factory
Mar 20, 2018 12:33:36 PM edu.usc.ir.geo.gazetteer.api.SearchResource <init>
INFO: Initialising searcher from index /Volumes/seagate/opensourcecomputervision/lucene-geo-gazetteer/src/main/bin/../../../geoIndex
Example Call
http://localhost:8765/api/search?s=Hightstown&s=New+Jersey
Example Results
{"Hightstown":[{"name":"Hightstown","countryCode":"US","admin1Code":"NJ","admin2Code":"021","latitude":40.26955,"longitude":-74.52321}],"New Jersey":[{"name":"New Jersey","countryCode":"US","admin1Code":"NJ","admin2Code":"","latitude":40.16706,"longitude":-74.49987}]}
Example NiFi Flow
example-geo.xml
Example Schema
{
"type": "record",
"name": "twitter",
"fields": [
{
"name": "msg",
"type": "string"
},
{
"name": "unixtime",
"type": "string"
},
{
"name": "friends_count",
"type": "string"
},
{
"name": "sentiment",
"type": "string"
},
{
"name": "geolongitude",
"type": "string"
},
{
"name": "hashtags",
"type": "string"
},
{
"name": "listed_count",
"type": "string"
},
{
"name": "tweet_id",
"type": "string"
},
{
"name": "user_name",
"type": "string"
},
{
"name": "favourites_count",
"type": "string"
},
{
"name": "source",
"type": "string"
},
{
"name": "vadersentiment",
"type": "string"
},
{
"name": "placename",
"type": "string"
},
{
"name": "media_url",
"type": "string"
},
{
"name": "sentiment2",
"type": "string"
},
{
"name": "retweet_count",
"type": "string"
},
{
"name": "user_mentions_name",
"type": "string"
},
{
"name": "geo",
"type": "string"
},
{
"name": "urls",
"type": "string"
},
{
"name": "countryCode",
"type": "string"
},
{
"name": "user_url",
"type": "string"
},
{
"name": "place",
"type": "string",
"doc": "Type inferred from '\"\"'"
},
{
"name": "timestamp",
"type": "string",
"doc": "Type inferred from '\"1516754942404\"'"
},
{
"name": "geolatitude",
"type": "string",
"doc": "Type inferred from '\"39.76\"'"
},
{
"name": "coordinates",
"type": "string",
"doc": "Type inferred from '\"\"'"
},
{
"name": "handle",
"type": "string",
"doc": "Type inferred from '\"LeeWeiden\"'"
},
{
"name": "profile_image_url",
"type": "string",
"doc": "Type inferred from '\"http://xxx.xxx.com/profile_images/777401884629803009/dUOFoLnt_normal.jpg\"'"
},
{
"name": "time_zone",
"type": "string",
"doc": "Type inferred from '\"Eastern Time (US & Canada)\"'"
},
{
"name": "ext_media",
"type": "string",
"doc": "Type inferred from '\"[]\"'"
},
{
"name": "statuses_count",
"type": "string",
"doc": "Type inferred from '\"186127\"'"
},
{
"name": "followers_count",
"type": "string",
"doc": "Type inferred from '\"1461\"'"
},
{
"name": "location",
"type": "string",
"doc": "Type inferred from '\"United States\"'"
},
{
"name": "time",
"type": "string",
"doc": "Type inferred from '\"Wed Jan 24 00:49:02 +0000 2018\"'"
},
{
"name": "user_mentions",
"type": "string",
"doc": "Type inferred from '\"[]\"'"
},
{
"name": "user_description",
"type": "string",
"doc": "Type inferred from '\"Drivers, Entrepreneur, Family, Faith, Fun, Fitness, Technology, CRM, Social, Mobility & Customer Experience.\"'"
}
]
}
References
https://wiki.apache.org/tika/GeoTopicParser
https://github.com/chrismattmann/lucene-geo-gazetteer
http://www.geonames.org/
... View more
Labels:
03-20-2018
01:40 PM
It's Java, if you aren't expecting nulls. Expect crashes. First Rule of Java, Handle Nulls. Second Rule of Java, Handle Exceptions because you forgot 1st rule.
... View more
03-17-2018
09:14 PM
copy to NiFi Lib Directory and restart
... View more
03-16-2018
03:15 PM
now use record processor
... View more
03-16-2018
12:05 PM
I was able to download with Chrome with no issues. Try Chrome or Firefox, make sure corporate firewall doesn't block Virtualbox files. Check your browser for errors.
... View more
03-16-2018
12:02 PM
does attunity work with CSV, JSON, XML and other files?
... View more
03-16-2018
12:02 PM
Thanks! Merge seems to be recommended by a few sources.
... View more
03-16-2018
12:19 AM
Other Options: https://github.com/enahwe/Csv2Hive
... View more
03-16-2018
12:16 AM
2 Kudos
NiFi JSON to DDL Custom Processor Java Class JUnit This is further enhanced version of the idea started here: https://community.hortonworks.com/articles/154957/converting-json-to-sql-ddl.html?es_p=6294995 There was some discussion on linkedin about the previous article being a good processor, so I decided to do that. This is pretty basic, but it handles most
types okay. Date and number processing is a bit hacky, but guesses some types. To install, copy the NAR file that you build or download from Github to your NiFi/lib directories and restart those servers. Add the New Processor to Your Flow Configure the Processor with a table type (that is ignored in this version) Configure the Processor with a table name (this is important) JsonToDDLProcessor Generated Docs I configured my table name to be the filename without an extension for JSON Output in NiFi Example Flow Enhancements In Consideration:
Apache OpenNLP Apache Tika Attribute Cleaner Enhancement Deep Learning for Determining Types Machine Learning for Type Inference MITIE Apache MXNet TensorFlow Stanford CoreNLP Kite SDK Hive Tools Spark Tools Make Fields Even Sized or Learn What Sizes Are Common Profiling Data Call to the community, if this is interesting, please join. If you don't want to code, please suggest enhancements, open tickets on bugs, spread the word. Thanks. Source Code: https://github.com/tspannhw/nifi-convertjsontoddl-processor mvn archetype:generate Install the Pre-Built Nar https://github.com/tspannhw/nifi-convertjsontoddl-processor/releases/tag/v1.0 Test JSON Files https://github.com/tspannhw/nifi-convertjsontoddl-processor/tree/master/nifi-convertjsontoddl-processors/src/test/resources Table Create DDL generatedddl CREATE TABLE simple ( EMPID INT, GENDER CHAR(1), DEPTID INT, FIRSTNAME VARCHAR(17), LASTNAME VARCHAR(15), TOTALSPENT INT ) generatedddl CREATE TABLE complex ( EMPID INT, GENDER CHAR(1), DEPTID INT, FIRSTNAME VARCHAR(17), LASTNAME VARCHAR(15), TOTALSPENT INT, ALONGFIELDNAME VARCHAR(33), MYFIELDISALARGESTRINGGUESSWHATTYPE VARCHAR(141), day9 INT, day0 INT, day1 INT, day2 INT, day3 INT, day4 INT, day5 INT, day6 INT, day7 INT, day8 INT, day9 INT, day0 INT, day1 INT, day INT, day INT, day INT, day INT, day INT, day INT, day INT, day INT, day INT, day0 INT, day1 INT, day2 INT, day3 INT, day4 INT, day5 INT, day6 INT, day7 INT, day8 INT, swver VARCHAR(41), hwver VARCHAR(15), mac VARCHAR(29), type VARCHAR(31), hwId VARCHAR(44), fwId VARCHAR(44), oemId VARCHAR(44), devname VARCHAR(51), model VARCHAR(21), deviceId VARCHAR(52), alias VARCHAR(59), iconhash CHAR(1), relaystate INT, ontime INT, activemode VARCHAR(20), feature VARCHAR(19), updating INT, rssi INT, ledoff INT, latitude INT, longitude INT, index INT, zonestr VARCHAR(59), tzstr VARCHAR(34), dstoffset INT, month INT, month INT, month INT, current INT, voltage INT, power INT, total INT, time DATETIME, ledon BOOLEAN, systemtime DATETIME ) generatedddl CREATE TABLE inception ( uuid VARCHAR(41), toppct VARCHAR(25), top VARCHAR(29), toppct VARCHAR(25), top VARCHAR(32), toppct VARCHAR(25), top VARCHAR(47), toppct VARCHAR(25), top VARCHAR(28), toppct VARCHAR(25), top VARCHAR(25), imagefilename VARCHAR(51), runtime CHAR(1) ) generatedddl CREATE TABLE weather ( version VARCHAR(15), xsinoNamespaceSchemaLocation VARCHAR(63), credit VARCHAR(43), creditURL VARCHAR(31), url VARCHAR(50), title VARCHAR(43), link VARCHAR(30), suggestedpickup VARCHAR(37), suggestedpickup_period VARCHAR(14), location VARCHAR(58), stationid VARCHAR(16), latitude VARCHAR(19), longitude VARCHAR(20), observationtime VARCHAR(52), observationtime_rfc822 DATETIME, weather VARCHAR(20), windstring VARCHAR(74), winddir VARCHAR(16), winddegrees VARCHAR(15), windmph VARCHAR(16), windgust_mph VARCHAR(16), windkt CHAR(1), windgust_kt VARCHAR(14), pressurein VARCHAR(17), visibilitymi VARCHAR(17), iconurl_base VARCHAR(57), twoday_history_url VARCHAR(59), iconurl_name VARCHAR(19), oburl VARCHAR(56), disclaimerurl VARCHAR(46), copyrighturl VARCHAR(46), privacypolicy_url VARCHAR(42) ) Example Flow jsontotable.xml Resources:
https://github.com/fitzscott/AirQuality/blob/master/HiveDataTypeGuesser.java https://github.com/manalishah/mitie-resources https://github.com/scrapinghub/skinfer/blob/master/README.rst https://github.com/quux00/hive-json-schema https://github.com/mit-nlp/MITIE https://opennlp.apache.org/docs/1.5.3/manual/opennlp.html#tools.namefind.recognition http://opennlp.sourceforge.net/models-1.5/ https://github.com/scrapinghub/dateparser/blob/master/README.rst http://kitesdk.org/docs/1.0.0/Inferring-a-Schema-from-an-Avro-Data-File.html https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVInferSchema.scala https://github.com/quux00/hive-json-schema http://thornydev.blogspot.in/2013/07/querying-json-records-via-hive.html https://github.com/catherinedevlin/ddl-generator https://github.com/edenzik/JSON-to-PostgreSQL/blob/master/src/main/Parser.java https://nifi.apache.org/developer-guide.html https://community.hortonworks.com/articles/116803/building-a-custom-processor-in-apache-nifi-12-for.html https://community.hortonworks.com/articles/4318/build-custom-nifi-processor.html
... View more
Labels:
03-15-2018
06:18 PM
1 Kudo
Docker seems to be very dependent on various of docker and requires lot of memory.
... View more