- Subscribe to RSS Feed
- Mark as New
- Mark as Read
- Bookmark
- Subscribe
- Printer Friendly Page
- Report Inappropriate Content
Created on
03-20-2018
05:47 PM
- edited on
03-12-2020
04:11 AM
by
VidyaSargur
Integrating lucene-geo-gazetteer For Geo Parsing with Apache NiFi
lucene-geo-gazetteer is a very cool Apache Tika, Apache Lucene and Apache OpenNLP tool that builds a fast index of geo data built from a large list of all countries data. It then provides a REST API that we can easily integrate into a flow.
So I have connected this to a NiFi flow for enhancing and enriching Twitter data with Geo data.
Example NiFi Flow To Convert Twitter Locations Into Geo Information
Downloading the Countries Data and Building the Geo Indexes
Calling the Local Geo Server
Example JSON Data Returned
Let's Pull out the fields we want after the split
Let's build a new JSON file of just the fields we like including the new geo ones.
Example JSON Processed
{"msg":"RT @pauljauregui: Cybersecurity Startups Struggle - https://t.co/wADHLyUEEB #CyberSecurity #AI #IoT #IIoT #IndustrialIoT #DataSecurity #Sec…","unixtime":"1516754942404","friends_count":"4293","sentiment":"NEGATIVE","geolongitude":"-98.5","hashtags":"[\"CyberSecurity\",\"AI\",\"IoT\",\"IIoT\",\"IndustrialIoT\",\"DataSecurity\"]","listed_count":"520","tweet_id":"955965632402485248","user_name":"Lee Weiden","favourites_count":"12454","source":"<a href=\"http://twitter.com\" rel=\"nofollow\">Twitter Web Client</a>","vadersentiment":"Compound -0.3182 Negative 0.161 Neutral 0.839 Positive 0.0 \n","placename":"United States","media_url":"[]","sentiment2":"Negative\n","retweet_count":"0","user_mentions_name":"[]","geo":"","urls":"[]","countryCode":"US","user_url":"","place":"","timestamp":"1516754942404","geolatitude":"39.76","coordinates":"","handle":"LeeWeiden","profile_image_url":"http://xxx.xxxx.com/profile_images/777401884629803009/dUOFoLnt_normal.jpg","time_zone":"Eastern Time (US & Canada)","ext_media":"[]","statuses_count":"186127","followers_count":"1461","location":"United States","time":"Wed Jan 24 00:49:02 +0000 2018","user_mentions":"[]","user_description":"Drivers, Entrepreneur, Family, Faith, Fun, Fitness, Technology, CRM, Social, Mobility & Customer Experience."}
Test The API
http://localhost:8765/api/search?s=Hightstown&s=New+Jersey
Build the Index From All Countries Dataset
./src/main/bin/lucene-geo-gazetteer -i geoIndex -b allCountries.txt
Run the REST Server
./src/main/bin/lucene-geo-gazetteer -server
Mar 20, 2018 12:33:35 PM org.apache.catalina.core.StandardContext setPath WARNING: A context path must either be an empty string or start with a '/' and do not end with a '/'. The path [/] does not meet these criteria and has been changed to [] Starting Embedded Tomcat on port : 8765 Mar 20, 2018 12:33:35 PM org.apache.coyote.AbstractProtocol init INFO: Initializing ProtocolHandler ["http-nio-8765"] Mar 20, 2018 12:33:35 PM org.apache.tomcat.util.net.NioSelectorPool getSharedSelector INFO: Using a shared selector for servlet write/read Mar 20, 2018 12:33:35 PM org.apache.catalina.core.StandardService startInternal INFO: Starting service Tomcat Mar 20, 2018 12:33:35 PM org.apache.catalina.core.StandardEngine startInternal INFO: Starting Servlet Engine: Apache Tomcat/8.0.28 Mar 20, 2018 12:33:35 PM org.apache.cxf.transport.servlet.CXFNonSpringServlet loadBusNoConfig INFO: Load the bus without application context Mar 20, 2018 12:33:36 PM org.springframework.context.support.AbstractApplicationContext prepareRefresh INFO: Refreshing org.apache.cxf.bus.spring.BusApplicationContext@293eb4d9: display name [org.apache.cxf.bus.spring.BusApplicationContext@293eb4d9]; startup date [Tue Mar 20 12:33:36 EDT 2018]; root of context hierarchy Mar 20, 2018 12:33:36 PM org.apache.cxf.bus.spring.BusApplicationContext getConfigResources INFO: No cxf.xml configuration file detected, relying on defaults. Mar 20, 2018 12:33:36 PM org.springframework.context.support.AbstractApplicationContext obtainFreshBeanFactory INFO: Bean factory for application context [org.apache.cxf.bus.spring.BusApplicationContext@293eb4d9]: org.springframework.beans.factory.support.DefaultListableBeanFactory@3d3d719b Mar 20, 2018 12:33:36 PM org.springframework.beans.factory.support.DefaultListableBeanFactory preInstantiateSingletons INFO: Pre-instantiating singletons in org.springframework.beans.factory.support.DefaultListableBeanFactory@3d3d719b: defining beans [cxf,org.apache.cxf.bus.spring.BusApplicationListener,org.apache.cxf.bus.spring.BusWiringBeanFactoryPostProcessor,org.apache.cxf.bus.spring.Jsr250BeanPostProcessor,org.apache.cxf.bus.spring.BusExtensionPostProcessor,org.apache.cxf.resource.ResourceManager,org.apache.cxf.configuration.Configurer,org.apache.cxf.binding.BindingFactoryManager,org.apache.cxf.transport.DestinationFactoryManager,org.apache.cxf.transport.ConduitInitiatorManager,org.apache.cxf.wsdl.WSDLManager,org.apache.cxf.phase.PhaseManager,org.apache.cxf.workqueue.WorkQueueManager,org.apache.cxf.buslifecycle.BusLifeCycleManager,org.apache.cxf.endpoint.ServerRegistry,org.apache.cxf.endpoint.ServerLifeCycleManager,org.apache.cxf.endpoint.ClientLifeCycleManager,org.apache.cxf.transports.http.QueryHandlerRegistry,org.apache.cxf.endpoint.EndpointResolverRegistry,org.apache.cxf.headers.HeaderManager,org.apache.cxf.catalog.OASISCatalogManager,org.apache.cxf.endpoint.ServiceContractResolverRegistry,org.apache.cxf.jaxrs.JAXRSBindingFactory]; root of factory hierarchy Mar 20, 2018 12:33:36 PM org.apache.cxf.transport.servlet.AbstractCXFServlet replaceDestinationFactory INFO: Replaced the http destination factory with servlet transport factory Mar 20, 2018 12:33:36 PM edu.usc.ir.geo.gazetteer.api.SearchResource <init> INFO: Initialising searcher from index /Volumes/seagate/opensourcecomputervision/lucene-geo-gazetteer/src/main/bin/../../../geoIndex
Example Call
http://localhost:8765/api/search?s=Hightstown&s=New+Jersey
Example Results
{"Hightstown":[{"name":"Hightstown","countryCode":"US","admin1Code":"NJ","admin2Code":"021","latitude":40.26955,"longitude":-74.52321}],"New Jersey":[{"name":"New Jersey","countryCode":"US","admin1Code":"NJ","admin2Code":"","latitude":40.16706,"longitude":-74.49987}]}
Example NiFi Flow
Example Schema
{ "type": "record", "name": "twitter", "fields": [ { "name": "msg", "type": "string" }, { "name": "unixtime", "type": "string" }, { "name": "friends_count", "type": "string" }, { "name": "sentiment", "type": "string" }, { "name": "geolongitude", "type": "string" }, { "name": "hashtags", "type": "string" }, { "name": "listed_count", "type": "string" }, { "name": "tweet_id", "type": "string" }, { "name": "user_name", "type": "string" }, { "name": "favourites_count", "type": "string" }, { "name": "source", "type": "string" }, { "name": "vadersentiment", "type": "string" }, { "name": "placename", "type": "string" }, { "name": "media_url", "type": "string" }, { "name": "sentiment2", "type": "string" }, { "name": "retweet_count", "type": "string" }, { "name": "user_mentions_name", "type": "string" }, { "name": "geo", "type": "string" }, { "name": "urls", "type": "string" }, { "name": "countryCode", "type": "string" }, { "name": "user_url", "type": "string" }, { "name": "place", "type": "string", "doc": "Type inferred from '\"\"'" }, { "name": "timestamp", "type": "string", "doc": "Type inferred from '\"1516754942404\"'" }, { "name": "geolatitude", "type": "string", "doc": "Type inferred from '\"39.76\"'" }, { "name": "coordinates", "type": "string", "doc": "Type inferred from '\"\"'" }, { "name": "handle", "type": "string", "doc": "Type inferred from '\"LeeWeiden\"'" }, { "name": "profile_image_url", "type": "string", "doc": "Type inferred from '\"http://xxx.xxx.com/profile_images/777401884629803009/dUOFoLnt_normal.jpg\"'" }, { "name": "time_zone", "type": "string", "doc": "Type inferred from '\"Eastern Time (US & Canada)\"'" }, { "name": "ext_media", "type": "string", "doc": "Type inferred from '\"[]\"'" }, { "name": "statuses_count", "type": "string", "doc": "Type inferred from '\"186127\"'" }, { "name": "followers_count", "type": "string", "doc": "Type inferred from '\"1461\"'" }, { "name": "location", "type": "string", "doc": "Type inferred from '\"United States\"'" }, { "name": "time", "type": "string", "doc": "Type inferred from '\"Wed Jan 24 00:49:02 +0000 2018\"'" }, { "name": "user_mentions", "type": "string", "doc": "Type inferred from '\"[]\"'" }, { "name": "user_description", "type": "string", "doc": "Type inferred from '\"Drivers, Entrepreneur, Family, Faith, Fun, Fitness, Technology, CRM, Social, Mobility & Customer Experience.\"'" } ] }
References
https://wiki.apache.org/tika/GeoTopicParser