Community Articles
Find and share helpful community-sourced technical articles
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.
Labels (1)
Super Guru

Integrating lucene-geo-gazetteer For Geo Parsing with Apache NiFi


lucene-geo-gazetteer is a very cool Apache Tika, Apache Lucene and Apache OpenNLP tool that builds a fast index of geo data built from a large list of all countries data. It then provides a REST API that we can easily integrate into a flow.

So I have connected this to a NiFi flow for enhancing and enriching Twitter data with Geo data.

Example NiFi Flow To Convert Twitter Locations Into Geo Information

64749-geoflow.png

Downloading the Countries Data and Building the Geo Indexes

64744-lucenegeobuild.png

Calling the Local Geo Server

64745-searchgeourl.png

Example JSON Data Returned

64746-geooutputjson.png

Let's Pull out the fields we want after the split

64747-geoevaluatejsonpath.png

Let's build a new JSON file of just the fields we like including the new geo ones.

64748-geocodingattributes.png


Example JSON Processed

{"msg":"RT @pauljauregui: Cybersecurity Startups Struggle - https://t.co/wADHLyUEEB #CyberSecurity #AI #IoT #IIoT #IndustrialIoT #DataSecurity #Sec…","unixtime":"1516754942404","friends_count":"4293","sentiment":"NEGATIVE","geolongitude":"-98.5","hashtags":"[\"CyberSecurity\",\"AI\",\"IoT\",\"IIoT\",\"IndustrialIoT\",\"DataSecurity\"]","listed_count":"520","tweet_id":"955965632402485248","user_name":"Lee Weiden","favourites_count":"12454","source":"<a href=\"http://twitter.com\" rel=\"nofollow\">Twitter Web Client</a>","vadersentiment":"Compound -0.3182 Negative 0.161 Neutral 0.839 Positive 0.0 \n","placename":"United States","media_url":"[]","sentiment2":"Negative\n","retweet_count":"0","user_mentions_name":"[]","geo":"","urls":"[]","countryCode":"US","user_url":"","place":"","timestamp":"1516754942404","geolatitude":"39.76","coordinates":"","handle":"LeeWeiden","profile_image_url":"http://pbs.twimg.com/profile_images/777401884629803009/dUOFoLnt_normal.jpg","time_zone":"Eastern Time (US & Canada)","ext_media":"[]","statuses_count":"186127","followers_count":"1461","location":"United States","time":"Wed Jan 24 00:49:02 +0000 2018","user_mentions":"[]","user_description":"Drivers, Entrepreneur, Family, Faith, Fun, Fitness, Technology, CRM, Social, Mobility & Customer Experience."}

Test The API

http://localhost:8765/api/search?s=Hightstown&s=New+Jersey

Build the Index From All Countries Dataset

./src/main/bin/lucene-geo-gazetteer -i geoIndex -b allCountries.txt

Run the REST Server

./src/main/bin/lucene-geo-gazetteer -server

Mar 20, 2018 12:33:35 PM org.apache.catalina.core.StandardContext setPath
WARNING: A context path must either be an empty string or start with a '/' and do not end with a '/'. The path [/] does not meet these criteria and has been changed to []
Starting Embedded Tomcat on port : 8765
Mar 20, 2018 12:33:35 PM org.apache.coyote.AbstractProtocol init
INFO: Initializing ProtocolHandler ["http-nio-8765"]
Mar 20, 2018 12:33:35 PM org.apache.tomcat.util.net.NioSelectorPool getSharedSelector
INFO: Using a shared selector for servlet write/read
Mar 20, 2018 12:33:35 PM org.apache.catalina.core.StandardService startInternal
INFO: Starting service Tomcat
Mar 20, 2018 12:33:35 PM org.apache.catalina.core.StandardEngine startInternal
INFO: Starting Servlet Engine: Apache Tomcat/8.0.28
Mar 20, 2018 12:33:35 PM org.apache.cxf.transport.servlet.CXFNonSpringServlet loadBusNoConfig
INFO: Load the bus without application context
Mar 20, 2018 12:33:36 PM org.springframework.context.support.AbstractApplicationContext prepareRefresh
INFO: Refreshing org.apache.cxf.bus.spring.BusApplicationContext@293eb4d9: display name [org.apache.cxf.bus.spring.BusApplicationContext@293eb4d9]; startup date [Tue Mar 20 12:33:36 EDT 2018]; root of context hierarchy
Mar 20, 2018 12:33:36 PM org.apache.cxf.bus.spring.BusApplicationContext getConfigResources
INFO: No cxf.xml configuration file detected, relying on defaults.
Mar 20, 2018 12:33:36 PM org.springframework.context.support.AbstractApplicationContext obtainFreshBeanFactory
INFO: Bean factory for application context [org.apache.cxf.bus.spring.BusApplicationContext@293eb4d9]: org.springframework.beans.factory.support.DefaultListableBeanFactory@3d3d719b
Mar 20, 2018 12:33:36 PM org.springframework.beans.factory.support.DefaultListableBeanFactory preInstantiateSingletons
INFO: Pre-instantiating singletons in org.springframework.beans.factory.support.DefaultListableBeanFactory@3d3d719b: defining beans [cxf,org.apache.cxf.bus.spring.BusApplicationListener,org.apache.cxf.bus.spring.BusWiringBeanFactoryPostProcessor,org.apache.cxf.bus.spring.Jsr250BeanPostProcessor,org.apache.cxf.bus.spring.BusExtensionPostProcessor,org.apache.cxf.resource.ResourceManager,org.apache.cxf.configuration.Configurer,org.apache.cxf.binding.BindingFactoryManager,org.apache.cxf.transport.DestinationFactoryManager,org.apache.cxf.transport.ConduitInitiatorManager,org.apache.cxf.wsdl.WSDLManager,org.apache.cxf.phase.PhaseManager,org.apache.cxf.workqueue.WorkQueueManager,org.apache.cxf.buslifecycle.BusLifeCycleManager,org.apache.cxf.endpoint.ServerRegistry,org.apache.cxf.endpoint.ServerLifeCycleManager,org.apache.cxf.endpoint.ClientLifeCycleManager,org.apache.cxf.transports.http.QueryHandlerRegistry,org.apache.cxf.endpoint.EndpointResolverRegistry,org.apache.cxf.headers.HeaderManager,org.apache.cxf.catalog.OASISCatalogManager,org.apache.cxf.endpoint.ServiceContractResolverRegistry,org.apache.cxf.jaxrs.JAXRSBindingFactory]; root of factory hierarchy
Mar 20, 2018 12:33:36 PM org.apache.cxf.transport.servlet.AbstractCXFServlet replaceDestinationFactory
INFO: Replaced the http destination factory with servlet transport factory
Mar 20, 2018 12:33:36 PM edu.usc.ir.geo.gazetteer.api.SearchResource <init>
INFO: Initialising searcher from index /Volumes/seagate/opensourcecomputervision/lucene-geo-gazetteer/src/main/bin/../../../geoIndex 


Example Call

http://localhost:8765/api/search?s=Hightstown&s=New+Jersey


Example Results

{"Hightstown":[{"name":"Hightstown","countryCode":"US","admin1Code":"NJ","admin2Code":"021","latitude":40.26955,"longitude":-74.52321}],"New Jersey":[{"name":"New Jersey","countryCode":"US","admin1Code":"NJ","admin2Code":"","latitude":40.16706,"longitude":-74.49987}]}

Example NiFi Flow

example-geo.xml

Example Schema

{
 "type": "record",
 "name": "twitter",
 "fields": [
  {
   "name": "msg",
   "type": "string"
  },
  {
   "name": "unixtime",
   "type": "string"
  },
  {
   "name": "friends_count",
   "type": "string"
  },
  {
   "name": "sentiment",
   "type": "string"
  },
  {
   "name": "geolongitude",
   "type": "string"
  },
  {
   "name": "hashtags",
   "type": "string"
  },
  {
   "name": "listed_count",
   "type": "string"
  },
  {
   "name": "tweet_id",
   "type": "string"
  },
  {
   "name": "user_name",
   "type": "string"
  },
  {
   "name": "favourites_count",
   "type": "string"
  },
  {
   "name": "source",
   "type": "string"
  },
  {
   "name": "vadersentiment",
   "type": "string"
  },
  {
   "name": "placename",
   "type": "string"
  },
  {
   "name": "media_url",
   "type": "string"
  },
  {
   "name": "sentiment2",
   "type": "string"
  },
  {
   "name": "retweet_count",
   "type": "string"
  },
  {
   "name": "user_mentions_name",
   "type": "string"
  },
  {
   "name": "geo",
   "type": "string"
  },
  {
   "name": "urls",
   "type": "string"
  },
  {
   "name": "countryCode",
   "type": "string"
  },
  {
   "name": "user_url",
   "type": "string"
  },
  {
   "name": "place",
   "type": "string",
   "doc": "Type inferred from '\"\"'"
  },
  {
   "name": "timestamp",
   "type": "string",
   "doc": "Type inferred from '\"1516754942404\"'"
  },
  {
   "name": "geolatitude",
   "type": "string",
   "doc": "Type inferred from '\"39.76\"'"
  },
  {
   "name": "coordinates",
   "type": "string",
   "doc": "Type inferred from '\"\"'"
  },
  {
   "name": "handle",
   "type": "string",
   "doc": "Type inferred from '\"LeeWeiden\"'"
  },
  {
   "name": "profile_image_url",
   "type": "string",
   "doc": "Type inferred from '\"http://pbs.twimg.com/profile_images/777401884629803009/dUOFoLnt_normal.jpg\"'"
  },
  {
   "name": "time_zone",
   "type": "string",
   "doc": "Type inferred from '\"Eastern Time (US & Canada)\"'"
  },
  {
   "name": "ext_media",
   "type": "string",
   "doc": "Type inferred from '\"[]\"'"
  },
  {
   "name": "statuses_count",
   "type": "string",
   "doc": "Type inferred from '\"186127\"'"
  },
  {
   "name": "followers_count",
   "type": "string",
   "doc": "Type inferred from '\"1461\"'"
  },
  {
   "name": "location",
   "type": "string",
   "doc": "Type inferred from '\"United States\"'"
  },
  {
   "name": "time",
   "type": "string",
   "doc": "Type inferred from '\"Wed Jan 24 00:49:02 +0000 2018\"'"
  },
  {
   "name": "user_mentions",
   "type": "string",
   "doc": "Type inferred from '\"[]\"'"
  },
  {
   "name": "user_description",
   "type": "string",
   "doc": "Type inferred from '\"Drivers, Entrepreneur, Family, Faith, Fun, Fitness, Technology, CRM, Social, Mobility & Customer Experience.\"'"
  }
 ]
}


References

https://wiki.apache.org/tika/GeoTopicParser

https://github.com/chrismattmann/lucene-geo-gazetteer

http://www.geonames.org/

419 Views
Don't have an account?
Coming from Hortonworks? Activate your account here
Version history
Revision #:
2 of 2
Last update:
‎08-17-2019 08:21 AM
Updated by:
 
Contributors
Top Kudoed Authors