chrismattmann / lucene-geo-gazetteer Goto Github PK
View Code? Open in Web Editor NEWUses Apache Lucene, OpenNLP and geonames and extracts locations from text and geocodes them.
License: Apache License 2.0
Uses Apache Lucene, OpenNLP and geonames and extracts locations from text and geocodes them.
License: Apache License 2.0
Fields "feature class", "feature code", "population","country code", "admin1 code", "admin2 code" can help us defining granularity of our locations. Then we will use these fields to select most relevant location for a String.
For example Pasadena currently returns coordinates - 4.6964,-74.06446 which point to some Pasadena in Columbia while more known location could be Pasadena CA / Pasadena TX.
This link contain all codes: http://www.geonames.org/export/codes.html.
This link contain schema and other deatils on data set http://download.geonames.org/export/dump/readme.txt
@chrismattmann Any more / Any less field you suggest?
Test suite must have diverse test cases covering continents, countries, states, cities, towns etc..
Consider using https://travis-ci.org/
Once we have this suite we can ensure future modifications don't disturb current classifiers.
Locations in allCountries.txt are present with their official names. There are few locations which are more known by their popular names all over the world than their official names.
Example -
Possible way - Use country code list in geonames.org [0] and add a dummy document in lucene index for all PCLI.
[0] - http://download.geonames.org/export/dump/countryInfo.txt
The current JSON output format is problematic and not intuitive.
For example,
$ lucene-geo-gazetteer -s "Los Angeles" -c 5
search produces the following result:
[
{
"Los Angeles":[
"Los Angeles County",
"-118.26102",
"34.19801",
"US",
"CA",
"037",
"City of Los Angeles",
"-118.53995",
"34.15649",
"US",
"CA",
"037",
"City Lands of Los Angeles",
"-118.23035",
"34.05807",
"US",
"CA",
"037",
"Los Angeles",
"-79.3643",
"-1.41917",
"EC",
"13",
"1207",
"Los Angeles",
"-72.32774",
"-37.40792",
"CL",
"06",
"83"
]
}
]
Desired and Intuitive output should have been:
[
{
"Los Angeles":[
[
"Los Angeles County",
"-118.26102",
"34.19801",
"US",
"CA",
"037"
],
[
"City of Los Angeles",
"-118.53995",
"34.15649",
"US",
"CA",
"037"
],
[
"City Lands of Los Angeles",
"-118.23035",
"34.05807",
"US",
"CA",
"037"
],
[
"Los Angeles",
"-79.3643",
"-1.41917",
"EC",
"13",
"1207"
],
[
"Los Angeles",
"-72.32774",
"-37.40792",
"CL",
"06",
"83"
]
]
}
]
This leaves room for future improvements without breaking the existing clients. Try adding a new field to output and guess what will happen array index?
Better yet:
[
{
"Los Angeles":[
{
"name":"Los Angeles County",
"lon":-118.26102,
"lat":34.19801,
"country":"US",
"state":"CA",
"stateid":"037"
},
{...}
]
}
]
This is definitely better, because it is easy to consume by rest clients.
@chrismattmann : How about returning country code, state, county along with co-ordinates? This information might make sense when we analyse this data in a context. It will allow us to group locations based on location's country, state and county.
[
{"Pasadena" : [
"Pasadena",
"-74.06446",
"4.6964",
"US",
"CA",
"37"
]}
]
Aim - For a given latitude and longitude return name of city, state and country.
Lucene API - https://lucene.apache.org/core/5_0_0/spatial/index.html
This is perhaps not a bug, but just want to share some examples of incorrect matching of locations:
$ ./src/main/bin/lucene-geo-gazetteer -s "russia" "china" "America"
[
{"russia" : [
"Republic of Belarus","28.0","53.0","BY","00",""
]},
{"china" : [
"Taiwan","121.0","24.0","TW","00",""
]},
{"America" : [
"South America","-57.65625","-14.60485","","",""
]}
]
Russia and China are recognized correctly if capitalized.
With current setup we enforce jdk 7 for building project. It will undoubtedly compile fine with jdk 8 but the idea behind enforcing 7 was to not depend on 8 entirely as some API in jdk 8 will not compile with java
With sniffer we can ensure that code does not include any API which are not included in java 7 even though it's compiled by jdk 8
Thoughts? @chrismattmann
<plugin>
<groupId>org.codehaus.mojo</groupId>
<artifactId>animal-sniffer-maven-plugin</artifactId>
<version>1.7</version>
<executions>
<execution>
<id>signature-check</id>
<phase>compile</phase>
<goals>
<goal>check</goal>
</goals>
</execution>
</executions>
<configuration>
<signature>
<groupId>org.codehaus.mojo.signature</groupId>
<artifactId>java17</artifactId>
<version>1.0</version>
</signature>
</configuration>
</plugin>
One Possible implementation:
@chrismattmann
Can we take one extra parameter to define number of results that should be returned by lucene-geo-gazetteer?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.