Giter VIP home page Giter VIP logo

stuartemiddleton / geoparsepy Goto Github PK

View Code? Open in Web Editor NEW
54.0 5.0 4.0 148 KB

geoparsepy is a Python geoparsing library that will extract and disambiguate locations from text. It uses a local OpenStreetMap database which allows very high and unlimited geoparsing throughput, unlike approaches that use a third-party geocoding service (e.g. Google Geocoding API). this repository holds Python examples to use the PyPI library.

License: Other

Python 100.00%
natural-language-processing artificial-intelligence information-extraction geoparse location-extraction toponym-resolution nlp openstreetmap postgresql

geoparsepy's People

Contributors

stuartemiddleton avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

geoparsepy's Issues

possible installation problem with data files encoding

There is a possible problem with either database dump files (encoding?) or installation instructions.
(sorry if it's better suited to troubleshooting questions than issues, I didn't find another way to communicate)

Attempt to import database files

psql -h localhost -U postgres -d openstreetmap -f global_cities.sql

leads to many syntax errors like

psql:global_cities.sql:81540: ERROR: syntax error at or near "章丘"
LINE 1: 章丘", "wikidata"=>"Q197392", "wikipedia"=>"en:Zhangqiu Di...

Installation/environment notes:

$ cat /etc/issue
Ubuntu 18.04.5 LTS \n \l
$ echo LANG
en_US.UTF-8

Since the dump files were created with postgres 11.3, it was installed with the following Docker file

FROM postgres:11.3

RUN apt-get update \
    && apt-get install wget -y \
    && apt-get install postgresql-11-postgis-3 -y \
    && apt-get install postgresql-11-postgis-3-scripts -y \
    && apt-get install postgis -y

COPY ./db.sql /docker-entrypoint-initdb.d/

# docker run --name pg-docker -e POSTGRES_PASSWORD=docker -d -p 5432:5432 -v $HOME/docker/volumes/postgres:/var/lib/postgresql/data postres_posgis11:latest
#psql -h localhost -U postgres -d openstreetmap -f global_cities.sql
#psql -h localhost -U postgres -d openstreetmap -f uk_places.sql
#psql -h localhost -U postgres -d openstreetmap -f north_america_places.sql
#psql -h localhost -U postgres -d openstreetmap -f europe_places.sql

(base) olga@anyclt104:~/workspace/edison/geoparsy_test/external/db$ cat db.sql
CREATE DATABASE openstreetmap;
CREATE EXTENSION IF NOT EXISTS postgis;
CREATE EXTENSION IF NOT EXISTS fuzzystrmatch;
CREATE EXTENSION IF NOT EXISTS postgis_tiger_geocoder;
CREATE EXTENSION IF NOT EXISTS hstore;

Running Geoparsepy with other languages

Currently the following languages are supported:

English, French, German, Italian, Portuguese, Russian, Ukrainian
All other languages will work but there will be no language specific token expansion available

Ive followed the instructions and gotten geoparsepy working with the example.

I tried adding a sentence to your listText: u'Hola, vivo en Madrid España' but its not finding anything. The location of "Madrid España" should be pretty easy to find as its a direct lookup.

Do you have any advice on how to approach handling other languages?

Geoparsing don't match some city

I tried to geoparse some phrases but not all the city are matched (for example: 'Sciacca' and 'Asciano').
Note that all the city are present on the database and all the phrases are correctly tokenized.

EDIT: I noticed that if I manually whitelist the cities everything works fine, but why are they not shown directly?

Here is my code:

import soton_corenlppy
import geoparsepy
import logging

logger = logging.getLogger("geoparsepy")
logging.basicConfig(level=logging.INFO, format="%(message)s")
logger.info('Logging started')

geospatial_config = geoparsepy.geo_parse_lib.get_geoparse_config(
    lang_codes=['it', 'en'],
    logger=logger
)

location_ids = {}
focus_areas = ['global_cities', 'europe_places', 'north_america_places', 'uk_places']
for focus_area in focus_areas:
    location_ids[focus_area + '_admin'] = [-1, -1]
    location_ids[focus_area + '_poly'] = [-1, -1]
    location_ids[focus_area + '_line'] = [-1, -1]
    location_ids[focus_area + '_point'] = [-1, -1]


# Create a connection with the database
database_handler = soton_corenlppy.PostgresqlHandler.PostgresqlHandler(
    user='postgres',
    passw=' ',
    hostname='localhost',
    port=5432,
    database='openstreetmap'
)

# Load a set of previously preprocessed locations from database
cached_locations = geoparsepy.geo_preprocess_lib.cache_preprocessed_locations(
    database_handle=database_handler,
    location_ids=location_ids,
    schema='public',
    geospatial_config=geospatial_config
)
logger.info(f"Loaded {len(cached_locations)} position")

# Close connection with the database
database_handler.close()


# Compile an inverted index from a list of arbirary data where one column is a phrase string
indexed_locations = geoparsepy.geo_parse_lib.calc_inverted_index(
    list_data=cached_locations,
    dict_geospatial_config=geospatial_config
)
logger.info(f"Indexed {len(indexed_locations.keys())} phrases")

# Create an index of osmid to row indexes in the cached_locations
osmid_lookup = geoparsepy.geo_parse_lib.calc_osmid_lookup(cached_locations=cached_locations)


listText = [
    u'hello New York, USA its Bill from Bassett calling',
    u'live on the BBC Victoria Derbyshire is visiting Derbyshire for an exclusive UK interview',
    u'Domani vado a Roma, nel Lazio',
    u'Io sono di Sciacca, in provincia di agrigento',
    u'Vengo dalla provincia di Agrigento, in Sicilia',
    u'Mi sdraio sul prato del mio vicino',
    u'Pavia e Ravenna sono belle città',
    u'Voglio andare a new york',
    u'Mi trovo a San Giuliano Terme',
    u'Io sono di Sciacca, in provincia di Agrigento',
    u'Martina vive a Nuoro ma vorrebbe andare ad Agrigento',
    u'Agrigento è la provincia che contiene il comune di Sciacca',
    u'Vicino san giuliano terme c\'è un comune che si chiama Asciano',
    u'La città di Sciacca si trova in provincia di Agrigento',
    u'Mi trovo a Sciacca'
]

listTokenSets = []
for text in listText:
    # Tokenize a text entry into unigram tokens text will be cleaned and tokenize
    listToken = soton_corenlppy.common_parse_lib.unigram_tokenize_text(
        text=text,
        dict_common_config=geospatial_config
    )
    listTokenSets.append(listToken)


# Geoparse token sets using a set of cached locations
listMatchSet = geoparsepy.geo_parse_lib.geoparse_token_set(
    token_set=listTokenSets,
    dict_inverted_index=indexed_locations,
    dict_geospatial_config=geospatial_config
)

# Print the matched location
for i in range(len(listMatchSet)):
    logger.info(f"\nText: {listText[i]}")
    listMatch = listMatchSet[i]
    for tupleMatch in listMatch:
        logger.info(str(tupleMatch))

The output is the following:

C:\Users\calog\PycharmProjects\geoparsepy\venv\Scripts\python.exe C:/Users/calog/PycharmProjects/geoparsepy/main2.py
Logging started
loading stoplist from C:\Users\calog\PycharmProjects\geoparsepy\venv\lib\site-packages\geoparsepy\corpus-geo-stoplist-it.txt
loading stoplist from C:\Users\calog\PycharmProjects\geoparsepy\venv\lib\site-packages\geoparsepy\corpus-geo-stoplist-en.txt
loading whitelist from C:\Users\calog\PycharmProjects\geoparsepy\venv\lib\site-packages\geoparsepy\corpus-geo-whitelist.txt
loading blacklist from C:\Users\calog\PycharmProjects\geoparsepy\venv\lib\site-packages\geoparsepy\corpus-geo-blacklist.txt
loading building types from C:\Users\calog\PycharmProjects\geoparsepy\venv\lib\site-packages\geoparsepy\corpus-buildingtype-it.txt
loading location type corpus C:\Users\calog\PycharmProjects\geoparsepy\venv\lib\site-packages\geoparsepy\corpus-buildingtype-it.txt
- 0 unique titles
- 61 unique types
loading street types from C:\Users\calog\PycharmProjects\geoparsepy\venv\lib\site-packages\geoparsepy\corpus-streettype-it.txt
loading location type corpus C:\Users\calog\PycharmProjects\geoparsepy\venv\lib\site-packages\geoparsepy\corpus-streettype-it.txt
- 10 unique titles
- 14 unique types
loading admin types from C:\Users\calog\PycharmProjects\geoparsepy\venv\lib\site-packages\geoparsepy\corpus-admintype-it.txt
loading location type corpus C:\Users\calog\PycharmProjects\geoparsepy\venv\lib\site-packages\geoparsepy\corpus-admintype-it.txt
- 10 unique titles
- 0 unique types
loading building types from C:\Users\calog\PycharmProjects\geoparsepy\venv\lib\site-packages\geoparsepy\corpus-buildingtype-en.txt
loading location type corpus C:\Users\calog\PycharmProjects\geoparsepy\venv\lib\site-packages\geoparsepy\corpus-buildingtype-en.txt
- 3 unique titles
- 76 unique types
loading street types from C:\Users\calog\PycharmProjects\geoparsepy\venv\lib\site-packages\geoparsepy\corpus-streettype-en.txt
loading location type corpus C:\Users\calog\PycharmProjects\geoparsepy\venv\lib\site-packages\geoparsepy\corpus-streettype-en.txt
- 15 unique titles
- 32 unique types
loading admin types from C:\Users\calog\PycharmProjects\geoparsepy\venv\lib\site-packages\geoparsepy\corpus-admintype-en.txt
loading location type corpus C:\Users\calog\PycharmProjects\geoparsepy\venv\lib\site-packages\geoparsepy\corpus-admintype-en.txt
- 14 unique titles
- 0 unique types
loading gazeteer from C:\Users\calog\PycharmProjects\geoparsepy\venv\lib\site-packages\geoparsepy\gazeteer-en.txt
caching locations : {'global_cities_admin': [-1, -1], 'global_cities_poly': [-1, -1], 'global_cities_line': [-1, -1], 'global_cities_point': [-1, -1], 'europe_places_admin': [-1, -1], 'europe_places_poly': [-1, -1], 'europe_places_line': [-1, -1], 'europe_places_point': [-1, -1], 'north_america_places_admin': [-1, -1], 'north_america_places_poly': [-1, -1], 'north_america_places_line': [-1, -1], 'north_america_places_point': [-1, -1], 'uk_places_admin': [-1, -1], 'uk_places_poly': [-1, -1], 'uk_places_line': [-1, -1], 'uk_places_point': [-1, -1]}
Loaded 800820 position
Indexed 605884 phrases

Text: hello New York, USA its Bill from Bassett calling
(1, 2, {(61785451,), (-175905,), (151937435,), (316976734,), (2218262347,), (29457403,), (-61320,)}, ('new', 'york'))
(2, 2, {(153924230,), (151528825,), (158656063,), (20913294,), (151672942,), (153595296,), (153968758,), (316990182,), (151651405,), (-134353,), (-1425436,), (153473841,)}, ('york',))
(4, 4, {(-148838,)}, ('usa',))
(8, 8, {(253067120,), (151840681,), (151463868,)}, ('bassett',))

Text: live on the BBC Victoria Derbyshire is visiting Derbyshire for an exclusive UK interview
(4, 4, {(75538688,), (385402175,), (151521359,), (74701108,), (-5606595,), (462241727,), (151395812,), (460070685,), (447925715,), (277608416,), (-1828436,), (-407423,), (154301948,), (-2316741,), (435240340,), (-5606596,), (463188523,), (151336948,), (151476805,), (30189922,), (158651084,), (-2256643,), (-10307525,)}, ('victoria',))
(8, 8, {(-195384,)}, ('derbyshire',))
(12, 12, {(-62149,)}, ('uk',))

Text: Domani vado a Roma, nel Lazio
(1, 1, {(151686158,)}, ('vado',))
(3, 3, {(385056116,), (-41313,)}, ('roma',))
(6, 6, {(-40784,)}, ('lazio',))

Text: Io sono di Sciacca, in provincia di agrigento

Text: Vengo dalla provincia di Agrigento, in Sicilia
(7, 7, {(-39152,)}, ('sicilia',))

Text: Mi sdraio sul prato del mio vicino
(3, 3, {(-42619,)}, ('prato',))

Text: Pavia e Ravenna sono belle città
(0, 0, {(158289705,), (-43483,), (230101550,)}, ('pavia',))
(2, 2, {(154313500,), (151333458,), (151866924,), (154149873,), (-42889,)}, ('ravenna',))
(4, 4, {(154337430,)}, ('belle',))

Text: Voglio andare a new york
(3, 4, {(61785451,), (-175905,), (151937435,), (316976734,), (2218262347,), (29457403,), (-61320,)}, ('new', 'york'))
(4, 4, {(153924230,), (151528825,), (158656063,), (20913294,), (151672942,), (153595296,), (153968758,), (316990182,), (151651405,), (-134353,), (-1425436,), (153473841,)}, ('york',))

Text: Mi trovo a San Giuliano Terme
(1, 1, {(62515792,)}, ('trovo',))
(3, 4, {(4594763552,), (130871200,), (6986638289,), (6008076012,), (3653962105,), (1213463381,), (5318245098,), (2815922128,)}, ('san', 'giuliano'))
(3, 5, {(258512997,)}, ('san', 'giuliano', 'terme'))
(5, 5, {(27013444,), (-1837372,)}, ('terme',))

Text: Io sono di Sciacca, in provincia di Agrigento

Text: Martina vive a Nuoro ma vorrebbe andare ad Agrigento
(3, 3, {(-39979,)}, ('nuoro',))
(8, 8, {(-39151,)}, ('agrigento',))

Text: Agrigento è la provincia che contiene il comune di Sciacca
(0, 0, {(-39151,)}, ('agrigento',))

Text: Vicino san giuliano terme c'è un comune che si chiama Asciano
(1, 2, {(4594763552,), (130871200,), (6986638289,), (6008076012,), (3653962105,), (1213463381,), (5318245098,), (2815922128,)}, ('san', 'giuliano'))
(1, 3, {(258512997,)}, ('san', 'giuliano', 'terme'))
(3, 3, {(27013444,), (-1837372,)}, ('terme',))

Text: La città di Sciacca si trova in provincia di Agrigento

Text: Mi trovo a Sciacca
(1, 1, {(62515792,)}, ('trovo',))

Broken encoding for cyrillic rows

You write: "Download pre-processed UTF-8 encoded SQL table dumps from OSM image dated dec 2019"
But database which generated dumps was English_United States.1252 encoding
Because of that we have invalid data in sql files :(
Screen Shot 2023-12-30 at 1 25 51 PM

There are some rows without localization:
Screen Shot 2023-12-30 at 1 45 32 PM

Do I need to Osm2pgsql? Or there is some other solution?

Database setup fails

Hi,

I started several attempts to setup the database required for geoparsepy. However I failed to do so on several devices and environments. I tried it on macOS locally and with Docker, same on Windows. I am not 100 % certain but I think the encoding of the sql Files is somehow messed up. Also I am struggling to setup the collation from the sql-Files.

Can anyone confirm that the dumps actually work and maybe give some advice?

Thanks in advance.

Missing data for columan "tags"

Dear Stuart,

When I restore the table of uk_places using your provided "uk_places.sql" file, I encountered the error "missing data for column "tags" CONTEXT: COPY uk_places_point, line 1: "1 Aaron's Hill {718482994} {-151304,-62149,-58447,-57582} 0101000020E6100000C737CAB0402AE4BFC8A9E7EE..."". I checked the content of sql file and suspected that there are some empty values for the column tags, e.g., the first row, Aaron's Hill.

I am wondering if there is a solution to resolve it? E.g., using '' to fill the empty values?

Thanks in avance.

error in reading postgresql database

I am trying to replicate the example as explained in the docs.

However, I am having issues in extracting all files from geoparsepy_preprocessed_tables.tar.gz (using 7zip on windows): I can only extract global_cities.sql and europe_places.sql.

I then create the PostgreSQL database following the example, and launch the python example file. The script fails here:

cached_locations = geoparsepy.geo_preprocess_lib.cache_preprocessed_locations( databaseHandle, dictLocationIDs, 'public', dictGeospatialConfig )

This is the traceback:

logging started
loading stoplist from C:\Users\**\AppData\Local\Programs\Python\Python37\lib\site-packages\geoparsepy\corpus-geo-stoplist-en.txt
loading whitelist from C:\Users\**\AppData\Local\Programs\Python\Python37\lib\site-packages\geoparsepy\corpus-geo-whitelist.txt
loading blacklist from C:\Users\**\AppData\Local\Programs\Python\Python37\lib\site-packages\geoparsepy\corpus-geo-blacklist.txt
loading building types from C:\Users\**\AppData\Local\Programs\Python\Python37\lib\site-packages\geoparsepy\corpus-buildingtype-en.txt
loading location type corpus C:\Users\**\AppData\Local\Programs\Python\Python37\lib\site-packages\geoparsepy\corpus-buildingtype-en.txt
- 3 unique titles
- 76 unique types
loading street types from C:\Users\**\AppData\Local\Programs\Python\Python37\lib\site-packages\geoparsepy\corpus-streettype-en.txt
loading location type corpus C:\Users\**\AppData\Local\Programs\Python\Python37\lib\site-packages\geoparsepy\corpus-streettype-en.txt
- 15 unique titles
- 32 unique types
loading admin types from C:\Users\**\AppData\Local\Programs\Python\Python37\lib\site-packages\geoparsepy\corpus-admintype-en.txt
loading location type corpus C:\Users\**\AppData\Local\Programs\Python\Python37\lib\site-packages\geoparsepy\corpus-admintype-en.txt
- 14 unique titles
- 0 unique types
loading gazeteer from C:\Users\**\AppData\Local\Programs\Python\Python37\lib\site-packages\geoparsepy\gazeteer-en.txt
caching locations : {'europe_places_admin': [-1, -1], 'europe_places_poly': [-1, -1], 'europe_places_line': [-1, -1], 'europe_places_point': [-1, -1], 'global_cities_admin': [-1, -1], 'global_cities_poly': [-1, -1], 'global_cities_line': [-1, -1], 'global_cities_point': [-1, -1]}
Traceback (most recent call last):
  File "C:\Users\**\20200819_geoparsepy_example.py", line 29, in <module>
    cached_locations = geoparsepy.geo_preprocess_lib.cache_preprocessed_locations( databaseHandle, dictLocationIDs, 'public', dictGeospatialConfig )
  File "C:\Users\**\AppData\Local\Programs\Python\Python37\lib\site-packages\geoparsepy\geo_preprocess_lib.py", line 1649, in cache_preprocessed_locations
    listRows = database_handle.execute_sql_query_batch( listSQL, timeout_statement, timeout_overall )
  File "C:\Users\**\AppData\Local\Programs\Python\Python37\lib\site-packages\soton_corenlppy\PostgresqlHandler.py", line 348, in execute_sql_query_batch
    raise Exception( 'SQL query failed (timeout retrying) : ' + strLastError + ' : ' + tupleStatement[0] )
Exception: SQL query failed (timeout retrying) : ['42P01'] UndefinedTable('relation "public.europe_places_admin" does not exist\nLINE 1: ...gions,ST_AsText(geom),hstore_to_matrix(tags) FROM public.eur...\n                                                             ^\n') : SELECT concat('europe_places_admin_',loc_id),name,osm_id_set,admin_regions,ST_AsText(geom),hstore_to_matrix(tags) FROM public.europe_places_admin

Minor issues getting getting the .sql files to load correctly

There are a couple issues with BOM and the username. I was using a docker image to host the postgres, and ran this to pre-process the .sql files and it worked correctly after that.

Replace postgres with your username of choice.

fixed_dir=./data/fixed
mkdir -p $fixed_dir
sed '1s/^\xEF\xBB\xBF//' < ./data/global_cities.sql > ${fixed_dir}/global_cities.sql
sed '1s/^\xEF\xBB\xBF//' < ./data/uk_places.sql > ${fixed_dir}/uk_places.sql
sed '1s/^\xEF\xBB\xBF//' < ./data/north_america_places.sql > ${fixed_dir}/north_america_places.sql
sed '1s/^\xEF\xBB\xBF//' < ./data/europe_places.sql > ${fixed_dir}/europe_places.sql

sed 's/TO sem/TO postgres/' -i ${fixed_dir}/global_cities.sql
sed 's/TO sem/TO postgres/' -i ${fixed_dir}/uk_places.sql
sed 's/TO sem/TO postgres/' -i ${fixed_dir}/north_america_places.sql
sed 's/TO sem/TO postgres/' -i ${fixed_dir}/europe_places.sql

Reducing false positives

This is not really an issue, but I am writing this to start a conversation and ask a few questions. First of all thanks a lot for making this public, it is a very useful tool.

Sometimes geoparsepy returns many false positives, which makes it difficult to filter out the locations that we are interested in.

For example, take the sentence "They are a ga machine". Just a random sentence, no meaning. "ga" is picked up as a possible location, and matched to a number of possible osmid.

How can we filter out such false positives? I noticed that very often this happens with short words, of 2-3 letters. A very rough approach would be to filter out all short words that returned a match, but I am sure something more nuanced is possible.

Thanks again

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.