Giter VIP home page Giter VIP logo

parser's Introduction

A modular, open-source search engine for our world.

Pelias is a geocoder powered completely by open data, available freely to everyone.

Local Installation · Cloud Webservice · Documentation · Community Chat

What is Pelias?
Pelias is a search engine for places worldwide, powered by open data. It turns addresses and place names into geographic coordinates, and turns geographic coordinates into places and addresses. With Pelias, you’re able to turn your users’ place searches into actionable geodata and transform your geodata into real places.

We think open data, open source, and open strategy win over proprietary solutions at any part of the stack and we want to ensure the services we offer are in line with that vision. We believe that an open geocoder improves over the long-term only if the community can incorporate truly representative local knowledge.

Pelias Parser

A natural language classification engine for geocoding.

This library contains primitive 'building blocks' which can be composed together to produce a powerful and flexible natural language parser.

The project was designed and built to work with the Pelias geocoder, so it comes bundled with a parser called AddressParser which can be included in other npm project independent of Pelias.

It is also possible to modify the configuration of AddressParser, the dictionaries or the semantics. You can also easily create a completely new parser to suit your own domain.

AddressParser Example

30 w 26 st nyc 10010

(0.95) ➜ [
  { housenumber: '30' },
  { street: 'w 26 st' },
  { locality: 'nyc' },
  { postcode: '10010' }
]

Application Interfaces

You can access the library via three different interfaces:

  • all parts of the codebase are available in javascript via npm
  • on the command line via the node bin/cli.js script
  • through a web service via the node server/http.js script

the web service provides an interactive demo at the URL /parser/parse

Quick Start

A quick and easy way to get started with the library is to use the command-line interface:

node bin/cli.js West 26th Street, New York, NYC, 10010

cli


Architecture Description

Please refer to the CLI screenshot above for a visual reference.

Tokenization

Tokenization is the process of splitting text into individual words.

The splitting process used by the engine maintains token positions, so it's able to 'remember' where each character was in the original input text.

Tokenization is coloured blue on the command-line.

Span

The most primitive element is called a span, this is essentially just a single string of text with some metadata attached.

The terms word, phrase and section (explained below) are all just ways of using a span.

Section Boundaries

Some parsers like libpostal ignore characters such as comma, tab, newline and quote.

While it's unrealistic to expect commas always being present, it's very useful to record their positions when they are.

These boundary positions help to avoid parsing errors for queries such as Main St, East Village being parsed as Main St East in Village.

Once sections are established there is no 'bleeding' of information between sections, avoiding the issue above.

Word Splitting

Each section is then split in to individual words, by default this simply considers whitespace as a word boundary.

As per the section, the original token positions are maintained.

Phrase Generation

May terms such as 'New York City' span multiple words, these multi-word tokens are called phrases.

In order to be able to classify phrase terms, permutations of adjacent words are generated.

Phrase generation is performed per-section, so it will not generate a phrase which contains words from more than one section.

Phrase generation is controlled by a configuration which specifies things like the minimum & maximum amount of words allowed in a phrase.

Token Graph

A graph is used to associate word, phrase and section elements to each other.

The graph is free-form, so it's easy to add a new relationship between terms in the future, as required.

Graph Example:

// find the next word in this section
word.findOne('next')

// find all words in this phrase
phrase.findAll('child')

Classification

Classification is the process of establishing that a word or phrase represents a 'concept' (such as a street name).

Classification can be based on:

  • Dictionary matching (usually with normalization applied)
  • Pattern matching (such as regular expressions)
  • Composite matching (such as relative positioning)
  • External API calls (such as calling other services)
  • Other semantic matching techniques

Classification is coloured green and red on the command-line.

Classifier Types

The library comes with three generic classifiers which can be extended in order to create a new classifier:

  • WordClassifier
  • PhraseClassifier
  • SectionClassifier

Classifiers

The library comes bundled with a range of classifiers out-of-the box.

You can find them in the /classifier directory, dictionary-based classifiers usually store their data in the /resources directory.

Example of some of the included classifiers:

// word classifiers
HouseNumberClassifier
PostcodeClassifier
StreetPrefixClassifier
StreetSuffixClassifier
CompoundStreetClassifier
DirectionalClassifier
OrdinalClassifier
StopWordClassifier

// phrase classifiers
IntersectionClassifier
PersonClassifier
GivenNameClassifier
SurnameClassifier
PersonalSuffixClassifier
PersonalTitleClassifier
ChainClassifier
PlaceClassifier
WhosOnFirstClassifier

Solvers

Solving is the final process, where solutions are generated based on all the classifications that have been made.

Each parse can contain multiple solutions, each is provided with a confidence score and is displayed sorted from highest scoring solution to lowest scoring.

The core of this process is the ExclusiveCartesianSolver module.

This solver generates all the possible permutations of the different classifications while taking care to:

  • ensure the same span position is not used more than once
  • ensure that the same classification is not used more than once.

After the ExclusiveCartesianSolver has run there are additional solvers which can:

  • filter the solutions to remove inconsistencies
  • add new solutions to provide additional functionality (such as intersections)

Solution Masks

It is possible to produce a simple mask for any generated solution, this is useful for comparing the solution to the original text:

VVV VVVV NN SSSSSSS AAAAAA PPPPP
Foo Cafe 10 Main St London 10010 Earth      

Contributing

Please fork and pull request against upstream master on a feature branch. Pretty please; provide unit tests.

Unit tests

You can run the unit test suite using the command:

$ npm test

Continuous Integration

CI tests every release against all supported Node.js versions.

Versioning

We rely on semantic-release and Greenkeeper to maintain our module and dependency versions.

Greenkeeper badge

parser's People

Contributors

blackmad avatar bradvogel avatar emacgillavry avatar janf01 avatar joxit avatar mansoor-sajjad avatar missinglink avatar orangejulius avatar pushkar-geospoc avatar taygun avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

parser's Issues

Street number prefixes

I was doing some mucking around with street number prefixes.

For example:
Unit 12/345 Main St
Apt 12/345 Main St
Lot 12/345 Main St
U12/345 Main St

In most cases the prefix just gets ignored and just classified as alpha, start_token.
This doesn't work for "lot" which gets detected as a place, and "U12" which breaks parsing entirely - alphanumeric, start_token for the entire string 'U12/345' thus losing both the unit number and the street number from the phrases step.

Is it worthwhile adding a classifier for these unit number prefixes so they can be detected explicitly?
"Unit" and "Lot" are very common in my datasets, but there are a few other alternatives which pop up from time to time.

Missing address results in autocomplete api

Describe the bug

The autocomplete api is not returning the expected address result

Steps to Reproduce

Autocomplete request:

https://pelias.github.io/compare/#/v1/autocomplete?layers=address%2Cstreet&focus.point.lat=41.201522&focus.point.lon=-8.6124324&text=rua+godinho+de+faria+1200

Returns
0) Rua Godinho de Faria, São Mamede de Infesta, PO, Portugal
1) Rua Godinho de Faria 255, São Mamede de Infesta, PO, Portugal
2) Rua Godinho de Faria (Antiga EN 14) 451, São Mamede de Infesta, PO, Portugal

But search returns the correct result (https://pelias.github.io/compare/#/v1/search?layers=address%2Cstreet&focus.point.lat=41.201522&focus.point.lon=-8.6124324&text=rua+godinho+de+faria+1200)

0) Rua Godinho de Faria 1200, São Mamede de Infesta, PO, Portugal)

Expected behavior

Using same street with different house number works as expected
https://pelias.github.io/compare/#/v1/autocomplete?layers=address%2Cstreet&focus.point.lat=41.201522&focus.point.lon=-8.6124324&text=rua+godinho+de+faria+255

Result:
0) Rua Godinho de Faria 255, São Mamede de Infesta, PO, Portugal

Additional information

I noticed this behaviour also with my addresses from CSV imports, some addresses working very well and others not working.

Place classification should be paired with another token?

For the input mt victoria rd, wellington (street: Mount Victoria Rd) the 'Mt' prefix is being classified as a 'place'.
The second solution is actually correct here.

It probably doesn't make sense to have a 'place' classification not paired with something?
Like 'Cafe' on it's own doesn't really make sense but 'Foo Cafe' or 'Cafe Foo' does?

node bin/cli.js 'mt victoria rd, wellington'

================================================================
TOKENIZATION (1ms)
----------------------------------------------------------------
INPUT                           ➜  mt victoria rd, wellington
SECTIONS                        ➜   mt victoria rd  0:14    wellington  15:26
S0 TOKENS                       ➜   mt  0:2   victoria  3:11   rd  12:14
S1 TOKENS                       ➜   wellington  16:26
S0 PHRASES                      ➜   mt victoria rd  0:14   mt victoria  0:11   mt  0:2   victoria rd  3:14   victoria  3:11   rd  12:14
S1 PHRASES                      ➜   wellington  16:26

================================================================
CLASSIFICATIONS (4ms)
----------------------------------------------------------------
WORDS
----------------------------------------------------------------
mt                              ➜   alpha  1.00   start_token  1.00   toponym  1.00   place  1.00
victoria                        ➜   alpha  1.00   toponym  1.00
rd                              ➜   alpha  1.00   street_suffix  1.00   road_type  1.00
wellington                      ➜   alpha  1.00   end_token  1.00

----------------------------------------------------------------
PHRASES
----------------------------------------------------------------
mt                              ➜   area  1.00   region  1.00   country  0.90
victoria                        ➜   given_name  1.00   surname  1.00
mt victoria                     ➜   place  0.70
victoria rd                     ➜   person  0.10   street  0.82
mt victoria rd                  ➜   street  0.84
wellington                      ➜   given_name  1.00   surname  1.00   area  1.00   locality  1.00

================================================================
SOLUTIONS (4ms)
----------------------------------------------------------------
(0.92) ➜ [
  { place: 'mt' },
  { street: 'victoria rd' },
  { locality: 'wellington' }
]

(0.91) ➜ [ { street: 'mt victoria rd' }, { locality: 'wellington' } ]

(0.77) ➜ [ { place: 'mt victoria' }, { locality: 'wellington' } ]

(0.09) ➜ [ { region: 'mt' } ]

german addresses parsed as street

there is a minor regression where complete german addresses can be parsed as a street, such as the example foostraße 10 below:

Screenshot 2019-06-05 at 14 01 47

It's not detected in the tests because it's not the top result but we should still remove these matches because they don't seem to make sense logically?

Parsing Czech Republic addresses

Hi team,
I have successfully installed pelias, but I have a problem with the autocomplete.
The query [street,city] /autocomplete/?text=Nerudova 20,Praha returns the correct result.
image

But the rotated query [city,street] /autocomplete/?text=Praha,Nerudova 20 does not return any result and parser pelias creates bad query decomposition
image

Is it possible to modify the configuration and get the same result as in the first case?

Search /search?text=Praha,Nerudova 20 without autocomplete the result is correct in both cases, but the parser is not pelies, but libpostal
image

Thank you for the advice

Improve ExclusiveCartesianSolver

The ExclusiveCartesianSolver is the basis for all other solvers however it's not producing a true cartesian product.

Originally this module used a cartesian algorithm to generate all the combinations, but the function with the comment do not add a pair where the span intersects an existing pair was later added to enforce the span consistency.

The function actually does two things:

  • Generate an exclusive cartesian product. ie. all combinations of all classifications without using the same classification twice.
  • Enforce that no solutions contain spans which overlap. ie. the same characters are not used twice.

I had some trouble coming up with an algorithm to do both of these functions effectively.
I originally thought it would have to be a recursive algorithm, but managed to simplify it to a non-recursive algo in the current form.

I believe that the current implementation does not produce a pure cartesian product because it favours the first span produced and prevents new combinations from being generated using the span which conflicted.

https://github.com/pelias/parser/blob/master/solver/ExclusiveCartesianSolver.js

Better documentation of what parser is?

Hey team,

I'm reading through as much of the pelias docs as I can find. I followed a link to pelias/parser and after reading through it (as well as https://geocode.earth/blog/2019/improved-autocomplete-parsing-is-here ) I think there could be some changes to the README that would help make it more understandable. Questions I still have

  • what is the relationship to libpostal?
  • what parts of pelias use parser vs libpostal? why?
  • what is the precision/recall or similar performance of parser vs libpostal?
  • if parser is meant to be better for autocomplete, why doesn't "111 8th a" guess it's a street? is that on the roadmap?

I would try to update the docs myself but I'm unclear as to the answers.

Best,
David

Recognise UK postcodes with space in between

Regex works properly. However, it's further upstream the space in between breaks the postal code into 2 sections. Also happens for NL postcodes. Examples:

  • SW11 6NU
  • BA14 7LY
  • EC1N 2NS

Future management for WhosOnFirst dictionary

For now, only eng is supported. If we need to add support for other languages, I suggest doing a single file by type containing the list of sorted elements (one for regions, one for cities ....).

Pros

  • If we remove duplicates, this will save space (3MB for eng localities, this should be the same for other languages)
  • A lot of localities are named with the local name in many languages (e.g Paris have the same naming in eng, fra, deu and dozens of other languages) -> Save space & loading time
  • If we sort elements now in eng, when we will add a new language, the git diff will be smaller and more controlled -> Save repository space when git clone it

Cons

  • We can't remove/use one specific language with a configuration (for example use parser in a fra context only or a eng one...)
  • Maybe we will exceed github quota -> can be solved with one file by Latin letter ? or range letter (e.g a-d.txt e-h.txt....)

What do you think ?

Austria testcase

Am Wassen 11, Zell an der Pram

(0.13) ➜ [ { locality: 'Wassen' } ]

expected:

"street": "am wassen",
"housenumber": "11",
"city": "zell an der pram"

handling of quoted text

currently, quotes are considered boundary characters with the same semantic meaning as a comma or tab.

I think it would be better to consider quoted sections as 'literal', so that no permutations are generated for these section of the text.

eg. something like 'A B C "D E F" G H' would produce permutations of:
[A, B, C], [A, B], [A], [B, C], [B], [C], [D, E, F], [G, H], [G], [H] (where the inner group produced no permutations)

this can probably be achieved by recording the leading and trailing boundary character which was used to delimit each section, we can then check if BOTH the leading and trailing character are from the 'quote' class, and if so, then disable permutations for that group.

UK house names

In the UK countryside, people name their houses!

clq3yZI

These houses don't have house numbers, they use a name instead.

It seems like this is a very difficult problem to solve but we could simply make an attempt to parse addresses like "Hollies Croft, Chester Rd, Kelsall, CV60RJ" by treating tokens from sections before the street (so commas a required) as a 'venue name' or 'place'.

Barcelona address

Comte Borrell 64/66 5º 1ª

This one building is labeled 64/66 (even numbers are adjacent on the same side of the street)

The final part represents floor 5 and unit 1

Very high parser.solve response time

Hey team!

I was using your awesome geocoding engine when I noticed something interesting.
Let me tell you more about it.


Here's what I did 😇

I was using pelias/api#1287 and one of my clients sent me a query with a very long text (over 2000 char). The query blocked the API and I wanted to know why.... It was because of the parser that took 15 minutes to solve the query.
It was a bad client side integration, but it almost killed our services, so we need to go further and fix this.

⚠️ This issue affects node v10 and below. With node v12+ the result takes 6-10 seconds ⚠️

Create a file named parser_killer and copy past this

hp({"activites_fr":["Week-end Gourman",["Séjour à 2 à l\'Hôtel L\'Escargot - L\'Escale Gourmande à Ruffec (16)","Séjour bien-être à 2 à Gourmandine à Saint-Andiol (13)","Séjour bien-être à 2 au Logis Hôtel Au Canard Gourmand à Samatan (32)","Séjour bien-être à La Bastide Gourmande à La Colle-sur-Loup (06)","Séjour bien-être en duo à Gourmandine à Saint-Andiol (13)","Séjour bien-être en duo au Logis Hôtel Au Canard Gourmand à Samatan (32)","Séjour bien-être et gourmand au Viest Hotel à Vicenza (Italie)","Séjour bien-être et gourmand pour 2 à l\'Amrâth Hotel & Thermen Born-Sittard**** à Born (Pays-Bas)","Séjour bien-être et gourmand pour 2 à l\'Auberge Le Relais*** à Corbion-sur-Semois (Luxembourg)","Séjour bien-être et gourmand pour 2 à l\'Hotel de la Vallée à Petit-Fays (Namur)"]],"categories_fr":["Week-end Gourman",["Séjour Gourmand"]],"coffrets_fr":["Week-end Gourman",["Week-end insolite et gourmand"]]});

Now run the CLI with twice the parser_killer file

node bin/cli.js $(cat parser_killer parser_killer)

Wait 15 minutes

The code hang here

parser/parser/Parser.js

Lines 41 to 42 in cd87ea4

// sort results by score desc
tokenizer.solution.sort(this.comparitor)

It happens just after the ExclusiveCartesianSolver with over 1.8M solutions....

The issue only targets node v10 and below thanks to the stable sort, but an address parsing in 6 second is still too high.


Here's what I think could be improved 🏆

The input has more than 200 words

  • we can add a limit for a text to be parsed / parse the first X words ?
  • improve the ExclusiveCartesianSolver to produce at most 1-10k of solutions ?

Use of Set for dictionaries

Initially, I used a js object and hasOwnProperty to do the hashmap lookups and then later used Set and has().

It would be nice to standardize this, I'm just not familiar with the performance of Set vs. Object, I think if Set is faster/the same then we should use it.

I think one benefit of Set is that Object can possibly have issues with numeric keys?

cc/ @Joxit thoughts?

support for comma delimited housenumber + street

I've seen a few cases internationally where users insert a comma between every component of the address, I'm not sure if this is done manually or when joining cells in a spreadsheet.

This is actually great for most tokens because it helps us to avoid parsing ambiguities.
The issue is when used between the housenumber and the street

so the parser will fail for an address such as:

1, Foo St, Foo, Bar, 411027

but pass for one where the first comma is not present:

1 Foo St, Foo, Bar, 411027

The code responsible for this is the TokenDistanceFilter, which should be modified to ignore section boundaries when considering adjacency.

position penalties

we should consider applying position penalties, such as when the postcode comes directly before the street (which is very uncommon).

eg:

22024 main st, ca

(1.0) ➜ [ { postcode: '22024' },
  { street: 'main st' },
  { region: 'ca' } ]

(0.7) ➜ [ { housenumber: '22024' },
  { street: 'main st' },
  { region: 'ca' } ]

Write documentation

Write up better docs in how the different components work:

  • spans
  • graph
  • sections / phrases / words
  • writing a classifier
  • writing a solver
  • debugging
  • writing tests
  • solutions / masks

'Germany' detected as locality name

node bin/cli.js 'Genter Straße 16a, Munich, Germany'

(0.90) ➜ [
  { street: 'Genter Straße' },
  { housenumber: '16a' },
  { locality: 'Munich' },
  { country: 'Germany' }
]

(0.71) ➜ [
  { street: 'Genter Straße' },
  { housenumber: '16a' },
  { locality: 'Germany' }
]

Recognise 6-position postcodes in addresses in the Netherlands (with spaces)

Typically, addresses in the Netherlands have 4 digits, followed by 0 or 1 space, followed by 2 alphanumeric characters, e.g. "7512EC" or "7512 EC". The alphanumeric characters should not be "SA", "SD" or "SS":

/^[1-9][0-9]{3} ?(?!sa|sd|ss)[a-z]{2}$/i;

This bug is to track the latter case, i.e. "7512 EC".

venue name parsing

related to pelias/api#1380
we're currently not doing a great job of parsing venue names where the place classification comes at the end of the input:

node bin/cli.js Café Pelias
(0.40) ➜ [ { place: 'Café' } ]

node bin/cli.js Café Pelias Geocoder
(0.22) ➜ [ { place: 'Café' } ]

node bin/cli.js Pelias Café
(0.90) ➜ [ { place: 'Pelias Café' } ]

node bin/cli.js Pelias Geocoder Café
(0.80) ➜ [ { place: 'Pelias Geocoder Café' } ]

examples: https://brandongaille.com/list-45-catchy-coffee-shop-house-names/

investigate ambiguous parsing of the -burg suffix in NL/DE

Today we are merging pelias/api#1565 which brings a bunch of pelias/parser changes into pelias/api.

As part of this process we did some wider acceptance test checks and diff'd them against the current baseline.

One change which was identified was this query (at partial completion "grolmanstrasse 51, charlottenburg") which identifies the Berlin borough charlottenburg as a street.

 grolmanstrasse 51, charlottenburg, berlin
-FFFFFFFFFFFFFFFF0000000000000000000000000
+FFFFFFFFFFFFFFFF0000000000000000FFFF0FFF0

This was likely introduced in the recent NL work #126.

I would like to see if we can find a better way of handling the ambiguities between German and Dutch for the -burg suffix.

note: the correct solution is also being generated, but they both score the same, this scoring is based on matched token length so a robust fix would need to work equally well in cases where the len(street) < len(borough) as len(street) > len(borough) and len(street) == len(borough)

================================================================
SOLUTIONS (2ms)
----------------------------------------------------------------
(0.53) ➜ [ { housenumber: '51' }, { street: 'Charlottenburg' } ]

(0.53) ➜ [ { street: 'Grolmanstrasse' }, { housenumber: '51' } ]

Unit parsing

The following addresses with unit numbers are not able to accurately detect the 'B' as a unit.

221 B Baker St
(0.99) ➜ [ { housenumber: '221' }, { street: 'B Baker St' } ]
221/B Baker St
(0.48) ➜ [ { street: 'Baker St' } ]

NL: recognise 'sngl' as abbr for 'singel' without full stop

Abbreviated street names referring to '-singel', e.g. Blekerssngl, Gouda or Herensngl, Haarlem are only recognised as street names when there is a full stop at the end: Blekerssngl., Gouda or Herensngl., Haarlem.

Suggested change: add singel|sngl to the concatenated_suffixes_separable.txt file, as 'singel' also works as a street name by itself, e.g. Singel, Amsterdam.

chore: update WOF dictionaries

at some point we may wish to update the WOF resources since the previous data was probably pretty old

node generate.js /data/wof/whosonfirst-data-admin-latest.db

performance testcases

This issue is to track some queries which cause the library to take a long time (>50ms)

  • 310 4848 Virginia Beach blvd Virginia Beach, va 23562, Virginia Beach, VA, USA
  • airport road n and golden gate pkwy off ramp e, naples, fl

can we avoid numeric street suffixes or score them lower?

The following example is a partially complete 'autocomplete-style' query where the street and housenumber have been entered but the postcode is only partially complete.

In this case, the correct result would be street:Eberswalder Straße housenumber:100 but for some reason, the street name is being detected as `Eberswalder Straße 100:

node bin/cli.js 'Eberswalder Straße 100 104'

================================================================
TOKENIZATION (1ms)
----------------------------------------------------------------
INPUT                           ➜  Eberswalder Straße 100 104
SECTIONS                        ➜   Eberswalder Straße 100 104  0:26
S0 TOKENS                       ➜   Eberswalder  0:11   Straße  12:18   100  19:22   104  23:26
S0 PHRASES                      ➜   Eberswalder Straße 100 104  0:26   Eberswalder Straße 100  0:22   Eberswalder Straße  0:18   Eberswalder  0:11   Straße 100 104  12:26   Straße 100  12:22   Straße  12:18   100 104  19:26   100  19:22   104  23:26

================================================================
CLASSIFICATIONS (2ms)
----------------------------------------------------------------
WORDS
----------------------------------------------------------------
Eberswalder                     ➜   alpha  1.00   start_token  1.00
Straße                          ➜   alpha  1.00   street_suffix  1.00
100                             ➜   numeric  1.00   housenumber  1.00
104                             ➜   numeric  1.00   end_token  1.00   housenumber  1.00

----------------------------------------------------------------
PHRASES
----------------------------------------------------------------
Eberswalder Straße              ➜   street  0.82
Eberswalder Straße 100          ➜   street  0.84

================================================================
SOLUTIONS (1ms)
----------------------------------------------------------------
(0.86) ➜ [ { street: 'Eberswalder Straße 100' }, { housenumber: '104' } ]

(0.74) ➜ [ { street: 'Eberswalder Straße' }, { housenumber: '100' } ]

Long street names with street prefixes don't seem to parse

Hey team,

I've been playing with some Brazilian queries and have noticed the parser seems to have issues with parsing long street queries with a street prefix.

It seems like it might be something about more than two tokens after the prefix word, but I can't figure out where in the source code that logic might live.

Rua Raul Leite Magalhães, 65, Tapiraí - SP, 18180-000, Brazil

  • no solution

Rua Raul Leite Magalhães

  • only parses "Rua Rual Leite" as street

Rua Raul Leite, 65, Tapiraí - SP, 18180-000, Brazil

  • deleted one word from street, works perfectly

Raul Leite Magalhães st, 65, Tapiraí - SP, 18180-000, Brazil

  • added street suffix (I know, nonsense query) and it works perfectly

I noticed that in the source code, it says "Boulevard Charles de Gaulle" works, which it does, but I think it's getting lucky with "de" being a likely intersection connector word. "Boulevard Charles foobar Gaulle" fails in the same way, where only "Boulevard Charles foobar" is parsed as a street.

Intersection parses too aggresive

The intersection parser seems to generate a few false positives.

Eg: Washington University in st louis.

We should reduce the impact of the false positive intersection parsing while still trying to retain the core functionality it provides.

Berlin testcase

A nice complex testcase from Berlin:

Onion Space, ExRotaprint, Gottschedstraße 4, Aufgang 4, 1. OG rechts, 13357 Berlin

Special handling of streets with no suffix

Some street names consist of a single word without a street suffix.
A well known example of this is Broadway_(Manhattan).

The parser currently doesn't parse addresses on these streets very well:

node bin/cli.js 24 broadway
...
(0.86) ➜ [ { street: '24 broadway' } ]

We are interpreting the input as a numeric street (minus the ordinal suffix), the following being a correct parse:

node bin/cli.js 24 street
...
(0.86) ➜ [ { street: '24 street' } ]

Interestingly, we have Broadway listed as a street suffix, although I'm not familiar with anywhere in the world which this is common, the USPS doesn't list it as a common street suffix in the USA.

So removing that suffix may help resolve this issue to some degree

Another similar street name I can think of is "Esplanade", which we may be able to handle similarly.

I think in absence of a suffix it might still be difficult to classify these strings as streets, since they’re just proper names with no context surrounding them, if that's the case we may need to keep a list of these proper names which are common street names in their own right such as "Broadway".

Some other similar cases to consider when testing this work:

see: https://onmilwaukee.com/articles/broadway

Compound streets + street suffix

It should not be possible to combine a compound street name (already containing a suffix) and another street suffix

Eg:

Foostraße rd

venue_classification: investigate change

Today we are merging pelias/api#1565 which brings a bunch of pelias/parser changes into pelias/api.

As part of this process we did some wider acceptance test checks and diff'd them against the current baseline.

One change which was identified was this query (at partial completion "San Simeon Drive Desert Hot Spr") which identifies the incomplete spr token as a street.

San Simeon Drive Desert Hot Springs CA 92240 {"focus.point.lat":33.96112,"focus.point.lon":-116.50168}
-FFFFF0000000000000000000000000000000000FFF00
+FFFFF0000000000000000000000000F00000000FFF00

Running a git bisect shows that this change was introduced in a65218d

A simple change to the en/street_types.txt file seems to resolve the issue, but it's unclear why this issue didn't exist previously.

diff --git a/resources/pelias/dictionaries/libpostal/en/street_types.txt b/resources/pelias/dictionaries/libpostal/en/street_types.txt
index 30ecf9d..9fbdcbe 100644
--- a/resources/pelias/dictionaries/libpostal/en/street_types.txt
+++ b/resources/pelias/dictionaries/libpostal/en/street_types.txt
@@ -14,3 +14,5 @@ beltway
 !broadway|bdwy|bway|bwy|brdway
 !esplanade|esp|espl
 market
+
+!spr

ZAC place type

I found a wrong solution for a Zone d'Aménagement Concerté, the steet should be ZAC de la Tuilerie

node bin/cli.js "ZAC de la Tuilerie, Villars-les-Dombes, France"

(0.76)  [ { place: 'ZAC' },
  { street: 'de la' },
  { locality: 'Villars-les-Dombes' },
  { country: 'France' } ]

(0.75)  [ { street: 'ZAC de la' },
  { locality: 'Villars-les-Dombes' },
  { country: 'France' } ]

(0.22)  [ { region: 'ZAC' }, { country: 'France' } ]

remove concept of 'word'

This library has the concepts of word, phrase and section

I not sure if the word concept is required as it can be represented as a single token phrase.
In fact, I think there is duplication between words and single token phrases right now.

If possible, it would be nice to remove the concept of a word, which should help clean up the code.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.