Giter VIP home page Giter VIP logo

gopostal's Introduction

openvenues

Open information extraction project for indexing and normalizing real-world venue/POI information from across the Web. Can be used standalone to extract venues from individual websites, or on a full-fledged copy of the entire Internet using the Common Crawl.

Project layout

  • extract: the "easy way", extract structured (or at least semi-structured) address and geo data from HTML markup. Supports schema.org microdata, RDFa Lite, hcard, geotags, HTML5 <address> elements, OpenGraph and extracting url params from Google map embeds
  • jobs: Amazon Elastic Mapreduce jobs for extracting places from the Common Crawl (224TB or 3.6+ billion urls available on S3 as of August 2014, new crawls published periodically).

Notes

BeautifulSoup vs. lxml

The first version of the Common Crawl extraction job was written using lxml, a fast C library based on libxml2, for parsing. However, running said parser over billions of badly-encoded webpages revealed some bugs in lxml/libxml2 related to reading from uninitialized memory at the C level (see https://bugs.launchpad.net/lxml/+bug/1240696), which eats up all the system's memory and crashes the box. The bug occurs non-deterministically, so is hard to track down, but will occur, on different documents, if the job is run for long enough. Until there's a fix lxml won't be usable for this project. BeautifulSoup is a forgiving pure-Python regex-based "parser" designed for working with "tag soup". It's up to 100x slower than lxml, so we currently use a high-recall (not necessarily high-precision) regex to filter out documents that definitely don't contain the keywords we're looking for before committing to a full parse. With this filter, the job still completes in a reasonable amount of time using 100 8-core machines.

Coming up next:

  • Address extraction (find postal addresses in text)
  • Deduping and normalization of venue names, addresses and locations

gopostal's People

Contributors

albarrentine avatar alex avatar oschwald avatar theory avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

gopostal's Issues

[paser] ParsedComponent vs map[string]string

Hi there, I was curious why in parser this was done:

type ParsedComponent struct {
    Label string `json:"label"`
    Value string `json:"value"`
}
    for i = 0; i < numComponents; i++ {
        parsedComponents[i] = ParsedComponent{
            Label: C.GoString(cLabelsPtr[i]),
            Value: C.GoString(cComponentsPtr[i]),
        }
    }

This is pretty hard to use in a practical setting. I found that the only way I could really use this was to convert it to a map[string]string and access it that way.

Wouldn't it be more reasonable to do this?

type PasedComponents map[string]string

// ...
for i = 0; i < numComponents; i++ {
    pasedComponents[C.GoString(cLabelsPtr[i])] = C.GoString(cComponentsPtr[i])
}

`%d string=` in string returned by expand.ExpandAddress

If I try to expand an address:

terms := expand.ExpandAddress("1234 Sesame St. New York, NY 12345")

I get the following strings returned:

[
  "%d string=12345 sesame saint new york ny 12345 to id 1234",
  "%d string=12345 sesame saint new york new york 12345 to id 1234",
  "%d string=12345 sesame street new york ny 12345 to id 1234",
  "%d string=12345 sesame street new york new york 12345 to id 1234",
  "%drive string=12345 sesame saint new york ny 12345 to id 1234",
  "%drive string=12345 sesame saint new york new york 12345 to id 1234",
  "%drive string=12345 sesame street new york ny 12345 to id 1234",
  "%drive string=12345 sesame street new york new york 12345 to id 1234"
]

I would expect to just return the first 4 strings without the %d string= in front. What is the purpose of that prefix?

I'm using go18.1 and version v0.0.0-20171226154602-e0184512a45d of gopostal

Can not make it work.

I have installed on Windows the lib and wanted to use this binding, but without any success.

Created a fresh go mod using go 1.20 and copy pasted readme example.
the IDE (vs code with Go extension) says
could not import github.com/openvenues/gopostal/expand (no required module provides package "github.com/openvenues/gopostal/expand")

And while trying to run:
github.com/openvenues/gopostal/expand: build constraints exclude all Go files in ...\go\pkg\mod\github.com\openvenues\[email protected]\expand

cross-compile from darwin (osX) to linux (ubuntu)

Any helps or hints appreciated.

Tried many things so far on my mac osX trying to build to ubuntu: GOOS=linux GOARCH=amd64 go build:

pkg-config --libs --cflags libpostal
-I/usr/local/include -L/usr/local/lib -lpostal
ls /usr/local/lib | grep libpostal
libpostal.1.dylib
libpostal.a
libpostal.dylib
libpostal.la

Common issue seems to be when I run:

CGO_ENABLED=1 GOOS=linux GOARCH=amd64 go build -o fmaplX main.go give this:

ld: unknown option: --build-id=none
clang: error: linker command failed with exit code 1 (use -v to see invocation)

with CGO_ENABLED=0 GOOS=linux GOARCH=amd64 go build -o fmaplX main.go

go build github.com/PartnerFusionInc/cable/fmap/vendor/github.com/openvenues/gopostal/expand: no buildable Go source files in /Users/jbowles/gowork/src/github.com/PartnerFusionInc/cable/fmap/vendor/github.com/openvenues/gopostal/expand
go build github.com/PartnerFusionInc/cable/fmap/vendor/github.com/openvenues/gopostal/parser: no buildable Go source files in /Users/jbowles/gowork/src/github.com/PartnerFusionInc/cable/fmap/vendor/github.com/openvenues/gopostal/parser

Here is my clang:

clang --version                                                                                                                     1 ↵ ‹2.4.1›
Apple LLVM version 8.1.0 (clang-802.0.42)
Target: x86_64-apple-darwin16.7.0
Thread model: posix
InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin

And by go env

GOARCH="amd64"
GOBIN=""
GOEXE=""
GOHOSTARCH="amd64"
GOHOSTOS="darwin"
GOOS="darwin"
GOPATH="/Users/jbowles/gowork"
GORACE=""
GOROOT="/usr/local/Cellar/go/1.8.3/libexec"
GOTOOLDIR="/usr/local/Cellar/go/1.8.3/libexec/pkg/tool/darwin_amd64"
GCCGO="gccgo"
CC="clang"
GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/0_/f4vhytgs455crcjlzfc9dy8h0000gn/T/go-build099799126=/tmp/go-build -gno-record-gcc-switches -fno-common"
CXX="clang++"
CGO_ENABLED="1"
PKG_CONFIG="pkg-config"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"

Are Parsed Address Labels Unique?

Are the labels in all of the parsed address components unique for a single address? If so, it might be useful to create a type on ParsedComponent[] that can convert to and from a map, and marshall and unmarshall JSON.

Libpostal throwing segmentation violations & crashing the entire app

We are currently using the gopostal library in a webserver application. Upon receiving an HTTP request from a client, the app will use gopostal to parse the address included in the client's request and return the parsed components. However, during our heavy load testing (and only during such testing), we are seeing the underlying libpostal library throwing segmentation violations, which in turn is crashing the entire application. Weirdly enough, the app works just fine during normal traffic.

Inconsistent parsing results US address.

Apologies if this is the wrong medium for this question but i'm at a wall. I'm getting inconsistent parsing results in my environments that is making it difficult to debug.

For example this address (it's a fake street address but real city, state, and zip) parses incorrectly in my docker instance(debian), but if i were to run it locally (m1 macos) it would parse correctly.

1111 main street, Chapel Hill, North Carolina 27516

It seems to confuse the state North Carolina and appends North to the city value:

{
    "label": "house_number",
    "value": "1111"
},
{
    "label": "road",
    "value": "main street"
},
{
    "label": "city",
    "value": "chapel hill north"
},
{
    "label": "state",
    "value": "carolina"
},
{
    "label": "postcode",
    "value": "27516"
}

While in another instance,

{
    "label": "house_number",
    "value": "1111"
},
{
    "label": "road",
    "value": "main street"
},
{
    "label": "city",
    "value": "chapel hill"
},
{
    "label": "state",
    "value": "north carolina"
},
{
    "label": "postcode",
    "value": "27516"
}

They're consistent in their environment, i have not re-compiled my local (correctly parsing) instance but both the docker instance and my local instance are using the same forked version of Libpostal when compiling and configuring/setting up.

I imagine this is an open ended and hard to answer question, but im wondering if this has been seen before and would appreciate just any insight into why they're different results and why it's not recognizing the state. Thanks in advance .

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.