openeventdata / mordecai Goto Github PK

View Code? Open in Web Editor NEW

730.0 730.0 98.0 2.69 MB

Full text geoparsing as a Python library

License: MIT License

Python 99.59% TeX 0.41%

geocoding geonames geoparsing nlp spacy toponym-resolution

mordecai's People

Contributors

Stargazers

Watchers

Forkers

cliffnyang riordan chilland lijielife smalgireddy javadev-a kpolimis dgreyling ml-ai-nlp-ir vishalbelsare cuulee ahalterman codeaudit jdc08161063 zhongxingpeng giserh conarchllc r-wheeler eric013 silasxue howl-anderson chenkovsky sauravcsvt attibalazs astorfi cbilgili enterstudio akesandgren gstahlman leads-bhl gengkunling juliansteam tjbanks muldercw mrocholl jacquesfize ganberg jdayllon collen-roller annewallace mesus damhau jamesknox parthasaradhikonda pegerto supadupa 36ice straykat914 anuragsinghchaudhary komax fagan2888 renniehaylock razakhan2 buttonmashvxd jonathan-bonaguro lowener hmallen saurabh1920 dclemenzi cloakd faneshion benman1 domeniconappo justcherie tspannhw akankshanb jswift2013 agent001 ngo010 gustavengstrom mou01 jingchun01 wsheffel vks luizavladislavna yasark luckymeef archaeogeek ac74475 beckosterland sudharsan2020 conorak dvvj casszhao techthiyanes wangjiqing1 dianita956 ahmad-abdellatif eddasof eranba92 strunge29 bgmartins iq-scm f9gk

mordecai's Issues

Allow country endpoint to return more than one filter country

Mordecai limits the search for place names to a single country per document, to reduce errors and to avoid the US bias in other systems. However, articles often reference locations in more than one country. Allow Mordecai to pick more than one country in certain cases. This will rely on having lots of high-quality labeled text data.

Test place name resolution accuracy

Before merging the v2 branch, need to assess the accuracy of the geonames lookup call. This will inform if we need a feature type detection step.

Use information on all locations to pick country

When a sentence has multiple locations, use the high-confidence country resolutions to help country-resolve the low-confidence places. The three obstacles are:

training data (need lots of sentences will all, not just some, locations resolved to the country). Many sentences have locations from multiple countries and we don't want to mess those up.
figuring out the right model to incorporate (have a second model that uses as features?)
messing a bit with the data structures in the code

Potential memory problem of geoparse function (due to lru_cache usage)

Hi there,

We're using mordecai to geolocalize ~200K tweets and I noticed that during our processing (after a few minutes) memory grows up until it hits the limit and the machine starts swapping (16GB). So it seems like a memory leak but after some debugging, I found the culprit to be the lru_cache decorator used for Geoprocessor.query_geonames and Geoprocessor.query_geonames_country.

Indeed, I was able to mitigate this behavior and fix it by calling cache_clear() each N Geoprocessor.geoparse (N=250 works fine).

tagger = Geoparser()
tweets = Tweet.get_tweets()  # iterate over 200K tweets
for i, t in enumerate(tweets):
    res = tagger.geoparse(t.full_text)
    # some processing....
    if not (i % 250):
        # this releases gigas of memory....
        tagger.query_geonames.cache_clear()
        tagger.query_geonames_country.cache_clear()

So it seems the reason is that lru_cache keeps arguments of last 1000 calls and that ends up to be too heavy in terms of memory (in a few minutes, processing eats 10 GB). Can you confirm this issue?

Use volumes for data

Instead of using wget to download the word vectors and the MITIE models, use a volume to expose them to the container. Its unclear to me if MITIE needs the models pulled down in order to build, if thats the case a work around could probably be found. Could provide a fetch script to download the data externally.

mordecai service error

Actually I am trying to start the mordecai service with the docker command, but it is throwing below error.
{u'es_host': 'elastic',
u'es_port': '9200',
u'mitie_directory': '/usr/src/MITIE/mitielib',
u'mitie_ner_model': '/usr/src/data/MITIE-models/english/ner_model.dat',
u'mordecai_port': 5000,
u'word2vec_model': '/usr/src/data/GoogleNews-vectors-negative300.bin.gz'}
Setting up MITIE.
Traceback (most recent call last):
File "app.py", line 24, in
configs['mitie_ner_model'])
File "/usr/src/resources/utilities.py", line 126, in setup_mitie
ner_model = mitie.named_entity_extractor(mitie_ner_model)
File "/src/mitie/mitie/mitie.py", line 181, in init
raise Exception("Unable to load named entity extractor from " + filename)
Exception: Unable to load named entity extractor from /usr/src/data/MITIE-models/english/ner_model.dat

can anybody suggest what could be the cause? and how to resolve it. thank you.
steps followed:
I built the mordecai as suggested in the doc, and tried to start with the image_id.

Return ADM1/province/state info

To make Mordecai a better drop-in replacement for CLIFF in the phoenix_pipeline, also return the ADM1 info for any extracted place names.

cannot import name lru_cache

import mordecai
Using TensorFlow backend.
Traceback (most recent call last):
File "", line 1, in
File "/usr/local/lib/python2.7/dist-packages/mordecai/init.py", line 1, in
from .geoparse import Geoparser
File "/usr/local/lib/python2.7/dist-packages/mordecai/geoparse.py", line 7, in
from functools import lru_cache
ImportError: cannot import name lru_cache

I am not sure, but the version of this package on my computer might be different from you. Mine is:
backports.functools-lru-cache==1.4
functools32==3.2.3.post2

using and customize elasticsearch geonames docker

Hi there,
I have problems in setting up an environment with docker compose.

First, the geonames index is not automatically created by elasticsearch so I need to perform a curl PUT to create the index. Also, it seems that using docker compose and mapped volumes, there are issues with file permissions so there is need to perform some chmod/chown commands before to launch the ES server.

To do all above, I need to customize/override the ES docker entrypoint and command. but I'm having hard time and all kind of issues...in the best case, the docker simply exit with status code 0.

Also, I noticed that the docs are referring to an older version of Elasticsearch docker image (5.5.2 vs 6.2.2). The recent ones are not on community docker repo...Now ES has its own docker repository.

Is there anyone that is willing to share his docker compose configuration?
Thanks in advance,
d

Keep getting error when trying pip install mordecai

I'm running Python 3.6 and when I write "pip install mordecai" I keep getting a list of errors that say "Failed building wheel for ujson", " Failed building wheel for preshed", " Failed building wheel for cymem", etc.

I read here( "https://stackoverflow.com/questions/43370851/failed-building-wheel-for-spacy") that i need to install spacy first but when I pip install spacy I again get "Failed building wheel for ujson", " Failed building wheel for preshed", " Failed building wheel for cymem", etc.

What should I do?

Use geographic binary relation info/context

Lots of sentences with geographic information are structured like "X, a town 30 km south of Y", or "X, a neighborhood in Y". In both cases, we want to:

code X, not Y
but potentially use Y to help find X

Neither MITIE's binary relation detection nor Freebase have this. We could use parse info, but that would be tricky and require lots of labeled examples. Thoughts?

Switch to spaCy

Switch from MITIE to spaCy for NER and word embeddings. The advantages of spaCy include:

better multilingual support (including custom multilingual models)
built-in word embeddings
nicer API
easier to use grammar of the sentence for future work on linking verbs and locations

Switch dependencies to OEDA infrastructure

Remove dependencies to Caerus infrastructure. These include:

Geonames elasticsearch image hosted on Caerus Dockerhub
mitie-py wrapper from Caerus Github
word2vec model stored on Caerus S3

Connection error

I confronted with a connection error in the updated Mordecai. I did not have this issue with the previous Mordecai. I will be thankful if you can help me to solve this error.

mordecai_1 | WARNING:elasticsearch:GET http://localhost:9200/_search [status:N/A request:0.000s]
mordecai_1 | Traceback (most recent call last):
mordecai_1 |   File "/usr/local/lib/python2.7/dist-packages/elasticsearch/connection/http_urllib3.py", line 74, in perform_request
mordecai_1 |     response = self.pool.urlopen(method, url, body, retries=False, headers=self.headers, **kw)
mordecai_1 |   File "/usr/local/lib/python2.7/dist-packages/urllib3/connectionpool.py", line 597, in urlopen
mordecai_1 |     _stacktrace=sys.exc_info()[2])
mordecai_1 |   File "/usr/local/lib/python2.7/dist-packages/urllib3/util/retry.py", line 222, in increment
mordecai_1 |     raise six.reraise(type(error), error, _stacktrace)
mordecai_1 |   File "/usr/local/lib/python2.7/dist-packages/urllib3/connectionpool.py", line 544, in urlopen
mordecai_1 |     body=body, headers=headers)
mordecai_1 |   File "/usr/local/lib/python2.7/dist-packages/urllib3/connectionpool.py", line 349, in _make_request
mordecai_1 |     conn.request(method, url, **httplib_request_kw)
mordecai_1 |   File "/usr/lib/python2.7/httplib.py", line 979, in request
mordecai_1 |     self._send_request(method, url, body, headers)
mordecai_1 |   File "/usr/lib/python2.7/httplib.py", line 1013, in _send_request
mordecai_1 |     self.endheaders(body)
mordecai_1 |   File "/usr/lib/python2.7/httplib.py", line 975, in endheaders
mordecai_1 |     self._send_output(message_body)
mordecai_1 |   File "/usr/lib/python2.7/httplib.py", line 835, in _send_output
mordecai_1 |     self.send(msg)
mordecai_1 |   File "/usr/lib/python2.7/httplib.py", line 797, in send
mordecai_1 |     self.connect()
mordecai_1 |   File "/usr/local/lib/python2.7/dist-packages/urllib3/connection.py", line 155, in connect
mordecai_1 |     conn = self._new_conn()
mordecai_1 |   File "/usr/local/lib/python2.7/dist-packages/urllib3/connection.py", line 134, in _new_conn
mordecai_1 |     (self.host, self.port), self.timeout, **extra_kw)
mordecai_1 |   File "/usr/local/lib/python2.7/dist-packages/urllib3/util/connection.py", line 88, in create_connection
mordecai_1 |     raise err
mordecai_1 | ProtocolError: ('Connection aborted.', error(111, 'Connection refused'))

IndexError in ranker function

The deterministic ranking function sometimes hits an IndexError:

/Users/ahalterman/MIT/Geolocation/mordecai/mordecai/geoparse.py in <listcomp>(.0)
    779         ranks = ranks[::-1]
    780         # sort the list of dicts according to ranks
--> 781         sorted_meta = [meta[r] for r in ranks]
    782         sorted_X = X[ranks]
    783         return (sorted_X, sorted_meta)

IndexError: list index out of range

This will reproduce:

geo.geoparse("In early 1938, the Prime Minister cut grants-in-aid to the provinces, effectively killing the relief project scheme. Premier Thomas Dufferin Pattullo closed the projects in April, claiming that British Columbia could not shoulder the burden alone. Unemployed men again flocked to Vancouver to protest government insensitivity and intransigence to their plight. The RCPU organized demonstrations and tin-canning (organized begging) in the city. Under the guidance of twenty-six-year-old Steve Brodie, the leader of the Youth Division who had cut his activist teeth during the 1935 relief camp strike, protesters occupied Hotel Georgia, the Vancouver Art Gallery (then located at 1145 West Georgia Street), and the main post office (now the Sinclair Centre).")

Convert to Python package

Rather than being fundamentally a (Dockerized) REST service like version 1 was, version 2 will be an installable Python package, with a separate repo that packages and runs it as a service.

Error from feature code lookup function

This bug is very weird. This sentence runs fine:

geo.geoparse("Congress and in the legislatures of Alabama, California, Florida, and Missouri.")

...but fails when you replace "Missouri" with "Michigan":

geo.geoparse("Congress and in the legislatures of Alabama, California, Florida, and Michigan.")

It fails here with a KeyError.

Make tests work

With the new setup of the CountryAPI() and PlacesAPI() and the reorganization of the directories the tests aren't running correctly right now. Fix that and make it clear how to run them from inside the container:

sudo docker exec -it d1fc3f40c4a6d05 bash
cd resources
py.test

Write event_geocode endpoint

Currently, Mordecai is a document geocoder, taking in text, extracting location names, and associating them with their geographic coordinates. As it gets incorporated into the Petrarch pipeline, it would be good to have an explicit endpoint for geocoding Petrarch-produced events, rather than whole documents. This would mostly be useful for geocoding events in sentences that have two or more locations (figure out how many that is).

take in text and event (and CoreNLP parse?)
use event metadata and sentence parse to extract the (one) correct location from a sentence

Improve country picking for underreported countries

/country's dependence on word2vec for picking the focus country of an article means that it performs poorly on countries that don't have very much news coverage and thus don't have very distinct word embeddings. This problem has come up especially for west African states.

measure how bad the problem is
think about using doc2vec + a model for predicting countries
consider re-training word2vec and doc2vec using Wikipedia and news text

Add Spanish /country and /places

Part of the design of Mordecai is to accommodate non-English languages with a minimum of modification and training.

MITIE already has a pre-made Spanish NER model, so all we need for a Spanish-language Mordecai is Spanish word2vec.

/country
/places

Undefined variable `locations`

Line 209 attempts to return the JSON formatted variable locations, but it is not defined within the scope of the function. Not sure what exactly the variable does when there is a ValueError.

Almost got this running...

Hi, Thank you for all of your help and my apologies for the string of questions but I'm very new to programing. I get the following message when I run my request. Does the error message indicate anything to you that I might be doing wrong?

Thank you again,

Install required packages when pip installing

Add required packages to setup.py so they get installed when using pip.

Geonames index missing US states

For some reason, US states seem to be missing from the Geonames/Elasticsearch index: When looking for Ohio by geonames id,

curl -XGET localhost:9200/geonames/_search?q="geonameid":5165418

returns nothing (that search format works for other entries). But

cat allCountries.txt | grep 5165418

shows that it's in in the allCountries.txt file. I'll try rebuilding the index...

Make it faster

Mordecai is too slow to use right now for a corpus of ~70 million documents. Based on my testing, it seems that the Elasticsearch queries are responsible for the great majority of time, with the model calls being a distant second. Explore ways to speed up these queries or to use threading to make many calls at once, since ES can handle hundreds of concurrent requests.

Timing overview:

Geoparse:
322602.0  # total
135417.0  42%  # infer_country
 77966.0  24%  # query_geonames_country
  6787.0   2%  # features for rank
 77908.0  24%  # rank_model.predict

infer_country:
 81817.0  # total
 27146.0  33%  # nlp()
 47713.0  58%  # make_country_features
  6102.0   7%  # country_model.predict
   643.0   .7% # make_country_matrix


make_country_features
 396670.0      # total
   1575.0      # _feature_word_embedding
 361902.0  91% # query_geonames

Line-by-line:

Wrote profile results to time_mordecai.py.lprof
Timer unit: 1e-06 s

Total time: 0.43051 s
File: time_mordecai.py
Function: make_country_features at line 14

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    14                                           @profile
    15                                           def make_country_features(geo, require_maj = False):
    16         1      51117.0  51117.0     11.9      doc = nlp("Thousands of Nigerians from throughout the country were converging Thursday for a rally in Lagos to protest the rights violations under the recently imposed Sharia law by Islamic fundamentalists in the northern district of Borno.")
    17         1         26.0     26.0      0.0      if not hasattr(doc, "ents"):
    18                                                   doc = nlp(doc)
    19                                               # initialize the place to store finalized tasks
    20         1          2.0      2.0      0.0      task_list = []
    21                                           
    22                                               # get document vector
    23                                               #doc_vec = self._feature_word_embedding(text)['country_1']
    24                                           
    25                                               # get explicit counts of country names
    26         1        136.0    136.0      0.0      ct_mention, ctm_count1, ct_mention2, ctm_count2 = geo._feature_country_mentions(doc)
    27                                           
    28                                               # now iterate through the entities, skipping irrelevant ones and countries
    29         8         23.0      2.9      0.0      for ent in doc.ents:
    30         7         42.0      6.0      0.0          if not ent.text.strip():
    31                                                       continue
    32         7         15.0      2.1      0.0          if ent.label_ not in ["GPE","LOC","FAC"]:
    33         5          5.0      1.0      0.0              continue
    34                                                   # don't include country names (make a parameter)
    35         2         10.0      5.0      0.0          if ent.text.strip() in geo._skip_list:
    36                                                       continue
    37                                           
    38                                                   ## just for training purposes
    39                                                   #if ent.text.strip() in self._just_cts.keys():
    40                                                   #    continue
    41                                           
    42                                                   #skip_list.add(ent.text.strip())
    43         2          3.0      1.5      0.0          ent_label = ent.label_ # destroyed by trimming
    44         2        124.0     62.0      0.0          ent = geo.clean_entity(ent)
    45                                           
    46                                                   # vector for just the solo word
    47         2        879.0    439.5      0.2          vp = geo._feature_word_embedding(ent)
    48         2          5.0      2.5      0.0          try:
    49         2          4.0      2.0      0.0              word_vec = vp['country_1']
    50         2          6.0      3.0      0.0              wv_confid = float(vp['confid_a'])
    51                                                   except TypeError:
    52                                                       # no idea why this comes up
    53                                                       word_vec = ""
    54                                                       wv_confid = "0"
    55                                           
    56                                                   # look for explicit mentions of feature names
    57         2        115.0     57.5      0.0          class_mention, code_mention = geo._feature_location_type_mention(ent)
    58                                           
    59                                                   ##### ES-based features
    60         2          2.0      1.0      0.0          try:
    61         2     377075.0 188537.5     87.6              result = geo.query_geonames(ent.text)
    62                                                   except ConnectionTimeout:
    63                                                       result = ""
    64                                           
    65                                                   # build results-based features
    66         2        155.0     77.5      0.0          most_alt = geo._feature_most_alternative(result)
    67         2        117.0     58.5      0.0          most_common = geo._feature_most_common(result)
    68         2        132.0     66.0      0.0          most_pop = geo._feature_most_population(result)
    69         2         10.0      5.0      0.0          first_back, second_back = geo._feature_first_back(result)
    70                                           
    71         2          3.0      1.5      0.0          try:
    72         2          3.0      1.5      0.0              maj_vote = Counter([word_vec, most_alt,
    73         2          3.0      1.5      0.0                                  first_back, most_pop,
    74         2         48.0     24.0      0.0                                  ct_mention
    75                                                                           #doc_vec_sent, doc_vec
    76         2          3.0      1.5      0.0                                  ]).most_common()[0][0]
    77                                                   except Exception as e:
    78                                                       print("Problem taking majority vote: ", ent, e)
    79                                                       maj_vote = ""
    80                                           
    81                                           
    82         2          3.0      1.5      0.0          if not maj_vote:
    83                                                       maj_vote = ""
    84                                           
    85                                                   # We only want all this junk for the labeling task. We just want to straight to features
    86                                                   # and the model when in production.
    87                                           
    88         2          2.0      1.0      0.0          try:
    89         2         56.0     28.0      0.0              start = ent.start_char - ent.sent.start_char
    90         2         16.0      8.0      0.0              end = ent.end_char - ent.sent.start_char
    91         2          4.0      2.0      0.0              iso_label = maj_vote
    92         2          2.0      1.0      0.0              try:
    93         2          4.0      2.0      0.0                  text_label = geo._inv_cts[iso_label]
    94                                                       except KeyError:
    95                                                           text_label = ""
    96                                           
    97         2        274.0    137.0      0.1              task = {"text" : ent.sent.text,
    98         2          3.0      1.5      0.0                      "label" : text_label, # human-readable country name
    99         2          9.0      4.5      0.0                      "word" : ent.text,
   100                                                               "spans" : [{
   101         2          2.0      1.0      0.0                          "start" : start,
   102         2          3.0      1.5      0.0                          "end" : end,
   103                                                                   } # make sure to rename for Prodigy
   104                                                                       ],
   105                                                               "features" : {
   106         2          2.0      1.0      0.0                              "maj_vote" : iso_label,
   107         2          3.0      1.5      0.0                              "word_vec" : word_vec,
   108         2          2.0      1.0      0.0                              "first_back" : first_back,
   109                                                                       #"doc_vec" : doc_vec,
   110         2          2.0      1.0      0.0                              "most_alt" : most_alt,
   111         2          3.0      1.5      0.0                              "most_pop" : most_pop,
   112         2          2.0      1.0      0.0                              "ct_mention" : ct_mention,
   113         2         29.0     14.5      0.0                              "ctm_count1" : ctm_count1,
   114         2          4.0      2.0      0.0                              "ct_mention2" : ct_mention2,
   115         2          3.0      1.5      0.0                              "ctm_count2" : ctm_count2,
   116         2          3.0      1.5      0.0                              "wv_confid" : wv_confid,
   117         2          3.0      1.5      0.0                              "class_mention" : class_mention, # inferred geonames class from mentions
   118         2          7.0      3.5      0.0                              "code_mention" : code_mention,
   119                                                                       #"places_vec" : places_vec,
   120                                                                       #"doc_vec_sent" : doc_vec_sent
   121                                                                       } }
   122         2          5.0      2.5      0.0              task_list.append(task)
   123                                                   except Exception as e:
   124                                                       print(ent.text,)
   125                                                       print(e)
   126         1          1.0      1.0      0.0      return task_list # rename this var

Total time: 0.489834 s
File: time_mordecai.py
Function: infer_country at line 129

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   129                                           @profile
   130                                           def infer_country(geo):
   131         1      39311.0  39311.0      8.0      doc = nlp("At a public bathhouse in the Moqor district of Ghazni Province, they met the commander of a small pro-government militia. ")
   132         1         19.0     19.0      0.0      if not hasattr(doc, "ents"):
   133                                                   doc = nlp(doc)
   134         1     443189.0 443189.0     90.5      proced = geo.make_country_features(doc, require_maj=False)
   135         1          2.0      2.0      0.0      if not proced:
   136                                                   pass
   137                                                   # logging!
   138                                                   #print("Nothing came back from make_country_features")
   139         1          1.0      1.0      0.0      feat_list = []
   140                                               #proced = self.ent_list_to_matrix(proced)
   141                                           
   142         3          5.0      1.7      0.0      for loc in proced:
   143         2        537.0    268.5      0.1          feat = geo.make_country_matrix(loc)
   144                                                   #labels = loc['labels']
   145         2          4.0      2.0      0.0          feat_list.append(feat)
   146                                                   #try:
   147                                                   # for each potential country...
   148         5         13.0      2.6      0.0          for n, i in enumerate(feat_list):
   149         3          4.0      1.3      0.0              labels = i['labels']
   150         3          4.0      1.3      0.0              try:
   151         3       6602.0   2200.7      1.3                  prediction = geo.country_model.predict(i['matrix']).transpose()[0]
   152         3         34.0     11.3      0.0                  ranks = prediction.argsort()[::-1]
   153         3         72.0     24.0      0.0                  labels = np.asarray(labels)[ranks]
   154         3          9.0      3.0      0.0                  prediction = prediction[ranks]
   155                                                       except ValueError:
   156                                                           prediction = np.array([0])
   157                                                           labels = np.array([""])
   158                                           
   159         2         16.0      8.0      0.0          loc['country_predicted'] = labels[0]
   160         2          6.0      3.0      0.0          loc['country_conf'] = prediction[0]
   161         2          3.0      1.5      0.0          loc['all_countries'] = labels
   162         2          2.0      1.0      0.0          loc['all_confidence'] = prediction
   163                                           
   164         1          1.0      1.0      0.0      return proced

Total time: 1.55258 s
File: time_mordecai.py
Function: test_geoparse at line 168

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   168                                           @profile
   169                                           def test_geoparse(geo):
   170         1          6.0      6.0      0.0      text = "Police in Russia have continued to violate human rights, especially in the region of Chechnya and in the cities of Moscow and St. Petersburg."
   171         1      43017.0  43017.0      2.8      doc = nlp(text)
   172         1     928033.0 928033.0     59.8      proced = geo.infer_country(doc)
   173         5         10.0      2.0      0.0      for loc in proced:
   174         4        103.0     25.8      0.0          if loc['country_conf'] >= geo.country_threshold: # shrug
   175         4     475823.0 118955.8     30.6              res = geo.query_geonames_country(loc['word'], loc['country_predicted'])
   176                                                   elif loc['country_conf'] < geo.country_threshold:
   177                                                       res = ""
   178                                                       # if the confidence is too low, don't use the country info
   179         4         15.0      3.8      0.0          try:
   180         4          5.0      1.2      0.0              _ = res['hits']['hits']
   181                                                       # If there's no geonames result, what to do?
   182                                                       # For now, just continue.
   183                                                       # In the future, delete? Or add an empty "loc" field?
   184                                                   except (TypeError, KeyError):
   185                                                       continue
   186                                                   # Pick the best place
   187         4       5871.0   1467.8      0.4          X, meta = geo.features_for_rank(loc, res)
   188         4         15.0      3.8      0.0          if X.shape[1] == 0:
   189                                                       # This happens if there are no results...
   190                                                       continue
   191         4        890.0    222.5      0.1          all_tasks, sorted_meta, sorted_X = geo.format_for_prodigy(X, meta, loc['word'], return_feature_subset=True)
   192         4       1165.0    291.2      0.1          fl_pad = np.pad(sorted_X, ((0, 4 - sorted_X.shape[0]), (0, 0)), 'constant')
   193         4         32.0      8.0      0.0          fl_unwrap = fl_pad.flatten()
   194         4      97406.0  24351.5      6.3          prediction = geo.rank_model.predict(np.asmatrix(fl_unwrap))
   195         4        136.0     34.0      0.0          place_confidence = prediction.max()
   196         4         42.0     10.5      0.0          loc['geo'] = sorted_meta[prediction.argmax()]
   197         4         12.0      3.0      0.0          loc['place_confidence'] = place_confidence

Switch CI infrastructure to Travis

We were using CircleCI before, but it can't handle the Docker + volume setup for elasticsearch/geonames. Travis has no trouble with this, so need to switch over.

API changes

I'm considering breaking changes to the API (as part of a potential large release).

each document-level request would return the Mordecai version number, the last updated date for the Elasticsearch index, and the language used. The place names would then be values in a "results" field.
Each place name's data would also include the Geonames ID number, which would make it much easier to look up or match with evaluation sets.

Prevent cities being geocoded to things inside the city

Lagos -> Intercontinental Lagos
Mogadishu -> Mogadishu University
Vatican -> Vatican Museums
Mannheim -> Kraftwerk Mannheim
Noida -> Dreams Inn Greater Noida
Rajasthan -> Rajasthan Desert Safari
etc.

Could do this by increasing the boost on the ascii name field, by doing some edit distance thing, or by boosting inhabited places over other geographic features.

404 over http://localhost:9200/geonames

ANSWER THERE: #56 (comment)

Hey everyone,

I just installed mordecai. Everything run smooth until I need to launch mordecai.

$ python3
$ from mordecai import Geoparser
Using TensorFlow backend.
$ geo = Geoparser()
GET http://localhost:9200/geonames/_count [status:404 request:0.011s]
Traceback (most recent call last):
File "/home/morty/.local/lib/python3.6/site-packages/mordecai/geoparse.py", line 56, in init
self.conn.count()
File "/home/morty/.local/lib/python3.6/site-packages/elasticsearch_dsl/search.py", line 587, in count
**self._params
File "/home/morty/.local/lib/python3.6/site-packages/elasticsearch/client/utils.py", line 73, in _wrapped
return func(*args, params=params, **kwargs)
File "/home/morty/.local/lib/python3.6/site-packages/elasticsearch/client/init.py", line 1123, in count
doc_type, '_count'), params=params, body=body)
File "/home/morty/.local/lib/python3.6/site-packages/elasticsearch/transport.py", line 312, in perform_request
status, headers, data = connection.perform_request(method, url, params, body, ignore=ignore, timeout=timeout)
File "/home/morty/.local/lib/python3.6/site-packages/elasticsearch/connection/http_urllib3.py", line 128, in perform_request
self._raise_error(response.status, raw_data)
File "/home/morty/.local/lib/python3.6/site-packages/elasticsearch/connection/base.py", line 125, in _raise_error
raise HTTP_EXCEPTIONS.get(status_code, TransportError)(status_code, error_message, additional_info)
elasticsearch.exceptions.NotFoundError: TransportError(404, 'index_not_found_exception', 'no such index')

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "", line 1, in
File "/home/morty/.local/lib/python3.6/site-packages/mordecai/geoparse.py", line 61, in init
"for instructions on setting up Geonames/Elasticsearch")
ConnectionError: [Errno Could not establish contact with Elasticsearch at localhost on port 9200. Are you sure it's running?
] Mordecai needs access to the Geonames/Elasticsearch gazetteer to function.: 'See https://github.com/openeventdata/mordecai#installation-and-requirements'

With a docker run I get:

e9227f1162e3 elasticsearch:5.5.2 "/docker-entrypoint.…" 4 minutes ago Up 4 minutes 127.0.0.1:9200->9200/tcp, 9300/tcp serene_khorana

and I can access http://localhost:9200 with my browser
When I try to go to http://localhost:9200/geonames I get a 404

Any lead to where/what to look at ?

Index Error

For some text such as the following text:

text= '''Santa Cruz is a first class municipality in the province of Davao del Sur, Philippines. It has a population of 81,093 people as of 2010. The Municipality of Santa Cruz is part of Metropolitan Davao. Santa Cruz is politically subdivided into 18 barangays. Of the 18 barangays, 7 are uplands, 9 are upland-lowland and coastal and 2 are lowland-coastal. Pista sa Kinaiyahan A yearly activity conducted every last week of April as a tribute to the Mother Nature through tree-growing, cleanup activities and Boulder Face challenge. Araw ng Santa Cruz It is celebrated every October 5 in commemoration of the legal creation of the municipality in 1884. Highlights include parades, field demonstrations, trade fairs, carnivals and traditional festivities. Sinabbadan Festival A festival of ethnic ritual and dances celebrated every September. Santa Cruz is accessible by land transportation vehicles plying the Davao-Digos City, Davao-Kidapawan City, Davao-Cotabato City, Davao-Koronadal City and Davao-Tacurong City routes passing through the town's single, 27 kilometres (17 mi) stretch of national highway that traverses its 11 barangays. From Davao City, the administrative center of Region XI, it is 38 kilometres (24 mi) away within a 45-minute ride, while it is 16 kilometres (9.9 mi) or about 15-minute ride from provincial capital city of Digos.
... '''

I got this error:

out = geo.geoparse(text)
Traceback (most recent call last):
File "", line 1, in
File "/usr/local/lib/python3.5/dist-packages/mordecai/geoparse.py", line 955, in geoparse
all_tasks, sorted_meta, sorted_X = self.format_for_prodigy(X, meta, loc['word'], return_feature_subset=True)
File "/usr/local/lib/python3.5/dist-packages/mordecai/geoparse.py", line 813, in format_for_prodigy
sorted_X, sorted_meta = self.ranker(X, meta)
File "/usr/local/lib/python3.5/dist-packages/mordecai/geoparse.py", line 781, in ranker
sorted_meta = [meta[r] for r in ranks]
File "/usr/local/lib/python3.5/dist-packages/mordecai/geoparse.py", line 781, in
sorted_meta = [meta[r] for r in ranks]
IndexError: list index out of range

Making mordecai more generalized solution and explaining training data

Hello,

Is there any plan to make mordecai more generalized solution for using different NER libraries? For ex. https://github.com/Hironsan/anago Maybe some wrapper around libraries could be used.

Could you also explain models used in Keras for parsing? (which features and labels are used, etc.)

Not able to install Geonames gazetteer running in Elasticsearch

Hi, I typed docker pull elasticsearch:5.5.2 as per the instructions and got the following response:
5.5.2: Pulling from library/elasticsearch
Digest: sha256:3686a5757ed46c9dbcf00f6f71fce48ffc5413b193a80d1c46a21e7aad4c53ad
Status: Image is up to date for elasticsearch:5.5.2

But then when I type "wget https://s3.amazonaws.com/ahalterman-geo/geonames_index.tar.gz --output-file=wget_log.txt" I got "-bash: wget: command not found". Did I miss something?

Version 2 speedup

Version 2 is pretty slow. Explore speedups, including:

smarter/smaller ES queries
asynchronous ES queries?
only loading parts of the spacy model
other slow parts from profiling

NameError when an exception occurs during Geoparsing

This exception class
https://github.com/openeventdata/mordecai/blob/master/mordecai/geoparse.py#L481

is not imported in the module.

This breaks exceptions handling as the reporting error is NameError because of the missing import.

bulk_geoparse returns no 'geo' keys

When I run batch_geoparse on a short list of strings, I get 0s for all country_conf values and no 'geo' keys/values.

from mordecai import Geoparser
geo = Geoparser()
ss=["I traveled from Oxford to Ottawa.","I visited New Orleans."]
geo.batch_geoparse(ss)

[[{'word': 'Oxford',
   'spans': [{'start': 0, 'end': 6}],
   'country_predicted': '',
   'country_conf': 0},
  {'word': 'Ottawa',
   'spans': [{'start': 0, 'end': 6}],
   'country_predicted': '',
   'country_conf': 0}],
 [{'word': 'New Orleans',
   'spans': [{'start': 0, 'end': 11}],
   'country_predicted': '',
   'country_conf': 0}]]

Likewise, I've been unable to finish running geoparse on a single string. I've tried running the example usage code (i.e. geo.geoparse("I traveled from Oxford to Ottawa.")), but it produces no output after >30 minutes.

Any thoughts about what might be causing this behavior?

FYI, I'm running this code on a Mac laptop with an i7 processor and 16GB of RAM. When installing mordecai and elasticsearch, I copied and pasted the directions in the README.md, and I'm running all code from within the same directory.

Use Elasticsearch aggregations to calculate features

Right now, Mordecai calculates features like the number of results per country or the result with the most alternative names from the first 50 results that come back from the search. This isn't ideal, because it involves bringing all 50 results back, which is inefficient, and allows us to only look at the top 50.

Instead, use ES's aggregations to calculate that stuff directly on the server. I have this working for the number of results per country, but calculating the number of alternative names will require modifying the index to include that information.

Most importantly, making these changes will requires retraining the model since the values of the features will be different (and hopefully better).

pip installing may not install spaCy models

I just tried pip install mordecai on a new machine and it doesn't seem to install the spaCy language model:

OSError: Can't find model 'en_core_web_lg'

The model can be downloaded separately like this:

pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-2.0.0/en_core_web_lg-2.0.0.tar.gz

The longer term solution is to figure out how to change setup.py to download it automatically. This line seems to not be working.

ImportError: dlopen: cannot load any more object with static TLS , while using mordecai

Hi,
I'm trying to perform geoparsing in several servers, all but one server fails to run mordecai, attached stack trace. All servers are identical with only difference being - mongodb is also run in the same server where mordecai won't run. I tried changing import orders but couldn't get far with solving it. Please let me know about your inputs and how should i proceed.
stacktrace.txt

Regards,
Aswin

switch to Geoparser() to remove Geoparse vs. geoparse method
consistent naming for feature making functions
doc_to_guess --> infer_country, plus change result picking names
?

Then,

Update tests
Update documentation

from mordecai import Geoparser
geo = Geoparser()
geo.geoparse("From Ottowa to Beijing.")

This is the error:

Spans for extracted placenames incorrectly start at 0

All spans returned by Mordecai begin at 0, which is incorrect. E.g.

geo.geoparse("""The state of North Rhine-Westphalia ranks first in population among 
German states for both Roman Catholics and Protestants.""")

{...
'spans': [{'end': 22, 'start': 0}],
...}

Different languages as inputs

Andy, does the new version of Mordecai work on different languages?