openeventdata / mordecai Goto Github PK
View Code? Open in Web Editor NEWFull text geoparsing as a Python library
License: MIT License
Full text geoparsing as a Python library
License: MIT License
Mordecai limits the search for place names to a single country per document, to reduce errors and to avoid the US bias in other systems. However, articles often reference locations in more than one country. Allow Mordecai to pick more than one country in certain cases. This will rely on having lots of high-quality labeled text data.
Before merging the v2 branch, need to assess the accuracy of the geonames lookup call. This will inform if we need a feature type detection step.
When a sentence has multiple locations, use the high-confidence country resolutions to help country-resolve the low-confidence places. The three obstacles are:
Hi there,
We're using mordecai to geolocalize ~200K tweets and I noticed that during our processing (after a few minutes) memory grows up until it hits the limit and the machine starts swapping (16GB). So it seems like a memory leak but after some debugging, I found the culprit to be the lru_cache decorator used for Geoprocessor.query_geonames
and Geoprocessor.query_geonames_country
.
Indeed, I was able to mitigate this behavior and fix it by calling cache_clear()
each N Geoprocessor.geoparse
(N=250 works fine).
tagger = Geoparser()
tweets = Tweet.get_tweets() # iterate over 200K tweets
for i, t in enumerate(tweets):
res = tagger.geoparse(t.full_text)
# some processing....
if not (i % 250):
# this releases gigas of memory....
tagger.query_geonames.cache_clear()
tagger.query_geonames_country.cache_clear()
So it seems the reason is that lru_cache keeps arguments of last 1000 calls and that ends up to be too heavy in terms of memory (in a few minutes, processing eats 10 GB). Can you confirm this issue?
Instead of using wget
to download the word vectors and the MITIE models, use a volume to expose them to the container. Its unclear to me if MITIE needs the models pulled down in order to build, if thats the case a work around could probably be found. Could provide a fetch script to download the data externally.
Actually I am trying to start the mordecai service with the docker command, but it is throwing below error.
{u'es_host': 'elastic',
u'es_port': '9200',
u'mitie_directory': '/usr/src/MITIE/mitielib',
u'mitie_ner_model': '/usr/src/data/MITIE-models/english/ner_model.dat',
u'mordecai_port': 5000,
u'word2vec_model': '/usr/src/data/GoogleNews-vectors-negative300.bin.gz'}
Setting up MITIE.
Traceback (most recent call last):
File "app.py", line 24, in
configs['mitie_ner_model'])
File "/usr/src/resources/utilities.py", line 126, in setup_mitie
ner_model = mitie.named_entity_extractor(mitie_ner_model)
File "/src/mitie/mitie/mitie.py", line 181, in init
raise Exception("Unable to load named entity extractor from " + filename)
Exception: Unable to load named entity extractor from /usr/src/data/MITIE-models/english/ner_model.dat
can anybody suggest what could be the cause? and how to resolve it. thank you.
steps followed:
I built the mordecai as suggested in the doc, and tried to start with the image_id.
To make Mordecai a better drop-in replacement for CLIFF in the phoenix_pipeline, also return the ADM1 info for any extracted place names.
import mordecai
Using TensorFlow backend.
Traceback (most recent call last):
File "", line 1, in
File "/usr/local/lib/python2.7/dist-packages/mordecai/init.py", line 1, in
from .geoparse import Geoparser
File "/usr/local/lib/python2.7/dist-packages/mordecai/geoparse.py", line 7, in
from functools import lru_cache
ImportError: cannot import name lru_cache
I am not sure, but the version of this package on my computer might be different from you. Mine is:
backports.functools-lru-cache==1.4
functools32==3.2.3.post2
Hi there,
I have problems in setting up an environment with docker compose.
First, the geonames index is not automatically created by elasticsearch so I need to perform a curl PUT to create the index. Also, it seems that using docker compose and mapped volumes, there are issues with file permissions so there is need to perform some chmod/chown commands before to launch the ES server.
To do all above, I need to customize/override the ES docker entrypoint and command. but I'm having hard time and all kind of issues...in the best case, the docker simply exit with status code 0.
Also, I noticed that the docs are referring to an older version of Elasticsearch docker image (5.5.2 vs 6.2.2). The recent ones are not on community docker repo...Now ES has its own docker repository.
Is there anyone that is willing to share his docker compose configuration?
Thanks in advance,
d
I'm running Python 3.6 and when I write "pip install mordecai" I keep getting a list of errors that say "Failed building wheel for ujson", " Failed building wheel for preshed", " Failed building wheel for cymem", etc.
I read here( "https://stackoverflow.com/questions/43370851/failed-building-wheel-for-spacy") that i need to install spacy first but when I pip install spacy I again get "Failed building wheel for ujson", " Failed building wheel for preshed", " Failed building wheel for cymem", etc.
What should I do?
Lots of sentences with geographic information are structured like "X, a town 30 km south of Y", or "X, a neighborhood in Y". In both cases, we want to:
Neither MITIE's binary relation detection nor Freebase have this. We could use parse info, but that would be tricky and require lots of labeled examples. Thoughts?
Switch from MITIE to spaCy for NER and word embeddings. The advantages of spaCy include:
Remove dependencies to Caerus infrastructure. These include:
mitie-py
wrapper from Caerus GithubI confronted with a connection error in the updated Mordecai. I did not have this issue with the previous Mordecai. I will be thankful if you can help me to solve this error.
mordecai_1 | WARNING:elasticsearch:GET http://localhost:9200/_search [status:N/A request:0.000s]
mordecai_1 | Traceback (most recent call last):
mordecai_1 | File "/usr/local/lib/python2.7/dist-packages/elasticsearch/connection/http_urllib3.py", line 74, in perform_request
mordecai_1 | response = self.pool.urlopen(method, url, body, retries=False, headers=self.headers, **kw)
mordecai_1 | File "/usr/local/lib/python2.7/dist-packages/urllib3/connectionpool.py", line 597, in urlopen
mordecai_1 | _stacktrace=sys.exc_info()[2])
mordecai_1 | File "/usr/local/lib/python2.7/dist-packages/urllib3/util/retry.py", line 222, in increment
mordecai_1 | raise six.reraise(type(error), error, _stacktrace)
mordecai_1 | File "/usr/local/lib/python2.7/dist-packages/urllib3/connectionpool.py", line 544, in urlopen
mordecai_1 | body=body, headers=headers)
mordecai_1 | File "/usr/local/lib/python2.7/dist-packages/urllib3/connectionpool.py", line 349, in _make_request
mordecai_1 | conn.request(method, url, **httplib_request_kw)
mordecai_1 | File "/usr/lib/python2.7/httplib.py", line 979, in request
mordecai_1 | self._send_request(method, url, body, headers)
mordecai_1 | File "/usr/lib/python2.7/httplib.py", line 1013, in _send_request
mordecai_1 | self.endheaders(body)
mordecai_1 | File "/usr/lib/python2.7/httplib.py", line 975, in endheaders
mordecai_1 | self._send_output(message_body)
mordecai_1 | File "/usr/lib/python2.7/httplib.py", line 835, in _send_output
mordecai_1 | self.send(msg)
mordecai_1 | File "/usr/lib/python2.7/httplib.py", line 797, in send
mordecai_1 | self.connect()
mordecai_1 | File "/usr/local/lib/python2.7/dist-packages/urllib3/connection.py", line 155, in connect
mordecai_1 | conn = self._new_conn()
mordecai_1 | File "/usr/local/lib/python2.7/dist-packages/urllib3/connection.py", line 134, in _new_conn
mordecai_1 | (self.host, self.port), self.timeout, **extra_kw)
mordecai_1 | File "/usr/local/lib/python2.7/dist-packages/urllib3/util/connection.py", line 88, in create_connection
mordecai_1 | raise err
mordecai_1 | ProtocolError: ('Connection aborted.', error(111, 'Connection refused'))
The deterministic ranking function sometimes hits an IndexError:
/Users/ahalterman/MIT/Geolocation/mordecai/mordecai/geoparse.py in <listcomp>(.0)
779 ranks = ranks[::-1]
780 # sort the list of dicts according to ranks
--> 781 sorted_meta = [meta[r] for r in ranks]
782 sorted_X = X[ranks]
783 return (sorted_X, sorted_meta)
IndexError: list index out of range
This will reproduce:
geo.geoparse("In early 1938, the Prime Minister cut grants-in-aid to the provinces, effectively killing the relief project scheme. Premier Thomas Dufferin Pattullo closed the projects in April, claiming that British Columbia could not shoulder the burden alone. Unemployed men again flocked to Vancouver to protest government insensitivity and intransigence to their plight. The RCPU organized demonstrations and tin-canning (organized begging) in the city. Under the guidance of twenty-six-year-old Steve Brodie, the leader of the Youth Division who had cut his activist teeth during the 1935 relief camp strike, protesters occupied Hotel Georgia, the Vancouver Art Gallery (then located at 1145 West Georgia Street), and the main post office (now the Sinclair Centre).")
Rather than being fundamentally a (Dockerized) REST service like version 1 was, version 2 will be an installable Python package, with a separate repo that packages and runs it as a service.
This bug is very weird. This sentence runs fine:
geo.geoparse("Congress and in the legislatures of Alabama, California, Florida, and Missouri.")
...but fails when you replace "Missouri" with "Michigan":
geo.geoparse("Congress and in the legislatures of Alabama, California, Florida, and Michigan.")
It fails here with a KeyError.
With the new setup of the CountryAPI() and PlacesAPI() and the reorganization of the directories the tests aren't running correctly right now. Fix that and make it clear how to run them from inside the container:
sudo docker exec -it d1fc3f40c4a6d05 bash
cd resources
py.test
Currently, Mordecai is a document geocoder, taking in text, extracting location names, and associating them with their geographic coordinates. As it gets incorporated into the Petrarch pipeline, it would be good to have an explicit endpoint for geocoding Petrarch-produced events, rather than whole documents. This would mostly be useful for geocoding events in sentences that have two or more locations (figure out how many that is).
/country
's dependence on word2vec for picking the focus country of an article means that it performs poorly on countries that don't have very much news coverage and thus don't have very distinct word embeddings. This problem has come up especially for west African states.
Part of the design of Mordecai is to accommodate non-English languages with a minimum of modification and training.
MITIE already has a pre-made Spanish NER model, so all we need for a Spanish-language Mordecai is Spanish word2vec.
Line 209 attempts to return the JSON formatted variable locations
, but it is not defined within the scope of the function. Not sure what exactly the variable does when there is a ValueError
.
Add required packages to setup.py
so they get installed when using pip
.
For some reason, US states seem to be missing from the Geonames/Elasticsearch index: When looking for Ohio by geonames id,
curl -XGET localhost:9200/geonames/_search?q="geonameid":5165418
returns nothing (that search format works for other entries). But
cat allCountries.txt | grep 5165418
shows that it's in in the allCountries.txt
file. I'll try rebuilding the index...
Mordecai is too slow to use right now for a corpus of ~70 million documents. Based on my testing, it seems that the Elasticsearch queries are responsible for the great majority of time, with the model calls being a distant second. Explore ways to speed up these queries or to use threading to make many calls at once, since ES can handle hundreds of concurrent requests.
Timing overview:
Geoparse:
322602.0 # total
135417.0 42% # infer_country
77966.0 24% # query_geonames_country
6787.0 2% # features for rank
77908.0 24% # rank_model.predict
infer_country:
81817.0 # total
27146.0 33% # nlp()
47713.0 58% # make_country_features
6102.0 7% # country_model.predict
643.0 .7% # make_country_matrix
make_country_features
396670.0 # total
1575.0 # _feature_word_embedding
361902.0 91% # query_geonames
Line-by-line:
Wrote profile results to time_mordecai.py.lprof
Timer unit: 1e-06 s
Total time: 0.43051 s
File: time_mordecai.py
Function: make_country_features at line 14
Line # Hits Time Per Hit % Time Line Contents
==============================================================
14 @profile
15 def make_country_features(geo, require_maj = False):
16 1 51117.0 51117.0 11.9 doc = nlp("Thousands of Nigerians from throughout the country were converging Thursday for a rally in Lagos to protest the rights violations under the recently imposed Sharia law by Islamic fundamentalists in the northern district of Borno.")
17 1 26.0 26.0 0.0 if not hasattr(doc, "ents"):
18 doc = nlp(doc)
19 # initialize the place to store finalized tasks
20 1 2.0 2.0 0.0 task_list = []
21
22 # get document vector
23 #doc_vec = self._feature_word_embedding(text)['country_1']
24
25 # get explicit counts of country names
26 1 136.0 136.0 0.0 ct_mention, ctm_count1, ct_mention2, ctm_count2 = geo._feature_country_mentions(doc)
27
28 # now iterate through the entities, skipping irrelevant ones and countries
29 8 23.0 2.9 0.0 for ent in doc.ents:
30 7 42.0 6.0 0.0 if not ent.text.strip():
31 continue
32 7 15.0 2.1 0.0 if ent.label_ not in ["GPE","LOC","FAC"]:
33 5 5.0 1.0 0.0 continue
34 # don't include country names (make a parameter)
35 2 10.0 5.0 0.0 if ent.text.strip() in geo._skip_list:
36 continue
37
38 ## just for training purposes
39 #if ent.text.strip() in self._just_cts.keys():
40 # continue
41
42 #skip_list.add(ent.text.strip())
43 2 3.0 1.5 0.0 ent_label = ent.label_ # destroyed by trimming
44 2 124.0 62.0 0.0 ent = geo.clean_entity(ent)
45
46 # vector for just the solo word
47 2 879.0 439.5 0.2 vp = geo._feature_word_embedding(ent)
48 2 5.0 2.5 0.0 try:
49 2 4.0 2.0 0.0 word_vec = vp['country_1']
50 2 6.0 3.0 0.0 wv_confid = float(vp['confid_a'])
51 except TypeError:
52 # no idea why this comes up
53 word_vec = ""
54 wv_confid = "0"
55
56 # look for explicit mentions of feature names
57 2 115.0 57.5 0.0 class_mention, code_mention = geo._feature_location_type_mention(ent)
58
59 ##### ES-based features
60 2 2.0 1.0 0.0 try:
61 2 377075.0 188537.5 87.6 result = geo.query_geonames(ent.text)
62 except ConnectionTimeout:
63 result = ""
64
65 # build results-based features
66 2 155.0 77.5 0.0 most_alt = geo._feature_most_alternative(result)
67 2 117.0 58.5 0.0 most_common = geo._feature_most_common(result)
68 2 132.0 66.0 0.0 most_pop = geo._feature_most_population(result)
69 2 10.0 5.0 0.0 first_back, second_back = geo._feature_first_back(result)
70
71 2 3.0 1.5 0.0 try:
72 2 3.0 1.5 0.0 maj_vote = Counter([word_vec, most_alt,
73 2 3.0 1.5 0.0 first_back, most_pop,
74 2 48.0 24.0 0.0 ct_mention
75 #doc_vec_sent, doc_vec
76 2 3.0 1.5 0.0 ]).most_common()[0][0]
77 except Exception as e:
78 print("Problem taking majority vote: ", ent, e)
79 maj_vote = ""
80
81
82 2 3.0 1.5 0.0 if not maj_vote:
83 maj_vote = ""
84
85 # We only want all this junk for the labeling task. We just want to straight to features
86 # and the model when in production.
87
88 2 2.0 1.0 0.0 try:
89 2 56.0 28.0 0.0 start = ent.start_char - ent.sent.start_char
90 2 16.0 8.0 0.0 end = ent.end_char - ent.sent.start_char
91 2 4.0 2.0 0.0 iso_label = maj_vote
92 2 2.0 1.0 0.0 try:
93 2 4.0 2.0 0.0 text_label = geo._inv_cts[iso_label]
94 except KeyError:
95 text_label = ""
96
97 2 274.0 137.0 0.1 task = {"text" : ent.sent.text,
98 2 3.0 1.5 0.0 "label" : text_label, # human-readable country name
99 2 9.0 4.5 0.0 "word" : ent.text,
100 "spans" : [{
101 2 2.0 1.0 0.0 "start" : start,
102 2 3.0 1.5 0.0 "end" : end,
103 } # make sure to rename for Prodigy
104 ],
105 "features" : {
106 2 2.0 1.0 0.0 "maj_vote" : iso_label,
107 2 3.0 1.5 0.0 "word_vec" : word_vec,
108 2 2.0 1.0 0.0 "first_back" : first_back,
109 #"doc_vec" : doc_vec,
110 2 2.0 1.0 0.0 "most_alt" : most_alt,
111 2 3.0 1.5 0.0 "most_pop" : most_pop,
112 2 2.0 1.0 0.0 "ct_mention" : ct_mention,
113 2 29.0 14.5 0.0 "ctm_count1" : ctm_count1,
114 2 4.0 2.0 0.0 "ct_mention2" : ct_mention2,
115 2 3.0 1.5 0.0 "ctm_count2" : ctm_count2,
116 2 3.0 1.5 0.0 "wv_confid" : wv_confid,
117 2 3.0 1.5 0.0 "class_mention" : class_mention, # inferred geonames class from mentions
118 2 7.0 3.5 0.0 "code_mention" : code_mention,
119 #"places_vec" : places_vec,
120 #"doc_vec_sent" : doc_vec_sent
121 } }
122 2 5.0 2.5 0.0 task_list.append(task)
123 except Exception as e:
124 print(ent.text,)
125 print(e)
126 1 1.0 1.0 0.0 return task_list # rename this var
Total time: 0.489834 s
File: time_mordecai.py
Function: infer_country at line 129
Line # Hits Time Per Hit % Time Line Contents
==============================================================
129 @profile
130 def infer_country(geo):
131 1 39311.0 39311.0 8.0 doc = nlp("At a public bathhouse in the Moqor district of Ghazni Province, they met the commander of a small pro-government militia. ")
132 1 19.0 19.0 0.0 if not hasattr(doc, "ents"):
133 doc = nlp(doc)
134 1 443189.0 443189.0 90.5 proced = geo.make_country_features(doc, require_maj=False)
135 1 2.0 2.0 0.0 if not proced:
136 pass
137 # logging!
138 #print("Nothing came back from make_country_features")
139 1 1.0 1.0 0.0 feat_list = []
140 #proced = self.ent_list_to_matrix(proced)
141
142 3 5.0 1.7 0.0 for loc in proced:
143 2 537.0 268.5 0.1 feat = geo.make_country_matrix(loc)
144 #labels = loc['labels']
145 2 4.0 2.0 0.0 feat_list.append(feat)
146 #try:
147 # for each potential country...
148 5 13.0 2.6 0.0 for n, i in enumerate(feat_list):
149 3 4.0 1.3 0.0 labels = i['labels']
150 3 4.0 1.3 0.0 try:
151 3 6602.0 2200.7 1.3 prediction = geo.country_model.predict(i['matrix']).transpose()[0]
152 3 34.0 11.3 0.0 ranks = prediction.argsort()[::-1]
153 3 72.0 24.0 0.0 labels = np.asarray(labels)[ranks]
154 3 9.0 3.0 0.0 prediction = prediction[ranks]
155 except ValueError:
156 prediction = np.array([0])
157 labels = np.array([""])
158
159 2 16.0 8.0 0.0 loc['country_predicted'] = labels[0]
160 2 6.0 3.0 0.0 loc['country_conf'] = prediction[0]
161 2 3.0 1.5 0.0 loc['all_countries'] = labels
162 2 2.0 1.0 0.0 loc['all_confidence'] = prediction
163
164 1 1.0 1.0 0.0 return proced
Total time: 1.55258 s
File: time_mordecai.py
Function: test_geoparse at line 168
Line # Hits Time Per Hit % Time Line Contents
==============================================================
168 @profile
169 def test_geoparse(geo):
170 1 6.0 6.0 0.0 text = "Police in Russia have continued to violate human rights, especially in the region of Chechnya and in the cities of Moscow and St. Petersburg."
171 1 43017.0 43017.0 2.8 doc = nlp(text)
172 1 928033.0 928033.0 59.8 proced = geo.infer_country(doc)
173 5 10.0 2.0 0.0 for loc in proced:
174 4 103.0 25.8 0.0 if loc['country_conf'] >= geo.country_threshold: # shrug
175 4 475823.0 118955.8 30.6 res = geo.query_geonames_country(loc['word'], loc['country_predicted'])
176 elif loc['country_conf'] < geo.country_threshold:
177 res = ""
178 # if the confidence is too low, don't use the country info
179 4 15.0 3.8 0.0 try:
180 4 5.0 1.2 0.0 _ = res['hits']['hits']
181 # If there's no geonames result, what to do?
182 # For now, just continue.
183 # In the future, delete? Or add an empty "loc" field?
184 except (TypeError, KeyError):
185 continue
186 # Pick the best place
187 4 5871.0 1467.8 0.4 X, meta = geo.features_for_rank(loc, res)
188 4 15.0 3.8 0.0 if X.shape[1] == 0:
189 # This happens if there are no results...
190 continue
191 4 890.0 222.5 0.1 all_tasks, sorted_meta, sorted_X = geo.format_for_prodigy(X, meta, loc['word'], return_feature_subset=True)
192 4 1165.0 291.2 0.1 fl_pad = np.pad(sorted_X, ((0, 4 - sorted_X.shape[0]), (0, 0)), 'constant')
193 4 32.0 8.0 0.0 fl_unwrap = fl_pad.flatten()
194 4 97406.0 24351.5 6.3 prediction = geo.rank_model.predict(np.asmatrix(fl_unwrap))
195 4 136.0 34.0 0.0 place_confidence = prediction.max()
196 4 42.0 10.5 0.0 loc['geo'] = sorted_meta[prediction.argmax()]
197 4 12.0 3.0 0.0 loc['place_confidence'] = place_confidence
We were using CircleCI before, but it can't handle the Docker + volume setup for elasticsearch/geonames. Travis has no trouble with this, so need to switch over.
I'm considering breaking changes to the API (as part of a potential large release).
Lagos -> Intercontinental Lagos
Mogadishu -> Mogadishu University
Vatican -> Vatican Museums
Mannheim -> Kraftwerk Mannheim
Noida -> Dreams Inn Greater Noida
Rajasthan -> Rajasthan Desert Safari
etc.
Could do this by increasing the boost on the ascii name field, by doing some edit distance thing, or by boosting inhabited places over other geographic features.
Hey everyone,
I just installed mordecai. Everything run smooth until I need to launch mordecai.
$ python3
$ from mordecai import Geoparser
Using TensorFlow backend.
$ geo = Geoparser()
GET http://localhost:9200/geonames/_count [status:404 request:0.011s]
Traceback (most recent call last):
File "/home/morty/.local/lib/python3.6/site-packages/mordecai/geoparse.py", line 56, in init
self.conn.count()
File "/home/morty/.local/lib/python3.6/site-packages/elasticsearch_dsl/search.py", line 587, in count
**self._params
File "/home/morty/.local/lib/python3.6/site-packages/elasticsearch/client/utils.py", line 73, in _wrapped
return func(*args, params=params, **kwargs)
File "/home/morty/.local/lib/python3.6/site-packages/elasticsearch/client/init.py", line 1123, in count
doc_type, '_count'), params=params, body=body)
File "/home/morty/.local/lib/python3.6/site-packages/elasticsearch/transport.py", line 312, in perform_request
status, headers, data = connection.perform_request(method, url, params, body, ignore=ignore, timeout=timeout)
File "/home/morty/.local/lib/python3.6/site-packages/elasticsearch/connection/http_urllib3.py", line 128, in perform_request
self._raise_error(response.status, raw_data)
File "/home/morty/.local/lib/python3.6/site-packages/elasticsearch/connection/base.py", line 125, in _raise_error
raise HTTP_EXCEPTIONS.get(status_code, TransportError)(status_code, error_message, additional_info)
elasticsearch.exceptions.NotFoundError: TransportError(404, 'index_not_found_exception', 'no such index')During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "", line 1, in
File "/home/morty/.local/lib/python3.6/site-packages/mordecai/geoparse.py", line 61, in init
"for instructions on setting up Geonames/Elasticsearch")
ConnectionError: [Errno Could not establish contact with Elasticsearch at localhost on port 9200. Are you sure it's running?
] Mordecai needs access to the Geonames/Elasticsearch gazetteer to function.: 'See https://github.com/openeventdata/mordecai#installation-and-requirements'
With a docker run I get:
e9227f1162e3 elasticsearch:5.5.2 "/docker-entrypoint.โฆ" 4 minutes ago Up 4 minutes 127.0.0.1:9200->9200/tcp, 9300/tcp serene_khorana
and I can access http://localhost:9200 with my browser
When I try to go to http://localhost:9200/geonames I get a 404
Any lead to where/what to look at ?
For some text such as the following text:
text= '''Santa Cruz is a first class municipality in the province of Davao del Sur, Philippines. It has a population of 81,093 people as of 2010. The Municipality of Santa Cruz is part of Metropolitan Davao. Santa Cruz is politically subdivided into 18 barangays. Of the 18 barangays, 7 are uplands, 9 are upland-lowland and coastal and 2 are lowland-coastal. Pista sa Kinaiyahan A yearly activity conducted every last week of April as a tribute to the Mother Nature through tree-growing, cleanup activities and Boulder Face challenge. Araw ng Santa Cruz It is celebrated every October 5 in commemoration of the legal creation of the municipality in 1884. Highlights include parades, field demonstrations, trade fairs, carnivals and traditional festivities. Sinabbadan Festival A festival of ethnic ritual and dances celebrated every September. Santa Cruz is accessible by land transportation vehicles plying the Davao-Digos City, Davao-Kidapawan City, Davao-Cotabato City, Davao-Koronadal City and Davao-Tacurong City routes passing through the town's single, 27 kilometres (17 mi) stretch of national highway that traverses its 11 barangays. From Davao City, the administrative center of Region XI, it is 38 kilometres (24 mi) away within a 45-minute ride, while it is 16 kilometres (9.9 mi) or about 15-minute ride from provincial capital city of Digos.
... '''
I got this error:
out = geo.geoparse(text)
Traceback (most recent call last):
File "", line 1, in
File "/usr/local/lib/python3.5/dist-packages/mordecai/geoparse.py", line 955, in geoparse
all_tasks, sorted_meta, sorted_X = self.format_for_prodigy(X, meta, loc['word'], return_feature_subset=True)
File "/usr/local/lib/python3.5/dist-packages/mordecai/geoparse.py", line 813, in format_for_prodigy
sorted_X, sorted_meta = self.ranker(X, meta)
File "/usr/local/lib/python3.5/dist-packages/mordecai/geoparse.py", line 781, in ranker
sorted_meta = [meta[r] for r in ranks]
File "/usr/local/lib/python3.5/dist-packages/mordecai/geoparse.py", line 781, in
sorted_meta = [meta[r] for r in ranks]
IndexError: list index out of range
Hello,
Is there any plan to make mordecai more generalized solution for using different NER libraries? For ex. https://github.com/Hironsan/anago Maybe some wrapper around libraries could be used.
Could you also explain models used in Keras for parsing? (which features and labels are used, etc.)
Hi, I typed docker pull elasticsearch:5.5.2 as per the instructions and got the following response:
5.5.2: Pulling from library/elasticsearch
Digest: sha256:3686a5757ed46c9dbcf00f6f71fce48ffc5413b193a80d1c46a21e7aad4c53ad
Status: Image is up to date for elasticsearch:5.5.2
But then when I type "wget https://s3.amazonaws.com/ahalterman-geo/geonames_index.tar.gz --output-file=wget_log.txt" I got "-bash: wget: command not found". Did I miss something?
Version 2 is pretty slow. Explore speedups, including:
This exception class
https://github.com/openeventdata/mordecai/blob/master/mordecai/geoparse.py#L481
is not imported in the module.
This breaks exceptions handling as the reporting error is NameError because of the missing import.
When I run batch_geoparse on a short list of strings, I get 0s for all country_conf values and no 'geo' keys/values.
from mordecai import Geoparser
geo = Geoparser()
ss=["I traveled from Oxford to Ottawa.","I visited New Orleans."]
geo.batch_geoparse(ss)
[[{'word': 'Oxford',
'spans': [{'start': 0, 'end': 6}],
'country_predicted': '',
'country_conf': 0},
{'word': 'Ottawa',
'spans': [{'start': 0, 'end': 6}],
'country_predicted': '',
'country_conf': 0}],
[{'word': 'New Orleans',
'spans': [{'start': 0, 'end': 11}],
'country_predicted': '',
'country_conf': 0}]]
Likewise, I've been unable to finish running geoparse on a single string. I've tried running the example usage code (i.e. geo.geoparse("I traveled from Oxford to Ottawa.")
), but it produces no output after >30 minutes.
Any thoughts about what might be causing this behavior?
FYI, I'm running this code on a Mac laptop with an i7 processor and 16GB of RAM. When installing mordecai and elasticsearch, I copied and pasted the directions in the README.md, and I'm running all code from within the same directory.
Right now, Mordecai calculates features like the number of results per country or the result with the most alternative names from the first 50 results that come back from the search. This isn't ideal, because it involves bringing all 50 results back, which is inefficient, and allows us to only look at the top 50.
Instead, use ES's aggregations to calculate that stuff directly on the server. I have this working for the number of results per country, but calculating the number of alternative names will require modifying the index to include that information.
Most importantly, making these changes will requires retraining the model since the values of the features will be different (and hopefully better).
I just tried pip install mordecai
on a new machine and it doesn't seem to install the spaCy language model:
OSError: Can't find model 'en_core_web_lg'
The model can be downloaded separately like this:
pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-2.0.0/en_core_web_lg-2.0.0.tar.gz
The longer term solution is to figure out how to change setup.py
to download it automatically. This line seems to not be working.
Hi,
I'm trying to perform geoparsing in several servers, all but one server fails to run mordecai, attached stack trace. All servers are identical with only difference being - mongodb is also run in the same server where mordecai won't run. I tried changing import orders but couldn't get far with solving it. Please let me know about your inputs and how should i proceed.
stacktrace.txt
Regards,
Aswin
Similar result as #62. When batch_geoparse
is ran it will return empty country prediction information.
However, if any geoparse
is run before running batch_geoparse
, batch_geoparse
will behaviour as expected.
When the re-write gets further along, the documentation will need to be re-written from almost scratch. I'd like it to be much shorter and straightforward than the current docs, with more complicated stuff moved to readthedocs.
If you start Mordecai without Geonames-ES running or it goes down later while running, it should issue a nice transparent exception.
Update the v2 API before merging:
Geoparser()
to remove Geoparse
vs. geoparse
methoddoc_to_guess
--> infer_country
, plus change result picking namesThen,
Add documentation to the readme on Geoparser
's configuration options.
Is the Dockerfile for the openeventdata/es-geonames
image publicly available in a github repo?
i'm getting an "importError" when i run the app.py
after running docker-compose up -d
. any idea how to fix this? i'm using a mac.
All spans returned by Mordecai begin at 0, which is incorrect. E.g.
geo.geoparse("""The state of North Rhine-Westphalia ranks first in population among
German states for both Roman Catholics and Protestants.""")
{...
'spans': [{'end': 22, 'start': 0}],
...}
Andy, does the new version of Mordecai work on different languages?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.