babelscape / crocodile Goto Github PK
View Code? Open in Web Editor NEWcRocoDiLe is a dataset extraction tool for Relation Extraction using Wikipedia and Wikidata presented in REBEL (EMNLP 2021).
cRocoDiLe is a dataset extraction tool for Relation Extraction using Wikipedia and Wikidata presented in REBEL (EMNLP 2021).
I try to run the extract_lan.sh script. But I got the error that the file 'wikidata/wikidata-triples.csv' could not be found.
I'm not sure where the file is created or should it be downloaded?
It would be great if you can point me into the right direction.
I have a question!
Do you also support Korean language?
Sorry I don't really understand the work flow of CROCODILE, how was the relation type of a triplet extracted from WikiData when we caught two named entities in a sentence? By recognizing the predicate word?
Hi, I've tried to execute the extract_lan.sh
script now, but it results in this error:
downloading wikipedia and wikidata dumps...
2022-01-25 16:29:34,692 - wikimapper.download - INFO - Downloading [https://dumps.wikimedia.org/dewiki/latest/dewiki-latest-page.sql.gz] to [data/de/dewiki-latest-page.sql.gz]
2022-01-25 16:30:36,991 - wikimapper.download - INFO - Downloading [https://dumps.wikimedia.org/dewiki/latest/dewiki-latest-page_props.sql.gz] to [data/de/dewiki-latest-page_props.sql.gz]
2022-01-25 16:30:55,206 - wikimapper.download - INFO - Downloading [https://dumps.wikimedia.org/dewiki/latest/dewiki-latest-redirect.sql.gz] to [data/de/dewiki-latest-redirect.sql.gz]
Create wikidata database
2022-01-25 16:31:00,687 - wikimapper.processor - INFO - Creating index for [dewiki-latest] in [data/de/index_dewiki-latest.db]
2022-01-25 16:31:00,697 - wikimapper.processor - INFO - Parsing pages dump
2022-01-25 16:31:51,875 - wikimapper.processor - INFO - Creating database index on 'wikipedia_title'
2022-01-25 16:31:55,962 - wikimapper.processor - INFO - Parsing page properties dump
2022-01-25 16:32:22,760 - wikimapper.processor - INFO - Parsing redirects dump
2022-01-25 16:32:45,246 - wikimapper.processor - INFO - Creating database index on 'wikidata_id'
Extract abstracts
Traceback (most recent call last):
File "/usr/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/usr/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/content/crocodile/wikiextractor/wikiextractor/WikiExtractor.py", line 656, in <module>
main()
File "/content/crocodile/wikiextractor/wikiextractor/WikiExtractor.py", line 652, in main
args.compress, args.processes, args.escape_doc)
File "/content/crocodile/wikiextractor/wikiextractor/WikiExtractor.py", line 297, in process_dump
input = decode_open(input_file)
File "/content/crocodile/wikiextractor/wikiextractor/WikiExtractor.py", line 277, in decode_open
return bz2.open(filename, mode=mode, encoding=encoding)
File "/usr/lib/python3.7/bz2.py", line 318, in open
binary_file = BZ2File(filename, bz_mode, compresslevel=compresslevel)
File "/usr/lib/python3.7/bz2.py", line 92, in __init__
self._fp = _builtin_open(filename, mode)
FileNotFoundError: [Errno 2] No such file or directory: 'data/de/dewiki-latest-pages-articles-multistream.xml.bz2'
As the output shows, wikimapper downloads 4 files into the data/de/
-directory, but non of them is a zipped xml file :(
Any ideas?
Hi! Firstly, thanks for publishing your research implementation!
I'm just missing the files in ./wikiextractor as described in the setup instructions:
For ./wikiextractor we use a submodule which is a fork of the original wikiextractor that implements wikimapper to extract the Wikidata entities.
Maybe the submodule is missing?
Hi, I've tried to execute the extract_lan.sh script , but it results in same error.
I already add that line in the download.py script, but it doesn't work.
maybe need to add in another place??
Hello,
First of all, I want to thank you for sharing this script with the community.
I'm trying to regenerate Rebel dataset.
By using python -m wikiextractor.wikiextractor.WikiExtractor data/$1/$1wiki-latest-pages-articles-multistream.xml.bz2 --links --language $1 --output text/$1 --templates data/$1/templates.txt
, I get page articles.
The wikidata entities are described by theirs links (--links), but wikidata-triplets.py use the wikidata ID.
How did you turn the links into IDs?
I created the German corpus and noticed that dates were not extracted. For the English corpus it seems to work but I am not sure why.
So I started debugging for the German case:
The regex works fine and finds the dates. The problem seems to arise when calling the dateparser.parse
This function expects a date_formats variable, and it is not passed. Since it is None
the following get_date_data()
call throws an Exception because parse()
or parse_with_formats()
expects date_formats.
I think date_formats could be:
DEFAULT_DATE_FORMAT = "%Y-%m-%d %H:%M:%S"
date_formats = [
DEFAULT_DATE_FORMAT, # 2010-09-21 09:27:01 => SQLite + MySQL
"%Y-%m-%dT%H:%M:%SZ", # 2010-09-20T09:27:01Z => Bing
"%a, %d %b %Y %H:%M:%S +0000", # Fri, 21 Sep 2010 09:27:01 +000 => Twitter
"%a %b %d %H:%M:%S +0000 %Y", # Fri Sep 21 09:21:01 +0000 2010 => Twitter
"%Y-%m-%dT%H:%M:%S+0000", # 2010-09-20T09:27:01+0000 => Facebook
"%Y-%m-%d %H:%M", # 2010-09-21 09:27
"%Y-%m-%d", # 2010-09-21
"%d/%m/%Y", # 21/09/2010
"%d %B %Y", # 21 September 2010
"%d %b %Y", # 21 Sep 2010
"%B %d %Y", # September 21 2010
"%B %d, %Y", # September 21, 2010
]
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.