babelscape / crocodile Goto Github PK

cRocoDiLe is a dataset extraction tool for Relation Extraction using Wikipedia and Wikidata presented in REBEL (EMNLP 2021).

Python 97.37% Shell 2.63%

information-extraction relation-extraction wikidata wikipedia

crocodile's People

Contributors

Stargazers

Forkers

niningliuhen2013 torgbuiedunyenyo leandrafichtel elnandes ju-resplande kcambrek vansnoden bablf cyndiekamau iamliving fleonce yofayed littlepea13

crocodile's Issues

No such file or directory: 'wikidata/wikidata-triples.csv'

I try to run the extract_lan.sh script. But I got the error that the file 'wikidata/wikidata-triples.csv' could not be found.
I'm not sure where the file is created or should it be downloaded?

It would be great if you can point me into the right direction.

Does it work well for Korean too?

I have a question!
Do you also support Korean language?

How is the relation (more specifically, the predicate) extracted?

Sorry I don't really understand the work flow of CROCODILE, how was the relation type of a triplet extracted from WikiData when we caught two named entities in a sentence? By recognizing the predicate word?

Missing input-file for wikiextractor in "extract_lan.sh"-script

Hi, I've tried to execute the extract_lan.sh script now, but it results in this error:

downloading wikipedia and wikidata dumps...
2022-01-25 16:29:34,692 - wikimapper.download - INFO - Downloading [https://dumps.wikimedia.org/dewiki/latest/dewiki-latest-page.sql.gz] to [data/de/dewiki-latest-page.sql.gz]
2022-01-25 16:30:36,991 - wikimapper.download - INFO - Downloading [https://dumps.wikimedia.org/dewiki/latest/dewiki-latest-page_props.sql.gz] to [data/de/dewiki-latest-page_props.sql.gz]
2022-01-25 16:30:55,206 - wikimapper.download - INFO - Downloading [https://dumps.wikimedia.org/dewiki/latest/dewiki-latest-redirect.sql.gz] to [data/de/dewiki-latest-redirect.sql.gz]
Create wikidata database
2022-01-25 16:31:00,687 - wikimapper.processor - INFO - Creating index for [dewiki-latest] in [data/de/index_dewiki-latest.db]
2022-01-25 16:31:00,697 - wikimapper.processor - INFO - Parsing pages dump
2022-01-25 16:31:51,875 - wikimapper.processor - INFO - Creating database index on 'wikipedia_title'
2022-01-25 16:31:55,962 - wikimapper.processor - INFO - Parsing page properties dump
2022-01-25 16:32:22,760 - wikimapper.processor - INFO - Parsing redirects dump
2022-01-25 16:32:45,246 - wikimapper.processor - INFO - Creating database index on 'wikidata_id'
Extract abstracts
Traceback (most recent call last):
  File "/usr/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/content/crocodile/wikiextractor/wikiextractor/WikiExtractor.py", line 656, in <module>
    main()
  File "/content/crocodile/wikiextractor/wikiextractor/WikiExtractor.py", line 652, in main
    args.compress, args.processes, args.escape_doc)
  File "/content/crocodile/wikiextractor/wikiextractor/WikiExtractor.py", line 297, in process_dump
    input = decode_open(input_file)
  File "/content/crocodile/wikiextractor/wikiextractor/WikiExtractor.py", line 277, in decode_open
    return bz2.open(filename, mode=mode, encoding=encoding)
  File "/usr/lib/python3.7/bz2.py", line 318, in open
    binary_file = BZ2File(filename, bz_mode, compresslevel=compresslevel)
  File "/usr/lib/python3.7/bz2.py", line 92, in __init__
    self._fp = _builtin_open(filename, mode)
FileNotFoundError: [Errno 2] No such file or directory: 'data/de/dewiki-latest-pages-articles-multistream.xml.bz2'

As the output shows, wikimapper downloads 4 files into the data/de/-directory, but non of them is a zipped xml file :(
Any ideas?

Submodule missing?

Hi! Firstly, thanks for publishing your research implementation!
I'm just missing the files in ./wikiextractor as described in the setup instructions:

For ./wikiextractor we use a submodule which is a fork of the original wikiextractor that implements wikimapper to extract the Wikidata entities.

Maybe the submodule is missing?

extract_lan.sh script

Hi, I've tried to execute the extract_lan.sh script , but it results in same error.
I already add that line in the download.py script, but it doesn't work.
maybe need to add in another place??

wikidata-triplets.py - 0 entity found

Hello,

First of all, I want to thank you for sharing this script with the community.
I'm trying to regenerate Rebel dataset.
By using python -m wikiextractor.wikiextractor.WikiExtractor data/$1/$1wiki-latest-pages-articles-multistream.xml.bz2 --links --language $1 --output text/$1 --templates data/$1/templates.txt, I get page articles.
The wikidata entities are described by theirs links (--links), but wikidata-triplets.py use the wikidata ID.

How did you turn the links into IDs?

dateparser does not seem to work

I created the German corpus and noticed that dates were not extracted. For the English corpus it seems to work but I am not sure why.

So I started debugging for the German case:

The regex works fine and finds the dates. The problem seems to arise when calling the dateparser.parse

This function expects a date_formats variable, and it is not passed. Since it is None the following get_date_data() call throws an Exception because parse() or parse_with_formats() expects date_formats.

I think date_formats could be:

DEFAULT_DATE_FORMAT = "%Y-%m-%d %H:%M:%S"
date_formats = [
    DEFAULT_DATE_FORMAT,           # 2010-09-21 09:27:01  => SQLite + MySQL
    "%Y-%m-%dT%H:%M:%SZ",          # 2010-09-20T09:27:01Z => Bing
    "%a, %d %b %Y %H:%M:%S +0000", # Fri, 21 Sep 2010 09:27:01 +000 => Twitter
    "%a %b %d %H:%M:%S +0000 %Y",  # Fri Sep 21 09:21:01 +0000 2010 => Twitter
    "%Y-%m-%dT%H:%M:%S+0000",      # 2010-09-20T09:27:01+0000 => Facebook
    "%Y-%m-%d %H:%M",              # 2010-09-21 09:27
    "%Y-%m-%d",                    # 2010-09-21
    "%d/%m/%Y",                    # 21/09/2010
    "%d %B %Y",                    # 21 September 2010
    "%d %b %Y",                    # 21 Sep 2010
    "%B %d %Y",                    # September 21 2010
    "%B %d, %Y",                   # September 21, 2010
]

babelscape / crocodile Goto Github PK

crocodile's People

Contributors

Stargazers

Forkers

crocodile's Issues

No such file or directory: 'wikidata/wikidata-triples.csv'

Does it work well for Korean too?

How is the relation (more specifically, the predicate) extracted?

Missing input-file for wikiextractor in "extract_lan.sh"-script

Submodule missing?

extract_lan.sh script

wikidata-triplets.py - 0 entity found

dateparser does not seem to work

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent