Giter VIP home page Giter VIP logo

crocodile's People

Contributors

littlepea13 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

crocodile's Issues

No such file or directory: 'wikidata/wikidata-triples.csv'

I try to run the extract_lan.sh script. But I got the error that the file 'wikidata/wikidata-triples.csv' could not be found.
I'm not sure where the file is created or should it be downloaded?

It would be great if you can point me into the right direction.

Missing input-file for wikiextractor in "extract_lan.sh"-script

Hi, I've tried to execute the extract_lan.sh script now, but it results in this error:

downloading wikipedia and wikidata dumps...
2022-01-25 16:29:34,692 - wikimapper.download - INFO - Downloading [https://dumps.wikimedia.org/dewiki/latest/dewiki-latest-page.sql.gz] to [data/de/dewiki-latest-page.sql.gz]
2022-01-25 16:30:36,991 - wikimapper.download - INFO - Downloading [https://dumps.wikimedia.org/dewiki/latest/dewiki-latest-page_props.sql.gz] to [data/de/dewiki-latest-page_props.sql.gz]
2022-01-25 16:30:55,206 - wikimapper.download - INFO - Downloading [https://dumps.wikimedia.org/dewiki/latest/dewiki-latest-redirect.sql.gz] to [data/de/dewiki-latest-redirect.sql.gz]
Create wikidata database
2022-01-25 16:31:00,687 - wikimapper.processor - INFO - Creating index for [dewiki-latest] in [data/de/index_dewiki-latest.db]
2022-01-25 16:31:00,697 - wikimapper.processor - INFO - Parsing pages dump
2022-01-25 16:31:51,875 - wikimapper.processor - INFO - Creating database index on 'wikipedia_title'
2022-01-25 16:31:55,962 - wikimapper.processor - INFO - Parsing page properties dump
2022-01-25 16:32:22,760 - wikimapper.processor - INFO - Parsing redirects dump
2022-01-25 16:32:45,246 - wikimapper.processor - INFO - Creating database index on 'wikidata_id'
Extract abstracts
Traceback (most recent call last):
  File "/usr/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/content/crocodile/wikiextractor/wikiextractor/WikiExtractor.py", line 656, in <module>
    main()
  File "/content/crocodile/wikiextractor/wikiextractor/WikiExtractor.py", line 652, in main
    args.compress, args.processes, args.escape_doc)
  File "/content/crocodile/wikiextractor/wikiextractor/WikiExtractor.py", line 297, in process_dump
    input = decode_open(input_file)
  File "/content/crocodile/wikiextractor/wikiextractor/WikiExtractor.py", line 277, in decode_open
    return bz2.open(filename, mode=mode, encoding=encoding)
  File "/usr/lib/python3.7/bz2.py", line 318, in open
    binary_file = BZ2File(filename, bz_mode, compresslevel=compresslevel)
  File "/usr/lib/python3.7/bz2.py", line 92, in __init__
    self._fp = _builtin_open(filename, mode)
FileNotFoundError: [Errno 2] No such file or directory: 'data/de/dewiki-latest-pages-articles-multistream.xml.bz2'

As the output shows, wikimapper downloads 4 files into the data/de/-directory, but non of them is a zipped xml file :(
Any ideas?

Submodule missing?

Hi! Firstly, thanks for publishing your research implementation!
I'm just missing the files in ./wikiextractor as described in the setup instructions:

For ./wikiextractor we use a submodule which is a fork of the original wikiextractor that implements wikimapper to extract the Wikidata entities.

Maybe the submodule is missing?

extract_lan.sh script

Hi, I've tried to execute the extract_lan.sh script , but it results in same error.
I already add that line in the download.py script, but it doesn't work.
maybe need to add in another place??

wikidata-triplets.py - 0 entity found

Hello,

First of all, I want to thank you for sharing this script with the community.
I'm trying to regenerate Rebel dataset.
By using python -m wikiextractor.wikiextractor.WikiExtractor data/$1/$1wiki-latest-pages-articles-multistream.xml.bz2 --links --language $1 --output text/$1 --templates data/$1/templates.txt, I get page articles.
The wikidata entities are described by theirs links (--links), but wikidata-triplets.py use the wikidata ID.

How did you turn the links into IDs?

dateparser does not seem to work

I created the German corpus and noticed that dates were not extracted. For the English corpus it seems to work but I am not sure why.

So I started debugging for the German case:

The regex works fine and finds the dates. The problem seems to arise when calling the dateparser.parse

This function expects a date_formats variable, and it is not passed. Since it is None the following get_date_data() call throws an Exception because parse() or parse_with_formats() expects date_formats.

I think date_formats could be:

DEFAULT_DATE_FORMAT = "%Y-%m-%d %H:%M:%S"
date_formats = [
    DEFAULT_DATE_FORMAT,           # 2010-09-21 09:27:01  => SQLite + MySQL
    "%Y-%m-%dT%H:%M:%SZ",          # 2010-09-20T09:27:01Z => Bing
    "%a, %d %b %Y %H:%M:%S +0000", # Fri, 21 Sep 2010 09:27:01 +000 => Twitter
    "%a %b %d %H:%M:%S +0000 %Y",  # Fri Sep 21 09:21:01 +0000 2010 => Twitter
    "%Y-%m-%dT%H:%M:%S+0000",      # 2010-09-20T09:27:01+0000 => Facebook
    "%Y-%m-%d %H:%M",              # 2010-09-21 09:27
    "%Y-%m-%d",                    # 2010-09-21
    "%d/%m/%Y",                    # 21/09/2010
    "%d %B %Y",                    # 21 September 2010
    "%d %b %Y",                    # 21 Sep 2010
    "%B %d %Y",                    # September 21 2010
    "%B %d, %Y",                   # September 21, 2010
]

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.