lil-lab / newsroom Goto Github PK

View Code? Open in Web Editor NEW

146.0 146.0 24.0 597 KB

Tools for downloading and analyzing summaries and evaluating summarization systems. https://summari.es/

License: Other

Makefile 0.29% Python 18.15% HTML 1.71% Shell 0.07% Perl 79.78%

newsroom's People

Contributors

Stargazers

Watchers

newsroom's Issues

Memory keep getting up when scraping data

As I run the scraping code, the memory of my computer keeps getting up while downloading summaries until it crashes because the memory is not enough. Is there a way to fix it?
Running on Windows 7 64bit, python version 3.6.

thin.rar download URL unreacheable

https://summari.es/files/thin.tar can't be reached. When I try from the browser the download fails at around 20 MB each time. Using wget the following error is received:
Connecting to summari.es:80... connected!
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://summari.es/files/thin.tar [following]
--13:16:19-- ftp://https:21/%2Fsummari.es/files/thin.tar
=> `thin.tar'
Connecting to https:21...
https: Host not found

Is there any mirror to the thin.rar file?

When will the complete dataset available?

Thanks for your hard work and great dataset.
When is the estimated release date for full newsroom dataset?

Description of JSON fields needed

I couldn't find a complete description of the JSON data fields of the dataset.

Did I miss it ?

Most of the fields are self-explanatory, but for example, I'm wondering what are :

extbin
sumbin
textbin

Is it the extractiveness, summary length and text length ?

Some articles have empty summaries, and scrape script doesn't always detect broken links

Hi, thanks a lot for this great dataset. I have been looking forward to training models on it but I ran into two issues.

The first is related to #2 . Some of the WebArchive links redirect for some reason and the original article is lost. For example:
https://web.archive.org/web/2014013019id_/http://www.nytimes.com/2014/01/31/world/europe/ukraine-unrest.html

For this issue, it can be solved by the comment in #2 , i.e. using exactness 3. However it is problematic for anyone who does not set exactness to 3, since the newsroom-scrape script doesn't detect the broken link. So the malformed article-summary pair ends up in the final dataset.

The second issue is that some summaries are empty. For example:
http://latimesblogs.latimes.com/technology/2011/08/how-to-opt-out-of-linkedin-sharing-your-data-with-third-parties.html

In this case, should the article be excluded from the dataset? I don't see how a meaningful summary can be extracted from this web page.

Syntax error when newsroom-scrape uses python concurrent library

Hi,

whenever I launch newsroom-scrape I get the SyntaxError from the concurrent library:

Traceback (most recent call last):
File "/usr/local/bin/newsroom-scrape", line 11, in
load_entry_point('newsroom', 'console_scripts', 'newsroom-scrape')()
File "/usr/lib/python3/dist-packages/pkg_resources/init.py", line 480, in load_entry_point
return get_distribution(dist).load_entry_point(group, name)
File "/usr/lib/python3/dist-packages/pkg_resources/init.py", line 2693, in load_entry_point
return ep.load()
File "/usr/lib/python3/dist-packages/pkg_resources/init.py", line 2324, in load
return self.resolve()
File "/usr/lib/python3/dist-packages/pkg_resources/init.py", line 2330, in resolve
module = import(self.module_name, fromlist=['name'], level=0)
File "/home/crow/src/newsroom/newsroom/init.py", line 2, in
from . import build
File "/home/crow/src/newsroom/newsroom/build/init.py", line 1, in
from .download import Downloader
File "/home/crow/src/newsroom/newsroom/build/download.py", line 2, in
from concurrent.futures import ThreadPoolExecutor
File "/usr/local/lib/python3.6/dist-packages/concurrent/futures/init.py", line 8, in
from concurrent.futures._base import (FIRST_COMPLETED,
File "/usr/local/lib/python3.6/dist-packages/concurrent/futures/_base.py", line 381
raise exception_type, self._exception, self._traceback
^
SyntaxError: invalid syntax

I think it's due to the use of python3 with a library written for python2. What can I do? I'm using python3

Summari.es download link is not working.

Just as the title says. At the moment https://summari.es/download/ link is not working.

missing some values in summary objects

There are only ['url', 'archive', 'title', 'date', 'text', 'summary', 'density', 'coverage', 'compression', 'compression_bin', 'coverage_bin', 'density_bin'] in summary objects. While ['subset', 'publication'] which stated on your website are missing. Without 'subset' we can not compare our result to those on the leaderboard. Can you provide those information? Thanks.

Some of the URLs are missing both from the Web rchive and original web sites

After running the crawler a couple of times with a reduced number of workers and even exactness 4, I have around 100 articles still missing in both the dev andd test set (train is still running, currently missing about 2700). For example:
https://newoldage.blogs.nytimes.com/2009/12/03/ccrc-fees-prepare-to-be-bewildered/index.html

Others are on the original websites, but not found on the WebArchive
https://www.nytimes.com/2010/10/21/opinion/21kristof.html
https://web.archive.org/web/*/https://www.nytimes.com/2010/10/21/opinion/21kristof.html
(All webarchive saves redirect to errors)

Any ideas on what went wrong there? Could there be a fix to get them directly from the sites when available? I can post the full lists if that's helpful.

problem when install the newsroom

Hi,Sir!I'm learning the newsroom.When i used pip to install it via git+git,i met this problem:failed building wheel for spacy and thinc .I don't understand this situation and have tried it for many times.
i will appreciate for your help~

data extract doesnt work

when i run this command newsroom-extract --archive dev.archive --dataset dev.data
to obtain the data file, the command line shows these message:
"Loading downloaded summaries:
Aborted!"
Then it stops. Can you guys help fix this? Ty

Dataset Impaired Integrity

The only data in the datesets is with keys:
-archive
-date
-compression
-coverage
-density
-compression_bin
-coverage_bin
-density_bin

I cannot see the "summary" and "text" values of each object.

There seems to be discrepancy in this comment. For coverage summary is kept as numerator or denom?

newsroom/newsroom/analyze/fragments.py

Line 174 in 4551c77

- summary_base (bool): use summary as numerator (default = True)

"Publication" key is missing from the extracted dataset.

I downloaded and extracted the dataset using the script but the "publication" key is missing from the extracted train.data file.

data extraction error

I am getting this error

_pickle.PicklingError: Can't pickle <function Article.process at 0x7f5f2ab56d90>: attribute lookup process on newsroom.build.filter failed

when using the following command:

newsroom-extract --archive dev.archive --dataset dev.data

I am running it with Python3.4.

No module named 'lxml.etree'

When running newsroom-scrape cmd, I got this error :

ModuleNotFoundError: No module named 'lxml.etree'

Full stack :

Traceback (most recent call last):
  File "/home/x/.venv/36/bin/newsroom-scrape", line 11, in <module>
    load_entry_point('newsroom', 'console_scripts', 'newsroom-scrape')()
  File "/home/x/.venv/36/lib/python3.6/site-packages/pkg_resources/__init__.py", line 489, in load_entry_point
    return get_distribution(dist).load_entry_point(group, name)
  File "/home/x/.venv/36/lib/python3.6/site-packages/pkg_resources/__init__.py", line 2793, in load_entry_point    return ep.load()
  File "/home/x/.venv/36/lib/python3.6/site-packages/pkg_resources/__init__.py", line 2411, in load
    return self.resolve()
  File "/home/x/.venv/36/lib/python3.6/site-packages/pkg_resources/__init__.py", line 2417, in resolve
    module = __import__(self.module_name, fromlist=['__name__'], level=0)
  File "/home/x/.venv/36/src/newsroom/newsroom/__init__.py", line 2, in <module>
    from . import build
  File "/home/x/.venv/36/src/newsroom/newsroom/build/__init__.py", line 2, in <module>
    from .filter import Article
  File "/home/x/.venv/36/src/newsroom/newsroom/build/filter.py", line 5, in <module>
    from readability import Document
  File "/home/x/.venv/36/lib/python3.6/site-packages/readability/__init__.py", line 1, in <module>
    from .readability import Document
  File "/home/x/.venv/36/lib/python3.6/site-packages/readability/readability.py", line 8, in <module>
    from lxml.etree import tostring
ModuleNotFoundError: No module named 'lxml.etree'

Possible errors in ROUGE-L evaluation

ROUGE-L is sensitive to sentence tokenization.

The data format used by the newsroom-run -> newsroom-score -> newsroom-tables evaluation pipeline does not appear to keep track of sentence tokenization. When sentence tokenization is not provided to ROUGE-1.5.5, multi-sentence references and hypotheses are evaluated as one long sentence. As a result, ROUGE-L scores produced by this evaluation pipeline (1) may be lower than expected compared to more standard ROUGE evaluation that uses tokenized sentences, and (2) probably do not match the ROUGE-L scores in the Newsroom paper, which are computed using sentence tokenization.

I would recommend adding a notice to the README that the evaluation pipeline does not keep track of sentence tokenization, which may result in lower-than-expected ROUGE-L scores, and that the newsroom-run -> newsroom-score -> newsroom-tables evaluation pipeline should not be used for publishable evaluation.

Error when running newsroom-scrape --thin thin/dev.jsonl.gz --archive dev.archive

newsroom-scrape --thin thin/dev.jsonl.gz --archive dev.archive
Traceback (most recent call last):
  File "/usr/local/bin/newsroom-scrape", line 11, in <module>
    load_entry_point('newsroom', 'console_scripts', 'newsroom-scrape')()
  File "/usr/local/lib/python2.7/site-packages/pkg_resources/__init__.py", line 587, in load_entry_point
    return get_distribution(dist).load_entry_point(group, name)
  File "/usr/local/lib/python2.7/site-packages/pkg_resources/__init__.py", line 2800, in load_entry_point
    return ep.load()
  File "/usr/local/lib/python2.7/site-packages/pkg_resources/__init__.py", line 2431, in load
    return self.resolve()
  File "/usr/local/lib/python2.7/site-packages/pkg_resources/__init__.py", line 2437, in resolve
    module = __import__(self.module_name, fromlist=['__name__'], level=0)
  File "/Users/aburkov/Downloads/src/newsroom/newsroom/__init__.py", line 2, in <module>
    from . import build
  File "/Users/aburkov/Downloads/src/newsroom/newsroom/build/__init__.py", line 1, in <module>
    from .download import Downloader
  File "/Users/aburkov/Downloads/src/newsroom/newsroom/build/download.py", line 55
    yield from executor.map(self._thread, urls)
             ^
SyntaxError: invalid syntax

I get an abort error when trying to scraping the dataset.

Can you help me out here, or is there any link i could directly download the dataset from? @yoavartzi @grusky

ModuleNotFoundError: No module named 'newsroom'

Hi. I am getting this module error. am I suppose to download it somewhere or do I need a newsroom.py file or something? Because I can't seem to find any file in the code. What am I missing?

evaluation tools

Will the evaluation tools be made available anytime soon? If not, do you recommend a specific python wrapper for rouge? Thanks

failed to scrape

Hi, i tried to get the newsroom dataset from scratch. After i run the newsroom-scrape --thin thin/dev.jsonl.gz --archive dev.archive, the dev.archive seems not apppear. It takes me 10 days to run the command and there is nothing left. So i wonder if there is something wrong to scrape from scratch?

And the final result i get is：

108800 pages need re-downloading later.

108810 pages need re-downloading later.

108820 pages need re-downloading later.

108830 pages need re-downloading later.


Rerun the script: 108837 pages failed to download.
- Try running with a lower --workers count (default = 16).
- Check which URLs are left with the --diff flag.
- Last resort: --exactness X to truncate dates to X digits.
  (e.g., --exactness 4 will download the closest year.)

Downloading Summaries: 100%|██████████| 108837/108837 [249:09:03<00:00,  8.24s/it]

ValueError: Unmatched ''"' when when decoding 'string'

ub16hp@UB16HP:~/ub16_prj/newsroom$ newsroom-scrape --thin thin/dev.jsonl.gz --archive dev.archive

gzip: stdin: unexpected end of file
Loading previously downloaded summaries: Traceback (most recent call last):
File "/usr/local/bin/newsroom-scrape", line 11, in
load_entry_point('newsroom', 'console_scripts', 'newsroom-scrape')()
File "/usr/local/lib/python3.5/dist-packages/click/core.py", line 722, in call
return self.main(*args, **kwargs)
File "/usr/local/lib/python3.5/dist-packages/click/core.py", line 697, in main
rv = self.invoke(ctx)
File "/usr/local/lib/python3.5/dist-packages/click/core.py", line 895, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/usr/local/lib/python3.5/dist-packages/click/core.py", line 535, in invoke
return callback(*args, **kwargs)
File "/home/ub16hp/ub16_prj/newsroom/newsroom/build/scrape.py", line 124, in main
done = {ln["archive"] for ln in f}
File "/home/ub16hp/ub16_prj/newsroom/newsroom/build/scrape.py", line 124, in
done = {ln["archive"] for ln in f}
File "/home/ub16hp/ub16_prj/newsroom/newsroom/build/jsonl.py", line 264, in readlines
yield _json.loads(line)
ValueError: Unmatched ''"' when when decoding 'string'
ub16hp@UB16HP:~/ub16_prj/newsroom$

Module will not pip install. Solution:

Following advice from https://stackoverflow.com/questions/70663523/the-unauthenticated-git-protocol-on-port-9418-is-no-longer-supported:

Simply use
pip install -e git+https://github.com/clic-lab/newsroom.git#egg=newsroom instead of
pip install -e git+git://github.com/clic-lab/newsroom.git#egg=newsroom

Complete dataset download

I submitted a request for the complete dataset from the website, but still haven't received any mail.

If the process takes some time, you might want to send an acknowledgement mail, something along the lines "Your request is now being treated", to be sure the user didn't give a wrong mail address.

Website: "read full article" links not working

The 'read full article' links don't work in the 'explore' section.

https://twitter.com/jeffrschneider/status/991391692854046721?s=11

Dataset Diversity Analysis

Dear Artzi @yoavartzi, first of all, I want to thank you for your great work.
I have two questions about the diversity analysis in your paper. I used your code to calculate the coverage, density, and compression. And then, I used the seaborn.kdeplot to visualize the results. But I found my result was different from figure 4 in your paper. The coverage score seems much lower. My questions are:

Did you divide the coverage scores with the maximum value or conduct the min-max normalization?
Did you randomly sample or use the entire dataset (training set) to calculate these three metrics?
Thank you again for your help.

My result is here:

lil-lab / newsroom Goto Github PK

newsroom's People

Contributors

Stargazers

Watchers

Forkers

newsroom's Issues

Recommend Projects

Recommend Topics

Recommend Org