lil-lab / newsroom Goto Github PK
View Code? Open in Web Editor NEWTools for downloading and analyzing summaries and evaluating summarization systems. https://summari.es/
License: Other
Tools for downloading and analyzing summaries and evaluating summarization systems. https://summari.es/
License: Other
https://summari.es/files/thin.tar can't be reached. When I try from the browser the download fails at around 20 MB each time. Using wget the following error is received:
Connecting to summari.es:80... connected!
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://summari.es/files/thin.tar [following]
--13:16:19-- ftp://https:21/%2Fsummari.es/files/thin.tar
=> `thin.tar'
Connecting to https:21...
https: Host not found
Is there any mirror to the thin.rar file?
Thanks for your hard work and great dataset.
When is the estimated release date for full newsroom dataset?
I couldn't find a complete description of the JSON data fields of the dataset.
Did I miss it ?
Most of the fields are self-explanatory, but for example, I'm wondering what are :
extbin
sumbin
textbin
Is it the extractiveness, summary length and text length ?
Hi, thanks a lot for this great dataset. I have been looking forward to training models on it but I ran into two issues.
The first is related to #2 . Some of the WebArchive links redirect for some reason and the original article is lost. For example:
https://web.archive.org/web/2014013019id_/http://www.nytimes.com/2014/01/31/world/europe/ukraine-unrest.html
For this issue, it can be solved by the comment in #2 , i.e. using exactness 3. However it is problematic for anyone who does not set exactness to 3, since the newsroom-scrape
script doesn't detect the broken link. So the malformed article-summary pair ends up in the final dataset.
The second issue is that some summaries are empty. For example:
http://latimesblogs.latimes.com/technology/2011/08/how-to-opt-out-of-linkedin-sharing-your-data-with-third-parties.html
In this case, should the article be excluded from the dataset? I don't see how a meaningful summary can be extracted from this web page.
Hi,
whenever I launch newsroom-scrape I get the SyntaxError from the concurrent library:
Traceback (most recent call last):
File "/usr/local/bin/newsroom-scrape", line 11, in
load_entry_point('newsroom', 'console_scripts', 'newsroom-scrape')()
File "/usr/lib/python3/dist-packages/pkg_resources/init.py", line 480, in load_entry_point
return get_distribution(dist).load_entry_point(group, name)
File "/usr/lib/python3/dist-packages/pkg_resources/init.py", line 2693, in load_entry_point
return ep.load()
File "/usr/lib/python3/dist-packages/pkg_resources/init.py", line 2324, in load
return self.resolve()
File "/usr/lib/python3/dist-packages/pkg_resources/init.py", line 2330, in resolve
module = import(self.module_name, fromlist=['name'], level=0)
File "/home/crow/src/newsroom/newsroom/init.py", line 2, in
from . import build
File "/home/crow/src/newsroom/newsroom/build/init.py", line 1, in
from .download import Downloader
File "/home/crow/src/newsroom/newsroom/build/download.py", line 2, in
from concurrent.futures import ThreadPoolExecutor
File "/usr/local/lib/python3.6/dist-packages/concurrent/futures/init.py", line 8, in
from concurrent.futures._base import (FIRST_COMPLETED,
File "/usr/local/lib/python3.6/dist-packages/concurrent/futures/_base.py", line 381
raise exception_type, self._exception, self._traceback
^
SyntaxError: invalid syntax
I think it's due to the use of python3 with a library written for python2. What can I do? I'm using python3
Just as the title says. At the moment https://summari.es/download/ link is not working.
There are only ['url', 'archive', 'title', 'date', 'text', 'summary', 'density', 'coverage', 'compression', 'compression_bin', 'coverage_bin', 'density_bin'] in summary objects. While ['subset', 'publication'] which stated on your website are missing. Without 'subset' we can not compare our result to those on the leaderboard. Can you provide those information? Thanks.
After running the crawler a couple of times with a reduced number of workers and even exactness 4, I have around 100 articles still missing in both the dev andd test set (train is still running, currently missing about 2700). For example:
https://newoldage.blogs.nytimes.com/2009/12/03/ccrc-fees-prepare-to-be-bewildered/index.html
Others are on the original websites, but not found on the WebArchive
https://www.nytimes.com/2010/10/21/opinion/21kristof.html
https://web.archive.org/web/*/https://www.nytimes.com/2010/10/21/opinion/21kristof.html
(All webarchive saves redirect to errors)
Any ideas on what went wrong there? Could there be a fix to get them directly from the sites when available? I can post the full lists if that's helpful.
when i run this command newsroom-extract --archive dev.archive --dataset dev.data
to obtain the data file, the command line shows these message:
"Loading downloaded summaries:
Aborted!"
Then it stops. Can you guys help fix this? Ty
The only data in the datesets is with keys:
-archive
-date
-compression
-coverage
-density
-compression_bin
-coverage_bin
-density_bin
I cannot see the "summary" and "text" values of each object.
newsroom/newsroom/analyze/fragments.py
Line 174 in 4551c77
I downloaded and extracted the dataset using the script but the "publication" key is missing from the extracted train.data file.
I am getting this error
_pickle.PicklingError: Can't pickle <function Article.process at 0x7f5f2ab56d90>: attribute lookup process on newsroom.build.filter failed
when using the following command:
newsroom-extract --archive dev.archive --dataset dev.data
I am running it with Python3.4.
When running newsroom-scrape
cmd, I got this error :
ModuleNotFoundError: No module named 'lxml.etree'
Full stack :
Traceback (most recent call last):
File "/home/x/.venv/36/bin/newsroom-scrape", line 11, in <module>
load_entry_point('newsroom', 'console_scripts', 'newsroom-scrape')()
File "/home/x/.venv/36/lib/python3.6/site-packages/pkg_resources/__init__.py", line 489, in load_entry_point
return get_distribution(dist).load_entry_point(group, name)
File "/home/x/.venv/36/lib/python3.6/site-packages/pkg_resources/__init__.py", line 2793, in load_entry_point return ep.load()
File "/home/x/.venv/36/lib/python3.6/site-packages/pkg_resources/__init__.py", line 2411, in load
return self.resolve()
File "/home/x/.venv/36/lib/python3.6/site-packages/pkg_resources/__init__.py", line 2417, in resolve
module = __import__(self.module_name, fromlist=['__name__'], level=0)
File "/home/x/.venv/36/src/newsroom/newsroom/__init__.py", line 2, in <module>
from . import build
File "/home/x/.venv/36/src/newsroom/newsroom/build/__init__.py", line 2, in <module>
from .filter import Article
File "/home/x/.venv/36/src/newsroom/newsroom/build/filter.py", line 5, in <module>
from readability import Document
File "/home/x/.venv/36/lib/python3.6/site-packages/readability/__init__.py", line 1, in <module>
from .readability import Document
File "/home/x/.venv/36/lib/python3.6/site-packages/readability/readability.py", line 8, in <module>
from lxml.etree import tostring
ModuleNotFoundError: No module named 'lxml.etree'
ROUGE-L is sensitive to sentence tokenization.
The data format used by the newsroom-run
-> newsroom-score
-> newsroom-tables
evaluation pipeline does not appear to keep track of sentence tokenization. When sentence tokenization is not provided to ROUGE-1.5.5, multi-sentence references and hypotheses are evaluated as one long sentence. As a result, ROUGE-L scores produced by this evaluation pipeline (1) may be lower than expected compared to more standard ROUGE evaluation that uses tokenized sentences, and (2) probably do not match the ROUGE-L scores in the Newsroom paper, which are computed using sentence tokenization.
I would recommend adding a notice to the README that the evaluation pipeline does not keep track of sentence tokenization, which may result in lower-than-expected ROUGE-L scores, and that the newsroom-run
-> newsroom-score
-> newsroom-tables
evaluation pipeline should not be used for publishable evaluation.
newsroom-scrape --thin thin/dev.jsonl.gz --archive dev.archive
Traceback (most recent call last):
File "/usr/local/bin/newsroom-scrape", line 11, in <module>
load_entry_point('newsroom', 'console_scripts', 'newsroom-scrape')()
File "/usr/local/lib/python2.7/site-packages/pkg_resources/__init__.py", line 587, in load_entry_point
return get_distribution(dist).load_entry_point(group, name)
File "/usr/local/lib/python2.7/site-packages/pkg_resources/__init__.py", line 2800, in load_entry_point
return ep.load()
File "/usr/local/lib/python2.7/site-packages/pkg_resources/__init__.py", line 2431, in load
return self.resolve()
File "/usr/local/lib/python2.7/site-packages/pkg_resources/__init__.py", line 2437, in resolve
module = __import__(self.module_name, fromlist=['__name__'], level=0)
File "/Users/aburkov/Downloads/src/newsroom/newsroom/__init__.py", line 2, in <module>
from . import build
File "/Users/aburkov/Downloads/src/newsroom/newsroom/build/__init__.py", line 1, in <module>
from .download import Downloader
File "/Users/aburkov/Downloads/src/newsroom/newsroom/build/download.py", line 55
yield from executor.map(self._thread, urls)
^
SyntaxError: invalid syntax
Can you help me out here, or is there any link i could directly download the dataset from? @yoavartzi @grusky
Hi. I am getting this module error. am I suppose to download it somewhere or do I need a newsroom.py file or something? Because I can't seem to find any file in the code. What am I missing?
Will the evaluation tools be made available anytime soon? If not, do you recommend a specific python wrapper for rouge? Thanks
Hi, i tried to get the newsroom dataset from scratch. After i run the newsroom-scrape --thin thin/dev.jsonl.gz --archive dev.archive
, the dev.archive
seems not apppear. It takes me 10 days to run the command and there is nothing left. So i wonder if there is something wrong to scrape from scratch?
And the final result i get is:
108800 pages need re-downloading later.
108810 pages need re-downloading later.
108820 pages need re-downloading later.
108830 pages need re-downloading later.
Rerun the script: 108837 pages failed to download.
- Try running with a lower --workers count (default = 16).
- Check which URLs are left with the --diff flag.
- Last resort: --exactness X to truncate dates to X digits.
(e.g., --exactness 4 will download the closest year.)
Downloading Summaries: 100%|██████████| 108837/108837 [249:09:03<00:00, 8.24s/it]
ub16hp@UB16HP:~/ub16_prj/newsroom$ newsroom-scrape --thin thin/dev.jsonl.gz --archive dev.archive
gzip: stdin: unexpected end of file
Loading previously downloaded summaries: Traceback (most recent call last):
File "/usr/local/bin/newsroom-scrape", line 11, in
load_entry_point('newsroom', 'console_scripts', 'newsroom-scrape')()
File "/usr/local/lib/python3.5/dist-packages/click/core.py", line 722, in call
return self.main(*args, **kwargs)
File "/usr/local/lib/python3.5/dist-packages/click/core.py", line 697, in main
rv = self.invoke(ctx)
File "/usr/local/lib/python3.5/dist-packages/click/core.py", line 895, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/usr/local/lib/python3.5/dist-packages/click/core.py", line 535, in invoke
return callback(*args, **kwargs)
File "/home/ub16hp/ub16_prj/newsroom/newsroom/build/scrape.py", line 124, in main
done = {ln["archive"] for ln in f}
File "/home/ub16hp/ub16_prj/newsroom/newsroom/build/scrape.py", line 124, in
done = {ln["archive"] for ln in f}
File "/home/ub16hp/ub16_prj/newsroom/newsroom/build/jsonl.py", line 264, in readlines
yield _json.loads(line)
ValueError: Unmatched ''"' when when decoding 'string'
ub16hp@UB16HP:~/ub16_prj/newsroom$
Following advice from https://stackoverflow.com/questions/70663523/the-unauthenticated-git-protocol-on-port-9418-is-no-longer-supported:
Simply use
pip install -e git+https://github.com/clic-lab/newsroom.git#egg=newsroom
instead of
pip install -e git+git://github.com/clic-lab/newsroom.git#egg=newsroom
I submitted a request for the complete dataset from the website, but still haven't received any mail.
If the process takes some time, you might want to send an acknowledgement mail, something along the lines "Your request is now being treated", to be sure the user didn't give a wrong mail address.
The 'read full article' links don't work in the 'explore' section.
https://twitter.com/jeffrschneider/status/991391692854046721?s=11
Dear Artzi @yoavartzi, first of all, I want to thank you for your great work.
I have two questions about the diversity analysis in your paper. I used your code to calculate the coverage, density, and compression. And then, I used the seaborn.kdeplot to visualize the results. But I found my result was different from figure 4 in your paper. The coverage score seems much lower. My questions are:
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.