Giter VIP home page Giter VIP logo

newsroom's People

Contributors

grusky avatar yoavartzi avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

newsroom's Issues

Memory keep getting up when scraping data

As I run the scraping code, the memory of my computer keeps getting up while downloading summaries until it crashes because the memory is not enough. Is there a way to fix it?
Running on Windows 7 64bit, python version 3.6.
capture

thin.rar download URL unreacheable

https://summari.es/files/thin.tar can't be reached. When I try from the browser the download fails at around 20 MB each time. Using wget the following error is received:
Connecting to summari.es:80... connected!
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://summari.es/files/thin.tar [following]
--13:16:19-- ftp://https:21/%2Fsummari.es/files/thin.tar
=> `thin.tar'
Connecting to https:21...
https: Host not found

Is there any mirror to the thin.rar file?

Description of JSON fields needed

I couldn't find a complete description of the JSON data fields of the dataset.

Did I miss it ?


Most of the fields are self-explanatory, but for example, I'm wondering what are :

  • extbin
  • sumbin
  • textbin

Is it the extractiveness, summary length and text length ?

Some articles have empty summaries, and scrape script doesn't always detect broken links

Hi, thanks a lot for this great dataset. I have been looking forward to training models on it but I ran into two issues.

The first is related to #2 . Some of the WebArchive links redirect for some reason and the original article is lost. For example:
https://web.archive.org/web/2014013019id_/http://www.nytimes.com/2014/01/31/world/europe/ukraine-unrest.html

For this issue, it can be solved by the comment in #2 , i.e. using exactness 3. However it is problematic for anyone who does not set exactness to 3, since the newsroom-scrape script doesn't detect the broken link. So the malformed article-summary pair ends up in the final dataset.

The second issue is that some summaries are empty. For example:
http://latimesblogs.latimes.com/technology/2011/08/how-to-opt-out-of-linkedin-sharing-your-data-with-third-parties.html

In this case, should the article be excluded from the dataset? I don't see how a meaningful summary can be extracted from this web page.

Syntax error when newsroom-scrape uses python concurrent library

Hi,

whenever I launch newsroom-scrape I get the SyntaxError from the concurrent library:

Traceback (most recent call last):
File "/usr/local/bin/newsroom-scrape", line 11, in
load_entry_point('newsroom', 'console_scripts', 'newsroom-scrape')()
File "/usr/lib/python3/dist-packages/pkg_resources/init.py", line 480, in load_entry_point
return get_distribution(dist).load_entry_point(group, name)
File "/usr/lib/python3/dist-packages/pkg_resources/init.py", line 2693, in load_entry_point
return ep.load()
File "/usr/lib/python3/dist-packages/pkg_resources/init.py", line 2324, in load
return self.resolve()
File "/usr/lib/python3/dist-packages/pkg_resources/init.py", line 2330, in resolve
module = import(self.module_name, fromlist=['name'], level=0)
File "/home/crow/src/newsroom/newsroom/init.py", line 2, in
from . import build
File "/home/crow/src/newsroom/newsroom/build/init.py", line 1, in
from .download import Downloader
File "/home/crow/src/newsroom/newsroom/build/download.py", line 2, in
from concurrent.futures import ThreadPoolExecutor
File "/usr/local/lib/python3.6/dist-packages/concurrent/futures/init.py", line 8, in
from concurrent.futures._base import (FIRST_COMPLETED,
File "/usr/local/lib/python3.6/dist-packages/concurrent/futures/_base.py", line 381
raise exception_type, self._exception, self._traceback
^
SyntaxError: invalid syntax

I think it's due to the use of python3 with a library written for python2. What can I do? I'm using python3

missing some values in summary objects

There are only ['url', 'archive', 'title', 'date', 'text', 'summary', 'density', 'coverage', 'compression', 'compression_bin', 'coverage_bin', 'density_bin'] in summary objects. While ['subset', 'publication'] which stated on your website are missing. Without 'subset' we can not compare our result to those on the leaderboard. Can you provide those information? Thanks.

Some of the URLs are missing both from the Web rchive and original web sites

After running the crawler a couple of times with a reduced number of workers and even exactness 4, I have around 100 articles still missing in both the dev andd test set (train is still running, currently missing about 2700). For example:
https://newoldage.blogs.nytimes.com/2009/12/03/ccrc-fees-prepare-to-be-bewildered/index.html

Others are on the original websites, but not found on the WebArchive
https://www.nytimes.com/2010/10/21/opinion/21kristof.html
https://web.archive.org/web/*/https://www.nytimes.com/2010/10/21/opinion/21kristof.html
(All webarchive saves redirect to errors)

Any ideas on what went wrong there? Could there be a fix to get them directly from the sites when available? I can post the full lists if that's helpful.

problem when install the newsroom

Hi,Sir!I'm learning the newsroom.When i used pip to install it via git+git,i met this problem:failed building wheel for spacy and thinc .I don't understand this situation and have tried it for many times.
i will appreciate for your help~

image

data extract doesnt work

when i run this command newsroom-extract --archive dev.archive --dataset dev.data
to obtain the data file, the command line shows these message:
"Loading downloaded summaries:
Aborted!"
Then it stops. Can you guys help fix this? Ty

Dataset Impaired Integrity

The only data in the datesets is with keys:
-archive
-date
-compression
-coverage
-density
-compression_bin
-coverage_bin
-density_bin

I cannot see the "summary" and "text" values of each object.

data extraction error

I am getting this error

_pickle.PicklingError: Can't pickle <function Article.process at 0x7f5f2ab56d90>: attribute lookup process on newsroom.build.filter failed

when using the following command:

newsroom-extract --archive dev.archive --dataset dev.data

I am running it with Python3.4.

No module named 'lxml.etree'

When running newsroom-scrape cmd, I got this error :

ModuleNotFoundError: No module named 'lxml.etree'

Full stack :

Traceback (most recent call last):
  File "/home/x/.venv/36/bin/newsroom-scrape", line 11, in <module>
    load_entry_point('newsroom', 'console_scripts', 'newsroom-scrape')()
  File "/home/x/.venv/36/lib/python3.6/site-packages/pkg_resources/__init__.py", line 489, in load_entry_point
    return get_distribution(dist).load_entry_point(group, name)
  File "/home/x/.venv/36/lib/python3.6/site-packages/pkg_resources/__init__.py", line 2793, in load_entry_point    return ep.load()
  File "/home/x/.venv/36/lib/python3.6/site-packages/pkg_resources/__init__.py", line 2411, in load
    return self.resolve()
  File "/home/x/.venv/36/lib/python3.6/site-packages/pkg_resources/__init__.py", line 2417, in resolve
    module = __import__(self.module_name, fromlist=['__name__'], level=0)
  File "/home/x/.venv/36/src/newsroom/newsroom/__init__.py", line 2, in <module>
    from . import build
  File "/home/x/.venv/36/src/newsroom/newsroom/build/__init__.py", line 2, in <module>
    from .filter import Article
  File "/home/x/.venv/36/src/newsroom/newsroom/build/filter.py", line 5, in <module>
    from readability import Document
  File "/home/x/.venv/36/lib/python3.6/site-packages/readability/__init__.py", line 1, in <module>
    from .readability import Document
  File "/home/x/.venv/36/lib/python3.6/site-packages/readability/readability.py", line 8, in <module>
    from lxml.etree import tostring
ModuleNotFoundError: No module named 'lxml.etree'

Possible errors in ROUGE-L evaluation

ROUGE-L is sensitive to sentence tokenization.

The data format used by the newsroom-run -> newsroom-score -> newsroom-tables evaluation pipeline does not appear to keep track of sentence tokenization. When sentence tokenization is not provided to ROUGE-1.5.5, multi-sentence references and hypotheses are evaluated as one long sentence. As a result, ROUGE-L scores produced by this evaluation pipeline (1) may be lower than expected compared to more standard ROUGE evaluation that uses tokenized sentences, and (2) probably do not match the ROUGE-L scores in the Newsroom paper, which are computed using sentence tokenization.

I would recommend adding a notice to the README that the evaluation pipeline does not keep track of sentence tokenization, which may result in lower-than-expected ROUGE-L scores, and that the newsroom-run -> newsroom-score -> newsroom-tables evaluation pipeline should not be used for publishable evaluation.

Error when running newsroom-scrape --thin thin/dev.jsonl.gz --archive dev.archive

newsroom-scrape --thin thin/dev.jsonl.gz --archive dev.archive
Traceback (most recent call last):
  File "/usr/local/bin/newsroom-scrape", line 11, in <module>
    load_entry_point('newsroom', 'console_scripts', 'newsroom-scrape')()
  File "/usr/local/lib/python2.7/site-packages/pkg_resources/__init__.py", line 587, in load_entry_point
    return get_distribution(dist).load_entry_point(group, name)
  File "/usr/local/lib/python2.7/site-packages/pkg_resources/__init__.py", line 2800, in load_entry_point
    return ep.load()
  File "/usr/local/lib/python2.7/site-packages/pkg_resources/__init__.py", line 2431, in load
    return self.resolve()
  File "/usr/local/lib/python2.7/site-packages/pkg_resources/__init__.py", line 2437, in resolve
    module = __import__(self.module_name, fromlist=['__name__'], level=0)
  File "/Users/aburkov/Downloads/src/newsroom/newsroom/__init__.py", line 2, in <module>
    from . import build
  File "/Users/aburkov/Downloads/src/newsroom/newsroom/build/__init__.py", line 1, in <module>
    from .download import Downloader
  File "/Users/aburkov/Downloads/src/newsroom/newsroom/build/download.py", line 55
    yield from executor.map(self._thread, urls)
             ^
SyntaxError: invalid syntax

ModuleNotFoundError: No module named 'newsroom'

Hi. I am getting this module error. am I suppose to download it somewhere or do I need a newsroom.py file or something? Because I can't seem to find any file in the code. What am I missing?

evaluation tools

Will the evaluation tools be made available anytime soon? If not, do you recommend a specific python wrapper for rouge? Thanks

failed to scrape

Hi, i tried to get the newsroom dataset from scratch. After i run the newsroom-scrape --thin thin/dev.jsonl.gz --archive dev.archive, the dev.archive seems not apppear. It takes me 10 days to run the command and there is nothing left. So i wonder if there is something wrong to scrape from scratch?

And the final result i get is:

108800 pages need re-downloading later.

108810 pages need re-downloading later.

108820 pages need re-downloading later.

108830 pages need re-downloading later.


Rerun the script: 108837 pages failed to download.
- Try running with a lower --workers count (default = 16).
- Check which URLs are left with the --diff flag.
- Last resort: --exactness X to truncate dates to X digits.
  (e.g., --exactness 4 will download the closest year.)

Downloading Summaries: 100%|██████████| 108837/108837 [249:09:03<00:00,  8.24s/it]

ValueError: Unmatched ''"' when when decoding 'string'

ub16hp@UB16HP:~/ub16_prj/newsroom$ newsroom-scrape --thin thin/dev.jsonl.gz --archive dev.archive

gzip: stdin: unexpected end of file
Loading previously downloaded summaries: Traceback (most recent call last):
File "/usr/local/bin/newsroom-scrape", line 11, in
load_entry_point('newsroom', 'console_scripts', 'newsroom-scrape')()
File "/usr/local/lib/python3.5/dist-packages/click/core.py", line 722, in call
return self.main(*args, **kwargs)
File "/usr/local/lib/python3.5/dist-packages/click/core.py", line 697, in main
rv = self.invoke(ctx)
File "/usr/local/lib/python3.5/dist-packages/click/core.py", line 895, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/usr/local/lib/python3.5/dist-packages/click/core.py", line 535, in invoke
return callback(*args, **kwargs)
File "/home/ub16hp/ub16_prj/newsroom/newsroom/build/scrape.py", line 124, in main
done = {ln["archive"] for ln in f}
File "/home/ub16hp/ub16_prj/newsroom/newsroom/build/scrape.py", line 124, in
done = {ln["archive"] for ln in f}
File "/home/ub16hp/ub16_prj/newsroom/newsroom/build/jsonl.py", line 264, in readlines
yield _json.loads(line)
ValueError: Unmatched ''"' when when decoding 'string'
ub16hp@UB16HP:~/ub16_prj/newsroom$

Complete dataset download

I submitted a request for the complete dataset from the website, but still haven't received any mail.

If the process takes some time, you might want to send an acknowledgement mail, something along the lines "Your request is now being treated", to be sure the user didn't give a wrong mail address.

Dataset Diversity Analysis

Dear Artzi @yoavartzi, first of all, I want to thank you for your great work.
I have two questions about the diversity analysis in your paper. I used your code to calculate the coverage, density, and compression. And then, I used the seaborn.kdeplot to visualize the results. But I found my result was different from figure 4 in your paper. The coverage score seems much lower. My questions are:

  1. Did you divide the coverage scores with the maximum value or conduct the min-max normalization?
  2. Did you randomly sample or use the entire dataset (training set) to calculate these three metrics?
    Thank you again for your help.

My result is here:
image

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.