Giter VIP home page Giter VIP logo

news-crawl's Introduction

NEWS-CRAWL

Crawler for news based on StormCrawler. Produces WARC files to be stored as part of the Common Crawl. The data is hosted as AWS Open Data Set โ€“ if you want to use the data and not the crawler software please read the announcement of the news dataset.

Prerequisites

  • Java 8
  • Install Elasticsearch 7.5.0 (ev. also Kibana)
  • Install Apache Storm 1.2.4
  • Start Elasticsearch and Storm
  • Build ES indices by running bin/ES_IndexInit.sh

Crawler Seeds

The crawler relies on RSS/Atom feeds and news sitemaps to find links to news articles on news sites. A small collection of example seeds (feeds and sitemaps) is provided in ./seeds/. Adding support for news sites which do not provide a news feed or sitemap is an open issue, see #41.

Configuration

The default configuration should work out-of-the-box. The only thing to do is to configure the user agent properties send in the HTTP request header. Open the file conf/crawler-conf.yaml in an editor and fill in the values for http.agent.name and all further properties starting with the http.agent. prefix.

Run the crawl

Generate an uberjar:

mvn clean package

And run ...

storm jar target/crawler-1.18.1.jar org.commoncrawl.stormcrawler.news.CrawlTopology -conf $PWD/conf/es-conf.yaml -conf $PWD/conf/crawler-conf.yaml $PWD/seeds/ feeds.txt

This will launch the crawl topology. It will also "inject" all URLs found in the file ./seeds/feeds.txt in the status index. The URLs point to news feeds and sitemaps from which links to news articles are extracted and fetched. The topology will create WARC files in the directory specified in the configuration under the key warc.dir. This directory must be created beforehand.

Of course, it's also possible to add (or remove) the seeds (feeds and sitemaps) using the Elasticsearch API. In this case, the can topology can be run without the last two arguments.

Alternatively, the topology can be run from the crawler.flux, please see the Storm Flux documentation. Make sure to adapt the Flux definition to your needs!

Monitor the crawl

When the topology is running you can check that URLs have been injected and news are getting fetched on [http://localhost:9200/status/_search?pretty]. Or use StormCrawler's Kibana dashboards to monitor the crawling process. Please follow the instructions to install the templates for Kibana provided as part of StormCrawler's Elasticsearch module documentation.

There is also a shell script bin/es_status to get aggregated counts from the status index, and to add, delete or force a re-fetch of URLs. E.g.,

$> bin/es_status aggregate_status
3824    DISCOVERED
34      FETCHED
5       REDIRECTION

Run Crawl from Docker Container

First, download Apache Storm 1.2.4. from the download page and place it in the directory downloads:

STORM_VERSION=1.2.4
mkdir downloads
wget -q -P downloads --timestamping https://downloads.apache.org/storm/apache-storm-$STORM_VERSION/apache-storm-$STORM_VERSION.tar.gz

Do not forget to create the uberjar (see above) which is included in the Docker image. Simply run:

mvn clean package

Then build the Docker image from the Dockerfile:

Note: the uberjar is included in the Docker image and needs to be built first (see above).

docker build -t newscrawler:1.18.1 .

To launch an interactive container:

docker run --net=host \
    -v $PWD/data/elasticsearch:/data/elasticsearch \
    -v $PWD/data/warc:/data/warc \
    --rm --name newscrawler -i -t newscrawler:1.18.1 /bin/bash

NOTE: don't forget to adapt the paths to mounted volumes used to persist data on the host. Make sure to add the user agent configuration in conf/crawler-conf.yaml.

CAVEAT: Make sure that the Elasticsearch port 9200 is not already in use or mapped by a running ES instance. Otherwise Elasticsearch commands may affect the running instance!

The crawler is launched in the running container by the script

/home/ubuntu/news-crawler/bin/run-crawler.sh

After 1-2 minutes if everything is up, connect to Elasticsearch on port 9200 or Kibana on port 5601.

news-crawl's People

Contributors

jnioche avatar sebastian-nagel avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

news-crawl's Issues

mvn clean package fails on Mac on Apple M1 Pro chip

Hi friends,

I am using news-crawl for academic research but am unable to set it up on my Mac computer. But I encountered build error and would really appreciate some help!

Environment:
MacOS Monterey 12.2.1, Apple M1 Pro
javac 20.0.2
openjdk 20.0.2 2023-07-18 : OpenJDK Runtime Environment (build 20.0.2+9-78); OpenJDK 64-Bit Server VM (build 20.0.2+9-78, mixed mode, sharing)
Apache Maven 3.9.3 (21122926829f1ead511c958d89bd2f672198ae9f)
Apache Storm 2.4.0
elasticsearch 8.8.2
On bash shell
news-crawler commit 4194f9c

Steps to reproduce:

  1. I installed elasticsearch, Apache Storm; started elasticsearch; built bin/ES_IndexInit.sh
  2. Cloned news-crawler, direct into directory and put "mvn clean package"

Expected result

  1. Successful build

Actual result:

  1. Screenshot

Screen Shot 2023-07-22 at 4 42 53 PM
Screen Shot 2023-07-22 at 4 43 04 PM
Screen Shot 2023-07-22 at 4 43 20 PM

I am new to Java development. Please let me know if I can provide additional context! Thanks

WARC file format fix: mask HTTP header fields Content-Encoding and Transfer-Encoding, adjust Content-Length

The CC-NEWS contain the literal values of the HTTP header fields Content-Encoding, Transfer-Encoding and Content-Length although the payload is stored unchunked and uncompressed.

  • the header fields Content-Encoding and Transfer-Encoding should be masked by a prefix (the CC-MAIN WARC files use X-Crawler-)
  • if the value of Content-Length is wrong because of a change of the Content-Encoding, the original HTTP header should be masked and the correct value should be given in the header Content-Length

Thanks, @wumpus for detecting this!

Error in build docker

Hello I'm getting this error while building the docker

ADD failed: stat /var/lib/docker/tmp/docker-builder779286778/target/crawler-1.17.jar: no such file or directory

PS: I have also tried said path but its also not working

Full support for sitemap extensions and namespaces

From sitemaps only news sitemaps are accepted as seed source. However,

  • information from extensions is not used (cf. news:publication_date #18)
  • any sitemap which declares the news namespace is considered to be a news sitemap. But there are also image sitemaps which declare the news namespace:
    • the image URLs should not be extracted as news seeds
    • ev. also items not marked as news should be skipped
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<urlset xmlns="http://www.sitemadps.org/schemas/sitemap/0.9" xmlns:image="http://www.google.com/schemas/sitemap-image/1.1" xmlns:news="http://www.google.com/schemas/sitemap-news/0.9">
  <url>
    <loc>...</loc>
    <lastmod>2018-03-18T21:44:50-07:00</lastmod>
    <image:image>
      <image:loc>...</image:loc>
      <image:title>...

Cf. crawler-commons/crawler-commons#162, crawler-commons/crawler-commons#174.

Adaptive fetch schedule for feeds

At present, the refetch schedule for seed feeds is globally 3 hours which is a compromise between

  • trying not to miss links for large national newspapers (10 min. are required to get every news link in rush hours, e.g., Monday morning)
  • not overloading small regional newspapers or news blogs with only a few news per day

The schedule should adapt to the change frequency within a configurable min and max refetch interval (eg., 10 min. - 2 weeks). Detection of unchanged feeds should be independent of a last-modified time sent together with the server response.

AdaptiveScheduler not applied to RSS/Atom feeds

The fetchInterval in metadata is not properly updated for RSS/Atom feeds. News sitemaps do not seem to be affected... (seen with the recent version based on StormCrawler 1.8 / ElasticSearch 6.0)

NewsSiteMapParserBolt: do not detect feeds as sitemaps

If a news feed uses the sitemaps namespace it is erroneously detected as sitemap which causes that it's processed as sitemap (without being properly parsed) and not as feed. One example feed:

<?xml version="1.0" encoding="ISO-8859-1"?>
<?xml-stylesheet type="text/xsl" media="screen" href="/~d/styles/rss2full.xsl"?><?xml-stylesheet type="text/css" media="screen" href="http://feeds.drudge.com/~d/styles/itemcontent.css"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:sitemap="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:wordzilla="http://www.cadenhead.org/workbench/wordzilla/namespace" xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0" version="2.0">

Should follow subsitemaps in sitemap index

The newscrawler uses only news sitemaps as "news feed" and ignores "ordinary" sitemaps not following the URLs listed there. However, the crawler should follow sitemaps listed in a sitemap index and check whether one of them is a news sitemap.
E.g. while https://www.greenwichtime.com/sitemap_news.xml is not a news sitemap, it links to a bunch of news sitemaps:

<?xml version="1.0" encoding="UTF-8" ?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<sitemap>
	<loc>http://www.greenwichtime.com/sitemap/news/ap.xml</loc>
	<lastmod>2018-02-08T03:15:03Z</lastmod>
</sitemap>
...

Reset WARC file size counter on time-based rotation

The WARC file rotation may hapen unnecessarily often:

% ls -lh /data/warc/
-rw-r--r--  1 storm    storm 983M Sep 28 07:43 CC-NEWS-20160927074341-00000.warc.gz
-rw-r--r--  1 storm    storm  42M Sep 28 07:59 CC-NEWS-20160928074341-00001.warc.gz
-rw-r--r--  1 storm    storm 240M Sep 28 12:15 CC-NEWS-20160928075927-00002.warc.gz

The file with timestamp 07:59 should be part of the next WARC. This happens if first the time limit applies:

2016-09-28 07:43:41.636 o.a.s.h.b.r.FileSizeRotationPolicy [INFO] Rotating file based on time : started 1474962208980 interval 86400000
2016-09-28 07:43:41.636 o.a.s.h.b.AbstractHdfsBolt [INFO] Rotating output file...
2016-09-28 07:43:41.651 o.a.s.h.b.AbstractHdfsBolt [INFO] Performing 0 file rotation actions.
2016-09-28 07:43:41.651 o.a.s.h.b.AbstractHdfsBolt [INFO] File rotation took 15 ms.

which did obviously not properly reset the file size counter. 15 min. later the file size limit (1 GB) is logged to be hit (943 MB + 42 MB = 1025 MB ~ 1GB):

2016-09-28 07:59:27.221 o.a.s.h.b.r.FileSizeRotationPolicy [INFO] Rotating file based on size : currentBytesWritten 1073768981 maxBytes 1073741824
2016-09-28 07:59:27.221 o.a.s.h.b.AbstractHdfsBolt [INFO] Rotating output file...
2016-09-28 07:59:27.233 o.a.s.h.b.AbstractHdfsBolt [INFO] Performing 0 file rotation actions.
2016-09-28 07:59:27.233 o.a.s.h.b.AbstractHdfsBolt [INFO] File rotation took 12 ms.

unable to fetch data from elasticsearch , no content is showing

i have used the following github repository "https://github.com/commoncrawl/news-crawl"
he has used the following versions of required libraries

Install Elasticsearch 7.5.0
Install Apache Storm 1.2.3
stormcrawler 1.16
maven 3.6.2

I have followed the path given in readme but my localhost:9200 is not showing any hits
the configuration command is successfully running but it shows some FETCH_ERROR errors.
No content or url is being shown.

Provide indexing outside AWS

For the time being, I do not have AWS credentials. This means I'm unable to determine the filenames for the crawl dumps. Could the directory structure be made available by methods other than aws ls?

I might be wrong, but I don't think aws ls can be run anonymously.

Run docker in a non-interactively way

Hi, I've been trying to run docker in a non-interactively using the following command

docker run -d \
   -p 127.0.0.1:9200:9200 -p 5601:5601 -p 8080:8080 \
   -v .../data/warc:/data/warc \
   -v .../data/elasticsearch:/data/elasticsearch \
   -t newscrawler:1.18 /home/ubuntu/news-crawler/bin/run-crawler.sh

the complete logs can be found here
When running it interactively I have no problem. Any idea what the problem is?

Use wikidata to complete seeds

Initially, the news crawler was seeded with URLs from news sites from DMOZ, see #8 for the procedure. DMOZ isn't updated anymore, but Wikidata could be a replacement to complete the seed list:

  • select all instances of newspaper (news media, or similar) having an official website:
    SELECT DISTINCT ?item ?itemLabel ?lang ?url
    WHERE
    { 
      ?item wdt:P31/wdt:P279* wd:Q11032.
      ?item wdt:P856 ?url.  # with official website
      SERVICE wikibase:label { bd:serviceParam wikibase:language "en,de,ru,fr,es,it,ja,zh,*" }
      OPTIONAL {
         ?item wdt:P407 ?language.
         ?language wdt:P220 ?lang.
       }
    }
    LIMIT 50
    (execute query on Wikidata query service)

News WARC files processing issue.

Basically i am trying to iterate over the records of news WARC file to get HTML content and process the HTML content. I am using python warc package
snippet to read warc file:
import warc
f = warc.open("CC-NEWS-20161001224340-00008.warc")
for record in f:
if record['Content-Type'] == 'application/http; msgtype=response':
payload = record.payload.read()
headers, body = payload.split('\r\n\r\n', 1)
if 'Content-Type: text/html' in headers:
#do my processing with html content (body)

But when i run this i am getting this error:
Traceback (most recent call last):
warc_process.py", line 69, in
read_entire_warc("CC-NEWS-20160926211809-00000.warc")
File "warc_process.py", line 54, in read_entire_warc
for record in f:
File "anaconda2/lib/python2.7/site-packages/warc/warc.py", line 393, in iter
record = self.read_record()
File "anaconda2/lib/python2.7/site-packages/warc/warc.py", line 364, in read_record
self.finish_reading_current_record()
File "anaconda2/lib/python2.7/site-packages/warc/warc.py", line 360, in finish_reading_current_record
self.expect(self.current_payload.fileobj, "\r\n")
File "anaconda2/lib/python2.7/site-packages/warc/warc.py", line 352, in expect
raise IOError(message)
IOError: Expected '\r\n', found 'WARC/1.0\r\n'

Sample WARC files facing issues with: CC-NEWS-20160926211809-00000.warcCC-NEWS-20161001122244-00007.warc.gz
CC-NEWS-20161001224340-00008.warc.gz
CC-NEWS-20161002224346-00009.warc.gz
CC-NEWS-20161003130443-00010.warc.gz
CC-NEWS-20161004130444-00011.warc.gz
CC-NEWS-20161005130450-00012.warc.gz
CC-NEWS-20161005152607-00013.warc.gz
CC-NEWS-20161006152607-00014.warc.gz
CC-NEWS-20161006191324-00015.warc.gz
CC-NEWS-20161007191326-00016.warc.gz
CC-NEWS-20161008015559-00017.warc.gz
CC-NEWS-20161009015614-00018.warc.gz
CC-NEWS-20161010001731-00019.warc.gz

NewsSiteMapParserBolt fails to parse valid XML sitemap

NewsSiteMapParserBolt fails to parse some valid XML sitemaps, e.g.,

2018-03-09 18:14:13.924 o.c.s.n.NewsSiteMapParserBolt Thread-30-sitemap-executor[10 11] [INFO] http://www.pjstar.com/section/google-news-sitemap detected as news sitemap based on content
2018-03-09 18:14:13.924 c.s.SiteMapParser Thread-30-sitemap-executor[10 11] [WARN] Bad url: [<?xml version="1.0" encoding="UTF-8"?>]
2018-03-09 18:14:13.924 c.s.SiteMapParser Thread-30-sitemap-executor[10 11] [WARN] Bad url: [<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"]
2018-03-09 18:14:13.926 c.s.SiteMapParser Thread-30-sitemap-executor[10 11] [WARN] Bad url: [     xmlns:news="http://www.google.com/schemas/sitemap-news/0.9"]
2018-03-09 18:14:13.927 c.s.SiteMapParser Thread-30-sitemap-executor[10 11] [WARN] Bad url: [     xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"]
2018-03-09 18:14:13.927 c.s.SiteMapParser Thread-30-sitemap-executor[10 11] [WARN] Bad url: [     xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9]
2018-03-09 18:14:13.927 c.s.SiteMapParser Thread-30-sitemap-executor[10 11] [WARN] Bad url: [   <url>]
2018-03-09 18:14:13.927 c.s.SiteMapParser Thread-30-sitemap-executor[10 11] [WARN] Bad url: [           <loc>http://www.pjstar.com/news/20180309/chosen-family-portrait-group-that-needed-each-other</loc>]
2018-03-09 18:14:13.927 c.s.SiteMapParser Thread-30-sitemap-executor[10 11] [WARN] Bad url: [           <news:news>]
2018-03-09 18:14:13.927 c.s.SiteMapParser Thread-30-sitemap-executor[10 11] [WARN] Bad url: [           <news:news>]
2018-03-09 18:14:13.927 c.s.SiteMapParser Thread-30-sitemap-executor[10 11] [WARN] Bad url: [           <news:news>]
2018-03-09 18:14:13.927 c.s.SiteMapParser Thread-30-sitemap-executor[10 11] [WARN] Bad url: [                   <news:publication>]
2018-03-09 18:14:13.927 c.s.SiteMapParser Thread-30-sitemap-executor[10 11] [WARN] Bad url: [                           <news:name>Peoria Journal Star</news:name>]
2018-03-09 18:14:13.927 c.s.SiteMapParser Thread-30-sitemap-executor[10 11] [WARN] Bad url: [                           <news:language>en</news:language>]
2018-03-09 18:14:13.928 c.s.SiteMapParser Thread-30-sitemap-executor[10 11] [WARN] Bad url: [                   </news:publication>]
2018-03-09 18:14:13.928 c.s.SiteMapParser Thread-30-sitemap-executor[10 11] [WARN] Bad url: [                   <news:publication_date>2018-03-09</news:publication_date>]

For this sitemap the server responds Content-Type: text/html; charset=ISO-8859-1 which seems to cause that it's not even tried to parse as XML.

Extract publishing date

It would be great if you could additionally extract the date when an article was published. Currently, this requires parsing the web page and using tools such as newspaper3k to get that information. However, during the crawling process at least some webpages would offer this information, e.g. the time stamp within the RSS feed
<pubDate>Thu, 25 Dec 2014 02:10:00 +0900</pubDate>
or within the sitemap
<news:publication_date>2016-12-09T16:18:48Z</news:publication_date>

Check WARC files generated

Compare with what we get from [https://webrecorder.io/]
Read with [https://github.com/ikreymer/webarchiveplayer]
Try with warcdump command

URL filter: exclude localhost and private addresses

The URL filters should reject localhost and private address spaces. The crawler may detect links pointing to a private network address, e.g.

2017-12-23 08:37:22.104 c.d.s.b.FetcherBolt FetcherThread #54 [ERROR] Exception while fetching http://localhost/wordpress/2017/.../
org.apache.http.conn.HttpHostConnectException: Connect to localhost:80 [localhost/127.0.0.1] failed: Connection refused (Connection refused)

This example looks more like an error on the remote page. But the crawler should never even try to access pages from localhost or a private network to avoid that information is leaked and is written to the WARC file. Could be, e.g., a link to the Storm web interface (http://localhost:8080/) exposing the cluster configuration.

Endless refetch of URLs due to changing domain names

The news crawler uses the domain name to manage fetch queues. The domain name is also used to route URLs to Elasticsearch shards. When a URL is re-fetched the existing routing key isn't reused, instead the domain name is newly extracted from the host name and used as routing key. This makes the routing unstable because the domain name extraction is based on the changing and continuously updated public suffix list. If the routing changes the status record doesn't get updated, instead a second record with the same key is created. Because the nextFetchDate of the original record is still in the past and is never updated, the URL is scheduled for re-fetch again and again.

Two examples (the domain name in the updated version is the correct one):

  • domain la.lv (la.lv is not a public suffix, so veselam.la.lv is not the domain name)
% ./bin/es_status url http://veselam.la.lv/feed
{                                                                                                                                                             
  "took" : 85,                                                                                                                                                
  "timed_out" : false,                                                                                                                                                      
  "_shards" : {                                                                                                                                                             
    "total" : 10,
    "successful" : 10,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 2,
    "max_score" : 16.336418,
    "hits" : [
      {
        "_index" : "status",
        "_type" : "status",
        "_id" : "0ffe9ec78013060e06b5ee955058ce2d42617af4a4e287660d33661797bacc05",
        "_score" : 16.336418,
        "_routing" : "veselam.la.lv",
        "_source" : {
          "url" : "http://veselam.la.lv/feed",
          "status" : "ERROR",
          "metadata" : {
            "error%2Ecause" : [
              "maxFetchErrors"
            ],
            "depth" : [
              "1"
            ],
            "fetch%2EstatusCode" : [
              "404"
            ],
            "hostname" : "veselam.la.lv"
          },
          "nextFetchDate" : "2019-02-16T13:11:28.794Z"
        }
      },
      {
        "_index" : "status",
        "_type" : "status",
        "_id" : "0ffe9ec78013060e06b5ee955058ce2d42617af4a4e287660d33661797bacc05",
        "_score" : 15.227517,
        "_routing" : "la.lv",
        "_source" : {
          "url" : "http://veselam.la.lv/feed",
          "status" : "FETCH_ERROR",
          "metadata" : {
            "error%2Ecause" : [
              "maxFetchErrors"
            ],
            "depth" : [
              "1"
            ],
            "fetch%2Eerror%2Ecount" : [
              "1"
            ],
            "fetch%2EstatusCode" : [
              "404"
            ],
            "hostname" : "la.lv"
          },
          "nextFetchDate" : "2019-02-18T11:39:33.000Z"
        }
      }
    ]
  }
}
  • domain sportmediaset.med - the top-level domain (also a public suffix) .med has been introduced recently in 2016.
% ./bin/es_status url http://www.sportmediaset.med
{
  "took" : 68,
  "timed_out" : false,
  "_shards" : {
    "total" : 10,
    "successful" : 10,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 2,
    "max_score" : 16.015972,
    "hits" : [
      {
        "_index" : "status",
        "_type" : "status",
        "_id" : "313cb24e75fd9ab5f3e6bc4afbd66cb067de1d375c69fe49bfeabdf3df7f7372",
        "_score" : 16.015972,
        "_routing" : "www.sportmediaset.med",
        "_source" : {
          "url" : "http://www.sportmediaset.med",
          "status" : "ERROR",
          "metadata" : {
            "error%2Ecause" : [
              "maxFetchErrors"
            ],
            "depth" : [
              "2"
            ],
            "isSitemap" : [
              "false"
            ],
            "isSitemapNews" : [
              "false"
            ],
            "hostname" : "www.sportmediaset.med"
          },
          "nextFetchDate" : "2019-02-09T05:58:20.006Z"
        }
      },
      {
        "_index" : "status",
        "_type" : "status",
        "_id" : "313cb24e75fd9ab5f3e6bc4afbd66cb067de1d375c69fe49bfeabdf3df7f7372",
        "_score" : 14.726991,
        "_routing" : "sportmediaset.med",
        "_source" : {
          "url" : "http://www.sportmediaset.med",
          "status" : "FETCH_ERROR",
          "metadata" : {
            "error%2Ecause" : [
              "maxFetchErrors"
            ],
            "depth" : [
              "2"
            ],
            "isSitemap" : [
              "false"
            ],
            "isSitemapNews" : [
              "false"
            ],
            "fetch%2Eerror%2Ecount" : [
              "1"
            ],
            "hostname" : "sportmediaset.med"
          },
          "nextFetchDate" : "2019-02-18T11:43:49.000Z"
        }
      }
    ]
  }
}

Odd duplicate content behaviour on www.diariodeavila.es domain

There appear to regularly be thousands of duplicate articles from this domain, always with identical initial paths but ending with different slugs

For example, I have noticed 2162 entries starting with https://www.diariodeavila.es/noticia/Z59BAEC38-ADF2-2475-07830E240D31D863 that all appear to have identical content (the article title is El Senado de EEUU aprueba la legalidad del 'impeachment' for all of these urls). Here are some example urls:

 https://www.diariodeavila.es/noticia/Z59BAEC38-ADF2-2475-07830E240D31D863/202101/111m-para-crear-una-red-de-areas-de-descanso-para-camiones
 https://www.diariodeavila.es/noticia/Z59BAEC38-ADF2-2475-07830E240D31D863/202101/120-contagiados-y-un-fallecido-balance-covid-de-la-jornada
 https://www.diariodeavila.es/noticia/Z59BAEC38-ADF2-2475-07830E240D31D863/202101/124-castellanos-y-leoneses-han-recibido-ya-la-segunda-dosis
 https://www.diariodeavila.es/noticia/Z59BAEC38-ADF2-2475-07830E240D31D863/202101/13-enmiendas-de-xav-por-14-millones-a-las-cuentas-regionales
 https://www.diariodeavila.es/noticia/Z59BAEC38-ADF2-2475-07830E240D31D863/202101/14-positivos-en-el-cribado-de-la-zona-de-madrigal
 https://www.diariodeavila.es/noticia/Z59BAEC38-ADF2-2475-07830E240D31D863/202101/160-contagios-mas-en-avila-en-un-dia-sin-fallecidos-covid
 https://www.diariodeavila.es/noticia/Z59BAEC38-ADF2-2475-07830E240D31D863/202101/16-positivos-en-el-primer-dia-de-cribado-en-cebreros
 https://www.diariodeavila.es/noticia/Z59BAEC38-ADF2-2475-07830E240D31D863/202101/170-nuevos-casos-covid-y-un-fallecido-en-el-hospital
 https://www.diariodeavila.es/noticia/Z59BAEC38-ADF2-2475-07830E240D31D863/202101/172-positivos-covid-mas-y-casi-medio-centenar-de-ingresados
 https://www.diariodeavila.es/noticia/Z59BAEC38-ADF2-2475-07830E240D31D863/202101/180-casos-y-un-fallecido-por-covid-balance-del-dia
 https://www.diariodeavila.es/noticia/Z59BAEC38-ADF2-2475-07830E240D31D863/202101/184-detenidos-en-la-tercera-noche-de-disturbios-en-paises-bajos
 https://www.diariodeavila.es/noticia/Z59BAEC38-ADF2-2475-07830E240D31D863/202101/200-incidencias-en-la-red-de-abastecimiento-por-el-temporal
 https://www.diariodeavila.es/noticia/Z59BAEC38-ADF2-2475-07830E240D31D863/202101/2020-deja-la-menor-cifra-de-empleados-publicos-del-decenio
 https://www.diariodeavila.es/noticia/Z59BAEC38-ADF2-2475-07830E240D31D863/202101/2020-dejo-la-menor-cifra-de-muertos-en-carreteras-abulenses
 https://www.diariodeavila.es/noticia/Z59BAEC38-ADF2-2475-07830E240D31D863/202101/2020-dejo-una-caida-minima-en-la-afiliacion-de-extranjeros
 https://www.diariodeavila.es/noticia/Z59BAEC38-ADF2-2475-07830E240D31D863/202101/2020-un-buen-ano-para-el-cerro-gallinero
 https://www.diariodeavila.es/noticia/Z59BAEC38-ADF2-2475-07830E240D31D863/202101/2145-vacunas-contra-la-covid-salen-de-avila
 https://www.diariodeavila.es/noticia/Z59BAEC38-ADF2-2475-07830E240D31D863/202101/2186-nuevos-casos-la-cifra-mas-alta-desde-noviembre

These were found in https://commoncrawl.s3.amazonaws.com/crawl-data/CC-NEWS/2021/02/CC-NEWS-20210210002910-00147.warc.gz

The problem appears to have started on the 1st Feb 2021 with the volume of pages from this site rising from ~50 per day to ~12000 per day.

produce WET files?

I'm not sure if this is the right place to ask this, (feel free to direct me where)
But would it be possible to also produce WET files from this library?

Many downstream libraries of CC consume WET files (such as oscar-project/ungoliant)
And it would be useful if there were WET files available alongside WARC files.

Ensure that seeds are refetched from time to time even if failed or redirected

All manually collected seeds (feeds or future seed formats, e.g., news sitemaps) should be refetched from time to time (weekly or monthly) even if they are redirected or failed to fetch:

  • a redirect may change over time. This indeed happened within 2 month for 10 seeds out of 5000: 8 redirects now point to https instead of http, two redirects have changed more significantly
  • any failure (failed to fetch, failed to parse, etc.) which is potentially only a transient error

Ideally, the fetch schedule should be configurable for a combination of metadata and fetch status.

amazing dataset!

Hello there,

I just discovered your news-crawler and I think this is an amazing idea!

Sorry if this is a very simple question, but is it possible to somehow download slices of the news-crawling data (possibly based on a keyword/regex/domain) without resorting to amazon AWS?

The ideas is that I have a very large cluster at my disposal already, so I would rather work with the raw data directly on my local cluster.

What do you think?
Thank you for your help

Automatic removal of ephemeral sitemaps

If a news site creates sitemaps with unique URLs on a daily base (or even in shorter intervals), over time this leads to too many sitemaps checked for updates, causing that news articles get stuck in queues jammed with sitemaps. The unique URLs pointing to sitemaps can stem from the robots.txt or a sitemap index. Typical URL/file patterns for ephemeral sitemaps are caused by including:

  • a timestamp in many variations:
    .../sitemap.xml?yyyy=2020&mm=02&dd=07
    .../sitemap-2017.xml?mm=12&dd=31
    .../sitemap-2019-04.xml
    .../sitemap?type=clanky-2019_9
    .../sitemap-201910.xml
    .../sitemap-news.xml?y=2018&m=03&d=19
    .../02-Sep-2019.xml
    .../articles_2019_06.xml
    .../sitemap_30-Nov-2019.xml
    .../sitemap_bydate.xml?startTime=2020-02-16T00:00:00&endTime=2020-02-22T23:59:59
    
  • a consecutive number, random ID, UUID, hash, etc.
    .../sitemap.xml?page=1424
    .../ymox96xuveov.xml
    /sitemaps/1151jawjodn3t.xml
    
  • or a combination of the above or with a news category:
    .../2019-05-13-0058/0817_8.xml
    .../world.xml?section_id=338&content_type=1&year=2017&month=9
    .../2019-12-22/sitemap.xml?page=1409
    

In the worst case, there can be 100k or even millions of sitemaps tracked for a domain, which requires to manually block or clean up the list of sitemaps, in order to be able to fetch news articles and follow the recent sitemaps.

Do not use "http/2" protocol version in HTTP headers in WARC files

340 WARC files of the news crawl data set, starting from 2020-09-12 until 2020-10-04 have been captured using HTTP/2 after a Java security upgrade which included ALPN and therefor allowed for HTTP/2. The crawler started to use HTTP/2 after an automatic restart.

The mentioned WARC files may cause WARC readers (eg. jwarc) to fail while parsing the HTTP headers:

  • request
    GET /2020/09/12/business/brexit-no-deal-uk-economy/index.html HTTP/2
    ...
    
  • response
    HTTP/2 200 
    

To address the issue:

  • for now block usage of HTTP/2
  • test which WARC parsers fail
  • enable the WARC bolt to write failure-proof files when using HTTP/2 (cf. iipc/warc-specifications#15, iipc/warc-specifications#42)
  • push fixes to the WARC parser libs or rewrite the WARC files so that they're compatible

Affected files:

s3://commoncrawl/crawl-data/CC-NEWS/2020/09/CC-NEWS-20200912083952-00000.warc.gz
...
s3://commoncrawl/crawl-data/CC-NEWS/2020/10/CC-NEWS-20201004110027-00339.warc.gz

More than 80% of the records are captured using HTTP/2.

URLs with trailing white space continuously re-fetched

Certain URLs following a quite specific pattern are continuously re-refetched, here the counts from a couple of hours log files:

    669 http://www.tvl.be/,NLD,Leuven
    668 http://www.ltv.ly/,ARA,National
    667 http://www.rajdhani.com.np/,NEP,National
    665 http://ukrainian.voanews.com/,UKR,Nationwide
    665 http://www.ura-inform.com/,RUS,Nationwide
    662 http://lariviera.netweek.it/,ITA,Sanremo
    458 http://www.topix.com/world/burkina-faso/,ENG,Foreign

The status index of these URLs seems correct, e.g.:

      "_source" : {
        "url" : "http://www.topix.com/world/burkina-faso/,ENG,Foreign",
        "status" : "FETCHED",
        "metadata" : {
          "fetch%2EstatusCode" : [ "200" ]
        },
        "hostname" : "topix.com",
        "nextFetchDate" : "2027-01-31T21:18:49.092Z"
      }

or

      "_source" : {
        "url" : "http://www.ura-inform.com/,RUS,Nationwide",
        "status" : "REDIRECTION",
        "metadata" : {
          "_redirTo" : [ "http://ura-inform.com" ],
          "fetch%2EstatusCode" : [ "301" ]
        },
        "hostname" : "ura-inform.com",
        "nextFetchDate" : "2027-01-31T21:22:09.213Z"
      }

Consider archiving of news feeds and sitemaps

The news feeds and sitemaps can be useful by itself - the feeds more than the sitemaps because they include news titles and short snippets. It might make sense to put them also into the WARC files.

But first, it's important to understand what the storage foot print would be as feeds/sitemaps are refetched multiple times per day.

Bootstrap topology to add feeds and sitemaps from news sites

DMOZ is a good source to get a large list of news sites for various languages and countries:

  1. define a set of English categories matching the topic
  2. extract the translations to other language category names / paths
  3. extract all URLs listed below all categories (English and other languages)

See as a first draft get_dmoz_news_links.sh which extracts about 50,000 news site URLs for 50 languages.
The list of extracted URLs is then crawled (following redirects and ev. in-domain links up to a limited depth) and the content is mined for

Of course, this approach could be adapted to other domains of interest and can be used to bootstrap a focused crawler.

News archive is not available since 2023-10-23 15:36:50

Since 2023-10-23 15:36:50 there was no new news dataset warc files listed in https://data.commoncrawl.org/crawl-data/CC-NEWS/2023/10/warc.paths.gz

curl -s -o - https://data.commoncrawl.org/crawl-data/CC-NEWS/2023/10/warc.paths.gz | gzip --decompress | tail -n 1
crawl-data/CC-NEWS/2023/10/CC-NEWS-20231023153650-02160.warc.gz

Could you please help? Is there something bad happened as last time or did I miss any announcement?

Thanks in advance!

Crawl-delay in robots.txt should not shrink delay configured by fetcher.server.delay

The news crawler is configured to be polite with a guaranteed fetch delay of few seconds. However, some robots.txt rules define a crawl-delay below one second which then overwrites the the configured delay. The crawler-commons robots.txt parser would allow even a delay of only 1 ms, in practice I've seen a crawl-delay of 200 ms. To keep the control a longer configured delay should take the precedence.

Note: Yandex' robots.txt specs allow fraction numbers for crawl-delay. Examples: bin.ua, vladnews.ru, gov.uk.

Avoid following advertisements in news feeds and sitemaps

See also this discussion on Common Crawl's user group.

Some news sites sell slots in their news feeds and sitemaps and put advertisements there. The crawler follows these links the same way as it follows links to news articles. Because of a news sitemap auto-detection feature, thousands of "news" articles
from the target site are then possibly crawled.

Potential ways to fight these ads:

  • block following cross-site links, ie. implement a cross submission validation
  • disable sitemap autodetect (of course, this may cause that sitemap seeds are lost if the URL changes)
  • manually adjust URL filters

Improve feed parser robustness

As of today, 350 feeds fail to parse, most of them because the URL points not to a RSS or Atom feed. However, 80-100 feeds fail with trivial errors which should not break a robust feed parser and do mostly not affect extraction of links:

  • (35 feeds) unknown entities &lsquo; or &uacute; etc.
2016-11-22 16:21:14.949 c.d.s.b.FeedParserBolt [ERROR] Exception while parsing http://rakurs.rovno.ua/news.rss: com.rometools.rome.io.ParsingFeedException: Invalid XML: Error on line 282:
  The entity "lsquo" was referenced, but not declared.
2016-11-22 16:18:18.177 c.d.s.b.FeedParserBolt [ERROR] Exception while parsing http://www.diariolaestrella.com/150/index.rss: com.rometools.rome.io.ParsingFeedException: Invalid XML: Error on line 17:
  The entity "uacute" was referenced, but not declared.
2016-11-22 16:19:35.721 c.d.s.b.FeedParserBolt [ERROR] Exception while parsing http://www.iltalehti.fi/rss/rss.xml: com.rometools.rome.io.ParsingFeedException: Invalid XML: Error on line 66:
  The entity "euro" was referenced, but not declared.
  • (20 feeds) single ampersands
2016-11-22 16:18:07.643 c.d.s.b.FeedParserBolt [ERROR] Exception while parsing http://www.northerniowan.com/feed/atom/: com.rometools.rome.io.ParsingFeedException: Invalid XML:
  Error on line 84: The entity name must immediately follow the '&' in the entity reference.
  • RSS extensions
2016-11-22 18:20:14.535 c.d.s.b.FeedParserBolt [ERROR] Exception while parsing http://www.amurpravda.ru/rss/news.rss: com.rometools.rome.io.ParsingFeedException: Invalid XML: Error on line 20:
  The prefix "yandex" for element "yandex:full-text" is not bound.
  • leading newlines / white space / BOMs
2016-11-22 16:20:12.279 c.d.s.b.FeedParserBolt [ERROR] Exception while parsing http://www.pixelmonsters.de/feed/gamenews.rss: com.rometools.rome.io.ParsingFeedException: Invalid XML: Error on line 2:
  The processing instruction target matching "[xX][mM][lL]" is not allowed.
  • NPEs (!)
2016-11-22 16:18:53.325 c.d.s.b.FeedParserBolt [ERROR] Exception while parsing http://chestertontribune.com/rss.xml: java.lang.NullPointerException
2016-11-22 16:18:21.004 c.d.s.b.FeedParserBolt [ERROR] Exception while parsing http://atv.at/atom.xml: java.lang.NullPointerException
2016-11-22 16:27:35.163 c.d.s.b.FeedParserBolt [ERROR] Exception while parsing https://antarcticsun.usap.gov/resources/xml/antsun-continent.xml: java.lang.NullPointerException
  • encoding issues
2016-11-22 16:20:33.593 c.d.s.b.FeedParserBolt [ERROR] Exception while parsing http://newamericamedia.org/atom.xml: com.rometools.rome.io.ParsingFeedException:
  Invalid XML: Error on line 534: Invalid byte 2 of 3-byte UTF-8 sequence.

This issue is used as umbrella to track existing feed parser problems and address them step by step:

News archive is not available since 06.06.2021

There seems to be only one file available for 2021-06-06 and nothing since then. Are there any changes related to news dataset?

$ aws s3 ls --no-sign-request commoncrawl/crawl-data/CC-NEWS/2021/06/
2021-06-01 06:05:03 1072694208 CC-NEWS-20210601011537-00178.warc.gz
2021-06-01 08:05:03 1072700698 CC-NEWS-20210601032956-00179.warc.gz
...
2021-06-05 21:05:03 1072700332 CC-NEWS-20210605162324-00264.warc.gz
2021-06-05 22:05:03 1072724264 CC-NEWS-20210605180523-00265.warc.gz
2021-06-06 17:05:03 1072722205 CC-NEWS-20210605195038-00266.warc.gz

Check cross-submits for sitemaps

Sitemaps are automatically detected in the robots.txt but not checked for cross-submits. From time to time this leads to spam-like injections of URLs not matching the news genre. Recently, via one of their periodicals a publishing company "injects" their entire publishing program including landing pages for books and other media. This also happened for real estate ads before.
Note that the sitemaps must follow the news sitemap format which is the barrier for most cross-submits but not always.

Allow to follow news sites not providing RSS/Atom feed or news sitemap

The news crawler (as of now) relies exclusively on RSS/Atom feeds and news sitemaps to find links to news articles. However, some news sites do not provide feeds or sitemaps. In order to follow these news sites, the crawler should be able monitor HTML pages manually marked as seeds and extract links from it:

  • add a parser class to the topology which
    • exclusively parses URLs marked as verified HTML seeds (eg. by a metadata key isHtmlSeed)
    • extracts links from the HTML and sends them to the status index as DISCOVERED
    • (optionally) outlinks are filtered: same host or domain, configurable URL patterns stored in status index for the HTML seed
  • the (adaptive) scheduler must be configured to schedule the refetch of HTML seeds

Large seed URL offer for CC News Dataset

Hi @jnioche @sebastian-nagel

I'm a researcher in multilingual natural language processing at the University of Pennsylvania. I have a dataset of about 45,000 news seed URLs and their corresponding locales and main language of publication.

I'd love to see these added to the Common Crawl News Dataset seed URLs to improve multilingual coverage.
We have full rights to release the info, and I'd be happy to help in any way I can. Please respond to this if you're interested.

-John

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.