commoncrawl / news-crawl Goto Github PK
View Code? Open in Web Editor NEWNews crawling with StormCrawler - stores content as WARC
License: Apache License 2.0
News crawling with StormCrawler - stores content as WARC
License: Apache License 2.0
Please tell me how large the dataset is. Thanks.
If a news feed uses the sitemaps namespace it is erroneously detected as sitemap which causes that it's processed as sitemap (without being properly parsed) and not as feed. One example feed:
<?xml version="1.0" encoding="ISO-8859-1"?>
<?xml-stylesheet type="text/xsl" media="screen" href="/~d/styles/rss2full.xsl"?><?xml-stylesheet type="text/css" media="screen" href="http://feeds.drudge.com/~d/styles/itemcontent.css"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:sitemap="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:wordzilla="http://www.cadenhead.org/workbench/wordzilla/namespace" xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0" version="2.0">
There seems to be only one file available for 2021-06-06 and nothing since then. Are there any changes related to news dataset?
$ aws s3 ls --no-sign-request commoncrawl/crawl-data/CC-NEWS/2021/06/
2021-06-01 06:05:03 1072694208 CC-NEWS-20210601011537-00178.warc.gz
2021-06-01 08:05:03 1072700698 CC-NEWS-20210601032956-00179.warc.gz
...
2021-06-05 21:05:03 1072700332 CC-NEWS-20210605162324-00264.warc.gz
2021-06-05 22:05:03 1072724264 CC-NEWS-20210605180523-00265.warc.gz
2021-06-06 17:05:03 1072722205 CC-NEWS-20210605195038-00266.warc.gz
The CC-NEWS contain the literal values of the HTTP header fields Content-Encoding
, Transfer-Encoding
and Content-Length
although the payload is stored unchunked and uncompressed.
Content-Encoding
and Transfer-Encoding
should be masked by a prefix (the CC-MAIN WARC files use X-Crawler-
)Content-Length
is wrong because of a change of the Content-Encoding, the original HTTP header should be masked and the correct value should be given in the header Content-Length
Thanks, @wumpus for detecting this!
DMOZ is a good source to get a large list of news sites for various languages and countries:
See as a first draft get_dmoz_news_links.sh which extracts about 50,000 news site URLs for 50 languages.
The list of extracted URLs is then crawled (following redirects and ev. in-domain links up to a limited depth) and the content is mined for
Of course, this approach could be adapted to other domains of interest and can be used to bootstrap a focused crawler.
The news crawler uses the domain name to manage fetch queues. The domain name is also used to route URLs to Elasticsearch shards. When a URL is re-fetched the existing routing key isn't reused, instead the domain name is newly extracted from the host name and used as routing key. This makes the routing unstable because the domain name extraction is based on the changing and continuously updated public suffix list. If the routing changes the status record doesn't get updated, instead a second record with the same key is created. Because the nextFetchDate
of the original record is still in the past and is never updated, the URL is scheduled for re-fetch again and again.
Two examples (the domain name in the updated version is the correct one):
la.lv
(la.lv
is not a public suffix, so veselam.la.lv
is not the domain name)% ./bin/es_status url http://veselam.la.lv/feed
{
"took" : 85,
"timed_out" : false,
"_shards" : {
"total" : 10,
"successful" : 10,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 2,
"max_score" : 16.336418,
"hits" : [
{
"_index" : "status",
"_type" : "status",
"_id" : "0ffe9ec78013060e06b5ee955058ce2d42617af4a4e287660d33661797bacc05",
"_score" : 16.336418,
"_routing" : "veselam.la.lv",
"_source" : {
"url" : "http://veselam.la.lv/feed",
"status" : "ERROR",
"metadata" : {
"error%2Ecause" : [
"maxFetchErrors"
],
"depth" : [
"1"
],
"fetch%2EstatusCode" : [
"404"
],
"hostname" : "veselam.la.lv"
},
"nextFetchDate" : "2019-02-16T13:11:28.794Z"
}
},
{
"_index" : "status",
"_type" : "status",
"_id" : "0ffe9ec78013060e06b5ee955058ce2d42617af4a4e287660d33661797bacc05",
"_score" : 15.227517,
"_routing" : "la.lv",
"_source" : {
"url" : "http://veselam.la.lv/feed",
"status" : "FETCH_ERROR",
"metadata" : {
"error%2Ecause" : [
"maxFetchErrors"
],
"depth" : [
"1"
],
"fetch%2Eerror%2Ecount" : [
"1"
],
"fetch%2EstatusCode" : [
"404"
],
"hostname" : "la.lv"
},
"nextFetchDate" : "2019-02-18T11:39:33.000Z"
}
}
]
}
}
sportmediaset.med
- the top-level domain (also a public suffix) .med
has been introduced recently in 2016.% ./bin/es_status url http://www.sportmediaset.med
{
"took" : 68,
"timed_out" : false,
"_shards" : {
"total" : 10,
"successful" : 10,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 2,
"max_score" : 16.015972,
"hits" : [
{
"_index" : "status",
"_type" : "status",
"_id" : "313cb24e75fd9ab5f3e6bc4afbd66cb067de1d375c69fe49bfeabdf3df7f7372",
"_score" : 16.015972,
"_routing" : "www.sportmediaset.med",
"_source" : {
"url" : "http://www.sportmediaset.med",
"status" : "ERROR",
"metadata" : {
"error%2Ecause" : [
"maxFetchErrors"
],
"depth" : [
"2"
],
"isSitemap" : [
"false"
],
"isSitemapNews" : [
"false"
],
"hostname" : "www.sportmediaset.med"
},
"nextFetchDate" : "2019-02-09T05:58:20.006Z"
}
},
{
"_index" : "status",
"_type" : "status",
"_id" : "313cb24e75fd9ab5f3e6bc4afbd66cb067de1d375c69fe49bfeabdf3df7f7372",
"_score" : 14.726991,
"_routing" : "sportmediaset.med",
"_source" : {
"url" : "http://www.sportmediaset.med",
"status" : "FETCH_ERROR",
"metadata" : {
"error%2Ecause" : [
"maxFetchErrors"
],
"depth" : [
"2"
],
"isSitemap" : [
"false"
],
"isSitemapNews" : [
"false"
],
"fetch%2Eerror%2Ecount" : [
"1"
],
"hostname" : "sportmediaset.med"
},
"nextFetchDate" : "2019-02-18T11:43:49.000Z"
}
}
]
}
}
Sitemaps are automatically detected in the robots.txt but not checked for cross-submits. From time to time this leads to spam-like injections of URLs not matching the news genre. Recently, via one of their periodicals a publishing company "injects" their entire publishing program including landing pages for books and other media. This also happened for real estate ads before.
Note that the sitemaps must follow the news sitemap format which is the barrier for most cross-submits but not always.
340 WARC files of the news crawl data set, starting from 2020-09-12 until 2020-10-04 have been captured using HTTP/2 after a Java security upgrade which included ALPN and therefor allowed for HTTP/2. The crawler started to use HTTP/2 after an automatic restart.
The mentioned WARC files may cause WARC readers (eg. jwarc) to fail while parsing the HTTP headers:
GET /2020/09/12/business/brexit-no-deal-uk-economy/index.html HTTP/2
...
HTTP/2 200
To address the issue:
Affected files:
s3://commoncrawl/crawl-data/CC-NEWS/2020/09/CC-NEWS-20200912083952-00000.warc.gz
...
s3://commoncrawl/crawl-data/CC-NEWS/2020/10/CC-NEWS-20201004110027-00339.warc.gz
More than 80% of the records are captured using HTTP/2.
The newscrawler uses only news sitemaps as "news feed" and ignores "ordinary" sitemaps not following the URLs listed there. However, the crawler should follow sitemaps listed in a sitemap index and check whether one of them is a news sitemap.
E.g. while https://www.greenwichtime.com/sitemap_news.xml is not a news sitemap, it links to a bunch of news sitemaps:
<?xml version="1.0" encoding="UTF-8" ?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<sitemap>
<loc>http://www.greenwichtime.com/sitemap/news/ap.xml</loc>
<lastmod>2018-02-08T03:15:03Z</lastmod>
</sitemap>
...
The news crawler (as of now) relies exclusively on RSS/Atom feeds and news sitemaps to find links to news articles. However, some news sites do not provide feeds or sitemaps. In order to follow these news sites, the crawler should be able monitor HTML pages manually marked as seeds and extract links from it:
isHtmlSeed
)Compare with what we get from [https://webrecorder.io/]
Read with [https://github.com/ikreymer/webarchiveplayer]
Try with warcdump command
It would be great if you could additionally extract the date when an article was published. Currently, this requires parsing the web page and using tools such as newspaper3k to get that information. However, during the crawling process at least some webpages would offer this information, e.g. the time stamp within the RSS feed
<pubDate>Thu, 25 Dec 2014 02:10:00 +0900</pubDate>
or within the sitemap
<news:publication_date>2016-12-09T16:18:48Z</news:publication_date>
From sitemaps only news sitemaps are accepted as seed source. However,
news:publication_date
#18)<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<urlset xmlns="http://www.sitemadps.org/schemas/sitemap/0.9" xmlns:image="http://www.google.com/schemas/sitemap-image/1.1" xmlns:news="http://www.google.com/schemas/sitemap-news/0.9">
<url>
<loc>...</loc>
<lastmod>2018-03-18T21:44:50-07:00</lastmod>
<image:image>
<image:loc>...</image:loc>
<image:title>...
Cf. crawler-commons/crawler-commons#162, crawler-commons/crawler-commons#174.
Explore schema.org annotation NewsArticle from CC main crawls or WDC to complete the list of news sites/domains used to look for news feeds and sitemaps. The issue is not to find seed candidates but to select only real news sites.
The WARC files contain duplicate URLs because pages are fetched twice within a short time interval due to apache/incubator-stormcrawler#340.
The WARC standards recommends to mark records which have been truncated because of limits on the content size or fetch time by a field WARC-Truncated. Add this field and track the reason for the truncation.
I'm a researcher in multilingual natural language processing at the University of Pennsylvania. I have a dataset of about 45,000 news seed URLs and their corresponding locales and main language of publication.
I'd love to see these added to the Common Crawl News Dataset seed URLs to improve multilingual coverage.
We have full rights to release the info, and I'd be happy to help in any way I can. Please respond to this if you're interested.
-John
At present, the refetch schedule for seed feeds is globally 3 hours which is a compromise between
The schedule should adapt to the change frequency within a configurable min and max refetch interval (eg., 10 min. - 2 weeks). Detection of unchanged feeds should be independent of a last-modified time sent together with the server response.
The WARC file rotation may hapen unnecessarily often:
% ls -lh /data/warc/
-rw-r--r-- 1 storm storm 983M Sep 28 07:43 CC-NEWS-20160927074341-00000.warc.gz
-rw-r--r-- 1 storm storm 42M Sep 28 07:59 CC-NEWS-20160928074341-00001.warc.gz
-rw-r--r-- 1 storm storm 240M Sep 28 12:15 CC-NEWS-20160928075927-00002.warc.gz
The file with timestamp 07:59 should be part of the next WARC. This happens if first the time limit applies:
2016-09-28 07:43:41.636 o.a.s.h.b.r.FileSizeRotationPolicy [INFO] Rotating file based on time : started 1474962208980 interval 86400000
2016-09-28 07:43:41.636 o.a.s.h.b.AbstractHdfsBolt [INFO] Rotating output file...
2016-09-28 07:43:41.651 o.a.s.h.b.AbstractHdfsBolt [INFO] Performing 0 file rotation actions.
2016-09-28 07:43:41.651 o.a.s.h.b.AbstractHdfsBolt [INFO] File rotation took 15 ms.
which did obviously not properly reset the file size counter. 15 min. later the file size limit (1 GB) is logged to be hit (943 MB + 42 MB = 1025 MB ~ 1GB):
2016-09-28 07:59:27.221 o.a.s.h.b.r.FileSizeRotationPolicy [INFO] Rotating file based on size : currentBytesWritten 1073768981 maxBytes 1073741824
2016-09-28 07:59:27.221 o.a.s.h.b.AbstractHdfsBolt [INFO] Rotating output file...
2016-09-28 07:59:27.233 o.a.s.h.b.AbstractHdfsBolt [INFO] Performing 0 file rotation actions.
2016-09-28 07:59:27.233 o.a.s.h.b.AbstractHdfsBolt [INFO] File rotation took 12 ms.
I'm not sure if this is the right place to ask this, (feel free to direct me where)
But would it be possible to also produce WET files from this library?
Many downstream libraries of CC consume WET files (such as oscar-project/ungoliant)
And it would be useful if there were WET files available alongside WARC files.
Initially, the news crawler was seeded with URLs from news sites from DMOZ, see #8 for the procedure. DMOZ isn't updated anymore, but Wikidata could be a replacement to complete the seed list:
SELECT DISTINCT ?item ?itemLabel ?lang ?url
WHERE
{
?item wdt:P31/wdt:P279* wd:Q11032.
?item wdt:P856 ?url. # with official website
SERVICE wikibase:label { bd:serviceParam wikibase:language "en,de,ru,fr,es,it,ja,zh,*" }
OPTIONAL {
?item wdt:P407 ?language.
?language wdt:P220 ?lang.
}
}
LIMIT 50
Hello I'm getting this error while building the docker
ADD failed: stat /var/lib/docker/tmp/docker-builder779286778/target/crawler-1.17.jar: no such file or directory
PS: I have also tried said path but its also not working
If the crawl topology died or was killed the WARC file is not properly closed. This causes an error when decompressing the WARC file: gzip: CC-NEWS-20160926233041-00001.warc.gz: unexpected end of file
.
I can obtain listing for Common crawl by:
https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2017-09/wet.paths.gz
How can I do this with Common Crawl News Dataset ?
The WARC standard recommends to compress every record independently "record-at-time". This feature is required by wayback machines including http://index.commoncrawl.org/.
Since 2023-10-23 15:36:50 there was no new news dataset warc files listed in https://data.commoncrawl.org/crawl-data/CC-NEWS/2023/10/warc.paths.gz
curl -s -o - https://data.commoncrawl.org/crawl-data/CC-NEWS/2023/10/warc.paths.gz | gzip --decompress | tail -n 1
crawl-data/CC-NEWS/2023/10/CC-NEWS-20231023153650-02160.warc.gz
Could you please help? Is there something bad happened as last time or did I miss any announcement?
Thanks in advance!
The news crawler is configured to be polite with a guaranteed fetch delay of few seconds. However, some robots.txt rules define a crawl-delay below one second which then overwrites the the configured delay. The crawler-commons robots.txt parser would allow even a delay of only 1 ms, in practice I've seen a crawl-delay of 200 ms. To keep the control a longer configured delay should take the precedence.
Note: Yandex' robots.txt specs allow fraction numbers for crawl-delay. Examples: bin.ua, vladnews.ru, gov.uk.
The fetchInterval in metadata is not properly updated for RSS/Atom feeds. News sitemaps do not seem to be affected... (seen with the recent version based on StormCrawler 1.8 / ElasticSearch 6.0)
For the time being, I do not have AWS credentials. This means I'm unable to determine the filenames for the crawl dumps. Could the directory structure be made available by methods other than aws ls
?
I might be wrong, but I don't think aws ls
can be run anonymously.
Storm-crawler is now based on Storm 1.2.1 and Elasticsearch 6.1.1, news-crawl should also be upgraded.
As a cronjob; using s3cmd?
Hi, I've been trying to run docker in a non-interactively using the following command
docker run -d \
-p 127.0.0.1:9200:9200 -p 5601:5601 -p 8080:8080 \
-v .../data/warc:/data/warc \
-v .../data/elasticsearch:/data/elasticsearch \
-t newscrawler:1.18 /home/ubuntu/news-crawler/bin/run-crawler.sh
the complete logs can be found here
When running it interactively I have no problem. Any idea what the problem is?
The URL filters should reject localhost and private address spaces. The crawler may detect links pointing to a private network address, e.g.
2017-12-23 08:37:22.104 c.d.s.b.FetcherBolt FetcherThread #54 [ERROR] Exception while fetching http://localhost/wordpress/2017/.../
org.apache.http.conn.HttpHostConnectException: Connect to localhost:80 [localhost/127.0.0.1] failed: Connection refused (Connection refused)
This example looks more like an error on the remote page. But the crawler should never even try to access pages from localhost or a private network to avoid that information is leaked and is written to the WARC file. Could be, e.g., a link to the Storm web interface (http://localhost:8080/) exposing the cluster configuration.
Basically i am trying to iterate over the records of news WARC file to get HTML content and process the HTML content. I am using python warc package
snippet to read warc file:
import warc
f = warc.open("CC-NEWS-20161001224340-00008.warc")
for record in f:
if record['Content-Type'] == 'application/http; msgtype=response':
payload = record.payload.read()
headers, body = payload.split('\r\n\r\n', 1)
if 'Content-Type: text/html' in headers:
#do my processing with html content (body)
But when i run this i am getting this error:
Traceback (most recent call last):
warc_process.py", line 69, in
read_entire_warc("CC-NEWS-20160926211809-00000.warc")
File "warc_process.py", line 54, in read_entire_warc
for record in f:
File "anaconda2/lib/python2.7/site-packages/warc/warc.py", line 393, in iter
record = self.read_record()
File "anaconda2/lib/python2.7/site-packages/warc/warc.py", line 364, in read_record
self.finish_reading_current_record()
File "anaconda2/lib/python2.7/site-packages/warc/warc.py", line 360, in finish_reading_current_record
self.expect(self.current_payload.fileobj, "\r\n")
File "anaconda2/lib/python2.7/site-packages/warc/warc.py", line 352, in expect
raise IOError(message)
IOError: Expected '\r\n', found 'WARC/1.0\r\n'
Sample WARC files facing issues with: CC-NEWS-20160926211809-00000.warcCC-NEWS-20161001122244-00007.warc.gz
CC-NEWS-20161001224340-00008.warc.gz
CC-NEWS-20161002224346-00009.warc.gz
CC-NEWS-20161003130443-00010.warc.gz
CC-NEWS-20161004130444-00011.warc.gz
CC-NEWS-20161005130450-00012.warc.gz
CC-NEWS-20161005152607-00013.warc.gz
CC-NEWS-20161006152607-00014.warc.gz
CC-NEWS-20161006191324-00015.warc.gz
CC-NEWS-20161007191326-00016.warc.gz
CC-NEWS-20161008015559-00017.warc.gz
CC-NEWS-20161009015614-00018.warc.gz
CC-NEWS-20161010001731-00019.warc.gz
See also this discussion on Common Crawl's user group.
Some news sites sell slots in their news feeds and sitemaps and put advertisements there. The crawler follows these links the same way as it follows links to news articles. Because of a news sitemap auto-detection feature, thousands of "news" articles
from the target site are then possibly crawled.
Potential ways to fight these ads:
Hello there,
I just discovered your news-crawler
and I think this is an amazing idea!
Sorry if this is a very simple question, but is it possible to somehow download slices of the news-crawling data (possibly based on a keyword/regex/domain) without resorting to amazon AWS?
The ideas is that I have a very large cluster at my disposal already, so I would rather work with the raw data directly on my local cluster.
What do you think?
Thank you for your help
Certain URLs following a quite specific pattern are continuously re-refetched, here the counts from a couple of hours log files:
669 http://www.tvl.be/,NLD,Leuven
668 http://www.ltv.ly/,ARA,National
667 http://www.rajdhani.com.np/,NEP,National
665 http://ukrainian.voanews.com/,UKR,Nationwide
665 http://www.ura-inform.com/,RUS,Nationwide
662 http://lariviera.netweek.it/,ITA,Sanremo
458 http://www.topix.com/world/burkina-faso/,ENG,Foreign
The status index of these URLs seems correct, e.g.:
"_source" : {
"url" : "http://www.topix.com/world/burkina-faso/,ENG,Foreign",
"status" : "FETCHED",
"metadata" : {
"fetch%2EstatusCode" : [ "200" ]
},
"hostname" : "topix.com",
"nextFetchDate" : "2027-01-31T21:18:49.092Z"
}
or
"_source" : {
"url" : "http://www.ura-inform.com/,RUS,Nationwide",
"status" : "REDIRECTION",
"metadata" : {
"_redirTo" : [ "http://ura-inform.com" ],
"fetch%2EstatusCode" : [ "301" ]
},
"hostname" : "ura-inform.com",
"nextFetchDate" : "2027-01-31T21:22:09.213Z"
}
The request records in the CC-NEWS WARC files lack the HTTP protocol version:
GET /path
instead of
GET /path HTTP/1.1
This makes some WARC parsers fail to process the WARC files, see https://groups.google.com/d/msg/common-crawl/hsb90GHq6to/Lv-9-nHAAQAJ.
NewsSiteMapParserBolt fails to parse some valid XML sitemaps, e.g.,
2018-03-09 18:14:13.924 o.c.s.n.NewsSiteMapParserBolt Thread-30-sitemap-executor[10 11] [INFO] http://www.pjstar.com/section/google-news-sitemap detected as news sitemap based on content
2018-03-09 18:14:13.924 c.s.SiteMapParser Thread-30-sitemap-executor[10 11] [WARN] Bad url: [<?xml version="1.0" encoding="UTF-8"?>]
2018-03-09 18:14:13.924 c.s.SiteMapParser Thread-30-sitemap-executor[10 11] [WARN] Bad url: [<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"]
2018-03-09 18:14:13.926 c.s.SiteMapParser Thread-30-sitemap-executor[10 11] [WARN] Bad url: [ xmlns:news="http://www.google.com/schemas/sitemap-news/0.9"]
2018-03-09 18:14:13.927 c.s.SiteMapParser Thread-30-sitemap-executor[10 11] [WARN] Bad url: [ xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"]
2018-03-09 18:14:13.927 c.s.SiteMapParser Thread-30-sitemap-executor[10 11] [WARN] Bad url: [ xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9]
2018-03-09 18:14:13.927 c.s.SiteMapParser Thread-30-sitemap-executor[10 11] [WARN] Bad url: [ <url>]
2018-03-09 18:14:13.927 c.s.SiteMapParser Thread-30-sitemap-executor[10 11] [WARN] Bad url: [ <loc>http://www.pjstar.com/news/20180309/chosen-family-portrait-group-that-needed-each-other</loc>]
2018-03-09 18:14:13.927 c.s.SiteMapParser Thread-30-sitemap-executor[10 11] [WARN] Bad url: [ <news:news>]
2018-03-09 18:14:13.927 c.s.SiteMapParser Thread-30-sitemap-executor[10 11] [WARN] Bad url: [ <news:news>]
2018-03-09 18:14:13.927 c.s.SiteMapParser Thread-30-sitemap-executor[10 11] [WARN] Bad url: [ <news:news>]
2018-03-09 18:14:13.927 c.s.SiteMapParser Thread-30-sitemap-executor[10 11] [WARN] Bad url: [ <news:publication>]
2018-03-09 18:14:13.927 c.s.SiteMapParser Thread-30-sitemap-executor[10 11] [WARN] Bad url: [ <news:name>Peoria Journal Star</news:name>]
2018-03-09 18:14:13.927 c.s.SiteMapParser Thread-30-sitemap-executor[10 11] [WARN] Bad url: [ <news:language>en</news:language>]
2018-03-09 18:14:13.928 c.s.SiteMapParser Thread-30-sitemap-executor[10 11] [WARN] Bad url: [ </news:publication>]
2018-03-09 18:14:13.928 c.s.SiteMapParser Thread-30-sitemap-executor[10 11] [WARN] Bad url: [ <news:publication_date>2018-03-09</news:publication_date>]
For this sitemap the server responds Content-Type: text/html; charset=ISO-8859-1
which seems to cause that it's not even tried to parse as XML.
i have used the following github repository "https://github.com/commoncrawl/news-crawl"
he has used the following versions of required libraries
Install Elasticsearch 7.5.0
Install Apache Storm 1.2.3
stormcrawler 1.16
maven 3.6.2
I have followed the path given in readme but my localhost:9200 is not showing any hits
the configuration command is successfully running but it shows some FETCH_ERROR errors.
No content or url is being shown.
If a news site creates sitemaps with unique URLs on a daily base (or even in shorter intervals), over time this leads to too many sitemaps checked for updates, causing that news articles get stuck in queues jammed with sitemaps. The unique URLs pointing to sitemaps can stem from the robots.txt or a sitemap index. Typical URL/file patterns for ephemeral sitemaps are caused by including:
.../sitemap.xml?yyyy=2020&mm=02&dd=07
.../sitemap-2017.xml?mm=12&dd=31
.../sitemap-2019-04.xml
.../sitemap?type=clanky-2019_9
.../sitemap-201910.xml
.../sitemap-news.xml?y=2018&m=03&d=19
.../02-Sep-2019.xml
.../articles_2019_06.xml
.../sitemap_30-Nov-2019.xml
.../sitemap_bydate.xml?startTime=2020-02-16T00:00:00&endTime=2020-02-22T23:59:59
.../sitemap.xml?page=1424
.../ymox96xuveov.xml
/sitemaps/1151jawjodn3t.xml
.../2019-05-13-0058/0817_8.xml
.../world.xml?section_id=338&content_type=1&year=2017&month=9
.../2019-12-22/sitemap.xml?page=1409
In the worst case, there can be 100k or even millions of sitemaps tracked for a domain, which requires to manually block or clean up the list of sitemaps, in order to be able to fetch news articles and follow the recent sitemaps.
All manually collected seeds (feeds or future seed formats, e.g., news sitemaps) should be refetched from time to time (weekly or monthly) even if they are redirected or failed to fetch:
Ideally, the fetch schedule should be configurable for a combination of metadata and fetch status.
Hi friends,
I am using news-crawl for academic research but am unable to set it up on my Mac computer. But I encountered build error and would really appreciate some help!
Environment:
MacOS Monterey 12.2.1, Apple M1 Pro
javac 20.0.2
openjdk 20.0.2 2023-07-18 : OpenJDK Runtime Environment (build 20.0.2+9-78); OpenJDK 64-Bit Server VM (build 20.0.2+9-78, mixed mode, sharing)
Apache Maven 3.9.3 (21122926829f1ead511c958d89bd2f672198ae9f)
Apache Storm 2.4.0
elasticsearch 8.8.2
On bash shell
news-crawler commit 4194f9c
Steps to reproduce:
Expected result
Actual result:
I am new to Java development. Please let me know if I can provide additional context! Thanks
The news feeds and sitemaps can be useful by itself - the feeds more than the sitemaps because they include news titles and short snippets. It might make sense to put them also into the WARC files.
But first, it's important to understand what the storage foot print would be as feeds/sitemaps are refetched multiple times per day.
Should add the remote target IP address as field "WARC-IP-Address" to CC-NEWS response records. Thanks, @wumpus for detecting this!
As of today, 350 feeds fail to parse, most of them because the URL points not to a RSS or Atom feed. However, 80-100 feeds fail with trivial errors which should not break a robust feed parser and do mostly not affect extraction of links:
‘
or ú
etc.2016-11-22 16:21:14.949 c.d.s.b.FeedParserBolt [ERROR] Exception while parsing http://rakurs.rovno.ua/news.rss: com.rometools.rome.io.ParsingFeedException: Invalid XML: Error on line 282:
The entity "lsquo" was referenced, but not declared.
2016-11-22 16:18:18.177 c.d.s.b.FeedParserBolt [ERROR] Exception while parsing http://www.diariolaestrella.com/150/index.rss: com.rometools.rome.io.ParsingFeedException: Invalid XML: Error on line 17:
The entity "uacute" was referenced, but not declared.
2016-11-22 16:19:35.721 c.d.s.b.FeedParserBolt [ERROR] Exception while parsing http://www.iltalehti.fi/rss/rss.xml: com.rometools.rome.io.ParsingFeedException: Invalid XML: Error on line 66:
The entity "euro" was referenced, but not declared.
2016-11-22 16:18:07.643 c.d.s.b.FeedParserBolt [ERROR] Exception while parsing http://www.northerniowan.com/feed/atom/: com.rometools.rome.io.ParsingFeedException: Invalid XML:
Error on line 84: The entity name must immediately follow the '&' in the entity reference.
2016-11-22 18:20:14.535 c.d.s.b.FeedParserBolt [ERROR] Exception while parsing http://www.amurpravda.ru/rss/news.rss: com.rometools.rome.io.ParsingFeedException: Invalid XML: Error on line 20:
The prefix "yandex" for element "yandex:full-text" is not bound.
2016-11-22 16:20:12.279 c.d.s.b.FeedParserBolt [ERROR] Exception while parsing http://www.pixelmonsters.de/feed/gamenews.rss: com.rometools.rome.io.ParsingFeedException: Invalid XML: Error on line 2:
The processing instruction target matching "[xX][mM][lL]" is not allowed.
2016-11-22 16:18:53.325 c.d.s.b.FeedParserBolt [ERROR] Exception while parsing http://chestertontribune.com/rss.xml: java.lang.NullPointerException
2016-11-22 16:18:21.004 c.d.s.b.FeedParserBolt [ERROR] Exception while parsing http://atv.at/atom.xml: java.lang.NullPointerException
2016-11-22 16:27:35.163 c.d.s.b.FeedParserBolt [ERROR] Exception while parsing https://antarcticsun.usap.gov/resources/xml/antsun-continent.xml: java.lang.NullPointerException
2016-11-22 16:20:33.593 c.d.s.b.FeedParserBolt [ERROR] Exception while parsing http://newamericamedia.org/atom.xml: com.rometools.rome.io.ParsingFeedException:
Invalid XML: Error on line 534: Invalid byte 2 of 3-byte UTF-8 sequence.
This issue is used as umbrella to track existing feed parser problems and address them step by step:
There appear to regularly be thousands of duplicate articles from this domain, always with identical initial paths but ending with different slugs
For example, I have noticed 2162 entries starting with https://www.diariodeavila.es/noticia/Z59BAEC38-ADF2-2475-07830E240D31D863
that all appear to have identical content (the article title is El Senado de EEUU aprueba la legalidad del 'impeachment'
for all of these urls). Here are some example urls:
https://www.diariodeavila.es/noticia/Z59BAEC38-ADF2-2475-07830E240D31D863/202101/111m-para-crear-una-red-de-areas-de-descanso-para-camiones
https://www.diariodeavila.es/noticia/Z59BAEC38-ADF2-2475-07830E240D31D863/202101/120-contagiados-y-un-fallecido-balance-covid-de-la-jornada
https://www.diariodeavila.es/noticia/Z59BAEC38-ADF2-2475-07830E240D31D863/202101/124-castellanos-y-leoneses-han-recibido-ya-la-segunda-dosis
https://www.diariodeavila.es/noticia/Z59BAEC38-ADF2-2475-07830E240D31D863/202101/13-enmiendas-de-xav-por-14-millones-a-las-cuentas-regionales
https://www.diariodeavila.es/noticia/Z59BAEC38-ADF2-2475-07830E240D31D863/202101/14-positivos-en-el-cribado-de-la-zona-de-madrigal
https://www.diariodeavila.es/noticia/Z59BAEC38-ADF2-2475-07830E240D31D863/202101/160-contagios-mas-en-avila-en-un-dia-sin-fallecidos-covid
https://www.diariodeavila.es/noticia/Z59BAEC38-ADF2-2475-07830E240D31D863/202101/16-positivos-en-el-primer-dia-de-cribado-en-cebreros
https://www.diariodeavila.es/noticia/Z59BAEC38-ADF2-2475-07830E240D31D863/202101/170-nuevos-casos-covid-y-un-fallecido-en-el-hospital
https://www.diariodeavila.es/noticia/Z59BAEC38-ADF2-2475-07830E240D31D863/202101/172-positivos-covid-mas-y-casi-medio-centenar-de-ingresados
https://www.diariodeavila.es/noticia/Z59BAEC38-ADF2-2475-07830E240D31D863/202101/180-casos-y-un-fallecido-por-covid-balance-del-dia
https://www.diariodeavila.es/noticia/Z59BAEC38-ADF2-2475-07830E240D31D863/202101/184-detenidos-en-la-tercera-noche-de-disturbios-en-paises-bajos
https://www.diariodeavila.es/noticia/Z59BAEC38-ADF2-2475-07830E240D31D863/202101/200-incidencias-en-la-red-de-abastecimiento-por-el-temporal
https://www.diariodeavila.es/noticia/Z59BAEC38-ADF2-2475-07830E240D31D863/202101/2020-deja-la-menor-cifra-de-empleados-publicos-del-decenio
https://www.diariodeavila.es/noticia/Z59BAEC38-ADF2-2475-07830E240D31D863/202101/2020-dejo-la-menor-cifra-de-muertos-en-carreteras-abulenses
https://www.diariodeavila.es/noticia/Z59BAEC38-ADF2-2475-07830E240D31D863/202101/2020-dejo-una-caida-minima-en-la-afiliacion-de-extranjeros
https://www.diariodeavila.es/noticia/Z59BAEC38-ADF2-2475-07830E240D31D863/202101/2020-un-buen-ano-para-el-cerro-gallinero
https://www.diariodeavila.es/noticia/Z59BAEC38-ADF2-2475-07830E240D31D863/202101/2145-vacunas-contra-la-covid-salen-de-avila
https://www.diariodeavila.es/noticia/Z59BAEC38-ADF2-2475-07830E240D31D863/202101/2186-nuevos-casos-la-cifra-mas-alta-desde-noviembre
These were found in https://commoncrawl.s3.amazonaws.com/crawl-data/CC-NEWS/2021/02/CC-NEWS-20210210002910-00147.warc.gz
The problem appears to have started on the 1st Feb 2021 with the volume of pages from this site rising from ~50 per day to ~12000 per day.
so that we rotate the file whenever we reach a given size and / or time since opening it
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.