ukwa / ukwa-heritrix Goto Github PK

The UKWA Heritrix3 custom modules and Docker builder.

Java 98.05% Shell 0.99% RobotFramework 0.44% Dockerfile 0.52%

ukwa-heritrix's Introduction

UKWA Heritrix

This repository takes Heritrix3 and adds in code and configuration specific to the UK Web Archive. It is used to build a Docker image that is used to run our crawls.

Local Development

If you are modifying the Java code and want to compile it and run the unit tests, you can use:

$ mvn clean install

However, as the crawler is a multi-component system, you'll also want to run integration tests.

Continuous Integration Testing

All tags, pushes and pull-requests on the main ukwa-heritrix repository will run integration testing before pushing an updated Docker container image. See the workflows here.

However, it is recommended that you understand and run the integration tests locally first.

Local Integration Testing

The supplied Docker Compose file can be used for local testing. This looks quite complex because the system spins up many services, including ones that are only needed for testing:

The main crawler, and associated services:
- ClamD for virus scanning,
- WebRender API and Warcprox for browser-based crawler integration.
- An OutbackCDX server for recording crawled URLs with timestamps and checksums for deduplication.
- An Apache Kafka topic/log server, and its associated Zookeeper instance.
Two test websites for running test crawls without touching the live web:
- A container that simulates http://acid.matkelly.com/
- A container that hosts a crawler test site at http://crawl-test-site.webarchive.org.uk (this is not a working public URL)
A local Wayback service for inspecting the results:
- this talks to the crawler's CDX server,
- and is assisted by a warc-server container that makes the crawled WARCs available.
A robot container that uses the Python Robot Framework to run some integration tests.

IMPORTANT there is a .env file that docker-compose.yml uses to pick up shared variables. This includes the user UID that is used to run the services. This should be overridden using whatever UID you develop under. e.g.

$ export CRAWL_UID=$(id -u)

There's a little helper script to do this, which you can run like this before running Docker operations:

$ source source-setup-crawl-uid.sh

To run the tests locally, build the images:

$ docker-compose build

This builds the heritrix and robot images.

Note that the Compose file is set up to pass the HTTP_PROXY and HTTPS_PROXY environment variables through to the build environment, so as long as those are set, it should build behind a corporate web proxy. If you are not behind a proxy, and these variables are not set, docker-compose will warn that the variables are not set, but the build should work nevertheless.

To run the integration tests:

$ docker-compose up

Alternatively, to launch the crawler for manual testing, use e.g. (listing heritrix warcprox webrender we make sure we see logs from those three containers):

$ docker-compose up heritrix warcprox webrender

and use a secondary terminal to e.g. launch crawls. Note that ukwa-heritrix is configured to wait a few seconds before auto-launching the frequent crawl job.

After running tests, it's recommended to run:

$ docker-compose rm -f
$ mvn clean

This deletes all the crawl output and state files, thus ensuring that subsequent runs start from a clean slate.

Service Endpoints

Once running, these are the most useful services for experimenting with the crawler itself:

Service	Endpoint	Description
Heritrix	https://localhost:8443/ (username/password `heritrix`/`heritrix`)	The main Heritrix crawler control interface.
Kafka UI	http://localhost:9000/	A browser UI that lets you look at the Kafka topics.
Crawl CDX	http://localhost:9090/	An instance of OutbackCDX used to record crawl outcomes for analysis and deduplication. Can be used to look up what happened to a URL during the crawl.
Wayback	http://localhost:8080/	An instance of OpenWayback that allows you to play back the pages that have been crawled. Uses the Crawl CDX to look up which WARCs hold the required URLs.

Note that the Heritrix REST API documentation contains some useful examples of how to interact with Heritrix using curl.

There are a lot of other services, but these are largely intended for checking or debugging:

Service	Endpoint	Description
Heritrix (JMX)	localhost:9101	Java JMX service used to access internal state for monitoring the Kafka client. (DEPRECATED)
Heritrix (Prometheus)	http://localhost:9119/	Crawler bean used to collect crawler metrics and publish them for Prometheus
More TBA

Manual testing

The separate crawl-streams utilities can be used to interact with the logs/streams that feed URLs into the crawl, and document the URLs found and processed by the crawler. To start crawling the two test sites, we use:

$ docker run --net host ukwa/crawl-streams submit -k localhost:9092 fc.tocrawl -S http://acid.matkelly.com/
$ docker run --net host ukwa/crawl-streams submit -k localhost:9092 fc.tocrawl -S http://crawl-test-site.webarchive.org.uk/

Note that the --net host part means the Docker container can talk to your development machine directly as localhost, which is the easiest way to reach your Kafka instance.

The other thing to note is the -S flag - this indicates that these URLs are seeds, and that means when the crawler pickes them up, it will widen the scope of the crawl to include any URLs that are on those sites (strictly, those URLs that have this URL as a prefix when expressed in SURT form. Without the -S flag, submitted URLs will be ignored unless they are within the current crawler scope.

Note, however, some extra URLs may be discovered during processing that are necessary for in scope URLs to work (e.g. images, CSS, JavaScript etc.). The crawler is configured to fetch these even if they are out of the main crawl scope. i.e. the crawl scope is intended to match up with the HTML pages that are of interest. Any further resources required by those changes will be added if the crawler determines they are needed.

Directly interacting with Kafka

It's also possible to interact directly with Kafka by installing and using the standard Kafka tools. This is not recommended at present, but these instructions are left here in case they are helpful:

cat testdata/seed.json | kafka-console-producer --broker-list kafka:9092 --topic fc.tocrawl
kafka-console-consumer --bootstrap-server kafka:9092 --topic fc.tocrawl --from-beginning
kafka-console-consumer --bootstrap-server kafka:9092 --topic fc.crawled --from-beginning

Automated testing

The robot container runs test crawls over the two test sites mentioned in the previous section. The actions and expected results are in the crawl-test-site.robot test specification.

Crawl Configuration

We use Heririx3 Sheets as a configuration mechanism to allow the crawler behaviour to change based on URL SURT prefix.

Summary of Heritrix3 Modules

Modules for Heritrix 3.4.+

AnnotationMatchesListRegexDecideRule: DecideRule for checking against annotations.
AsynchronousMQExtractor: publishes messages to an external queue for processing.
ClamdScanner: for processing in an external ClamAv daemon.
CompressibilityDecideRule: REJECTs highly-compressible (and highly incompressible) URIs.
ConsecutiveFailureDecideRule: REJECTs a URI if both it and its referrer's HTTP status codes are >= 400.
CountryCodeAnnotator: adds a country-code annotation to each URI where present.
ExternalGeoLookup: implementation of ExternalGeoLookupInterface for use with a ExternalGeoLocationDecideRule; uses MaxMind's GeoLite2 database.
ExtractorJson: extracts URIs from JSON-formatted data.
ExtractorPattern: extracts URIs based on regular expressions (written explicitly for one site; not widely used).
HashingCrawlMapper: intended as a simpler version of the HashCrawlMapper using the Hashing libraries.
IpAnnotator: annotates each URI with the IP.
ViralContentProcessor: passes incoming URIs to ClamAv.
WARCViralWriterProcessor, XorInputStream: workarounds for force-writing of 'conversion' records based on XORed version of the original data.
RobotsTxtSitemapExtractor: Extracts and enqueues sitemap links from robots.txt files.
WrenderProcessor: Runs pages through a web-rendering web service rather than the usual H3 processing.

Release Process

We only need tagged builds, so

mvn release:clean release:prepare

is sufficient to tag a version and initiate a Docker container build. Note that the SCM/git tag should be of the form X.Y.Z.

Redis Notes

Some experimental code uses a Redis back end. This should support multiple implementations, but subtleties around transactions, distribution, and syntax remain.

e.g. KvRocks is great but does not support things like ZADD with LT. The LT option was added recently (Redis 6.2), so does not have wide support elsewhere. Consider using two ops instead.

Changes

2.7.11:
- Based on Heritrix 3.4.0-20210621
- ...
2.7.0-BETA:
- Update Heritrix3 to version based on BDB-JE 7.
- Stop using addPersistentDataMapKey because it's been removed from H3.
2.6.10:
- Allow switch to Bloom filter unique URI approach based on environment variable.
- Switch OutbackCDX client POST to using HttpClient.
- Default to not checking system properties for the OutbackCDX HttpClient builder.
- Ensure HTTPClient response entities get consumed.
- Add Prometheus metrics for OutbackCDX client requests.
2.6.9:
- Ensure quota resets get propagated to pre-requisites and redirects, for #50.
2.6.8:
- Switch back to server quotas as the default.
2.6.7:
- Avoid collecting source stats as this is causing problems arising from #49.
2.6.6:
- Ensure SourceTag does not get set to null in RobotsTxtSitemapExtractor (#49)
- Modify logger to allow some buffering rather than flushing every line.
2.6.5:
- Realised critical data fields like launchTimestamp were not marked to perist in the disk cache! This was causing crawls to fail under load when items are swapped out before swapping back in for the fetch.
2.6.4:
- Added optional refreshDepth field, which marks the number of hops for which the launchTimestamp will be marked as an inherited field.
2.6.3:
- Handle case where emitting inscope URIs with no source by using self as source.
2.6.2:
- Rely on Crawler Commons more so it handles site-maps of different formats.
- Update to commons-io 2.4 across the whole project (to make sure Crawler Commons is happy).
- Don't assume a Sitemap from robots.txt is definitely a sitemap as there could be redirects. See #44.
- Handle case where incoming non-seed URIs were not getting logged right because the inscope log builder assumes there is a source CrawlURI.
- Allow up to 50,000 outlinks per CrawlURI, to cope with the case of large sitemaps.
2.6.1:
- Modify disposition processor so robots.txt cache does not get trashed when robots.txt get discovered outside of pre-requisites and ruled out-of-scope.
- Update to OutbackCDX 0.5.1 requirement, taking out hack needed to cope with URLs with * in (see nla/outbackcdx#14)
2.6.0:
- Sitemap extraction and simple integration with re-crawl mechanism.
2.5.2:
- Revert to skipping whole disposition chain as robots.txt cache was getting trashed by -5000 responses.
2.5.1:
- Make number of WARC writers configurable (as 5 was a bottleneck under DC).
2.5.0:
- Simplify shutdown logic to avoid lock files lingering.
- Allow over-quota queues to be retired and re-awakened instead of emitting all.
- Skip to disposition processor rather than skipping the whole chain, so the recrawl delay get set right.
2.4.15:
- Fix Spring syntax problem.
2.4.14:
- Attempt checkpoint during Docker shutdown.
- Allow WebRender maxTries to be configured.
2.4.13:
- Make WebRender timeouts configurable, and use WEBRENDER as the default prefix.
- Emit in-scope URLs into a separate Kafka topic.
2.4.12:
- Only reset sheets if the sheets are being modified, to allow simple 'refresh' requests.
2.4.11:
- Copy the sheets so they 'stick' even if we change the targetSheet.
2.4.10:
- Ensure sheets are unchanged if unspecified.
2.4.9:
- Refactor and simplify RecentlySeen code and default to ignoring forceFetch. Can override obeyForceFetch, but the behaviour we want is to force the CrawlURI to be accepted into the frontier even if it's already there (so it can be re-prioritised). We don't want forceFetch to override RecentlySeen in that case.
2.4.8:
- Reset sheet definitions for recrawl delays, as launchTimestamp resolves the issue.
2.4.7:
- Update to Heritrix 3.4.0-20190418 (and so avoid caching DNS failures for ever)
- Make startup support none or new-crawl as well as resume-latest.
2.4.6:
- Ensure critical tasks are handled before skipping the disposition chain.
- Allow per-URL override of sheet-based launchTimestamp, consistently named.
2.4.5:
- Switch to generic per-launch sheet setup. i.e. 'target sheets' where any property can be set.
- Create sheet for every target, set launchTimestamp that way.
- Switch to hostMaxSuccessKb rather than serverMaxSuccessKb because clearing server quotas was brittle across HTTP/HTTPS.
2.4.4:
- Add ability to skip the disposition chain if it's a recently-seen URI and hence out of scope.
- Make WebRender more patient.
- Renamed Metrics Bean so it's clear it's about Prometheus.
2.4.3:
- Allow alert count to be monitored via Prometheus.
- Log known issue with pre-0.5.1 OutbackCDX URL handling when there's an asterisk.
2.4.2:
- Revert to Heritrix 3.4.0-20190207 rather than trialing 3.4.0-SNAPSHOT.
2.4.1:
- Switch to clearing host quotas rather than server quotas.
- Add DOI.org as a known URL shortener (i.e. always resolve via)
- Restore URL shorteners list.
2.4.0:
- Allow use of launch timestamp to control re-crawls.
- Process error outlinks (because WebRendered come back as 'error code' -5002), and make WebRendered items clear.
- Ensure critical pre-requisites are not blocked by the quotas, because quotas are only cleared when the seeds go through.
2.3.6:
- Revert to server quota, and avoid messing with frontier groups.
2.3.5:
- Use and clear host quotas instead, to avoid http/https problem.
2.3.4:
- Support NOT routing via Kafka properly.
- Simplify sheet logic layout.
2.3.3:
- Make operation of quota reset clear.
- Give H3 time to start up.
2.3.2:
- Use consistent envvar name for Web Render config.
- Use newer webrender-api service in docker-compose file.
- Slashes not allowed in WARC prefix for warcprox. Removed them.
- Update to Java 8 on Travis
2.3.1:
- Store WebRender WARCs next to the normal ones -- Didn't work, see 2.3.2.
- Update build to Java 8 in Maven.
- Update pre-send scoping logic to match candidates processor.
- Reducing logging of forgetting URLs.
2.3.0:
- Switching to a 'forgetful' URI filter, BdbUriUniqFilter plus ForgettingFrontierProcessor.
- Switch to using an annotation for quota resets.
- Now running with Puppeteer as an alternative renderer
- Shift quota-reset into a processor.
- Ensure sheets are applied if post-filtering.
- Always get embeds.
2.2.20:
- Shorten recrawl periods somewhat, to attempt to avoid issues with crawls being skipped.
- Ensure resetting quotas avoids any possible race condition.
2.2.19:
- Docker build rely on Maven to handle H3 version etc.
- Allow OSSRH Snapshots.
- Assemble Heritrix from Maven.
- Experiment with a snapshot build and only core repos.
2.2.18:
- Fix NPW in WebRender numTries check.
2.2.17:
- Update to 3.4.0-20190207 H3 build.
- Make it possible to disable outgoing logs.
- Shifting to 3.4 H3 release.
2.2.16:
- Use IA SNAPSHOT binary release.
- Update WebRenderCount annotation properly.
2.2.15:
- Switch to using the term WebRender rather than Wrender.
- Simplify web renderer to allow H3 to handle retries properly.
- Add an explicit client id to the Kafka client.
- Name Prometheus metric consistently.
2.2.14:
- Apply sheets to HTTPS and HTTP.
2.2.13:
- Try alternate quota reset logic.
- Ensure Kafka offsets get committed properly.
- Switch to keeping all checkpoints by default, to avoid messing with log files.
2.2.12:
- Ensure unique groupId per consumer, as required for manual partition in Kafka.
2.2.11:
- Add job-resume command.
- Use latest ukwa/heritrix version.
- Always resume rather than launching without a checkpoint.
- Ensure we use the local SNAPSHOT build.
- Store checkpoints in the state folder to make sure they get kept.
- Avoid using disposition lock unnecessarily.
- Add standard Heritrix crawl metrics to the Prometheus exporter bean.
- Add roofinglines host to polite list following CAS-1113711-Y8R9
- Use explicit flag to indicate that quotas should be reset.
2.2.10:
- Don't skip revisits when looking up in OCDX, as this allowed heavy re-crawling of popular URLs.
2.2.9:
- Add an 'excludes' watched file.
- Use tally mechanism to reset quotas.
- Use threadsafe queues branch of H3.
2.2.8:
- Attempt thread-safe quota resets.
2.2.7:
- Make OutbackCDX connection pool size configurable.
2.2.6:
- Modified quota reset logic.
2.2.5:
- Stop WebRender treating data URIs as a serious problem.
2.2.4:
- Format surt files correctly (use newline).
- Disposition lock around stats reset.
2.2.3:
- Use watched file as per #17, log discarded URLs.
- Fix over-prioritisation of inferred urls.
- Add Prometheus Metrics bean.
- Allow separation of candidate URLs.
- Update GeoIP and add Prometheus dependencies.
2.2.2:
- Add more detail to logging for virus scanner issues.
- Lock the frontier while enqueuing to avoid NPE, see #16
2.2.1:
- Modifications to Kafka assignment handling and add SEEK_TO_BEGINNING control
2.2.0:
- Make state re-usable between launches.
- Manually assign Kafka partitions.
2.1.15:
- Add experimental JMX -> Prometheus hook.
- Allow max.poll.records overrides.
2.1.14:
- Avoid checking OutbackCDX for already-rejected URIs.
2.1.13:
- Remove monitor process that seems unstable on GlusterFS.
2.1.12:
- Tunable message handler thread pool size.
- Using multithreaded message processing to ease OutbackCDX bottleneck.
2.1.11:
- Also avoid null host bad URIs.
2.1.10:
- Also avoid sending problematic URLs to OutbackCDX if present.
- Discard unwanted and malformed URLs.
2.1.9:
- Avoid overwriting seekToBeginning before the seeking has been done.
2.1.8:
- Cleaner behaviour for stop/start of Kafka hook.
- Add scripts to examine the H3 service.
2.1.7:
- Make GEOIP_LOOKUP_EVERY_URI an option.
- Avoid passing all URIs through GeoIP check if it won't change the result
- Tuneable Kafka behaviour.
2.1.6:
- Tidy up blockAll a bit.
2.1.5:
- Allow pause/resume to work as expected.
2.1.4:
- DNS results keyed consistently with hosts.
2.1.3:
- Scope decision recording made optional, and added more consistent naming.
2.1.2:
- Extended DecideRuleSequence to record the decisive rule.
2.1.1:
- Added ability to log discarded URLs, and fixed a serious bug in URL routing
2.1.0:
- Recently Seen functionality moved to a DecideRule, allowing us to use Heritrix's recheckScope feature to prevent recrawling of URLs that have been crawled since the original request was enqueued.
- The OutbackCDXRecentlySeenDecideRule implementation also stores the last hash, so the OutbackCDXPersistLoadProcessor is no longer needed.
2.0.0:
- Switched to Recently Seen unique URI filter, backed by OutbackCDX.

ukwa-heritrix's People

Contributors

Stargazers

Watchers

Forkers

halvir vejan min2ha radtoo uk-gov-mirror mirrorweb ldbiz

ukwa-heritrix's Issues

Ensure partition offsets are being recorded properly

Having just paused and restarted a large crawl, the partition offsets have all been reset. The KAFKA_SEEK_TO_BEGINNING flag is set to false, so this should not have occurred (and even then should not have occurred on pausing/unpausing the crawl.

Need to verify that the offsets are being committed - as we are now manually managing assignment, maybe this needs to be handled differently.

Error sending messages to topic with Kafka

Kafka returns this error while sending messages to topic to run a crawl test as described in the documentation:

cat testdata/seed.json | $KAFKA/kafka-console-producer.sh --broker-list localhost:9092 --topic uris.tocrawl.fc

[2019-03-21 10:08:26,427] ERROR Error when sending message to topic uris.tocrawl.fc with key: null, value: 361 bytes with error: (org.apache.kafka.clients.producer.internals.ErrorLoggingCallback) org.apache.kafka.common.errors.TimeoutException: Expiring 1 record(s) for uris.tocrawl.fc-10: 1542 ms has passed since batch creation plus linger time

It would be possible to have clarifications about how to set up a new crawl and from which docker container or environment launch these commands?

Thanks for your explanations.

Quieten down or resolve cookie warnings

We get a LOT of

2019-03-20 14:37:47.103 WARNING thread-374 org.apache.http.client.protocol.ResponseProcessCookies.processCookies() Invalid cookie header: "Set-Cookie: woocommerce_items_in_cart=0; expires=Wed, 20-Mar-2019 13:37:46
 GMT; Max-Age=-3600; path=/". Negative max-age attribute: -3600
2019-03-20 14:46:05.434 WARNING thread-374 org.apache.http.client.protocol.ResponseProcessCookies.processCookies() Invalid cookie header: "Set-Cookie: wpSGCacheBypass=0; expires=Wed, 20-Mar-2019 13:46:05 GMT; Max-
Age=-3600; path=/". Negative max-age attribute: -3600

They seem to turn up all over the place, but always with the same -1hr Max-Age, which makes my suspicious that this is a HTTP client problem. We should check that our and/or quieten these errors down.

Avoid attempting to parse clearly irrelevant URIs

The web-renderer processor is attempting to parse all extracted links/references, and this can throw errors like:

 Could not parse as UURI: data:image/svg+xml;base64,PHN2ZyB4bWxucz0iaHR0cDovL

There's not point parsing data: or indeed mailto: URIs - we should probably default to whitelisting on http: and https: URIs.

Ensure refusal of robots.txt recrawls does not invalidate cached robots.txt info

I modified the crawler to skip to the DispositionProcessor rather than to the end of the disposition chain, so that it would get the crawl delay right.

However, this also means -5000 robots.txt events invalidate the robots.txt records and lead to lots of -61 events. This causes this issue to return. So, need to modify the crawler to make this more reliable.

Ensure we keep crawl logs files

To my surprise/dismay, it seems telling Heritrix to only keep the last checkpoint also means it deletes the previous checkpoint log files! This doesn't cause a problem when we're promptly syncing to HDFS, but is not desirable on other systems or in case we hit a bottleneck.

Thread contention in uk.bl.wap.modules.deciderules.CompressibilityDecideRule

Seeing 1/4 of all threads blocked waiting to get hold of a single java.util.zip.Deflater instance...

Blocked/Waiting On: java.util.zip.Deflater@11476000 which is owned by ToeThread #69: http://luterano.blogspot.co.uk/2006/09/el-salvadors-holocaust-hero.html(136)
    uk.bl.wap.modules.deciderules.CompressibilityDecideRule.evaluate(CompressibilityDecideRule.java:66)
    org.archive.modules.deciderules.PredicatedDecideRule.innerDecide(PredicatedDecideRule.java:47)
    org.archive.modules.deciderules.DecideRule.decisionFor(DecideRule.java:60)
    org.archive.modules.deciderules.DecideRuleSequence.innerDecide(DecideRuleSequence.java:113)
    org.archive.modules.deciderules.DecideRule.decisionFor(DecideRule.java:60)
    org.archive.crawler.framework.Scoper.isInScope(Scoper.java:107)
    org.archive.crawler.prefetch.CandidateScoper.innerProcessResult(CandidateScoper.java:40)
    org.archive.modules.Processor.process(Processor.java:142)
    org.archive.modules.ProcessorChain.process(ProcessorChain.java:131)
    org.archive.crawler.postprocessor.CandidatesProcessor.runCandidateChain(CandidatesProcessor.java:176)
    org.archive.crawler.postprocessor.CandidatesProcessor.innerProcess(CandidatesProcessor.java:230)
    org.archive.modules.Processor.innerProcessResult(Processor.java:175)
    org.archive.modules.Processor.process(Processor.java:142)
    org.archive.modules.ProcessorChain.process(ProcessorChain.java:131)
    org.archive.crawler.framework.ToeThread.run(ToeThread.java:152)

i.e. this compressibility rule needs to be threadsafe and likely use a threadlocal Deflater instance, as it seems the scope bean is shared across threads.

Ensure quotas are cleared properly

The quota reset logic was not sufficient. It cleared the quota for a specific host, but failed to do the same for 'aliases'. Specifically, if the seed uses the http scheme, but immediately redirects, the server quota for host:443 has not been reset, and this blocks the download of the robots.txt, which in turn prevents anything else being downloaded (silently discarded with a -61 status code - as we're enqueuing directly rather than via Kafka they won't be noted as discarded unless we modify the Candidates Chain).

Added logic to 2.3.5 to both clear the server quotas, but also to switch to host quotas.

NOTE that this does not cope with other aliases, e.g. www./www#./no-www, so perhaps we need to add some more logic? One is to allow resetQuotas to be inherited through pre-requisites or redirects from the seed. The simplest is perhaps just to clearly report it and use the aliases from W3ACT when we launch, which curators can then update as needed.

Problem unpausing after taking Kafka off line

We paused the crawler to re-configure Kakfa, which required fully shutting down and restarting Kafka.

n.b. doing a stack rm complained about removing the Kafka network - we should have just stopped the Kafkas.

Once Kafka was ready, we restarted the crawlers, but they got stuck. There was some of this:

INFO: uk.bl.wap.crawler.postprocessor.KafkaKeyedCrawlLogFeed$StatsCallback onCompletion error count so far: 5629/698790000 (0.0%) [Thu Jul 11 15:38:47 GMT 2019]

And a lot of threads like this:

"ToeThread #999: http://smartroof.co.uk/2017/05/" #1087 prio=4 os_prio=0 tid=0x00007f8e5c9a2000 nid=0x49c in Object.wait() [0x00007f8b1d1d1000]
   java.lang.Thread.State: TIMED_WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        at org.apache.kafka.clients.Metadata.awaitUpdate(Metadata.java:177)
        - locked <0x0000000556ad3178> (a org.apache.kafka.clients.Metadata)
        at org.apache.kafka.clients.producer.KafkaProducer.waitOnMetadata(KafkaProducer.java:884)
        at org.apache.kafka.clients.producer.KafkaProducer.doSend(KafkaProducer.java:770)
        at org.apache.kafka.clients.producer.KafkaProducer.send(KafkaProducer.java:760)
        at uk.bl.wap.crawler.postprocessor.KafkaKeyedToCrawlFeed.sendToKafka(KafkaKeyedToCrawlFeed.java:176)
        at uk.bl.wap.crawler.postprocessor.KafkaKeyedToCrawlFeed.innerProcess(KafkaKeyedToCrawlFeed.java:304)
        at org.archive.modules.Processor.innerProcessResult(Processor.java:175)
        at org.archive.modules.Processor.process(Processor.java:142)
        at org.archive.modules.ProcessorChain.process(ProcessorChain.java:131)
        at org.archive.crawler.postprocessor.CandidatesProcessor.runCandidateChain(CandidatesProcessor.java:176)
        at org.archive.crawler.postprocessor.CandidatesProcessor.innerProcess(CandidatesProcessor.java:230)
        at org.archive.modules.Processor.innerProcessResult(Processor.java:175)
        at org.archive.modules.Processor.process(Processor.java:142)
        at org.archive.modules.ProcessorChain.process(ProcessorChain.java:131)
        at org.archive.crawler.framework.ToeThread.run(ToeThread.java:152)

   Locked ownable synchronizers:
        - None

Going to attempt a full shut-down and restart. But really this should work.

The ToeThreads are not removed when pausing the crawler, so presumable something went wrong there when the Kafka went away?

Pass DOM from WrenderProcessor along to the extractor(s)

Rather than skipping the rest of the chain, a successful WrenderProcessor should skip only FetchHTTP and let the rest of the chain run, especially the extractors so we have a chance of getting things we missed like srcset URLs.

Unfortunately, this likely means modifying or subclassing FetchHTTP itself so it only runs if there's no status code set (or otherwise infers it need not run).

It would also mean populating the CrawlURI properly

https://github.com/internetarchive/heritrix3/blob/05811705ed996122bea1f4e034c1ed5f7a07240f/modules/src/main/java/org/archive/modules/fetcher/FetchHTTP.java#L999-L1016

using the decoded renderedContent.

Odd 304 errors

We're seeing

fc_heritrix-worker.1.tewka74xob8a@crawler02    | SEVERE: org.archive.crawler.framework.ToeThread recoverableProblem Problem java.lang.NullPointerException occurred when trying to process 'https://www.dailymail.co.uk/reader-comments/p/comment/link/383687131' at step ABOUT_TO_BEGIN_PROCESSOR in
fc_heritrix-worker.1.tewka74xob8a@crawler02    |  [Mon Jan 21 12:26:49 GMT 2019]
fc_heritrix-worker.1.tewka74xob8a@crawler02    | java.lang.NullPointerException
fc_heritrix-worker.1.tewka74xob8a@crawler02    |        at org.archive.modules.recrawl.FetchHistoryProcessor.innerProcess(FetchHistoryProcessor.java:111)
fc_heritrix-worker.1.tewka74xob8a@crawler02    |        at org.archive.modules.Processor.innerProcessResult(Processor.java:175)
fc_heritrix-worker.1.tewka74xob8a@crawler02    |        at org.archive.modules.Processor.process(Processor.java:142)
fc_heritrix-worker.1.tewka74xob8a@crawler02    |        at org.archive.modules.ProcessorChain.process(ProcessorChain.java:131)
fc_heritrix-worker.1.tewka74xob8a@crawler02    |        at org.archive.crawler.framework.ToeThread.run(ToeThread.java:148)
fc_heritrix-worker.1.tewka74xob8a@crawler02    |

Which is weird, because we've never downloaded them (there's no fetch history), which means we should not see a 304 condition. From the CLI I got a 403 access denied for that URL (which works in the browser), so maybe this is a problem with the site behaviour.

It's possible that H3 is retrying downloads, and that the HTTP client is persisting some state that gets passed in subsequent requests? Leading to a 304?

AFAIK H3 should not see HTTP 304 because it doesn't send If-Modified-Since or If-None-Match header.

Create new/updated wrender module to use webrender-puppeteer

Once ukwa/webrender-puppeteer#1 is complete, decide how to integrate with Heritrix and update/replace the WrenderProcessor module if necessary.

Either:

A new service that provides the same API and ensures the rendered versions get packaged up. (This does not require changes to Heritrix, but changes the deployment/service suite).
A new module that calls docker run ukwa/webrender-puppeteer directly rather than using a REST API.

The first is preferred, but that means we need to complete https://github.com/ukwa/ukwa-webrender-server/issues/1

Add viaHeritrix download option to WrenderProcessor

The current WrenderProcessor expects warcprox to be used to capture the rendered resources. Although the quality may suffer, it may be useful to add a mode that lets Heritrix3 (re)download the resources, to make (initial?) deployment simpler.

It would just be a case of enqueueing the other URLs in the request-response entries as E links, and handing processing along the chain rather than skipping the rest of the processors.

Should also gobble up all those delicious cookies...

    "pages": [
      {
        "cookies": [
          {
            "domain": ".bl.uk", 
            "expires": "Fri, 17 Nov 2017 20:38:34 GMT", 
            "expiry": 1510951114, 
            "httponly": false, 
            "name": "__qca", 
            "path": "/", 
            "secure": false, 
            "value": "P0-1147487100-1476995914446"
          }, 
        ...

ExtractorJson: NoSuchMethodError

The ExtractorJson module is throwing this error:

java.lang.NoSuchMethodError: org.archive.modules.extractor.Link.addRelativeToBase(Lorg/archive/modules/CrawlURI;ILjava/lang/String;Lorg/archive/modules/extractor/LinkContext;Lorg/archive/modules/extractor/Hop;)V
        at uk.bl.wap.modules.extractor.ExtractorJson.innerExtract(ExtractorJson.java:42)
        at org.archive.modules.extractor.ContentExtractor.extract(ContentExtractor.java:37)
        at org.archive.modules.extractor.Extractor.innerProcess(Extractor.java:101)
        at org.archive.modules.Processor.innerProcessResult(Processor.java:175)
        at org.archive.modules.Processor.process(Processor.java:142)
        at org.archive.modules.ProcessorChain.process(ProcessorChain.java:131)
        at org.archive.crawler.framework.ToeThread.run(ToeThread.java:148)

Quite rightly as that method/signature doesn't exist. However, that's not what we're actually calling.

Change the actual crawl job name when starting up?

When running different crawls, it would be preferable that the job name actually said frequent-npld or dc2019 rather than always saying frequent. To do this, we could change the start mechanism to use frequent as a profile and launch the actual job under the requested crawl name.

Should only load GeoIP2 database once.

I just noticed that the current GeoIP2 lookup module re-load the GeoIP2 database from a file every single time a URL is looked up.

https://github.com/ukwa/bl-heritrix-modules/blob/master/src/main/java/uk/bl/wap/modules/deciderules/ExternalGeoLookup.java#L52

This is almost certainly rather inefficient. Instead, the database could be set up during construction/initialisation once, and then re-used across lookups.

Cope better with partially-failed Web Render events

Occasionally, due to edge cases or load peaks, the Heritrix engine can think the web rendering failed on the first pass, but in fact it just didn't complete cleanly or within the time-out.

Because it appeared to fail, H3 cannot extract the links to enqueue them. Instead, it defers then retries the download. This was intended to retry using FetchHTTP instead, but in these specific cases, the RecentlySeenDecideRule spots that the URL has been captured (because warcprox recorded it as such during the successful part of the download). It is therefore discarded (-5000) rather than re-crawled.

Unfortunately, this means that the outlinks are never captured. If no other process happens across those links, we don't get anything else from the site.

😞

It's not clear there's a huge amount we can do about this within Heritrix itself. We've coupled the processes together like this on purpose, to avoid sites getting crawled multiple times by multiple crawlers, so decoupling them isn't really an option.

One possibility would be to add/extend our warcprox modules so it can post links to a queue. But this involves putting quite a bit of logic into warcprox that doesn't really belong there.

It's somewhat related to #28, in that we want to make sure we extract links from the rendered DOM, which we can't get hold of very easily.

Another idea would be a post-crawl QA/checking process that scans what has happened to seeds, looks for outlinks at the same time, and posts them to Kafka for download.

I wonder whether it would make sense to crawl a 'pretend' website that makes it easier to pipe these things through Heritrix. e.g. we hit a seed https://www.bl.uk/, and after we deal with it, we enqueue a 'pretend' URL that gives us access to the onreadydom or failing that, the original response. e.g. http://internal.check.service/get?url=https://www.bl.uk. This would be enqueued and extracted as normal by Heritrix, using the onreadydom if that worked, or the normal response if that failed for some reason. Link extraction would proceed as normal. This would give Web Render link extraction a second chance, and also give it a possibility of picking up URLs we missed (srcset etc.).

The main drawback is this would 'pollute' the logs and WARCs with content that didn't really mean what the rest of it means. However, the WARC pollution could be limited by adding an annotation that would be configured to prevent the records being written. The entries in the crawl log are probably fine, and would act as an indicator of what had been done.

Reminder: if we mark WebRendered URLs as -5002 this prevented enqueueing so we had to add processErrorOutlinks=true. If we marked WebRendered URLs as the success status code this had a different problem: Writing to WARCs, and was that it? In which case, we can block the writing instead? Ah, duplicate records sent to OutbackCDX? I think that was it.

NPE in StatisticsTracker because of `null` `SourceTag`

Just had a large crawl die horribly because of:

SEVERE: org.archive.crawler.framework.ToeThread run Fatal exception in ToeThread #989: dns:007bond.co.uk [Fri Jul 12 20:13:38 GMT 2019]
java.lang.NullPointerException
        at java.util.concurrent.ConcurrentHashMap.get(ConcurrentHashMap.java:936)
        at org.archive.crawler.reporting.StatisticsTracker.saveSourceStats(StatisticsTracker.java:767)
        at org.archive.crawler.reporting.StatisticsTracker.crawledURISuccessful(StatisticsTracker.java:760)
        at org.archive.crawler.reporting.StatisticsTracker.onApplicationEvent(StatisticsTracker.java:986)
        at org.springframework.context.event.SimpleApplicationEventMulticaster.multicastEvent(SimpleApplicationEventMulticaster.java:97)
        at org.springframework.context.support.AbstractApplicationContext.publishEvent(AbstractApplicationContext.java:303)
        at org.archive.crawler.frontier.WorkQueueFrontier.processFinish(WorkQueueFrontier.java:977)
        at org.archive.crawler.frontier.AbstractFrontier.finished(AbstractFrontier.java:576)
        at org.archive.crawler.framework.ToeThread.run(ToeThread.java:187)

Which happened a lot and killed all the ToeThreads. This was because, in this code:

https://github.com/internetarchive/heritrix3/blob/aa705bef2eb8fbbb9aeb56995e713a7b6ba0ed00/engine/src/main/java/org/archive/crawler/reporting/StatisticsTracker.java#L759-L760

If the A_SOURCE_TAG is set, but the value is null, it tries to use the null later and ConcurrentHashMap throws NPE on a null key

Looking at what happened, this appears to arise on DNS records of URLs discovered via the robots.txt. e.g. these events (that appeared to occur out of order, as the hosts are different).

2019-07-12T20:13:35.440Z     1         63 dns:007bond.co.uk LP http://007bond.co.uk/sitemap.xml text/dns #989 20190712201335399+38 sha1:ISZ7R2PFKOMBUTNRCOCJPFNKCYBM54HB - - {"warcFilename":"BL-NPLD-20190712194016457-10688-71~npld-dc-heritrix3-worker-1~8443.warc.gz","warcFileOffset":660940299,"scopeDecision":"ACCEPT by rule #14 PrerequisiteAcceptDecideRule","warcFileRecordLength":243}
2019-07-12T20:13:35.305Z   200        200 http://www.007bond.co.uk/robots.txt P http://www.007bond.co.uk/ text/plain #233 20190712201335014+34 sha1:OOJFFALMLEBE7RK6362FV6YQHHYRKHYJ - ip:77.104.133.250 {"contentSize":619,"warcFilename":"BL-NPLD-20190712195333571-10698-71~npld-dc-heritrix3-worker-1~8443.warc.gz","warcFileOffset":516684741,"scopeDecision":"ACCEPT by rule #14 PrerequisiteAcceptDecideRule","warcFileRecordLength":2429}

The problem appears to be that you must only copy over the SourceTag if it's not null. Hence, in the RobotsTxtSitemapExtractor, this was wrong:

            curiClone.setSourceTag(curi.getSourceTag());

but this should be fine:

            if (curi.getSourceTag() != null) {
                curiClone.setSourceTag(curi.getSourceTag());
            }

Add white-list/black-list support

The W3ACT definition includes the idea of regular-expressions for white-listing and black-listing additional URLs that are allowed into the crawl. The current launch mechanism does not provide a way to pass those in or register them.

Option 0, the very simplest approach, is to update the whitelist and blacklist text files based on W3ACT data prior to launch. This would work with the current crawler but not the newer scalable crawler, which is designed to operate pretty-much continuously.

Option 1 is to use the existing blacklist/whitelist logic (i.e. org.archive.modules.deciderules.MatchesListRegexDecideRule instances). When whitelist/blacklist requests come in, the handler updates the local beans. This means the lists are global to the whole crawl on that crawler.

Option 2 is the same but to somehow associated it with a seed/source or sheet SURT. This means the whitelists and blacklists can be made to only operate in the context of the URLs found via a particular seed. It is not clear whether this is a large advantage or not!

However, these last two options do not work with the new scaling method, as discovered-URLs are delivered to hosts based on the keys of the target URL, and so if the white/blacklists refer to different hosts than the seed, the right crawler probably won't know about the whitelist.

One alternative would be to separate the crawl streams for crawl instructions versus discovered URLs, but this would be complicated as a simple implementation would mean all crawlers fetched all seeds. i.e. if we use a shared crawl-job stream rather than putting everything in the distributed to-crawl stream. This would need a separate Kafka listener that set up the crawl configuration as instructed, but then just passed the URL on to the to-crawl stream. This is actually a quite reasonable set-up, but requires a fair amount of work.

A different approach would be to have a separate Scope Oracle, i.e. host some crawl configuration as a separate service and pull that in as needed. But that only really moves the problem to a separate component, and isn't much of an advantage unless it's part of a full de-coupling of the crawl modules, i.e. introduction a discovered-uris stream and running a separate process to scope the URLs and pass them onto the to-crawl stream.

In summary, a crawl-job launch mechanism is probably the best approach in the nearish-term. We could base it on Brozzler's job configuration, and every crawler instance would use it to set/update the crawler configuration. Because passing the URL on to the right instance without duplication is difficult, we could just make it a two-step launch. i.e. send the crawl-job configuration first, and then send in the to-crawl URL a little while after?

Longer term, the Scope Oracle is probably a better approach. It would update itself based on the latest job configurations from W3ACT/wherever, and could be plumbed into H3 as a REST service or as a separate discovered-uri stream consumer.

Links not being extracted from site maps

Unfortunately, links were never being extracted from site maps, as no XML extractor was in place. Adding an instance of ExtractorXML to the fetch chain should do it.

Add additional RSS/Atom/ROME extractor

We could copy Kris's CrawlRSS work and make an RSS extractor based on ROME but make it work like our new SitemapExtractor.

Reset caps when seeds appear

When seeds are injected into the crawl, they should also clear any capping of the crawls. This means resetting the counters and waking and retired queues. (e.g. reconsiderRetiredQueues as shown here).

Prevent Kryo warnings

We're observing these warnings:

UNREGISTERED FOR KRYO class org.archive.util.Histotable in class org.archive.crawler.frontier.BdbWorkQueue
UNREGISTERED FOR KRYO class org.archive.crawler.frontier.precedence.HighestUriQueuePrecedencePolicy$HighestUriPrecedenceProvider in class org.archive.crawler.frontier.BdbWorkQueue
UNREGISTERED FOR KRYO class org.json.JSONObject in class org.archive.modules.CrawlURI
UNREGISTERED FOR KRYO class java.util.LinkedHashSet in class org.archive.modules.CrawlURI

These should be registered with AutoKryo, which I think has to be done in the main project.

Fix up report stats in processors

For the sitemap processor, we get this report:

Processor: uk.bl.wap.modules.extractor.RobotsTxtSitemapExtractor
  0 links from 4297 CrawlURIs

which is probably because the way we have to enqueue the links means they don't get auto-counted.

Similarly, we should report some worked/failed stats in the Wrenderer.

Complete testing of initial streaming crawler prototype

This system builds a prototype 'frequent' crawl job that re-routes discovered URIs via a Kafka log stream. As Kafka log partitions can be partitioned by key, this means we can orchestrate multiple Heritrix3 instances simply by starting more instances under the same group.id.

This appears to work well, but a larger-scale test is required before we proceed with refining how it works.

To perform a suitable test, we need

Check de-duplication does not reference warc/revisit records

This point makes me realise I should check our OutbackCDXPersistLoadProcessor and OutbackCDXPersistStoreProcessor work as they should and always refer back to the original record.

NPE in Kafka handling caused crawl to ceace

Observed failure in Kafka hook likely due to delay in Kafka coming up...

2019-03-14 16:41:30.549 WARNING thread-498 org.apache.kafka.clients.NetworkClient.warn() [Producer clientId=producer-1] Error while fetching metadata with correlation id 1 : {fc.crawled=UNKNOWN_TOPIC_OR_PARTITION}
Exception in thread "Thread-15" java.lang.NullPointerException
at uk.bl.wap.crawler.frontier.KafkaUrlReceiver$KafkaConsumerRunner.run(KafkaUrlReceiver.java:350)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

Retire-and-awaken queues rather than emitting all over-quota URIs.

Currently, the crawler logs all over-quota URIs rather than retiring the queue. However, this makes the log files kinda big. We could retire the queues and keep checking them again after messages come in.

Shift to async REST API for web rendering

Still having backlog problems with web rendering. Once ukwa/webrender-api#1 is implemented, we should shift H3 over to it. Related to #33 .

Verify that the continous crawler is working

The continuous crawler has been running successfully for weeks, but we need to verify that it is doing a sufficiently good job to justify the switch-over.

Proposal is to generate crawl volume breakdowns per host across daily and weekly crawl streams, and compare them to make sure they are roughly equivalent.

Daily seed only running every other day, due to small delays. The recrawl periods should be shortened slightly, e.g. 23 hrs not 24hrs etc. but seed relaunch should use a narrow re-crawl window (10min?) to prevent the shorter recrawl period causing the schedule to drift (at the cost of occasionally double-crawling seeds).
Ensure DNS failures are not remembered forever. internetarchive/heritrix3#234

Document new Dockerized crawler development workflow

I need to document how this project can now be used to run crawler development and testing.

Scoping not being applied correctly.

The current design scopes prior to enqueueing the discovered URLs in the 'to-crawl' queue. This will not work as expected currently, as when running distributed, each H3 engine only has the scope configuration for the seeds passed to it.

The better plan is to enqueue all discovered URLs and let the receiver do the scoping. We could use a topic naming convention to manage these streams:

uris.requested (where crawl launch requests go)
uris.discovered (where all discovered URIs go)
uris.discarded (where out-of-scope or otherwise discarded URIs go)
uris.to.crawl (where in-scope URIs go, if we were to run the scoper as a separate process)

The receiver would subscribe to uris.requested and uris.discovered in the current design.

Additionally, it would be good to work out how to modify the candidate chain to redirect the out-of-scope URLs to a dedicated stream.

Allow separate Requested/Discovered/Accepted URI streams?

Should we allow the launch requests to be store in a separate topic/log/stream to the URIs the log of discovered URLs?

To make things faster, when running a single crawler, we would directly enqueue all discovered URIs and log them to a stream that would only be used when we needed to rebuild the frontier. This would also help alleviate the problems we've seen when pause/unpause or restart the crawler and it rewinds to the start of the Launch queue when it should not (but that's really a bug in the Kafka client so we should really resolve that).

It does, however, complicate things if we still want to route discovered URIs via the stream rather than directly enqueuing them, e.g. for distributed crawling or for when using streams to send different requests to different crawl processes. In that case, we need to allow the receiver to listen to two streams, and re-enable the redirection via the stream, and put this all behind a single configuration option.

Extend URL Reciever to allow different event stores to be used

The KafkaUrlReceiver could be refactored to offer different storage options, e.g.

Kafka, which is extremely scalable but hard work to deploy (needs zookeepers, replication etc),
NATS Streaming which is a streaming store+API that persists to disk so should scale quite well. The Java API looks nice too.
NSQ looks potentially useful but it's not clear to me how to make sure the clients resume consumption cleanly, or how to rewind to go back and check the event stream.
Redis Streaming which is a widely-used stream store but one limited by RAM (c. 100MB/1e6 messages, so too small for domain crawls).
Log files, based on using Tailer. See also the source for Tailer.java and this related example. We'd need to implement the offset durability logic ourselves, hooked into the H3 checkpoint mechanism. Would also need to understand log rotation, and possibly implement it in the consumer so it can be synchronised with the checkpoints.

This would allow the same continuous crawling behaviour to be used without requiring Kafka. This would make it easier to others to experiment with our crawl set-up more easily. But it would significantly increase the integration testing needed, will have no log compression, and we may not use it.

Blocked re-crawl of robots.txt causing failure cascade for host

See, for e.g.

2019-03-28T22:32:15.322Z -5000          - https://conservativehome.blogs.com/robots.txt LLLREELLELRLLLEPR http://conservativehome.blogs.com/robots.txt unknown #022 - - tid:87632:https://theconversation.com/brexit-an-escape-room-with-no-escape-109935/ - {"scopeDecision":"REJECT by rule #13 OutbackCDXRecentlySeenDecideRule"}

i.e. robots.txt getting blocked because we've seen it recently, but this leads to a cascade of -61 events.

Only reset sheets if the sheets are being modified

Currently, any 'refresh' crawl requests reset the sheets to []. We should instead leave the sheets as they are unless sheets are specified, in which case the sheets should be cleared-the-set.

ClamAV processor should timeout

We've seen an issue in production with hanging socket connections interfering with crawl ops.

[ToeThread #54: http://www.wymondhamandattleboroughmercury.co.uk/news/greening-wymondham-big-litter-pick-2018-1-5469488?action=login
 CrawlURI http://www.wymondhamandattleboroughmercury.co.uk/news/greening-wymondham-big-litter-pick-2018-1-5469488?action=login LLL http://www.wymondhamandattleboroughmercury.co.uk/news/greening-wymondham-big-litter-pick-2018-1-5469488    0 attempts
    in processor: viralContent
    ACTIVE for 7d21h23m42s277ms
    step: ABOUT_TO_BEGIN_PROCESSOR for 7d21h23m41s42ms
Java Thread State: RUNNABLE
Blocked/Waiting On: NONE
    java.net.SocketInputStream.socketRead0(Native Method)
    java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
    java.net.SocketInputStream.read(SocketInputStream.java:171)
    java.net.SocketInputStream.read(SocketInputStream.java:141)
    java.net.SocketInputStream.read(SocketInputStream.java:127)
    uk.bl.wap.util.ClamdScanner.getResponse(ClamdScanner.java:136)
    uk.bl.wap.util.ClamdScanner.clamdSession(ClamdScanner.java:105)
    uk.bl.wap.util.ClamdScanner.clamdScan(ClamdScanner.java:51)
    uk.bl.wap.crawler.processor.ViralContentProcessor.innerProcess(ViralContentProcessor.java:88)
    org.archive.modules.Processor.innerProcessResult(Processor.java:175)
    org.archive.modules.Processor.process(Processor.java:142)
    org.archive.modules.ProcessorChain.process(ProcessorChain.java:131)
    org.archive.crawler.framework.ToeThread.run(ToeThread.java:148)
]

We should check that the ViralContentProcessor will time out after some reasonable time (a few mins).

NullPointerExceptions killing ToeThreads

We're seeing really odd fatal errors, killing off ToeThreads in crawls:

SEVERE: org.archive.crawler.framework.ToeThread run Fatal exception in ToeThread #99: https://www.andersonslimited.co.uk/robots.txt [Wed Jul 25 09:08:01 GMT 2018]
java.lang.NullPointerException
        at org.archive.crawler.frontier.BdbMultipleWorkQueues.delete(BdbMultipleWorkQueues.java:484)
        at org.archive.crawler.frontier.BdbWorkQueue.deleteItem(BdbWorkQueue.java:88)
        at org.archive.crawler.frontier.WorkQueue.dequeue(WorkQueue.java:195)
        at org.archive.crawler.frontier.WorkQueueFrontier.processFinish(WorkQueueFrontier.java:948)
        at org.archive.crawler.frontier.AbstractFrontier.finished(AbstractFrontier.java:574)
        at org.archive.crawler.framework.ToeThread.run(ToeThread.java:187)

Looking at the code, this shouldn't really be possible!

Going up the call tree, it appears the peekItem has become inconsistent with, i.e. reset to null.

Note that NetArchive Suite have also seen this issue and patched it in this way.

Also observing

java.util.ConcurrentModificationException
        at java.util.TreeMap$PrivateEntryIterator.nextEntry(TreeMap.java:1211)
        at java.util.TreeMap$EntryIterator.next(TreeMap.java:1247)
        at java.util.TreeMap$EntryIterator.next(TreeMap.java:1242)
        at com.esotericsoftware.kryo.serialize.MapSerializer.writeObjectData(MapSerializer.java:90)
        at com.esotericsoftware.kryo.serialize.FieldSerializer.writeObjectData(FieldSerializer.java:161)
        at com.esotericsoftware.kryo.Kryo.writeObjectData(Kryo.java:453)
        at com.esotericsoftware.kryo.ObjectBuffer.writeObjectData(ObjectBuffer.java:262)
        at org.archive.bdb.KryoBinding.objectToEntry(KryoBinding.java:81)
        at com.sleepycat.collections.DataView.useValue(DataView.java:549)
        at com.sleepycat.collections.DataCursor.initForPut(DataCursor.java:817)
        at com.sleepycat.collections.DataCursor.put(DataCursor.java:751)
        at com.sleepycat.collections.StoredContainer.putKeyValue(StoredContainer.java:321)
        at com.sleepycat.collections.StoredMap.put(StoredMap.java:279)
        at org.archive.util.ObjectIdentityBdbManualCache$1.onRemoval(ObjectIdentityBdbManualCache.java:122)
        at com.google.common.cache.LocalCache.processPendingNotifications(LocalCache.java:1954)
        at com.google.common.cache.LocalCache$Segment.runUnlockedCleanup(LocalCache.java:3457)
        at com.google.common.cache.LocalCache$Segment.postWriteCleanup(LocalCache.java:3433)
        at com.google.common.cache.LocalCache$Segment.put(LocalCache.java:2888)
        at com.google.common.cache.LocalCache.put(LocalCache.java:4146)
        at org.archive.util.ObjectIdentityBdbManualCache.dirtyKey(ObjectIdentityBdbManualCache.java:379)
        at org.archive.crawler.frontier.WorkQueue.makeDirty(WorkQueue.java:690)
        at org.archive.crawler.frontier.AbstractFrontier.tally(AbstractFrontier.java:636)
        at org.archive.crawler.frontier.AbstractFrontier.doJournalAdded(AbstractFrontier.java:647)
        at org.archive.crawler.frontier.WorkQueueFrontier.sendToQueue(WorkQueueFrontier.java:410)
        at org.archive.crawler.frontier.WorkQueueFrontier.processScheduleAlways(WorkQueueFrontier.java:333)
        at org.archive.crawler.frontier.AbstractFrontier.receive(AbstractFrontier.java:554)
        at org.archive.crawler.util.SetBasedUriUniqFilter.add(SetBasedUriUniqFilter.java:85)
        at org.archive.crawler.frontier.WorkQueueFrontier.processScheduleIfUnique(WorkQueueFrontier.java:378)
        at org.archive.crawler.frontier.WorkQueueFrontier.schedule(WorkQueueFrontier.java:356)
        at org.archive.crawler.postprocessor.CandidatesProcessor.runCandidateChain(CandidatesProcessor.java:189)
        at uk.bl.wap.crawler.frontier.KafkaUrlReceiver$CrawlMessageHandler.run(KafkaUrlReceiver.java:468)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)```

So, what seems to be happening, I think, is that occasionally, between this statement and this one, the WorkQueue gets updated by a separate thread in a way that forces it to get written out to disk and then read back in again. As peekItem is transient, flushing it out to the disk and back drops the value and we're left with a null.

Record seed configuration updates as annotations

The fact that incoming launch events can also re-configure the crawl means we should really log a bit of that information somewhere so people can work out what's going on.

A lazy version would be to clone the whole launch message into the extra-info JSON blob. But this could include e.g. cookies etc. and is really overkill. Crucial properties are:

isSeed if it's a seed
forceFetch ? -- not clear as this only forces the CrawlURI to be enqueued into the frontier (as a re-prioritisation method)
the list of sheets applied
the targetSheet spec? (this can go in the JSON blob)
launchTimestamp already implemented
refreshDepth
resetQuotas already implemented

Ensure quota resets work with server quotas

We switch to host quotas to make resetting easier, but as all DNS requests get allocated to a host label of dns: this means we can run out of DNS quota!

So, better to switch back to server quotas, but this need to address the earlier problems arising because the seed and/or it's pre-requisites redirect to a different server. i.e. if we have a seed of http://example.org/, we may get a P http://example.org/robots.txt then PR https://example.org/robots.txt which then gets blocked by the quota for example.org:443 long before even the example.org:80 quota gets reset.

An idea would be to propagate the resetQuotas annotation to prerequisites including via redirects. But the preconditions system works differently than the usual link extraction, skipping the rest of the fetch chain, so it'll have to be handled elsewhere.

The fullVia is set when the candidate chain is run, but are pre-requisites passed through the candidate chain? Ah, yes, getPrerequisiteUri is only called in CandidatesProcessor which calls runCandidateChain on it.

So, the simplest approach is to add a processor to the candidate chain that checks the via and propagates any resetQuotas annotation pre-requisites or redirects.

How to ensure sitemaps and multi-level sitemaps get refreshed?

We can currently recrawl seeds or whole hosts. This means that site maps need to be added manually as seeds, and even then, sub-site-maps won't get picked up.

Ideally, a refresh would refresh a couple of hops deep, or perhaps make site maps a special case where the launch date is always inherited?

Switch to a 'Scope Oracle' model

Currently, we need to replicate the current crawl scope(s) in multiple places. Managing and maintaining the scope becomes rather cumbersome under distributed crawling, and we could also do consulting the scope from the access side as well as during crawling and as part of W3ACT.

In principle, we could have a distinct REST API service that held the current crawl scopes (NPLD and BY-PERMISSION). W3ACT would consult it, and changes in W3ACT would change it there. All the crawlers would consult it rather than have their own. A single replica service could be used for access/frontend services as needed.

Of course, this is quite a big bit of work, so recording it here, but putting it on the back-burner for a while. Need to settle in with the current new model first!

Re-crawl logic causes over-crawling of sites with many page-level Targets

The launch_ts approach works well, but when we have a large Target with multiple individual page-level Targets (e.g. BBC News homepage versus individual articles), the current implementation tends to over-crawl. e.g. if a particular article is re-crawled, it sets a new launch_ts and as this in inherited for discovered URLs, it causes the whole site to get re-crawled.

In practice, we probably only want to inherit thelaunch_ts for the homepage of each site.

we could not inherit launch_ts and rely on the recrawl sheet frequency
only inherit launch_ts for URLs with no path (this will not work quite as expected if the whole host does not have a suitable record)
as 2. but use additional data from W3ACT to spot highest-level URL on a site. Will still fail when we want to crawl a subsection of a site at a different frequency.

In truth, we want surt-prefix-scoped launch_ts values. i.e. rather than inherit the launch_ts value directly, add a new mechanism that gets configured when the URL comes in. This is like the sheets mechanism, but sheets are quite difficult to use for this, as you'd need to have a sheet for every SURT prefix, with the launch_ts set. It seems likely that this large number of sheets would not work reliably.

An alternative would be to create a new Processor that gets configured with the SURT-to-launch_ts mapping and applies the right one prior to scoping of candidates.

Always get prerequisites that are resolved via redirects?

Currently, the org.archive.modules.deciderules.PrerequisiteAcceptDecideRule only matches:

hop path matches ^.*P$

but should this really be...

hop path matches ^.*PR+$

I'm seeing this a lot with robots.txt resolution from http to https, but perhaps that's fine, as robots.txt for http:server is not the same as robots.txt for https:server.

Support mildly malformed and compressed Sitemaps

Running in stage/pre-prod and seeing some sitemaps that we can't parse.

Compressed ones like https://www.ebay.co.uk/lst/GTC-3-06-04-2019_6.xml.gz
Content not allowed in prolog: http://www.bbc.co.uk/mobile_sitemap.xml
White spaces are required between publicId and systemId. http://www.bbc.co.uk/ukchina/simp/sitemap.xml See Stack Overflow

The GZip one will require content sniffing, which maybe supported already by crawler-commons. The XML parser seems to be a bit overzealous here. Maybe it can be made more forgiving?

Synchronise crawl scope across all Heritrix workers

We have experimented with leaving all discovered URLs in the tocrawl stream. This will probably work okay for the frequent crawls, but due to the large number of out-of-scope URLs and duplicates, the queue rapidly becomes far too large in the domain crawl.

We could attempt to just deduplicate the tocrawl stream, but this could conflict with the intentional recrawling of URLs. In the current model, the simplest approach is to apply the full set of scope rules before emitting the discovered URLs.

The problem here is that, due to the way work is distributed across the H3 instances, the crawl scopes are inconsistent across nodes. For example, if we launch a crawl and mark a URL as a seed, then the instance that crawls that host will add the URL to widen it's scope. The other nodes don't see the seed so don't do this, which in turn means if one of those other nodes discovers URLs on that host, it will erroneously discard them from the crawl.

In the very short term, for the domain crawl, the scope can be fixed for the duration of the crawl.

For dynamic crawls, to keep the scope consistent across the nodes, it would probably make most sense for the scope to be held outside the nodes, in a remote database. However, that's a fairly big leap from where we are right now, in terms of crawl life-cycle management and because it means adding yet another component to the system.

An alternative strategy would be to add a KafkaCrawlConfigReceiver, running on every worker, each reading the same single-partition crawl.config queue. When the current KafkaUrlReceiver picks up a seed, it could post a message to the crawl.config queue, then handle the seed as normal. The KafkaCrawlConfigReceiver instances would then pick up this message and grow the scope as required, without enqueueing the URL (i.e. by modifying the DecideRule, via an autowired connection).

This avoids adding any new external system, and ensures crawl launch is still a single action, but does not cope well when we want to remove a site from the crawl scope.

The simplest shared external component would be a WatchedSurtFile. This could be updated externally, or from the crawl, and could be re-read quickly. The main constraint is that it has to be held outside of the job folder, so it can be cross-mounted and made available for every node.

Having tested this, it seems to work fine - we can mount an alternative path to a SURT file and it gets reloaded. For the frequent crawl, we can also get a Luigi job to re-create this file periodically. This seems the simplest option, and should work well as a shared file distributed via GlusterFS.

Add some link farms to the block list

See ukwa/ukwa-manage#27

Quota resets are not working because sheet association was broken for HTTPS

It seems the quota-clearing is not working. We see:

INFO: uk.bl.wap.crawler.frontier.KafkaUrlReceiver setSheetAssociations Setting sheets for https://(com,fourfourtwo,www,)/ to [recrawl-1day] [Sat Jan 19 09:00:43 GMT 2019]
INFO: uk.bl.wap.crawler.frontier.KafkaUrlReceiver$CrawlMessageFrontierScheduler run Adding seed to crawl: https://www.fourfourtwo.com/ [Sat Jan 19 09:00:43 GMT 2019]
INFO: uk.bl.wap.crawler.frontier.KafkaUrlReceiver$CrawlMessageFrontierScheduler resetQuotas Clearing down quota stats for https://www.fourfourtwo.com/ [Sat Jan 19 09:00:43 GMT 2019]

but then

2019-01-19T09:00:43.851Z -5003          - https://www.fourfourtwo.com/ - https://www.fourfourtwo.com/ unknown #262 - - tid:65838:https://www.fourfourtwo.com/ Q:serverMaxSuccessKb {}

Add "Scope+N Hops" scoping support

It would be handy to have a scoping mechanism that let the crawl run N hopes from the scope, ignoring redirects etc.

So, imagine we modify the outgoing links in a post-processor module, using a scope+L annotation to track how far off the original scope we are. Ideally, we could:

Remove (but remember) any scope+? annotation from the outlink.
Run the outline through the scope, and see if it would get accepted.
If it would not get accepted, append the current hop to the scope+ annotation.
If we have a useful scope annotation, e.g. scope+L, add it to the outline.
Handle the outlink as normal.

This makes it possible to track how far off scope we are, but only works if we ALSO add a new decide rule that uses the scope+? annotation and ACCEPTS outlines into the frontier if they are within a configurable range.

If re-running the scope is problematic (especially when distributed crawling means separate crawl engines -- see #13 for a related example of this problem), we can use a simpler alternative. If the outline URL is not on same host as the Source (i.e. the seed), we add/append the scope+? annotation. This hardcodes the scoping as host + N hops but is probably acceptable in practice.

WARCViralWriterProcessor and revisits?

Currently we're skipping writing revisit records with the WARCViralWriterProcessor (i.e. writeRevisitForIdenticalDigests is false).

We need to verify that this kludge can actually handle revisits correctly.