digitaldwagon / crawly_server Goto Github PK

View Code? Open in Web Editor NEW

0.0 0.0 0.0 55 KB

Distributed web crawling made easy

Java 100.00%

crawly_server's Introduction

crawly_server's People

Watchers

crawly_server's Issues

Global export function

A way to export all the URLs the database knows of in a nice format for sharing or uploading to archive.org

Homepage

Website homepage to introduce people to the project, some simple instructions on how to download and run a client, and some simple statistics.

Queue and reject reasons

When an item is added to the queue or rejected, it would be nice to have a "reason" why that happened. For example, discovered/manually queued or invalid protocol/bad URL/no DNS etc

Bulk processing queue

The queue mongodb collection should be split into two parts:

The primary queue that the tracker reads and issues items from
A secondary "bulk" queue that can be much larger (store more items) and act as a processing queue for the tracker to run more taxxing filtering on (such as #2).

Improved domain name checking

The get domain name check should be improved. Right now, it claims the domain names of websites like test.example.org.uk is "org.uk" when it should be "example.org.uk". It would also be nice to clean up in general

Collect meta information (like page titles and descriptions)

Search engine required information, such as the page title, description, etc, should all be crawled and stored.

Index get urls API methods

Simple API methods to get URLs the tracker knows of for exporting or searching (for example, list every URL for a domain, or every finished url with text, etc)

Pass queuedBy and queuedAt to claims and done

When an item is claimed, the queuedBy and queuedAt information is lost, when it shouldn't be. When an item is submitted, the claimedBy and claimedAt information is also lost. All of this information should survive and be passed on.

Auto-unclaiming

Claims older than a set period (say 24h) should be automatically re-queued so they can be reclaimed.

Submitted item validation

The data for submitted items is blindly accepted, without even checking all content is there. It should be validated to make sure everything is present before being submitted.

Unclaim items API route

API route to unclaim items and return them to the queue (for example if a client needs to stop early)

API Docs

Documentation for the api

URLs are not deduplicated against documents in the write queue

URLs sitting in the write queue are not deduplicated against each other. The deduplicator also doesn't catch URLs that have been written, but are still not in the read cache.

Better item filtering

Right now, valid URIs that are not http(s) (such as irc) are queued, when they should be rejected.
URLs such as https://localhost/ are also accepted, when they really shouldn't be.
We should also be able to filter "bad" domains (like spam websites, or websites that have asked nicely we NOT crawl them.)

digitaldwagon / crawly_server Goto Github PK

crawly_server's Introduction

crawly_server's People

Watchers

crawly_server's Issues

Recommend Projects

Recommend Topics

Recommend Org