digitaldwagon / crawly_server Goto Github PK
View Code? Open in Web Editor NEWDistributed web crawling made easy
Distributed web crawling made easy
A way to export all the URLs the database knows of in a nice format for sharing or uploading to archive.org
Website homepage to introduce people to the project, some simple instructions on how to download and run a client, and some simple statistics.
When an item is added to the queue or rejected, it would be nice to have a "reason" why that happened. For example, discovered/manually queued or invalid protocol/bad URL/no DNS etc
The queue mongodb collection should be split into two parts:
The primary queue that the tracker reads and issues items from
A secondary "bulk" queue that can be much larger (store more items) and act as a processing queue for the tracker to run more taxxing filtering on (such as #2).
The get domain name check should be improved. Right now, it claims the domain names of websites like test.example.org.uk is "org.uk" when it should be "example.org.uk". It would also be nice to clean up in general
Search engine required information, such as the page title, description, etc, should all be crawled and stored.
Simple API methods to get URLs the tracker knows of for exporting or searching (for example, list every URL for a domain, or every finished url with text, etc)
When an item is claimed, the queuedBy and queuedAt information is lost, when it shouldn't be. When an item is submitted, the claimedBy and claimedAt information is also lost. All of this information should survive and be passed on.
Claims older than a set period (say 24h) should be automatically re-queued so they can be reclaimed.
The data for submitted items is blindly accepted, without even checking all content is there. It should be validated to make sure everything is present before being submitted.
API route to unclaim items and return them to the queue (for example if a client needs to stop early)
Documentation for the api
URLs sitting in the write queue are not deduplicated against each other. The deduplicator also doesn't catch URLs that have been written, but are still not in the read cache.
Right now, valid URIs that are not http(s) (such as irc) are queued, when they should be rejected.
URLs such as https://localhost/ are also accepted, when they really shouldn't be.
We should also be able to filter "bad" domains (like spam websites, or websites that have asked nicely we NOT crawl them.)
Project dashboard to view crawl firehoses (like submitted and discovered URLs) and a leaderboard.
It would be nice to have "priority" queues separated by a number, so that you could de-prioritize some URLs (such as page elements) and prioritize others (like forums or important-to-crawl websites)
It would be nice to check if the domains for URLs we have in the queue have valid DNS records, so we can fail them immediately if it's impossible to crawl them. They should be filtered out to the rejects collection
Route to fail items that can't be retrieved, but aren't response 000 (For example, bad DNS or invalid protocol)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.