Giter VIP home page Giter VIP logo

crawly_server's Introduction

crawly_server's People

Watchers

 avatar

crawly_server's Issues

Global export function

A way to export all the URLs the database knows of in a nice format for sharing or uploading to archive.org

Homepage

Website homepage to introduce people to the project, some simple instructions on how to download and run a client, and some simple statistics.

Queue and reject reasons

When an item is added to the queue or rejected, it would be nice to have a "reason" why that happened. For example, discovered/manually queued or invalid protocol/bad URL/no DNS etc

Bulk processing queue

The queue mongodb collection should be split into two parts:

The primary queue that the tracker reads and issues items from
A secondary "bulk" queue that can be much larger (store more items) and act as a processing queue for the tracker to run more taxxing filtering on (such as #2).

Improved domain name checking

The get domain name check should be improved. Right now, it claims the domain names of websites like test.example.org.uk is "org.uk" when it should be "example.org.uk". It would also be nice to clean up in general

Index get urls API methods

Simple API methods to get URLs the tracker knows of for exporting or searching (for example, list every URL for a domain, or every finished url with text, etc)

Pass queuedBy and queuedAt to claims and done

When an item is claimed, the queuedBy and queuedAt information is lost, when it shouldn't be. When an item is submitted, the claimedBy and claimedAt information is also lost. All of this information should survive and be passed on.

Auto-unclaiming

Claims older than a set period (say 24h) should be automatically re-queued so they can be reclaimed.

Submitted item validation

The data for submitted items is blindly accepted, without even checking all content is there. It should be validated to make sure everything is present before being submitted.

Unclaim items API route

API route to unclaim items and return them to the queue (for example if a client needs to stop early)

Better item filtering

Right now, valid URIs that are not http(s) (such as irc) are queued, when they should be rejected.
URLs such as https://localhost/ are also accepted, when they really shouldn't be.
We should also be able to filter "bad" domains (like spam websites, or websites that have asked nicely we NOT crawl them.)

Dashboard

Project dashboard to view crawl firehoses (like submitted and discovered URLs) and a leaderboard.

Queue priorities

It would be nice to have "priority" queues separated by a number, so that you could de-prioritize some URLs (such as page elements) and prioritize others (like forums or important-to-crawl websites)

Check for valid DNS on queue submission

It would be nice to check if the domains for URLs we have in the queue have valid DNS records, so we can fail them immediately if it's impossible to crawl them. They should be filtered out to the rejects collection

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.