let4be / crusty Goto Github PK

View Code? Open in Web Editor NEW

83.0 2.0 3.0 2.97 MB

Broad Web Crawler

License: GNU General Public License v3.0

Dockerfile 2.37% Shell 3.14% Rust 94.48%

crusty's Introduction

crusty

fast, scalable && polite broad web crawler written in Rust

Introduction

Broad web crawling is an activity of going through practically boundless web by starting from a set of locations(urls) and following outgoing links. Usually it doesn't matter where you start from as long as it has outgoing links to external domains.

It presents unique set of challenges one must overcome to get a stable, scalable && polite system, Crusty is an attempt to tackle on some of those challenges to see what's out here while having fun with Rust ;)

This particular implementation could be used to quickly fetch a subset of all observable internet && discover most popular domains

The whole system is designed to be minimalistic in nature and does not bring any heavy artillery(like graph database or distributed file system). If needed, this can be connected externally. Also, internally some trade-offs have been made(see tradeoffs/limitations section).

Built on top of crusty-core which handles all low-level aspects of web crawling

Key features

Configurability && extensibility

see a typical config file with some explanations regarding available options. Crusty also supports a flexible profile system allowing to override any setting in default config file. Just set environment variable CRUSTY_PROFILE to the name of ./conf/profile-[*].yaml config with specialized settings
Blazing fast single node performance (~10 gbit/s on 48 core(96HT) c5.metal)
- crusty is written in Rust on top of green threads running on tokio, so it can achieve quite impressive single-node performance while staying 100% scalable
- we parse HTML using LoL HTML - very fast speed, works in tight memory boundaries and is easy to use
- running at light speed you will need to think twice what to do with such amount of data ;)
Stable performance and predictable resource consumption
- Crusty has small, stable and predictable memory footprint and is usually cpu/network bound. There is no GC pressure and no war over memory.
- built on top of buffered Flume channels - which helps to build system with predictable performance && scalability. Peak loads are getting buffered, continuous over-band loads lead to producer backoff.
Scalability - both vertically && horizontally
- per-node domain concurrency(N of parallel jobs) setting allows to saturate resources of given hardware
- each Crusty node is essentially an independent unit which we can run hundreds of in parallel(on different machines of course), the tricky part is job delegation and domain discovery which is solved by a high performance sharded queue-like structure built on top of redis.
- we leverage redis low-latency and use carefully picked up data structures along with careful memory management to achieve our goals
- all domains are sharded based on addr_key, where addr_key = First Lexicographically going IP && netmask, so it's possible to "compress" a given IP address and store all similar addresses under the same slot
- this is used to avoid bombing same IP(or subnet if so desired) by concurrent requests
- each domain belongs to some shard(crc32(addr_key) % number_of_shards), now each Crusty instance can read/update from a subset of all those shards while can insert to all of them(so-called domain discovery). Shards can be distributed across many redis instances if needed.
- with careful planning and minimal changes this system should scale from a single core instance to a 100+ instances(with total of tens of thousands of cores) which can easily consume your whole DC, or several if you are persistent... ;)
Smart Queue implemented on top of Redis Modules system allows to:
- ensure we check only one domain with same addr_key(no DDOS!)
- ensure high queue throughput
- ensure high availability(pre-sharded, if some segments become temporary unavailable system will work with others)
- ensure high scalability(pre-sharded, move shards to other machines if there's not enough CPU or more reliability desired). Though the fact is a single Smart Queue can easily handle 5+ Crusty running on top hardware(96 cores monsters with 25gbit channels), so I would be quite curious to see a use case where you need to move out redis queue shards to dedicated machines(except for reliability)
True politeness
- while we can crawl tens of thousands of domains in parallel - we should absolutely limit concurrency on per-domain level to avoid any stress to crawled sites, see default_crawler_settings.concurrency.
- each domain is first resolved and then mapped to addr_key, Redis Queue makes sure we -never- process several domains with the same addr_key thus effectively limiting concurrency on per IP/Subnet basis
- it's a good practice to introduce delays between visiting pages, see default_crawler_settings.delay.
- robots.txt filtering is fully supported(using Google's implementation ported to rust, no custom delays or sitemap yet)
- global IP filtering - you can easily restrict which IP blocks or maybe even countries(if you bring a good IP->Country mapper) you wish to visit(just hack on top of existing DNS resolver when configuring crusty-core)
Observability

Crusty uses tracing and stores multiple metrics in clickhouse that we can observe with grafana - giving a real-time insight in crawling performance

Getting started

one liner for debian/ubuntu to be used on clean, docker-less system

curl -fsSL https://raw.githubusercontent.com/let4be/crusty/master/infra/lazy.sh | sudo bash -s && cd crusty

Now be aware this one liner will do most of the job for you and ask you some useful questions along the way to help you configure your machine, BUT it requires ROOT access. You can do all of this manually, just study the script. I was just so bored with doing it over and over again in my tests I wrote the script...

alternatively follow instructions at

https://docs.docker.com/get-docker/

https://docs.docker.com/compose/install/

then clone this repository && configure your machine manually(study the script!) and don't forget /etc/sysctl.conf && configure crusty

play with it ( don't forget to change example.com to some valid domain with outgoing links not protected by robots.txt! )

CRUSTY_SEEDS=https://example.com docker-compose up -d --build

optionally set CRUSTY_PROFILE env variable to override some of default settings, for example

CRUSTY_PROFILE=c5.metal CRUSTY_SEEDS=https://example.com docker-compose up -d --build

see Crusty live at http://localhost:3000/d/crusty-dashboard/crusty?orgId=1&refresh=5s
to stop background run and erase crawling data(redis/clickhouse/grafana) docker-compose down -v

additionally

study config file and adapt to your needs, there are sensible defaults for a 100mbit channel, if you have more/less bandwidth/cpu you might need to adjust concurrency_profile
to stop background run and retain crawling data docker-compose down
to run && attach and see live logs from all containers (can abort with ctrl+c) CRUSTY_SEEDS=https://example.com docker-compose up
to see running containers docker ps(crusty-grafana, crusty-clickhouse, crusty-redis, crusty and optionally crusty-unbound)
to see logs: docker logs crusty

if you decide to build manually via cargo build, remember - release build is a lot faster(and default is debug)

In the real world usage scenario on high bandwidth channel docker might become a bit too expensive, so it might be a good idea either to run directly or at least in network_mode = host

External service dependencies

redis - smart queue(custom redis module) && top-k(custom redis module), both are using an excellent RedisBloom module
clickhouse - metrics
grafana - dashboard for metrics stored in clickhouse
unbound(optional) - run your own caching/recursive DNS resolver server(for heavy-duty setups you most likely have to)

just use docker-compose, it's the recommended way to play with Crusty

DB model is available in this sql

Grafana dashboard is exported as json model

Tradeoffs / limitations

Crusty not only writes top-k in redis but also attempts to bring aggregated data into clickhouse regularly, the whole thing implemented in a way that does not create issues with concurrency(SET NX, i.e. some running Crusty instance will pick aggregated data up) - it's perfectly fine to run multiple Crusty instances
top-k functionality isn't sharded, so there might be a point where it becomes a bottleneck(setting higher buffering on Crusty and optionally discarding all low profile hits might help though)
while Crusty is blazing fast you probably would like to do something with your 10 gbit/sec data stream, this is where the real struggle begins ;)
being minimalistic in design Crusty does not bring any heavy artillery(like a recommended Graph Database, or a recommended way to aggregate / persist collected data) - so bring your own

Development

make sure rustup is installed: https://rustup.rs/
make sure pre-commit is installed: https://pre-commit.com/
run ./go setup
run ./go check to run all pre-commit hooks and ensure everything is ready to go for git
run ./go release minor to release a next minor version for crates.io

Contributing

I'm open to discussions/contributions, - use github issues,

pull requests are welcomed

crusty's People

Contributors

Stargazers

Watchers

Forkers

isgasho busizshen chofoo

crusty's Issues

Implement a first approximation of PageRank for Domains

Right now this broad crawler is completely emtpy, I think it would be cool if we had something to show off ;)
A good candidate for such task could be page rank...

Now calculating URL page rank is a whole mega-task in it's own, proper implementation of which(that scales) could take months because of requirements on throughput, memory, speed, scalability.
Such system most likely needs a sophisticated URL -> ID mapping

On the other hand we could easily calculate Domain PageRank Ad hoc,

collect all outbound domains for a given Job
convert Job's Domain into a Second Level Domain (super-blog.tumblr.com -> tumblr.com)
convert all outbound domains into unique Second Level Domains as well
store all this in RedisGraph(this will work because there's only very limited N of second level domains and RedisGraph uses sparse matrixes)

https://oss.redislabs.com/redisgraph/
RedisGraph/RedisGraph#398

Depending on the underlying hardware results may vary. However, inserting a new relationship is done in O(1). RedisGraph is able to create over 1 million nodes under half a second and form 500K relations within 0.3 of a second.

RedisGraph has PageRank built-in

Rework metrics for internal channels

Those should be registred dynamically right on channel creation

Provide example configurations for various setups

This could better explain what the primary scaling points are,
config for aws's c5.metal would be quite different from t2.micro

Consider migrating to actix

It seems like it could significantly simplify the code

Investigate possible deadlock/hanging of clickhouse writer

Sometimes when testing on aws c5.metal I see that writers "hang" which leads to pile of unprocessed messages in buffers(particularly metrics_task as the heaviest clickhouse hitter)

Think about the best way to support ipv6

Make DNS resolver configurable

Evaluate docker overlay network performance influence on high volume setups

Could be cool to keep network inside docker overlay, but i'm concerned about performance especially on high end setups

Probably it should be default configuration anyway just to ensure everyone can try Crusty no matter open ports on the system...

Consider adding bind9 to docker-compose

High volume setups most likely will need local recursive dns server

Improve domain heuristics at domain_filter_map

It seems to capture way too much trash right now
Should dequeue handler update ddc to prevent same domain taking space in resolver's q?...

Investigate why crawling fails to start from some websites

Tried to start crawling from https://cnn.com and nothing ...

curl script for super-fast start

Need to convert existing ./infra/lazy.sh and ./infra/net.sh into one script that can be called right from curl
curl -fsSL https://raw.githubusercontent.com/let4be/crusty/master/infra/lazy.sh | bash -s
curl is readily available everywhere and the script could pull such stuff as git, bmon, htop, etc...

Error in redis dockerfile

Hello, I hope you are well. I've been trying to run the project locally on my Docker, but I always get an error in the redis dockerfile. I updated some things in the docker file, but the problem still persists:

FROM redis

# Update and install necessary packages
RUN apt-get update && apt-get -y install git build-essential cmake

# Create a symlink for python3 to ensure it is recognized

# Create app directory and clone RedisBloom
RUN mkdir /app && cd /app && \
    git clone https://github.com/RedisBloom/RedisBloom && \
    cd RedisBloom && \
    git submodule update --init --recursive && \
    ./sbin/setup \
    bash -l \
    make

# Copy configuration and modules
COPY redis.conf /usr/local/etc/redis/redis.conf
COPY --from=crusty_crusty:latest /usr/local/lib/libredis_queue.so /app
COPY --from=crusty_crusty:latest /usr/local/lib/libredis_calc.so /app

# Expose port
EXPOSE 6379/tcp
CMD [ "redis-server", "/usr/local/etc/redis/redis.conf", "--loadmodule /app/libredis_queue.so", "--loadmodule /app/libredis_calc.so", "--loadmodule /app/RedisBloom/redisbloom.so" ]

error that is happening:

2023-11-27 16:51:55 1:M 27 Nov 2023 19:51:55.351 * Module 'crusty.queue' loaded from /app/libredis_queue.so
2023-11-27 16:51:55 1:M 27 Nov 2023 19:51:55.351 * Module 'crusty.calc' loaded from /app/libredis_calc.so
2023-11-27 16:51:55 1:M 27 Nov 2023 19:51:55.351 # Module /app/RedisBloom/redisbloom.so failed to load: /app/RedisBloom/redisbloom.so: cannot open shared object file: No such file or directory
2023-11-27 16:51:55 1:M 27 Nov 2023 19:51:55.351 # Can't load module from /app/RedisBloom/redisbloom.so: server aborting

Make crawling rules configurable via config

Stuff like page limit, max depth, skip no follow links, etc, etc, etc...

Simple Job metrics

Duration
Termination status - ok/soft timeout/hard timeout

Lolhtml check if we can remove elements and if this saves some cpu cycles

Right now it does some "wasteful serialization"which we just throw away
Yet the lib is so damn fast it doesn't matter...

ideally we would like to completely disable HTML rewriting functionality, but I don't think it's currently possible, see cloudflare/lol-html#91

Investigate how we track errors and what is considered an error in grafana dashboard

Right now it does not make any sense whatsoever...
blocked by

let4be/crusty-core#9
let4be/crusty-core#8

Add some other important buffer len graphs to dashboard

domain resolver
metrics q

Glitchy buffers panel in grafana dashboard

It uses dynamic pulling of all available buffers - displays labels wrong and cannot calc max(outputs Trillions)
it's either a grafana bug or I did something wrong :\

Ensure proper stopping sequence on SIGTERM

This used to work properly

Review how we access DNS resolved addresses

Right now we do not precisely control which address hyper will use when connecting but we assume it's the first one(and concurrency restrictions being applied accordingly, which may backfire)

Check channel buffer sizes

Some clearly weren't selected properly...
In some places we send vectors which doesn't play nice with buffers(not what was implied)

Unify clickhouse writing operations

Concurrent writing to clickhouse

It's essential we scale this part as well...
Right now we write from a single green thread, though we use buffering and write in configurable chunks

On high volume of traffic the metrics writing part may backoff the whole system

Queue sharding support

It's partially here, but need to add

routing to proper shard based on addr_key
spawn a green thread for each owned shard(shard_min .. shard_max)
glue it all together and test

Migrate job management system to Redis

While current "queue-like system" on top of clickhouse worked quite well for testing it's no near as good as required for any serious high-volume use

Recently I did some testing on a beefy AWS hardware and fixed some internal bottlenecks(not yet merged) and in some testing scenarios where I could temporary alleviate the last left bottleneck - job distribution(writing new/updating completed/selecting), Crusty was capable of doing over 900MiB/sec - a whooping 7+gbit/sec! on 48 core(96 logical) c5.metal with a 25gbit/s port

New job queue should be solely redis-based using redis modules: https://redis.io/topics/modules-intro
rust has good enough library to allow writing redis module logic: https://github.com/RedisLabsModules/redismodule-rs

We will use pre-sharded queue(based on addr_key)

Atomic operations:

Enqueue jobs
Dequeue jobs
Finish jobs

using correct underlying data types(mostly sets and bloom filter for history) + batching and pipelining we can have solid throughput, low cpu usage per redis node, decent reliability and scalability
careful expiration could help to avoid memory overflow on redis node - we always discover domains faster than we can process them

Add WARC support

seems like https://github.com/jedireza/warc could help

Review config defaults

Consider completely removing config defaults from code where possible

We already include_str! default config right into our code and parse it - we can take most of defaults from there.

Config is split between Crusty and crusty-corethough, and the latter has no idea nor should it assume anything about configuration system. So we should keep config defaults in crusty-core

Implement faster HTML parsing

As soon as crusty-core fully supports custom html processing I'd like to experiment a bit and find a faster way to extract links(and probably some meta data) from HTML

we don't need anything complex when doing broad web crawling so it should be possible to speed this up a lot(right now we do full DOM parsing)

Extracting links/title/meta should be easy to do with a simple tokenizer, like in https://docs.rs/html5ever/0.25.1/html5ever/tokenizer/index.html

Domain discovery: replace ttl cache with lru

Concurrency auto-tuning

Figure a way to auto-tune domain concurrency(there is a ~perfect N based on CPU and network bandwidth available)
Will need some kind of graceful adaptive algo which will look at metrics(tx/rx, error rates) and determine the N

lazy.sh: allow branch option

Attaching a database

Hello, let4be

First of all I want to say it is really impressive what you have built, I am really amazed, so congratulations. Furthermore, I see that you wrote in the README file that one could attach a graph database to save the crawled data, but I can't quite understand how to do it and how would it fit in the dataflow, because I understand that crusty already saves the crawled data in some database.

I am interested in broad crawling, particularly with Rust, because I've been working on a peer2peer search engine, and thus I need a low-resource broad crawler. I have a (untidy) Python prototype which I would like to convert to Rust.

I would greatly appreciate if you could help me with this, so I could solve this problem for the search engine project.
Thank you very much in advance. Kind regards.

Improve robots.txt support

We probably should always try to download robots.txt before we access index page... and if it gets resolved with 4xx or 5xx codes we should act accordingly and follow google best practices https://developers.google.com/search/docs/advanced/robots/robots_txt

right now we download / && /robots.txt in parallel and external links from / most likely will be added to the Q(not internal though)