Akka web crawler

A simple web crawler built with Akka and Akka Streams. It uses the websocket protocol to provide better interactivity.

Considerations

Crawling a website can take a long time. Seeing a loader for many minutes without any feedback didn't seem like a good idea. Thus I considered that interactivity was important. Every page in a domain that is crawled is sent to the user immediately. When crawling is finished, the server sends a successful event.

The user is able to stop and request to crawl a new url. Each request will trigger a new worker (actor) that will crawl that specific url.

The Akka toolkit provides an abstraction to architect your program following the actor model. Actors are single threaded. They communicate with each other via message passing. One actor sends messages to another actor's mailbox and that target actor will only read the message when its current message is processed. In our case, it means that a worker will perform its crawling until successfully finished or until it encounters an error, but that computation cannot be interrupted by another actor's message. Once they finish their computation, they will cache their results in an in-memory database. We use Redis for that matter.

In case a user stops and starts crawling a domain that is already being crawled, the Akka supervisor will not create a new worker to run the same computation but will start listening to messages from the existing worker again.

Run it locally

You'll need to have Scala and SBT (Simple Build Tool) installed in order to run the project. It also needs to have a Redis server running locally, listening on port 6379 unless specified otherwise.

Once installed, run:

sbt run

Run tests

Run sbt test

It focuses on:

Interactivity: A user that requests a valid url gets its sitemap eventually while receiving each crawled page in the meantime.
Start/stop crawling: It makes sure that there is at most one worker sending messages to the client, that there are never two workers running the same computation at the same time, etc
Caching: It doesn't compute the same expensive computation twice

No mocking was done to query urls, so internet is required to pass these tests. A redis server is also required to be running locally.

acentelles / akka-web-crawler Goto Github PK

akka-web-crawler's Introduction

Akka web crawler

Considerations

Run it locally

Run tests

akka-web-crawler's People

Contributors

Stargazers

Watchers

Forkers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent