creekorful / bathyscaphe Goto Github PK

View Code? Open in Web Editor NEW

92.0 6.0 25.0 850 KB

Fast, highly configurable, cloud native dark web crawler.

Home Page: https://blog.creekorful.com/building-fast-modern-web-crawler/

License: GNU General Public License v3.0

Go 96.81% Shell 1.21% Python 1.98%

web-crawler golang elasticsearch crawling kibana crawler hidden-services tor architecture

bathyscaphe's Introduction

Bathyscaphe dark web crawler

Bathyscaphe is a Go written, fast, highly configurable, cloud-native dark web crawler.

How to start the crawler

To start the crawler, one just need to execute the following command:

$ ./scripts/docker/start.sh

and wait for all containers to start.

Notes

You can start the crawler in detached mode by passing --detach to start.sh.
Ensure you have at least 3 GB of memory as the Elasticsearch stack docker will require 2 GB.

How to initiate crawling

One can use the RabbitMQ dashboard available at localhost:15003, and publish a new JSON object in the crawlingQueue .

The object should look like this:

{
  "url": "https://facebookcorewwwi.onion"
}

How to speed up crawling

If one want to speed up the crawling, he can scale the instance of crawling component in order to increase performances. This may be done by issuing the following command after the crawler is started:

$ ./scripts/docker/start.sh -d --scale crawler=5

this will set the number of crawler instance to 5.

How to view results

You can use the Kibana dashboard available at http://localhost:15004. You will need to create an index pattern named ' resources', and when it asks for the time field, choose 'time'.

How to hack the crawler

If you've made a change to one of the crawler component and wish to use the updated version when running start.sh you just need to issue the following command:

$ goreleaser --snapshot --skip-publish --rm-dist

this will rebuild all images using local changes. After that just run start.sh again to have the updated version running.

Architecture

The architecture details are available here.

bathyscaphe's People

Contributors

Stargazers

Watchers

bathyscaphe's Issues

%!s(<nil>) response

I'm having that previous issue again with the new build:

scheduler_1 | time="2020-08-08T17:17:49Z" level=debug msg="Processing URL: https://www.facebookcorewwwi.onion"
api_1 | time="2020-08-08T17:17:49Z" level=debug msg="Successfully published URL: https://www.facebookcorewwwi.onion"
api_1 | time="2020-08-08T17:17:49Z" level=error msg="Error getting response: %!s()"
scheduler_1 | time="2020-08-08T17:17:49Z" level=error msg="Error while searching URL: %!s()"
scheduler_1 | time="2020-08-08T17:17:49Z" level=error msg="Received status code: 500"

Build Error invalid argument "creekorful/"

Hi, docker version is 19.03.12, we tried to build trandoshan using the scripts build.sh. However it gives the following errors:

invalid argument "creekorful/" for "-t, --tag" flag: invalid reference format See 'docker build --help'.

Where can we find the documentation on this project? Thanks

test on a list of onion websites

Hi,

Hope you are all well !

Just a quick suggestion, you could bulk/test trandoshan on the list of onion websites available at https://github.com/onionltd/oniontree

I am curious how to bulk add them ? how much will fail because of a captacha challenge.

Thanks in advance for your insights and inputs on that.

Cheers,
X

Increase elasticsearch maximum memory?

Elasticsearch Crashing with Code 127

I'm running this on a Google Cloud Platform compute instance with 8GB RAM and 2 cores.

When I open the Kibana dashboard and create a canvas with a data table of the crawled content from resources *, it appears to lag for a brief moment and later give me 401 unauth errors.

In the console, I see that docker_elasticsearch_1 exited with code 127

Memory usage at the time of crash doesn't seem to be high either, with around 2/8GB RAM being used.

scheduler_1      | time="2020-09-07T03:53:12Z" level=info msg="Successfully initialized tdsh-scheduler. Waiting for URLs"
torproxy_1       | WARNING: no logs are available with the 'none' log driver
// After opening kibana dashboard and waiting about 20 seconds
docker_elasticsearch_1 exited with code 127

Centralise configuration around API?

Elasticsearch: max virtual memory areas vm.max_map_count [xxxx] is too low, increase to at least [xxxx]

I'm seeing a lot more results than before but also API errors on URL's I know to be working:

time="2020-08-10T20:40:42Z" level=error msg="Error getting response from ES: dial tcp: lookup elasticsearch on 127.0.0.11:53: no such host"

Add switches to docker-compose allowing detaching

Add -t -i switches to docker command allowing detaching. Right now I have to restart the whole project if I want to attach/detach.

Alternatively, you could assign unique detach keys:

--detach-keys "ctrl-a,a"

Sorry, I was trying to add an "Enhancement label" but I don't think I can

Switch to zerolog

Implement trandoshanctl login

Create dashboard application

The idea is to have a simple JS (angular? react? vue?) application that will dial with the API to get insight from the crawler.

Resource page to view / search resources using input
Page to submit URL to crawl
?

If anyone has suggestion, feel free to comment on this PR!

Use proper HTTP client

Duplicate URLs in ElasticSearch DB

Hi there,

I've been playing with this Tor crawler for some time and generally it works pretty well. However, I've got a problem of duplicate urls. It has been running for 4 days and has achieved over 4000 hits, but the count of unique urls is just around 1000.

I noticed that there is a query method in the scheduler that asks the ElasticSearch DB whether the found url already exists.

b64URI := base64.URLEncoding.EncodeToString([]byte(normalizedURL.String()))
apiURL := fmt.Sprintf("%s/v1/resources?url=%s", apiURI, b64URI)

var urls []proto.ResourceDto
r, err := httpClient.JSONGet(apiURL, &urls)
...
if len(urls) == 0 {
...

I've copied this method to the crawler and persister as well to check todo urls and resource urls. However, it still only gets around 1000 unique urls out of over 4000 hits.

Does anyone have any idea of how to fix this problem? Any hint would be greatly appreciated.

Problems crawling torch results pages

For some reason the crawler is not parsing Torch results pages correctly because none of the links end up being scheduled or crawled.

e.g.

http://xmh57jrzrnw6insl.onion/4a1f6b371c/search.cgi?s=DRP&q=irc&cmd=Search%21

Should return plenty of results.

I have no idea why, the only thing remotely interesting is that there is an iframe at the start of the page. I think I've seen this with some other pages too but I can't remember what they were.

Add persistence for rabbitmq

Feeder to API unmarshaling error

I can't get that new feeder to work, am I doing it wrong?

./cmd/feeder/feeder --api-uri http://localhost:15005 --url https://www.facebookcorewwwi.onion
INFO[0000] Starting trandoshan-feeder v0.0.1
INFO[0000] URL https://www.facebookcorewwwi.onion successfully sent to the crawler

api_1            | time="2020-08-08T03:15:21Z" level=error msg="Error while un-marshaling url: invalid character 'h' looking for beginning of value"

Maybe missing some json encoding? I'm not sure, I tried passing json encoded values too but it didn't like them any better than a raw url.

scheduler: add list of disallowed extensions

crawler process are loosing too much time querying for resources not wanted, the idea is to not schedule URL ending with specify extensions (.img, .jpg, ...)

create release.sh script

Take tag as parameter (without v prefix)
Create commit (format: Release v$tag)
Show diff for the maintainer to confirm the changes
Create signed commit (format: Release v$tag)
Call build.sh with $tag as parameter
Call build.sh with no parameter (latest build)
Display to the users that he need to check the details & run (git push && git push --tags && ./push.sh $tag && ./push.sh latest

Create trandoshanctl

Will remplace the feeder

Allows to submit url
Allows to list resources

Configure RabbitMQ to persist messages to disk

Improve search based on keyword

Create cache of 'ignored' resources

We can put 'down' domains in it so we won't spend time trying to crawl them again
We can manually ignore domains by adding them to the list

Implement authentication mechanism

For the API

Implement login functionnality

Allows the crawler to log in darknet forums to reach more interesting insights

scheduler: Error while searching URL: %!s(<nil>)

reported by @FFrozTT

Error while submiting new URL through the API

Hi, I'm quite interested in this crawler but I got an error when I tried to start it. So I just added

feeder:
    image: trandoshan.io/feeder:latest
    command: --log-level debug --api-uri http://localhost:15005 --url http://torlinkbgs6aabns.onion/

to the docker-compose.yml and executed ./scripts/start.sh. But the feeder didn't work properly and returned the following message: feeder_1 | time="2020-08-26T11:43:50Z" level=error msg="Unable to publish URL: Post \"http://localhost:15005/v1/urls\": dial tcp 127.0.0.1:15005: connect: connection refused"
I did some search online but failed to solve this problem. Could anyone give me some hints? Thank you!

Add kubernetes config files

Store resource on disk

Release docker image on dockerhub

Switch to new architecture

The current architecture of Trandoshan is not flexible: messages can be read by only one consumer (to prevent duplicates), etc...

It could be interesting to switch to event driven architecture: each process push they own events trough queues (no consumer uniquness), and everyone who cares about the message just need to subscribe and do what he wants with it.

This of course introduces a problem: we will have message consumer by the same duplicated processes (f.e the crawler process which is generally scaled). To prevent this will need to have a UNIQUE crawler process reading trough the queue, and pushing it to a private queues where others crawler process are subscribing (i.e forwarding the message)

I dunno if the implementation make sense at this time, but the general idea seems pretty good.

Implement index refresh

To discuss

Improve testing

Improve authorization

Add documentation
Allow generate token from command line

Allow to crawl same url multiple times

Configure a refresh / min delay to allow recrawl

Add graph database support

Create relationship between crawled pages to determinate where this URL has been referenced

Better ACL for API

Use combination of verb + path

f.e

GET /v1/resources
POST /v1/resources
POST /v1/urls

etc...

Create endpoint to generate user? that will consume rights etc...
Rights will be stored in the JWT token

Kibana server is not ready yet

I tried the new project but can't get past "Kibana server is not ready yet". I used the packaged build and start scripts. Are there additional steps or an installation guide somewhere?

Edit: Everything appeared to start OK, here's my output:
Starting deployments_nats_1 ... done
Starting deployments_elasticsearch_1 ... done
Starting deployments_torproxy_1 ... done
Starting deployments_scheduler_1 ... done
Starting deployments_crawler_1 ... done
Starting deployments_kibana_1 ... done
Starting deployments_api_1 ... done
Starting deployments_persister_1 ... done
Attaching to deployments_torproxy_1, deployments_nats_1, deployments_elasticsearch_1, deployments_scheduler_1, deployments_crawler_1, deployments_api_1, deployments_kibana_1, deployments_persister_1
torproxy_1 | WARNING: no logs are available with the 'none' log driver
nats_1 | WARNING: no logs are available with the 'none' log driver
elasticsearch_1 | WARNING: no logs are available with the 'none' log driver
scheduler_1 | time="2020-08-05T21:39:31Z" level=info msg="Starting trandoshan-scheduler v0.0.1"
scheduler_1 | time="2020-08-05T21:39:31Z" level=debug msg="Using NATS server at: nats"
scheduler_1 | time="2020-08-05T21:39:31Z" level=debug msg="Using API server at: http://api:8080"
scheduler_1 | time="2020-08-05T21:39:31Z" level=info msg="Successfully initialized trandoshan-scheduler. Waiting for URLs"
crawler_1 | time="2020-08-05T21:39:32Z" level=info msg="Starting trandoshan-crawler v0.0.1"
crawler_1 | time="2020-08-05T21:39:32Z" level=debug msg="Using NATS server at: nats"
crawler_1 | time="2020-08-05T21:39:32Z" level=debug msg="Using TOR proxy at: torproxy:9050"
crawler_1 | time="2020-08-05T21:39:32Z" level=info msg="Successfully initialized trandoshan-crawler. Waiting for URLs"
api_1 | {"time":"2020-08-05T21:39:33.269084605Z","level":"INFO","prefix":"echo","file":"api.go","line":"73","message":"Starting trandoshan-api v0.0.1"}
api_1 | {"time":"2020-08-05T21:39:33.269182929Z","level":"DEBUG","prefix":"echo","file":"api.go","line":"75","message":"Using elasticsearch server at: http://elasticsearch:9200"}
api_1 | {"time":"2020-08-05T21:39:33.295324468Z","level":"INFO","prefix":"echo","file":"api.go","line":"88","message":"Successfully initialized trandoshan-api. Waiting for requests"}
api_1 | ⇨ http server started on [::]:8080
kibana_1 | WARNING: no logs are available with the 'none' log driver
persister_1 | time="2020-08-05T21:39:34Z" level=info msg="Starting trandoshan-persister v0.0.1"
persister_1 | time="2020-08-05T21:39:34Z" level=debug msg="Using NATS server at: nats"
persister_1 | time="2020-08-05T21:39:34Z" level=debug msg="Using API server at: http://api:8080"
persister_1 | time="2020-08-05T21:39:34Z" level=info msg="Successfully initialized trandoshan-persister. Waiting for resources"
deployments_elasticsearch_1 exited with code 1

index website metadata

Improve build step

Since we are copying the project root dir when building dockerfiles, a change in process A will force rebuild of others ones.

Use the olivere/elastic ES client

Release executables using CD pipeline with goreleaser

Request to Elasticsearch failed: {"error":{}}

So I've had all containers running overnight without exiting and there is certainly a lot of activity but something doesn't seem quite right between Kibana and Elasticsearch. Kibana is only showing me 8 entry's and giving this error:

Request to Elasticsearch failed: {"error":{}}

Error: Request to Elasticsearch failed: {"error":{}}
    at http://x.x.x.x:15004/bundles/commons.bundle.js:3:4900279
    at Function._module.service.Promise.try (http://x.x.x.x:15004/bundles/commons.bundle.js:3:2504083)
    at http://x.x.x.x:15004/bundles/commons.bundle.js:3:2503457
    at Array.map (<anonymous>)
    at Function._module.service.Promise.map (http://x.x.x.x:15004/bundles/commons.bundle.js:3:2503414)
    at callResponseHandlers (http://x.x.x.x:15004/bundles/commons.bundle.js:3:4898793)
    at http://x.x.x.x:15004/bundles/commons.bundle.js:3:4881154
    at processQueue (http://x.x.x.x:15004/built_assets/dlls/vendors.bundle.dll.js:435:204190)
    at http://x.x.x.x:15004/built_assets/dlls/vendors.bundle.dll.js:435:205154
    at Scope.$digest (http://x.x.x.x:15004/built_assets/dlls/vendors.bundle.dll.js:435:215159)

Doing that, we should refactor the feeder process to use this endpoint instead of the queue.

In the meantime, it will be time to allow the API to be reach from outside the docker network.