Giter VIP home page Giter VIP logo

bathyscaphe's Introduction

Bathyscaphe dark web crawler

CI

Bathyscaphe is a Go written, fast, highly configurable, cloud-native dark web crawler.

How to start the crawler

To start the crawler, one just need to execute the following command:

$ ./scripts/docker/start.sh

and wait for all containers to start.

Notes

  • You can start the crawler in detached mode by passing --detach to start.sh.
  • Ensure you have at least 3 GB of memory as the Elasticsearch stack docker will require 2 GB.

How to initiate crawling

One can use the RabbitMQ dashboard available at localhost:15003, and publish a new JSON object in the crawlingQueue .

The object should look like this:

{
  "url": "https://facebookcorewwwi.onion"
}

How to speed up crawling

If one want to speed up the crawling, he can scale the instance of crawling component in order to increase performances. This may be done by issuing the following command after the crawler is started:

$ ./scripts/docker/start.sh -d --scale crawler=5

this will set the number of crawler instance to 5.

How to view results

You can use the Kibana dashboard available at http://localhost:15004. You will need to create an index pattern named ' resources', and when it asks for the time field, choose 'time'.

How to hack the crawler

If you've made a change to one of the crawler component and wish to use the updated version when running start.sh you just need to issue the following command:

$ goreleaser --snapshot --skip-publish --rm-dist

this will rebuild all images using local changes. After that just run start.sh again to have the updated version running.

Architecture

The architecture details are available here.

bathyscaphe's People

Contributors

creekorful avatar ffroztt avatar gaganbhat avatar smithalc avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

bathyscaphe's Issues

%!s(<nil>) response

I'm having that previous issue again with the new build:

scheduler_1 | time="2020-08-08T17:17:49Z" level=debug msg="Processing URL: https://www.facebookcorewwwi.onion"
api_1 | time="2020-08-08T17:17:49Z" level=debug msg="Successfully published URL: https://www.facebookcorewwwi.onion"
api_1 | time="2020-08-08T17:17:49Z" level=error msg="Error getting response: %!s()"
scheduler_1 | time="2020-08-08T17:17:49Z" level=error msg="Error while searching URL: %!s()"
scheduler_1 | time="2020-08-08T17:17:49Z" level=error msg="Received status code: 500"

Build Error invalid argument "creekorful/"

Hi, docker version is 19.03.12, we tried to build trandoshan using the scripts build.sh. However it gives the following errors:

invalid argument "creekorful/" for "-t, --tag" flag: invalid reference format See 'docker build --help'.

Where can we find the documentation on this project? Thanks

test on a list of onion websites

Hi,

Hope you are all well !

Just a quick suggestion, you could bulk/test trandoshan on the list of onion websites available at https://github.com/onionltd/oniontree

I am curious how to bulk add them ? how much will fail because of a captacha challenge.

Thanks in advance for your insights and inputs on that.

Cheers,
X

Elasticsearch Crashing with Code 127

I'm running this on a Google Cloud Platform compute instance with 8GB RAM and 2 cores.

When I open the Kibana dashboard and create a canvas with a data table of the crawled content from resources *, it appears to lag for a brief moment and later give me 401 unauth errors.

In the console, I see that docker_elasticsearch_1 exited with code 127

Memory usage at the time of crash doesn't seem to be high either, with around 2/8GB RAM being used.

scheduler_1      | time="2020-09-07T03:53:12Z" level=info msg="Successfully initialized tdsh-scheduler. Waiting for URLs"
torproxy_1       | WARNING: no logs are available with the 'none' log driver
// After opening kibana dashboard and waiting about 20 seconds
docker_elasticsearch_1 exited with code 127

Add switches to docker-compose allowing detaching

Add -t -i switches to docker command allowing detaching. Right now I have to restart the whole project if I want to attach/detach.

Alternatively, you could assign unique detach keys:

--detach-keys "ctrl-a,a"

Sorry, I was trying to add an "Enhancement label" but I don't think I can

Create dashboard application

The idea is to have a simple JS (angular? react? vue?) application that will dial with the API to get insight from the crawler.

  • Resource page to view / search resources using input
  • Page to submit URL to crawl
  • ?

If anyone has suggestion, feel free to comment on this PR!

Duplicate URLs in ElasticSearch DB

Hi there,

I've been playing with this Tor crawler for some time and generally it works pretty well. However, I've got a problem of duplicate urls. It has been running for 4 days and has achieved over 4000 hits, but the count of unique urls is just around 1000.

I noticed that there is a query method in the scheduler that asks the ElasticSearch DB whether the found url already exists.

b64URI := base64.URLEncoding.EncodeToString([]byte(normalizedURL.String()))
apiURL := fmt.Sprintf("%s/v1/resources?url=%s", apiURI, b64URI)

var urls []proto.ResourceDto
r, err := httpClient.JSONGet(apiURL, &urls)
...
if len(urls) == 0 {
...

I've copied this method to the crawler and persister as well to check todo urls and resource urls. However, it still only gets around 1000 unique urls out of over 4000 hits.

Does anyone have any idea of how to fix this problem? Any hint would be greatly appreciated.

Feeder to API unmarshaling error

I can't get that new feeder to work, am I doing it wrong?

./cmd/feeder/feeder --api-uri http://localhost:15005 --url https://www.facebookcorewwwi.onion
INFO[0000] Starting trandoshan-feeder v0.0.1
INFO[0000] URL https://www.facebookcorewwwi.onion successfully sent to the crawler
api_1            | time="2020-08-08T03:15:21Z" level=error msg="Error while un-marshaling url: invalid character 'h' looking for beginning of value"

Maybe missing some json encoding? I'm not sure, I tried passing json encoded values too but it didn't like them any better than a raw url.

create release.sh script

  • Take tag as parameter (without v prefix)
  • Create commit (format: Release v$tag)
  • Show diff for the maintainer to confirm the changes
  • Create signed commit (format: Release v$tag)
  • Call build.sh with $tag as parameter
  • Call build.sh with no parameter (latest build)
  • Display to the users that he need to check the details & run (git push && git push --tags && ./push.sh $tag && ./push.sh latest

Create cache of 'ignored' resources

  • We can put 'down' domains in it so we won't spend time trying to crawl them again
  • We can manually ignore domains by adding them to the list

Error while submiting new URL through the API

Hi, I'm quite interested in this crawler but I got an error when I tried to start it. So I just added

feeder:
    image: trandoshan.io/feeder:latest
    command: --log-level debug --api-uri http://localhost:15005 --url http://torlinkbgs6aabns.onion/

to the docker-compose.yml and executed ./scripts/start.sh. But the feeder didn't work properly and returned the following message: feeder_1 | time="2020-08-26T11:43:50Z" level=error msg="Unable to publish URL: Post \"http://localhost:15005/v1/urls\": dial tcp 127.0.0.1:15005: connect: connection refused"
I did some search online but failed to solve this problem. Could anyone give me some hints? Thank you!

Switch to new architecture

The current architecture of Trandoshan is not flexible: messages can be read by only one consumer (to prevent duplicates), etc...

It could be interesting to switch to event driven architecture: each process push they own events trough queues (no consumer uniquness), and everyone who cares about the message just need to subscribe and do what he wants with it.

This of course introduces a problem: we will have message consumer by the same duplicated processes (f.e the crawler process which is generally scaled). To prevent this will need to have a UNIQUE crawler process reading trough the queue, and pushing it to a private queues where others crawler process are subscribing (i.e forwarding the message)

I dunno if the implementation make sense at this time, but the general idea seems pretty good.

Better ACL for API

Use combination of verb + path

f.e

  • GET /v1/resources
  • POST /v1/resources
  • POST /v1/urls

etc...


  • Create endpoint to generate user? that will consume rights etc...
  • Rights will be stored in the JWT token

Kibana server is not ready yet

I tried the new project but can't get past "Kibana server is not ready yet". I used the packaged build and start scripts. Are there additional steps or an installation guide somewhere?

Edit: Everything appeared to start OK, here's my output:
Starting deployments_nats_1 ... done
Starting deployments_elasticsearch_1 ... done
Starting deployments_torproxy_1 ... done
Starting deployments_scheduler_1 ... done
Starting deployments_crawler_1 ... done
Starting deployments_kibana_1 ... done
Starting deployments_api_1 ... done
Starting deployments_persister_1 ... done
Attaching to deployments_torproxy_1, deployments_nats_1, deployments_elasticsearch_1, deployments_scheduler_1, deployments_crawler_1, deployments_api_1, deployments_kibana_1, deployments_persister_1
torproxy_1 | WARNING: no logs are available with the 'none' log driver
nats_1 | WARNING: no logs are available with the 'none' log driver
elasticsearch_1 | WARNING: no logs are available with the 'none' log driver
scheduler_1 | time="2020-08-05T21:39:31Z" level=info msg="Starting trandoshan-scheduler v0.0.1"
scheduler_1 | time="2020-08-05T21:39:31Z" level=debug msg="Using NATS server at: nats"
scheduler_1 | time="2020-08-05T21:39:31Z" level=debug msg="Using API server at: http://api:8080"
scheduler_1 | time="2020-08-05T21:39:31Z" level=info msg="Successfully initialized trandoshan-scheduler. Waiting for URLs"
crawler_1 | time="2020-08-05T21:39:32Z" level=info msg="Starting trandoshan-crawler v0.0.1"
crawler_1 | time="2020-08-05T21:39:32Z" level=debug msg="Using NATS server at: nats"
crawler_1 | time="2020-08-05T21:39:32Z" level=debug msg="Using TOR proxy at: torproxy:9050"
crawler_1 | time="2020-08-05T21:39:32Z" level=info msg="Successfully initialized trandoshan-crawler. Waiting for URLs"
api_1 | {"time":"2020-08-05T21:39:33.269084605Z","level":"INFO","prefix":"echo","file":"api.go","line":"73","message":"Starting trandoshan-api v0.0.1"}
api_1 | {"time":"2020-08-05T21:39:33.269182929Z","level":"DEBUG","prefix":"echo","file":"api.go","line":"75","message":"Using elasticsearch server at: http://elasticsearch:9200"}
api_1 | {"time":"2020-08-05T21:39:33.295324468Z","level":"INFO","prefix":"echo","file":"api.go","line":"88","message":"Successfully initialized trandoshan-api. Waiting for requests"}
api_1 | โ‡จ http server started on [::]:8080
kibana_1 | WARNING: no logs are available with the 'none' log driver
persister_1 | time="2020-08-05T21:39:34Z" level=info msg="Starting trandoshan-persister v0.0.1"
persister_1 | time="2020-08-05T21:39:34Z" level=debug msg="Using NATS server at: nats"
persister_1 | time="2020-08-05T21:39:34Z" level=debug msg="Using API server at: http://api:8080"
persister_1 | time="2020-08-05T21:39:34Z" level=info msg="Successfully initialized trandoshan-persister. Waiting for resources"
deployments_elasticsearch_1 exited with code 1

Improve build step

Since we are copying the project root dir when building dockerfiles, a change in process A will force rebuild of others ones.

Request to Elasticsearch failed: {"error":{}}

So I've had all containers running overnight without exiting and there is certainly a lot of activity but something doesn't seem quite right between Kibana and Elasticsearch. Kibana is only showing me 8 entry's and giving this error:

Request to Elasticsearch failed: {"error":{}}

Error: Request to Elasticsearch failed: {"error":{}}
    at http://x.x.x.x:15004/bundles/commons.bundle.js:3:4900279
    at Function._module.service.Promise.try (http://x.x.x.x:15004/bundles/commons.bundle.js:3:2504083)
    at http://x.x.x.x:15004/bundles/commons.bundle.js:3:2503457
    at Array.map (<anonymous>)
    at Function._module.service.Promise.map (http://x.x.x.x:15004/bundles/commons.bundle.js:3:2503414)
    at callResponseHandlers (http://x.x.x.x:15004/bundles/commons.bundle.js:3:4898793)
    at http://x.x.x.x:15004/bundles/commons.bundle.js:3:4881154
    at processQueue (http://x.x.x.x:15004/built_assets/dlls/vendors.bundle.dll.js:435:204190)
    at http://x.x.x.x:15004/built_assets/dlls/vendors.bundle.dll.js:435:205154
    at Scope.$digest (http://x.x.x.x:15004/built_assets/dlls/vendors.bundle.dll.js:435:215159)

Allow to post an URL trough the API

I think it's a good idea to keep the queue from being used by too many processes.
The API should be the single point of entry for the whole system, except when the performance (async) justify it.

Therefore, I think we should add another endpoint in the API, to allow to post an URL.
The API will simply put the url in the corresponding queue.

Doing that, we should refactor the feeder process to use this endpoint instead of the queue.

In the meantime, it will be time to allow the API to be reach from outside the docker network.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.