OD-Database Crawler 🕷

Usage

With Config File (if config.yml found in working dir)
- Download default config
- Set server.url and server.token
- Start with ./od-database-crawler server --config <file>
With Flags or env
- Override config file if it exists
- --help for list of flags
- Every flag is available as an environment variable: --server.crawl_stats ➡️ OD_SERVER_CRAWL_STATS
- Start with ./od-database-crawler server <flags>

With Docker

docker run \
    -e OD_SERVER_URL=xxx \
    -e OD_SERVER_TOKEN=xxx \
    terorie/od-database-crawler

Here are the most important config flags. For more fine control, take a look at /config.yml.

Flag/Environment	Description	Example
`server.url` `OD_SERVER_URL`	OD-DB Server URL	`https://od-db.mine.the-eye.eu/api`
`server.token` `OD_SERVER_TOKEN`	OD-DB Server Access Token	Ask Hexa TM
`server.recheck` `OD_SERVER_RECHECK`	Job Fetching Interval	`3s`
`output.crawl_stats` `OD_OUTPUT_CRAWL_STATS`	Crawl Stats Logging Interval (0 = disabled)	`500ms`
`output.resource_stats` `OD_OUTPUT_RESORUCE_STATS`	Resource Stats Logging Interval (0 = disabled)	`8s`
`output.log` `OD_OUTPUT_LOG`	Log File (none = disabled)	`crawler.log`
`crawl.tasks` `OD_CRAWL_TASKS`	Max number of sites to crawl concurrently	`500`
`crawl.connections` `OD_CRAWL_CONNECTIONS`	HTTP connections per site	`1`
`crawl.retries` `OD_CRAWL_RETRIES`	How often to retry after a temporary failure (e.g. `HTTP 429` or timeouts)	`5`
`crawl.dial_timeout` `OD_CRAWL_DIAL_TIMEOUT`	TCP Connect timeout	`5s`
`crawl.timeout` `OD_CRAWL_TIMEOUT`	HTTP request timeout	`20s`
`crawl.user-agent` `OD_CRAWL_USER_AGENT`	HTTP Crawler User-Agent	`googlebot/1.2.3`
`crawl.job_buffer` `OD_CRAWL_JOB_BUFFER`	Number of URLs to keep in memory/cache, per job. The rest is offloaded to disk. Decrease this value if the crawler uses too much RAM. (0 = Disable Cache, -1 = Only use Cache)	`5000`