Giter VIP home page Giter VIP logo

crawler's Introduction

Arakneed

A common use targeted concurrent crawler for any directed graph. It's designed to be easy to use.

It's an adequate practice to organize your crawler code instead of a spider library or framework.

Why this since there's scrapy .etc?

Because they are not supposed to crawl my pictures on my laptop but I want to crawl them like a spider.

Arakneed can be used to traverse any directed graph, including a directory on your computer to collect pictures or collecting someone's all Github Gists... And anything looks like directed graph.

Though it can also be used to crawl a website like a tranditional spider.

Usage

Install as dependency:

pip install -U arakneed

Or you can clone the repo and work right in place within file crawler.py:

git clone https://github.com/arakneed/crawler.git

How does it work?

Any vertex spotted by the spider will be scheduled as a task. The only thing you need to do is to define how to handle the tasks.

import asyncio
from pathlib import Path
import re

import aiohttp
from arakneed import Crawler, Task


async def resolver(task: Task, response: aiohttp.ClientResponse):

    if task.type == 'page':
        r = await response.text()

        return [
            Task('image', group[1])
            for group in re.compile(r'<img.+?src=\"(.+?)\".*?>').finditer(r)
            if group[1].endswith('.jpg') or group[1].endswith('.png') or group[1].endswith('.svg')
        ]

    if task.type == 'image':
        image_path = Path('~/Downloads/gh-images', task.key.split('/')[-1]).expanduser()
        if not image_path.parent.is_dir():
            image_path.parent.mkdir()
        image_path.touch()
        image_path.write_bytes(await response.content.read())


asyncio.run(Crawler().run(Task('page', 'https://github.com'), resolver))

This code downloads all images it founds on Github. I believe it explains what does the business code look like.

Examples

  • Crawl a website
  • Collect pictures in a local driver
  • Abstract Syntax Tree analyzer

MISC

  • Be careful with circles in the directed graph if you are customizing the scheduler/spider. The framework always checks whether all corresponding vertices are resolved recursively of every vertex to know when could it have a relax :)
  • This paradigm is not distributed. Though you can take a glance of it through Redis based vertices resolving check, but the task is locked as soon as it's resolved, you cannot resolve a task on several machines simultaneously.

Development

A branch called dev is recommended for common development.

Useful commands:

  • install poetry
    curl -sSL https://raw.githubusercontent.com/python-poetry/poetry/master/get-poetry.py | python
    source $HOME/.poetry/env
  • install dependencies
    poetry install
  • run tests
    poetry run pytest
  • lint
    poetry run flake8 --max-line-length=120 --statistics

crawler's People

Contributors

github-actions[bot] avatar somarlyonks avatar

Stargazers

 avatar

Watchers

 avatar

Forkers

somarlyonks

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.