Giter VIP home page Giter VIP logo

blue-fish's Introduction

Blue Fish

A Crawler with sync and download in local.

Crawl not only the articles content, but also the images included. Save as markdown file for local search.With index file, we can sync the remote data and update the new articles.

usage: BlueFish [-h] [-v] [-f] [--pull PULL] [--proxy PROXY]

BlueFish - A simple tool for sync and download with crawlers

options:
  -h, --help     show this help message and exit
  -v, --version  Print the version of BlueFish and remote sources list
  -f, --force    Force to pull all of remote data
  -p PATH, --path PATH  Set save path
  --pull PULL    Pull which the remote data, default is all
  --proxy PROXY  Set the proxy for BlueFish

Supported Website

  • tttang.com
  • xz.aliyun.com
  • weixin platform
  • custom website support

Usage

Python >= 3.10.x

pip install -r requirements.txt
python bluefish.py --help

# First time to pull the remote data which you are interested in
# And use the same command to sync the remote data
python bluefish.py --pull xz,tttang --force --proxy socks5://username:[email protected]:1080 --path ../

the name of the folders under dist ends with the date you get the articles

tree data -L 2

data
|-- dist
|   |-- tttang-2023-11-15
|   `-- xz-2023-11-15
`-- index
    |-- tttang.idx
    `-- xz.idx

index file is auto generated, pls Don't Modify it

Note

Suggest to run on Linux, and with the access to the global network :)

Speed

Use asyncio and aiohttp to speed up the crawler.

But... it's so fast that we may be banned by the website. Just use proxy to avoid it. Also there is a unsolved problem: Received "Response payload is not completed" when reading response, it occurs when sending lots of package to same domain.

So, I set semaphore = asyncio.Semaphore(3) try to avoid it.

test tttang.com (1580 articles), time costs 877.40s (about 15min, 0.55s per article)

Add your own source:

  1. Add sync script:
class XZSync(BaseSync):
    def __init__(self):
        super().__init__(baseurl="https://xz.aliyun.com", index_name="xz")

    def parse_page(self, text):
      ...
    
    def get_fully_storage(self):
        ...

    def get_remote_storage(self, last_idx = None):
        ...
        
    def get_total_page(self) -> int:
        ...
  1. Add download script:
class XZCrawler(BaseCrawler):
    def __init__(self, name = "xz"):
        ...

    async def parse(self, text: str):
        ...
  1. Add to bluefish.py:
sources = {
    "xz": XZSync,
    ...
}
  1. Enjoy it

blue-fish's People

Contributors

dianzhh avatar silenteag avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

Forkers

dianzhh

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.