Giter VIP home page Giter VIP logo

archive-chan's Introduction

archive-chan

Downloads threads on 4chan and saves the images/videos

This program has the ability to download entire threads saving the format of the discussion as well as preserving any video, gifs or images that may have been posted. Each thread is downloaded in an html file in a similar layout to 4chan albeit simplified.

Requirements

  • Python3
  • BeautifulSoup
  • Flask
  • requests

To install requirements

pip3 install -r requirements.txt

Usage

Archive 4chan threads or catalogs

To archive one or multiple threads of your choosing pass in the thread url or a text file of thread urls each on a new line to archiver.py. A number of flags can be set in addition to this.

This is the help output

python3 archiver.py -h
usage: archiver.py [-h] [-p] [-r RETRIES] [--posts POSTS] [-v] Thread


positional arguments:
  Thread                Enter the link or txt file of links to the 4chan thread

optional arguments:
  -h, --help            show this help message and exit
  -p, --preserve_files  Save images and video files locally
  -r RETRIES, --retries RETRIES
                        Set total number of retries if a download fails
  --posts POSTS         Number of posts to download
  -v, --verbose         Print more information on each post

Here is an example that downloads every post in a thread and saves all the media uploaded.

python3 archiver.py http://boards.4channel.org/p/thread/3434289/ect-edit-challenge-thread -p

To archive all the threads pass in the board as a positional argument. A number of flags can be set in addition to this.

Here is an example that downloads every active post in a /g/.

python3 archiver.py g -v
Downloading thread: 51971506
Downloading post: p51971506 posted on 12/20/15(Sun)20:03:52
Downloading reply: p67501950 replied on 09/07/18(Fri)19:58:36
Downloading thread: 70621338
Downloading post: p70621338 posted on 04/19/19(Fri)23:03:23
Downloading reply: p70621345 replied on 04/19/19(Fri)23:04:13
Downloading reply: p70621391 replied on 04/19/19(Fri)23:10:35
Downloading reply: p70621407 replied on 04/19/19(Fri)23:12:27

archive-chan's People

Contributors

peskypotato avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

archive-chan's Issues

small bug (DB) : baord ID is not a static value

VirtualBox_windows 10_09_01_2021_16_52_53
everytime you run archive.py with use_db command board id will change to something else which causes many problems
i'd suggest adding an board NAME column in Threads Table for a quick workaround

Broken html saving

line 39, in get_data
html_file.write(rendered)
UnicodeEncodeError: 'ascii' codec can't encode character '\u2019' in position 59002: ordinal not in range(128)

Keep checking threads untill they'e either archived or 404

From what I could see, archive-chan currently only downloads snapshots of the threads instead of "watching" them for new posts until completion.
I'm thinking we could add --watch-threads flag or something like that.
I would gladly implement this. Your archiver is the most complete I've found so far.
I would just like to discuss this with you as I'm not sure how to do this yet.

archive-chan hangs after a while when downloading whole boards

I noticed this yesterday. After a while (say half an hour), archive-chan just hangs for some reason.
I thought this was something to do with the requests.get() call with no timeout, so I replaced every call with a custom safe_get() function I managed to throw together after skimming some requests tutorials.
However, it still hung even using my function.

So maybe it's something else? I'm not the best at debugging, though.
What I ran was python archiver.py pol -p -r 3 -v --use_db.

Doubling the number of processes in the Pool also had no noticeable effect.

I checked the ./threads/ folder and most of the *.html files were not written yet, so when I hit ctrl c, I lost all of the text which it probably had in memory.
Maybe we should dump it all before exiting when catching a KeyboardInterrupt. so it isn't lost.

This is the function I wrote to test the timeout theory:

from typing import Text, Optional

from requests import PreparedRequest, Response, Session 
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry


class TimeoutHTTPAdapter(HTTPAdapter):
    def __init__(self, *args, timeout: Optional[int] = None, **kwargs):
        self.timeout = 10
        if timeout is not None:
            self.timeout = timeout
        super().__init__(*args, **kwargs)

    def send(self, request: PreparedRequest, timeout=None, **kwargs) -> Response:
        if timeout is None:
            kwargs["timeout"] = self.timeout
        return super().send(request, **kwargs)


def safe_get(url: str, max_retries: int = 3, timeout: int = 10) -> Response:
    retry_strategy = Retry(
        total=max_retries,
        backoff_factor=1,
        status_forcelist=[413, 429, 500, 502, 503, 504],
    )
    adapter = TimeoutHTTPAdapter(timeout=timeout, max_retries=retry_strategy)

    session = Session()
    session.mount("https://", adapter)
    session.mount("http://", adapter)

    response = session.get(url)
    return response

bs4 error

i get this error when trying to download a thread

Traceback (most recent call last):
File "C:\Users\user\Downloads\archive-chan\archiver.py", line 8, in
from extractors.extractor import Extractor
File "C:\Users\user\Downloads\archive-chan\extractors\extractor.py", line 1, in
from bs4 import BeautifulSoup as soup
ModuleNotFoundError: No module named 'bs4'

ive already done "pip install beautifulsoup4" and "pip3 install -r requirements.txt" but it still throws the same error
any help please?

New bug: jinja2.exceptions.UndefinedError: 'reply' is undefined

Found it when downloading this thread:
boards.4chan.org/pol/thread/293116201/
with:
python archiver.py https://boards.4chan.org/pol/thread/293116201/ --preserve_files

I don't know how front-ends work, but it looks like a missing flag? I don't know how to debug this.
I've been dealing with every bug I've found, but I need help on this one.

Traceback (most recent call last):
  File "archiver.py", line 137, in <module>
    main()
  File "archiver.py", line 132, in main
    feeder(url)
  File "archiver.py", line 114, in feeder
    archive(url)
  File "archiver.py", line 86, in archive
    extractor.extract(thread, params)
  File "./archive-chan/extractors/fourchan_api.py", line 18, in extract
    self.get_data(thread, params)
  File "./archive-chan/extractors/fourchan_api.py", line 39, in get_data
    rendered = render_template('thread.html', thread=thread, op=op_info, replies=replies)
  File "./envs/p37/flask/templating.py", line 140, in render_template
    ctx.app,
  File "./envs/p37/flask/templating.py", line 120, in _render
    rv = template.render(context)
  File "./envs/p37/jinja2/environment.py", line 1090, in render
    self.environment.handle_exception()
  File "./envs/p37/jinja2/environment.py", line 832, in handle_exception
    reraise(*rewrite_traceback_stack(source=source))
  File "./envs/p37/jinja2/_compat.py", line 28, in reraise
    raise value.with_traceback(tb)
  File "./archive-chan/./assets/templates/thread.html", line 60, in top-level template code
    <img src="../../assets/image/country/troll/{{ op.troll_country|lower }}.gif" alt="{{ op.troll_country }}" title="{{ reply.country_name }}" class="countryFlag">
  File "./envs/p37/jinja2/environment.py", line 471, in getattr
    return getattr(obj, attribute)
jinja2.exceptions.UndefinedError: 'reply' is undefined

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.