Giter VIP home page Giter VIP logo

tumblr-crawler's Introduction

tumblr-crawler

This is a Python script that you can easily download all the photos and videos from your favorite tumblr blogs.

中文版教程请移步这里

How to Discuss

  • Feel free to join our Slack, where you can ask questions and help answer them on Slack.
  • Also you can open new issue on Github

Prerequisite

For Programmers and Developers

You know how to install Python and pip. Then pip install requests xmltodict

or

$ git clone https://github.com/dixudx/tumblr-crawler.git
$ cd tumblr-crawler
$ pip install -r requirements.txt

For non-programmers

Configuration and Downloading

There are 2 ways to specify the sites you want to download, either by creating a sites.txt file or specifying in the command line parameter.

Use sites.txt

Find a text editor and open the file sites.txt, add the sites you want to download into the file, separated by comma/space/tab/CR, no .tumblr.com suffixes. For example, if you want to download vogue.tumblr.com and gucci.tumblr.com, compose the file like this:

vogue,gucci
vogue2, gucci2

And then save the file, and run python tumblr-photo-video-ripper.py in your terminal or just double click the file which will be automatically run by Python.

Use the command line parameter (only for OS experts)

If you are familiar with command lines in Windows or Unix systems, you may run the script with a parameter to specify the sites:

python tumblr-photo-video-ripper.py site1,site2

The site names should be separated with comma, no space and no .tumblr.com suffixes needed.

How the files get downloaded and stored

The photos/videos will be saved to the folders named with the tumblr blog. You will find them in the current path/directory.

This script will not re-download the photos or videos if they have already been downloaded. So it will do no harm by running this script several times. In the meanwhile, you can find back the missing photos or videos.

Use Proxies (Optional)

You may want to use proxies when downloading. Please refer to ./proxies_sample1.json and ./proxies_sample2.json. And save your own proxies to ./proxies.json in json format. You can validate the content by visiting http://jsonlint.com/.

If ./proxies.json is an empty file, no proxies will be used during downloading.

If you are using Shadowsocks with global mode, your ./proxies.json can be,

{
    "http": "socks5://127.0.0.1:1080",
    "https": "socks5://127.0.0.1:1080"
}

And now you can enjoy your downloads.

More customizations for Programmers Only

# Setting timeout
TIMEOUT = 10

# Retry times
RETRY = 5

# Medium Index Number that Starts from
START = 0

# Numbers of photos/videos per page
MEDIA_NUM = 50

# Numbers of downloading threads concurrently
THREADS = 10

You can set TIMEOUT to another value, e.g. 50, according to your network quality.

And this script will retry downloading the images or videos several times (default value is 5).

You can also only download photos or videos by commenting

def download_media(self, site):
    # only download photos
    self.download_photos(site)
    #self.download_videos(site)

or

def download_media(self, site):
    # only download videos
    #self.download_photos(site)
    self.download_videos(site)

tumblr-crawler's People

Contributors

b0unt9 avatar dixudx avatar gwpl avatar haoliang-quan avatar joshuakwan avatar karlicoss avatar kyeolee89 avatar railwaycat avatar tainakadrums avatar whitelok avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.