Giter VIP home page Giter VIP logo

data-con-scrape's Introduction

Data-Con-Scrape

These are the companion materials for my python webscraping workshop from Boston Data-Con 2014, an excellent conference that was put on by the good people of the Boston Data Community.

The session was advertised as: "A tutorial on python for web scraping, covering BeautifulSoup, and when and how to use Selenium for dynamic pages and comment loading."

What actually happened was: a lively workshop with great questions, interactive bloopers, and very very little Selenium. Thank you to everyone who stayed to the very end to scrape some data with me. Here are the materials and some extras. ~Enjoy

(video of the session is posted here)

It's like trespassing and organizing your desk. @laurieskelly explains joy of web scraping. #bdc14 #python Scrape http://t.co/CRfkv5sGgk

โ€” Mike Combs (@mike3d0g) September 14, 2014

Contents:

For all notebooks:

  • Links here in the README will take you to the published/non-interactive version of the notebooks.
  • Clone this repo and run .ipynb files using ipython notebook to play with them interactively.
  • Data-Con-metacritic.ipynb: Scraping the site metacritic.com to get information about new releases and produce links to visit and get detailed info on each one

  • Data-Con-metacritic2.ipynb: Digging into details on the movie links created in the first metacritic scraper

  • Data-Con_Selenium.ipynb: Tried to find more cool examples for Selenium, ended up finding another workaround and getting totally distracted making a scraper for datatau

Web Scraping Tips

  • If possible for your data collecting project, use an API instead of scraping. It is kinder to the nice people creating the data that you are collecting, more resistant to breaking, and usually more efficient to code. Scraping is "for" cases when APIs are not provided.

  • If possible for your web scraping project, avoid using Selenium. It is more complex to develop scraper code using Selenium, and slower to run. If you can get what you need without Selenium, it is usually better.

  • When you're poking through a website to scrape it, it's a great idea to open the page in Incognito Mode so that your active sessions, plug-ins, etc, do not make the content that you see differ systematically from the content "seen" by Requests.

  • Websites change. When they do scrapers typically break. There are ways to write your selectors or build your scraping logic to be robust to minor changes, but broken scrapers are part of the game. You can't go around them, you can't go under them, so to live through them:

    1. Make your code noisy. Include tests and checks that can detect changes, and notify yourself when something changes.
    2. Save raw html. "Space is cheap," as they say, so saving raw html can allow you retroactively patch the holes in your longitudinal scraping data after you hvae adjusted to a change in page format.
    3. For this reason, ugly sites make great scraping targets. If a page looks like it hasn't been updated since 1998, you might infer that it is less likely to be re-styled and re-structured every 3-6 months.

Random Tips

  • urlparse.urljoin() is a handy way to stick parts of a url together without messing it up and having too many or too few slashes up in there. module docs

Resources

What did I forget?

remind me on twitter or make a pull request : )

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.