Giter VIP home page Giter VIP logo

first-web-scraper's Introduction

First Web Scraper

A step-by-step guide to writing a web scraper with Python.

Contributing to the documentation

After installing the repository, the Sphinx documentation can be edited in the docs directory and published to ReadTheDocs by pushing changes to the master branch.

First install the requirements.

pipenv install

Fire up the test server, which will automatically update to show changes made to the restructured text files in the docs directory.

make docs

Open http://localhost:8000 in your browser and start making changes.

first-web-scraper's People

Contributors

artisdom avatar cjdd3b avatar dependabot[bot] avatar jackiekazil avatar lozadaomr avatar mattwynn1 avatar michaelaharvey avatar palewire avatar schwanksta avatar sisiwei avatar stucka avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

first-web-scraper's Issues

soup syntax error

syntax error
Guys,

Whenever I use beautiful soup I always get this syntax error, I am a very new user of python and I just don't understand this error.

Alas! Boone County has changed their website.

I'll try to carve out some time to go through the tutorial later today and update things, but now it's a Java site -- STILL WITH A TABLE! Just wanted to ping everyone and make sure you all knew.

I have a former student going through the tutorial on her own and sending me questions as she goes, so I can probably take up the updating.

Beautifulsoup and differences for python3x

Thank you for the tutorial! I did it with python3, so there were a few differences. But there's also been a change for BeautifulSoup that affected my install at least.

Here's what I found

  • I encountered a problem installing Beautiful soup, which was solved by installing 'beautifulsoup4'. The biggest difference I noticed was importing, as noted below.
  • Instead of having to remove &nbsp, another character showed up.
  • CSV works differently for Python3, but it wasn't difficult to find that solution.

Below is the final code with notes on the key differences I encountered. Thanks again!

import csv
import requests

# Installing 'beautifulsoup' failed on my Mac. Going to the BeautifulSoup page I found that it's recommended to install 'BeautifulSoup4', which worked. When importing, use bs4 as shown below

from bs4 import BeautifulSoup

url = 'http://www.showmeboone.com/sheriff/JailResidents/JailResidents.asp'
response =requests.get(url)
html = response.content
soup = BeautifulSoup(html)
table = soup.find('table', attrs={'class': 'resultsTable'})

list_of_rows = []
for row in table.findAll('tr')[1:]:
    list_of_cells = []
    for cell in row.findAll('td'):
# &nbsp wasn't a problem on the page, but \xaO was. It was simple enough to swap out the two, but I wonder - how would I have the text.replace work for more than one character problem?
        text = cell.text.replace('\xa0','')
        list_of_cells.append(text)
    list_of_rows.append(list_of_cells)

# Here's the key difference for Python3x. Found this using a quick search on stackoverload, of course.

filename = 'inmates.csv'
with open(filename, 'w', newline='') as f:
    writer =csv.writer(f)
    writer.writerow(["Last", "First", "Middle", "Gender", "Race", "Age", "City", "State"])
    writer.writerows(list_of_rows)

beautifulsoup4 instead of beautifulsoup ?

This is not an issue per say but merely a suggestion to switch to beautifulsoup4 instead of beautifulsoup3 as only critical bugs get attention in the case of beautifulsoup3.

showmeboone.com is frequently inaccessible.

I would suggest a different website/URL for demonstrating web scraping. Possibly, will write one by this weekend where the website is hosted on one of my server allowing for web scraping without breaking any laws/Terms of use.

Header fails in the last step.

Im not sure what Im missing. Everything is working but the last step of the header. Any suggestions?

import csv
import requests
from BeautifulSoup import BeautifulSoup

url = 'http://www.showmeboone.com/sheriff/JailResidents/JailResidents.asp'
response = requests.get(url)
html = response.content

soup = BeautifulSoup(html)
table = soup.find('tbody', attrs={'class': 'stripe'})

list_of_rows = []
for row in table.findAll('tr')[1:]:
list_of_cells = []
for cell in row.findAll('td'):
text = cell.text.replace('ย ' , '')
list_of_cells.append(text)
list_of_rows.append(list_of_cells)

outfile = open("./inmates.csv", "wb")
writer = csv.writer(outfile)
writer.writerow(["Last", "First", "Middle", "Gender", "Race", "Age", "City", "State"])
writer.writerows(list_of_rows)
~
~

-============

Here is the output in text. Same results in excel.

Details,ADAM,OMER,SIRAJ,M,B,29,COLUMBIA,MO
Details,ALEXANDER,BENJAMIN,FRANKLIN,M,B,22,COLUMBIA,MO
Details,AMAN,JUSTIN,TYLER,M,B,36,COLUMBIA,MO
Details,ANDREWS,JOSEPH,DAMON,M,W,39,COLUMBIA,MO
Details,ARTEAGA,BRAYAN,OSIRIS-CACHO,M,H,25,JEFFERSON CITY,MO
Details,AUSTIN,KAY,CEE,F,W,33,KANSAS CITY,MO
Details,AVALOS-AVALOS,JOSE,,M,H,18,ST.ANN,MO
Details,BARRETT,NICHOLE,RAYSHEA,F,B,31,COLUMBIA,MO
Details,BENNETT,EMAS,CASSELL,M,B,38,KANSAS CITY,MO
Details,BENNETT,VONTHILLA,MARIE,F,B,52,COLUMBIA,MO
Details,BERARDI,DEBORAH,EVA,F,W,55,COLUMBIA,MO
Details,BERETS,HARRISON,COLE,M,W,21,COLUMBIA,MO
Details,BERGESCH,CHRISTOPHER,JAMES,M,W,25,FULTON,MO
Details,BEUER,CHRISTOPHER,TODD,M,W,35,COLUMBIA,MO
Details,BLAIR,RYAN,WADE,M,W,25,MEXICO,MO
Details,BLAND,RANDY,LAMONT,M,B,22,SPRINGFIELD,MO
Details,BODINE,EDITH,JOYCE,F,W,41,MOBERLY,MO
Details,BOGART,COURTNEY,LEE,M,W,39,COLUMBIA,MO
Details,BONAPARTE,NATHANIEL,LEROY,M,B,38,COLUMBIA,MO
Details,BOWERS,ARRINGTON,LEE,M,B,50,COLUMBIA,MO
Details,BRADLEY,JAMIE,MICHELLE,F,W,32,COLUMBIA,MO
Details,BROOKINS,OSCAR,MENTER,M,B,28,COLUMBIA,MO
Details,BROOKINS,QUANTRELL,TRAVEION,M,B,26,COLUMBIA,MO
Details,BROWN,JARRELL,DORAN,M,B,19,KANSAS CITY,MO
Details,BUCHANAN,CHARLES,MARLEON,M,B,25,COLUMBIA,MO
Details,BURKS,JUSTIN,JAY,M,W,41,FULTON,MO
Details,BURTON,RICK,DONNELL,M,B,53,ST LOUIS,MO
Details,CANNELL,MAC,GARNETT,M,W,32,COLUMBIA,MO
Details,CARRASQUILLO-MARTINEZ,ANGEL,LUIS,M,H,35,JACKSON,MO
Details,CARTER,DARIAN,MAURICE,M,B,24,COLUMBIA,MO
Details,CARTER,DEMARCO,RAYDELL,M,B,21,COLUMBIA,MO
Details,CARTER,KORJANAE,FORSTEIN,F,B,23,COLUMBIA,MO
Details,CEFERINO,APAEZ,ESTELA,M,H,29,COLUMBIA,MO
Details,CHASE,JULIAN,FULLER,M,B,23,COLUMBIA,MO
Details,CLARKSON,RODNEY,ALEXANDER,M,B,25,COLUMBIA,MO
Details,CLAYTON,DEWAYNE,EDWARD,M,W,51,COLUMBIA,MO
Details,COATES,JOSHUA,EDWARD,M,W,35,ROCHEPORT,MO
Details,COLLINS,DENNIS,CARL,M,B,32,COLUMBIA,MO
Details,CONTRERAS,JOEY,,M,W,29,MARSHALL,MO
Details,CRUM,CHRISTOPHER,KHAN,M,B,37,COLUMBIA,MO
Details,CRUSBY-DANIELS,RAYMONDRE,MACK,M,B,19,FULTON,MO
Details,CUNNINGHAM,MYLONYO,TRAMIRE,M,B,24,COLUMBIA,MO
Details,DALTON,NEIL,LANCE,M,W,29,COLUMBIA,MO
Details,DAVIS,KAYLA,REBECCA,F,W,27,COLUMBIA,MO
Details,DEACON,DUSTIN,ODALE,M,W,24,COLUMBIA,MO
Details,DELVON,ANTONIO,KEVON,M,B,24,COLUMBIA,MO
Details,DENNY,DEANDRE,LAVELLE,M,B,20,COLUMBIA,MO
Details,DIAZ,MARIBEL,MARIE,F,H,18,COLUMBIA,MO
Details,DORE,ROBERT,LEE,M,W,29,COLUMBIA,MO
Details,DOWNS,WILLIAM,CHARLES,M,W,42,JEFFERSON CITY,MO

errors in web scraping

c:\programData\Anaconda3\lib\site-packages\bs4_init_.py:181:userwarning:No parser was explicitly specified so i am using the best avaliable Html parser for this system("lxml").This usally isn't a problem,but if you run this code on another system or in a different virtual environment,it may use a different parser and behave differently

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.