palewire / first-web-scraper Goto Github PK

View Code? Open in Web Editor NEW

203.0 41.0 165.0 2.43 MB

A step-by-step guide to writing a web scraper with Python

Home Page: https://palewi.re/docs/first-web-scraper/

License: GNU General Public License v3.0

Makefile 7.65% Python 92.35%

first-web-scraper's Introduction

First Web Scraper

A step-by-step guide to writing a web scraper with Python.

Documentation: palewi.re/docs/first-web-scraper/

Contributing to the documentation

After installing the repository, the Sphinx documentation can be edited in the docs directory and published to ReadTheDocs by pushing changes to the master branch.

First install the requirements.

pipenv install

Fire up the test server, which will automatically update to show changes made to the restructured text files in the docs directory.

make docs

Open http://localhost:8000 in your browser and start making changes.

first-web-scraper's People

Contributors

Stargazers

Watchers

Forkers

egrommet tlevine jackiekazil amillang juan08 jayohday dahuguelet georab pychristopher digen andylolz scshepard kempwrites artisdom lynnegithub tsumuel hatoriz jheasly gauri7 masssly biggestleo hkejigu michaelaharvey nimxor smpa3193 cloudxtreme stucka ubuntuevangelist josianator 0xjashim rtulke hemcrop nikolanajdovski bnauman revcozmo jabirghalib kgabit lishaofeng sunyang0426 wizarr taylorsamaral pushpakjindal jorgealonso108 ramansahu vixjoy finleyexp ericli23 optionalg aglaianwoman do-labs youssef-sourour mcomee87 spongebobdatapants meijinsawamura weinbergj1 dynamicdesignz jessicalhuseman mittasagar wrightway12 aashishmehtoliya brandonn525 eena123 mattwinters sephra1 ehs2146 gokhu18 harshdeepy205 newsappsumd danishibrar mauricedw22 zaperking fagan2888 ws-pittman techkayacodes rogkay alijbrown admariner swifilaboroka tettehisrael dianxiang-sun laurenymiller27 keganvosteen eqramul01

first-web-scraper's Issues

soup syntax error

Guys,

Whenever I use beautiful soup I always get this syntax error, I am a very new user of python and I just don't understand this error.

Alas! Boone County has changed their website.

I'll try to carve out some time to go through the tutorial later today and update things, but now it's a Java site -- STILL WITH A TABLE! Just wanted to ping everyone and make sure you all knew.

I have a former student going through the tutorial on her own and sending me questions as she goes, so I can probably take up the updating.

Beautifulsoup and differences for python3x

Thank you for the tutorial! I did it with python3, so there were a few differences. But there's also been a change for BeautifulSoup that affected my install at least.

Here's what I found

I encountered a problem installing Beautiful soup, which was solved by installing 'beautifulsoup4'. The biggest difference I noticed was importing, as noted below.
Instead of having to remove &nbsp, another character showed up.
CSV works differently for Python3, but it wasn't difficult to find that solution.

Below is the final code with notes on the key differences I encountered. Thanks again!

import csv
import requests

# Installing 'beautifulsoup' failed on my Mac. Going to the BeautifulSoup page I found that it's recommended to install 'BeautifulSoup4', which worked. When importing, use bs4 as shown below

from bs4 import BeautifulSoup

url = 'http://www.showmeboone.com/sheriff/JailResidents/JailResidents.asp'
response =requests.get(url)
html = response.content
soup = BeautifulSoup(html)
table = soup.find('table', attrs={'class': 'resultsTable'})

list_of_rows = []
for row in table.findAll('tr')[1:]:
    list_of_cells = []
    for cell in row.findAll('td'):
# &nbsp wasn't a problem on the page, but \xaO was. It was simple enough to swap out the two, but I wonder - how would I have the text.replace work for more than one character problem?
        text = cell.text.replace('\xa0','')
        list_of_cells.append(text)
    list_of_rows.append(list_of_cells)

# Here's the key difference for Python3x. Found this using a quick search on stackoverload, of course.

filename = 'inmates.csv'
with open(filename, 'w', newline='') as f:
    writer =csv.writer(f)
    writer.writerow(["Last", "First", "Middle", "Gender", "Race", "Age", "City", "State"])
    writer.writerows(list_of_rows)

beautifulsoup4 instead of beautifulsoup ?

This is not an issue per say but merely a suggestion to switch to beautifulsoup4 instead of beautifulsoup3 as only critical bugs get attention in the case of beautifulsoup3.

showmeboone.com is frequently inaccessible.

I would suggest a different website/URL for demonstrating web scraping. Possibly, will write one by this weekend where the website is hosted on one of my server allowing for web scraping without breaking any laws/Terms of use.

Header fails in the last step.

Im not sure what Im missing. Everything is working but the last step of the header. Any suggestions?

import csv
import requests
from BeautifulSoup import BeautifulSoup

url = 'http://www.showmeboone.com/sheriff/JailResidents/JailResidents.asp'
response = requests.get(url)
html = response.content

soup = BeautifulSoup(html)
table = soup.find('tbody', attrs={'class': 'stripe'})

list_of_rows = []
for row in table.findAll('tr')[1:]:
list_of_cells = []
for cell in row.findAll('td'):
text = cell.text.replace(' ' , '')
list_of_cells.append(text)
list_of_rows.append(list_of_cells)

outfile = open("./inmates.csv", "wb")
writer = csv.writer(outfile)
writer.writerow(["Last", "First", "Middle", "Gender", "Race", "Age", "City", "State"])
writer.writerows(list_of_rows)
~
~

-============

Here is the output in text. Same results in excel.

Details,ADAM,OMER,SIRAJ,M,B,29,COLUMBIA,MO
Details,ALEXANDER,BENJAMIN,FRANKLIN,M,B,22,COLUMBIA,MO
Details,AMAN,JUSTIN,TYLER,M,B,36,COLUMBIA,MO
Details,ANDREWS,JOSEPH,DAMON,M,W,39,COLUMBIA,MO
Details,ARTEAGA,BRAYAN,OSIRIS-CACHO,M,H,25,JEFFERSON CITY,MO
Details,AUSTIN,KAY,CEE,F,W,33,KANSAS CITY,MO
Details,AVALOS-AVALOS,JOSE,,M,H,18,ST.ANN,MO
Details,BARRETT,NICHOLE,RAYSHEA,F,B,31,COLUMBIA,MO
Details,BENNETT,EMAS,CASSELL,M,B,38,KANSAS CITY,MO
Details,BENNETT,VONTHILLA,MARIE,F,B,52,COLUMBIA,MO
Details,BERARDI,DEBORAH,EVA,F,W,55,COLUMBIA,MO
Details,BERETS,HARRISON,COLE,M,W,21,COLUMBIA,MO
Details,BERGESCH,CHRISTOPHER,JAMES,M,W,25,FULTON,MO
Details,BEUER,CHRISTOPHER,TODD,M,W,35,COLUMBIA,MO
Details,BLAIR,RYAN,WADE,M,W,25,MEXICO,MO
Details,BLAND,RANDY,LAMONT,M,B,22,SPRINGFIELD,MO
Details,BODINE,EDITH,JOYCE,F,W,41,MOBERLY,MO
Details,BOGART,COURTNEY,LEE,M,W,39,COLUMBIA,MO
Details,BONAPARTE,NATHANIEL,LEROY,M,B,38,COLUMBIA,MO
Details,BOWERS,ARRINGTON,LEE,M,B,50,COLUMBIA,MO
Details,BRADLEY,JAMIE,MICHELLE,F,W,32,COLUMBIA,MO
Details,BROOKINS,OSCAR,MENTER,M,B,28,COLUMBIA,MO
Details,BROOKINS,QUANTRELL,TRAVEION,M,B,26,COLUMBIA,MO
Details,BROWN,JARRELL,DORAN,M,B,19,KANSAS CITY,MO
Details,BUCHANAN,CHARLES,MARLEON,M,B,25,COLUMBIA,MO
Details,BURKS,JUSTIN,JAY,M,W,41,FULTON,MO
Details,BURTON,RICK,DONNELL,M,B,53,ST LOUIS,MO
Details,CANNELL,MAC,GARNETT,M,W,32,COLUMBIA,MO
Details,CARRASQUILLO-MARTINEZ,ANGEL,LUIS,M,H,35,JACKSON,MO
Details,CARTER,DARIAN,MAURICE,M,B,24,COLUMBIA,MO
Details,CARTER,DEMARCO,RAYDELL,M,B,21,COLUMBIA,MO
Details,CARTER,KORJANAE,FORSTEIN,F,B,23,COLUMBIA,MO
Details,CEFERINO,APAEZ,ESTELA,M,H,29,COLUMBIA,MO
Details,CHASE,JULIAN,FULLER,M,B,23,COLUMBIA,MO
Details,CLARKSON,RODNEY,ALEXANDER,M,B,25,COLUMBIA,MO
Details,CLAYTON,DEWAYNE,EDWARD,M,W,51,COLUMBIA,MO
Details,COATES,JOSHUA,EDWARD,M,W,35,ROCHEPORT,MO
Details,COLLINS,DENNIS,CARL,M,B,32,COLUMBIA,MO
Details,CONTRERAS,JOEY,,M,W,29,MARSHALL,MO
Details,CRUM,CHRISTOPHER,KHAN,M,B,37,COLUMBIA,MO
Details,CRUSBY-DANIELS,RAYMONDRE,MACK,M,B,19,FULTON,MO
Details,CUNNINGHAM,MYLONYO,TRAMIRE,M,B,24,COLUMBIA,MO
Details,DALTON,NEIL,LANCE,M,W,29,COLUMBIA,MO
Details,DAVIS,KAYLA,REBECCA,F,W,27,COLUMBIA,MO
Details,DEACON,DUSTIN,ODALE,M,W,24,COLUMBIA,MO
Details,DELVON,ANTONIO,KEVON,M,B,24,COLUMBIA,MO
Details,DENNY,DEANDRE,LAVELLE,M,B,20,COLUMBIA,MO
Details,DIAZ,MARIBEL,MARIE,F,H,18,COLUMBIA,MO
Details,DORE,ROBERT,LEE,M,W,29,COLUMBIA,MO
Details,DOWNS,WILLIAM,CHARLES,M,W,42,JEFFERSON CITY,MO

errors in web scraping

c:\programData\Anaconda3\lib\site-packages\bs4_init_.py:181:userwarning:No parser was explicitly specified so i am using the best avaliable Html parser for this system("lxml").This usally isn't a problem,but if you run this code on another system or in a different virtual environment,it may use a different parser and behave differently

Create notes doc on difference between urllib2, requests, and mechanize

Create notes doc on difference between urllib2, requests, and mechanize (if it doesn't exist in one of the others)... we don't HAVE to have this, but it would probably be nice in explaining.

palewire / first-web-scraper Goto Github PK

first-web-scraper's Introduction

First Web Scraper

Contributing to the documentation

first-web-scraper's People

Contributors

Stargazers

Watchers

Forkers

first-web-scraper's Issues

soup syntax error

Alas! Boone County has changed their website.

Beautifulsoup and differences for python3x

beautifulsoup4 instead of beautifulsoup ?

showmeboone.com is frequently inaccessible.

Header fails in the last step.

errors in web scraping

Create notes doc on difference between urllib2, requests, and mechanize

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent