How to Read HTML Tables With Pandas

How to Read HTML Tables With Pandas
Install Pandas
Web scraping with Pandas
Saving Data to CSV

Pandas library is made for handling data with columns and rows. Often, you would need to scrape HTML tables from web pages.

This guide demonstrates how to read HTML tables with pandas with a few simple steps.

Install Pandas

To install pandas, we recommend that you use Anaconda. Alternatively, you can install pandas without Anaconda using PIP. You can also install Jupyter Notebook with PIP as follows:

pip install pandas
pip install notebook

Web scraping with Pandas

Importing Pandas

import pandas as pd

Reading Tables from Webpage

Use the read_html function to parse tables from a webpage. This function returns a list of DataFrames

url = 'https://en.wikipedia.org/wiki/List_of_wealthiest_Americans_by_net_worth'
dfs = pd.read_html(url)
df = dfs[0]

You can use a different parser, such as BeautifulSopup by setting flavor='bs4'

dfs = pd.read_html(url, flavor='bs4')

Preview Results

df.head()

Parsing Dates

In this example, the date contains other info that needs to be cleaned up:

df['Date of birth(age)'] = df['Date of birth(age)'].str.replace(r'\(.*\)', '', regex=True)

Next, convert this obj datatype to a datetime64 datatype as follows:

df['Date of birth(age)'] = pd.to_datetime(df['Date of birth(age)'])

Locating Specific Tables

You can use the match parameter to find only the tables that contain the desired text.

url = 'https://en.wikipedia.org/wiki/The_World%27s_Billionaires'
dfs = pd.read_html(url, flavor='bs4', match='Source\(s\) of wealth')

Scraping a Specific Column

read_html will return the entire table in a data frame. To get a specific column, use pandas filtering as follows:

df[['Name']]

Skipping a Row

See the following example:

url = 'https://en.wikipedia.org/wiki/Billionaire'
dfs = pd.read_html(url, flavor='bs4',match='known billionaires')

Usually, if you want to skip rows, you can use the skiprows parameter:

dfs = pd.read_html(url, skiprows=1)

In this case, we will have to remove one header row as follows:

df.droplevel(0,axis=1)

Saving Data to CSV

Use the to_csv method of the data frame object:

df.to_csv('file_name.csv',index=False)

oxylabs / pandas-read-html-2 Goto Github PK

pandas-read-html-2's Introduction

How to Read HTML Tables With Pandas

Install Pandas

Web scraping with Pandas

Importing Pandas

Reading Tables from Webpage

Preview Results

Parsing Dates

Locating Specific Tables

Scraping a Specific Column

Skipping a Row

Saving Data to CSV

pandas-read-html-2's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

oxylabs / pandas-read-html-2 Goto Github PK

pandas-read-html-2's Introduction

How to Read HTML Tables With Pandas

​Install Pandas

Web scraping with Pandas

Importing Pandas​

Reading Tables from Webpage

Preview Results

Parsing Dates

Locating Specific Table​s

Scraping a Specific Column

Skipping a Row

Saving Data to CSV

pandas-read-html-2's People

Contributors

Watchers

Recommend Projects

Recommend Topics

Recommend Org

Install Pandas

Importing Pandas

Locating Specific Tables