Scraping Top Repositories for Topics at GitHub

Introduction

This is my first web scraping project. I wanted to know about all the top repositories for each topic on GitHub. But it seemed like a lot of hassle to manually go through each topic page on GitHub and search for it's top repositories. Also to remember which repository blongs to which topic was a lot of work for me. So I tried to automate the whlole process using the mighty Python and web scraping to do the work for me.

Outline

First we scrape through https://github.com/topics
Then we get a list of all topics on the first page. For each topic, we fetch topic title, topic description and topic url
Later we get a list of all the repositories on each topic page. For each repository, we fetch repository name, owner name, stars and repository url
We create a CSV file of all the topics info collected
At last we create a CSV file of all repositories info collected for each topic like the following :-

Repository Name,Userame,Stars,URL

three.js,mrdoob,72500,https://github.com/mrdoob/three.js

libgdx,libgdx,18600,https://github.com/libgdx/libgdx

Tools used

I used

Python as the programming language due to its wide range of libraries (modules).
requests module to download the webpages.
BeautifulSoup module to scrape through the HTML codes of webpages.
Pandas module to create dataframes of collected data
os module to create folder and change path of directory
shutil module to delete directory
IPython.display module to import and display images
time module to sleep process

References

I generally don't remember all codes and neither do I try to. Before solving any problem, I first create a rough outline of what I want to do and how I will solve it. Then I look for the syntax online using Goolge. I really want to thank saome websites which helped me with the codes and syntax :-

Stack Overflow
Geeks For Geeks
MDN Web Docs
W3Schools
There are multiple videos on YouTube
Also if you are a beginner and want to learn python from it's roots, I suggest you check the free course PY4E by Chuck Severance.

wwwediai / web_scraping_bs4 Goto Github PK

web_scraping_bs4's Introduction

Scraping Top Repositories for Topics at GitHub

Introduction

Outline

Tools used

References

web_scraping_bs4's People

Contributors

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent