Giter VIP home page Giter VIP logo

web_scraping_bs4's Introduction

Scraping Top Repositories for Topics at GitHub

Introduction

This is my first web scraping project. I wanted to know about all the top repositories for each topic on GitHub. But it seemed like a lot of hassle to manually go through each topic page on GitHub and search for it's top repositories. Also to remember which repository blongs to which topic was a lot of work for me. So I tried to automate the whlole process using the mighty Python and web scraping to do the work for me.

Outline

  • First we scrape through https://github.com/topics
  • Then we get a list of all topics on the first page. For each topic, we fetch topic title, topic description and topic url
  • Later we get a list of all the repositories on each topic page. For each repository, we fetch repository name, owner name, stars and repository url
  • We create a CSV file of all the topics info collected
  • At last we create a CSV file of all repositories info collected for each topic like the following :-

Repository Name,Userame,Stars,URL

three.js,mrdoob,72500,https://github.com/mrdoob/three.js

libgdx,libgdx,18600,https://github.com/libgdx/libgdx

Tools used

I used

  1. Python as the programming language due to its wide range of libraries (modules).
  2. requests module to download the webpages.
  3. BeautifulSoup module to scrape through the HTML codes of webpages.
  4. Pandas module to create dataframes of collected data
  5. os module to create folder and change path of directory
  6. shutil module to delete directory
  7. IPython.display module to import and display images
  8. time module to sleep process

References

I generally don't remember all codes and neither do I try to. Before solving any problem, I first create a rough outline of what I want to do and how I will solve it. Then I look for the syntax online using Goolge. I really want to thank saome websites which helped me with the codes and syntax :-

web_scraping_bs4's People

Contributors

sayan-paul-200 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.