Giter VIP home page Giter VIP logo

delelinus / scrape-and-analyze-ycombinator Goto Github PK

View Code? Open in Web Editor NEW
1.0 1.0 0.0 1.17 MB

As my first project in my transitioning into data engineering. A project that involves collecting, wrangling, cleaning and analysing/visualizing companies' information listed on https://ycombinator.com/companies.

Jupyter Notebook 97.01% Python 2.99%
data-engineering data-science selenium-python webscraping ycombinator data-scientists data-engineer

scrape-and-analyze-ycombinator's Introduction

Author

Ayanwoye Gideon Ayandele โ€“ [email protected]

Table of Contents

  1. Project Description.
  2. Web Scraping (ycombinator_scraper.ipynb/ycombinator_scraper.py)
  3. Data Wrangling and Exploration (EDA_ycombinator.ipynb)
  4. Analysis Summary
  5. Details of Charts
  6. References

March, 2022 - Scraping, Cleaning and Analyzing Companies Information as listed on Ycombinator

The motivation for this project is to achieve a very basic end-to-end data engineering project by collecting/scraping, wrangling, cleaning and analysing/visualizing companies' information listed on https://ycombinator.com/companies.

The project main objectives were:

  1. Perform web scraping
  2. Do data wrangling (gathering, assessing and cleaning) on the crawled data.
  3. Store, analyze, and visualize the wrangled data.
  4. Reporting on:
    • data wrangling efforts.
    • data analysis and visualizations

The project was divided into two parts:

  1. Web Scraping (ycombinator_scraper.ipynb/ycombinator_scraper.py)
  2. Data Wrangling and Exploration (EDA_ycombinator.ipynb)

Web Scraping (ycombinator_scraper.ipynb/ycombinator_scraper.py)

The dependencies and third party libraries for the scraper include:

  • Selenium
  • BeautifulSoup
  • requests
  • numpy
  • pandas
  1. I scraped data pertaining to all 1000 companies listed on https://ycombinator.com/companies, which are:
  • The listed company names
  • The company's ycombinator page url
  • The company location
  • The company short description (Description head) using the selenium library since the page is dynamic.

Untitled

  1. I then went through the scraped company's ycombinator page url using requests library since the pages are static, and grab many other informations (company's description, year founded, team size, company page url, social media urls, management details) as they appear for each company.

Screenshot (217) Screenshot (218)

  1. At the end, I created a CSV file in the following format:
Company_Name Company_Page_URL Company_Location Description_Head Website Description Founded Team_Size Linkedin_Profile Twitter_Profile Facebook_Profile Crunchbase_Profile Active_Founder1 Active_Founder2 Active_Founder3
Airbnb https://www.ycombinator.com/companies/airbnb San Francisco, CA, US, Book accommodations around the world. http://airbnb.com Founded in August of 2008 and based in San Fra... 2008 5000 https://www.linkedin.com/company/airbnb/ https://twitter.com/Airbnb https://www.facebook.com/airbnb/ https://www.crunchbase.com/organization/airbnb Nathan Blecharczyk\nNone\nhttps://twitter.com/... Brian Chesky\nNone\nhttps://twitter.com/bchesky\n Joe Gebbia\nNone\nhttps://twitter.com/jgebbia\n,
  1. The scraper runs for approxiamtely 1.5 minute with multithreading and approximately 7 minutes when NOT multithreaded

Data Wrangling and Exploration (EDA_ycombinator.ipynb)

The dependencies and third party libraries for the EDA include:

  • numpy
  • pandas
  • matplotlib
  • seaborn

The summary from the data assessment and cleaning were that:

  • There were cases of duplicated company names (Nash, Atlas and Streak) which appeared twice but had their characteristics to be different from the duplicate, it was then concluded to neglect the issue.
  1. Missing data were represented with NaN which would not be imputed or removed as they represented charateristics that were not for the particular company
  2. New variable showing the Country_Of_Origin of the company was extracted from the Company_Location column and, another variable Number_Of_Founders was also extracted from Active_Founder1 through to Active_Founder6

Analysis Summary

Using both Univariate and Bivariate analysis:

  • The most represented country of all is the USA which counts 654 of the total 1000 companies. It is followed by India, Canada, UK, Nigeria and Indonesia

  • It could be seen that the more recent a company is founded, the likely it is to be funded/listed by ycombinator

  • The Team size distribution is highly right-skewed with a really long tail that it was very difficult to view the plot. I had to resolve into binning of size 100 and also set the plot's x_axis limit to 3000. Most teamsize is between 2-4

  • Most number of founders is 2 followed by 1 and 3

  • No interesting relationship between country of origin and team size, number of founder and year founded. Also

  • There is a weak, negative linear correlation between Number_Of_Founder and team size.

Details of Charts

  • Most represented country (Country_Of_Origin) on ycombinator: The most represented country of all is the USA which counts 654 of the total 1000 companies. It is followed by India, Canada, UK, Nigeria and Indonesia: Screenshot (209)

  • The distribution of the Year founded of the companies: It could be seen that the more recent a company is founded, the likely it is to be funded/listed by ycombinator: Screenshot (210)

  • The distribution of the team size of the companies: The Team size distribution is highly right-skewed with a really long tail that it was very difficult to view the plot. I had to resolve into binning of size 100 and also set the plot's x_axis limit to 3000. Most teamsize is between 2-4: Screenshot (211)

  • The distribution of the Number_Of_Founder of the companies: Most number of founders is 2 followed by 1 and 3: Screenshot (212)

There is no interesting relationship between country of origin and team size, number of founder and year founded. Also there is a weak, negative linear correlation between Number_Of_Founder and team size.

  • The relationship between Country_Of_Origin, and Year founded.: Screenshot (213)

  • The relationship between Country_Of_Origin, and team size: Screenshot (214)

  • The relationship between Country_Of_Origin, and Number_Of_Founder: Screenshot (215)

  • The relationship between Number_Of_Founder, and team size: Screenshot (216)

References

scrape-and-analyze-ycombinator's People

Contributors

delelinus avatar

Stargazers

 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.