Giter VIP home page Giter VIP logo

group_project_1_messy_data's Introduction

group_project_1_messy_data

Ironhack logo

Task : clean the data - summarise your findings in a 'one pager'

Here's your challenge for your first group project!

the deadline for finishing is Monday at noon; I will give you class time to work on this project, and you should submit your one pager via the student portal AND deliver a short group presentation to your classmates.

You will be working with a data set hosted on Kaggle that has been scraped for you from the web about US data science hires in 2018 (ie pre-covid!). The author wanted to look at some specific questions :

Who gets hired? What kind of talent do employers want when they are hiring a data scientist?

Which location has the most opportunities?

What skills, tools, degrees or majors do employers want the most for data scientists?

I think you can do more with this data set to summarise the insights and the process of data wrangling. The data is not easy to work with at the moment. Your main challenge will be to use Python to clean, wrangle and generally reshape the data to make it more straightforward to analyse- to visualise what you find in the data you can either export it to a csv, use excel to chart it, or you can explore the capabilities of Python to plot the data.

You will be in a group (2-3 students) to work on this project; as we are remote this is an opportunity to get to know each-other while applying your recently acquired skills working with messy data. This is your first group project- be reasonable in your expectations of what can be achieved in the timeframe and working with new people!

The insights you find can be documented simply with screenshots of your data frames or downloaded images of charts, but I would like to see these accompanied by some simple annotation/text summarising both what you found AND how easy it was to get to. What we want from each group is a one pager- suitable for an infographic or blog page, describing what you learnt from the data and what the gaps in the data or limitations of it are.

For inspiration on what sort of insights you might look into, you can see the web scraper's blog here : https://nycdatascience.com/blog/student-works/who-gets-hired-an-outlook-of-the-u-s-data-scientist-job-market-in-2018/

Some ideas for working successfully remotely with a group:

  • set up a co-working zoom / slack session

  • have an 'installation party' - getting started with the data all together, bring your own drinks and snacks

  • some of the group could try working primarily with python/pandas, others can try with Excel - and compare what you find

  • split the task among you- maybe some of you are better than presentations, others at pandas or plotting

  • share a digital whiteboard to brainstorm ideas

  • agree a shared communication method eg Telegram / Slack or co work in a zoom break out room

Heres the data we will be working with:

Kaggle data source

HINT : You will need to first download the data as csv file(s)

Expected steps and outcome:

  • You can use the ALL data set you see in the Kaggle link or practice combining the separate files into one data frame

  • employ string functions or REGEX, eg. Like , IF/ELSE to extract common values from strings of different lengths, eg job description

  • insights by any combination of job profile, company, location city, area of the country

  • create new columns as needed to enhance the data source: for example employ Boolean T/F logic to indicate which roles are closest to big financial or software centres in the US

  • make a decision about handling NULLs in the data - fill in values where logical, ignore them or clean them where not

  • any other data cleaning or wrangling tasks you find useful.

  • 'one pager' summary - including insights, commentary, review of how easy the data was to work with and highlighting any limitations you found in the data set. This can be in pdf, slide, word doc etc... this can be as beautiful or as simple as you like. You will be sharing this with your classmates and the teaching team will provide feedback on your submissions. As you effectively have ONLY one page to make your case, you might start by identifying multiple trends and then scale back to focus on just one or two important ones. The main focus of the exercise is on working with messy data, so if you dont find any great data insights, you should feel free to take screen shots of your cleaning procedures and talk about them. One member of the group should host this one pager on git / googledrive / similar and submit the url.

  • a short class presentation (aim for 5 minutes) involving all members of your group to talk through your method and findings.

--- any questions reach out to the LT or TAs

{"mode":"full","isActive":false}

group_project_1_messy_data's People

Contributors

tonyhathuc avatar

Stargazers

 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.