group_project_1_messy_data

Task : clean the data - summarise your findings in a 'one pager'

Here's your challenge for your first group project!

the deadline for finishing is Monday at noon; I will give you class time to work on this project, and you should submit your one pager via the student portal AND deliver a short group presentation to your classmates.

You will be working with a data set hosted on Kaggle that has been scraped for you from the web about US data science hires in 2018 (ie pre-covid!). The author wanted to look at some specific questions :

Who gets hired? What kind of talent do employers want when they are hiring a data scientist?

Which location has the most opportunities?

What skills, tools, degrees or majors do employers want the most for data scientists?

I think you can do more with this data set to summarise the insights and the process of data wrangling. The data is not easy to work with at the moment. Your main challenge will be to use Python to clean, wrangle and generally reshape the data to make it more straightforward to analyse- to visualise what you find in the data you can either export it to a csv, use excel to chart it, or you can explore the capabilities of Python to plot the data.

You will be in a group (2-3 students) to work on this project; as we are remote this is an opportunity to get to know each-other while applying your recently acquired skills working with messy data. This is your first group project- be reasonable in your expectations of what can be achieved in the timeframe and working with new people!

The insights you find can be documented simply with screenshots of your data frames or downloaded images of charts, but I would like to see these accompanied by some simple annotation/text summarising both what you found AND how easy it was to get to. What we want from each group is a one pager- suitable for an infographic or blog page, describing what you learnt from the data and what the gaps in the data or limitations of it are.

For inspiration on what sort of insights you might look into, you can see the web scraper's blog here : https://nycdatascience.com/blog/student-works/who-gets-hired-an-outlook-of-the-u-s-data-scientist-job-market-in-2018/

Some ideas for working successfully remotely with a group:

set up a co-working zoom / slack session
have an 'installation party' - getting started with the data all together, bring your own drinks and snacks
some of the group could try working primarily with python/pandas, others can try with Excel - and compare what you find
split the task among you- maybe some of you are better than presentations, others at pandas or plotting
share a digital whiteboard to brainstorm ideas
agree a shared communication method eg Telegram / Slack or co work in a zoom break out room

Heres the data we will be working with:

Kaggle data source

HINT : You will need to first download the data as csv file(s)

Expected steps and outcome:

You can use the ALL data set you see in the Kaggle link or practice combining the separate files into one data frame
employ string functions or REGEX, eg. Like , IF/ELSE to extract common values from strings of different lengths, eg job description
insights by any combination of job profile, company, location city, area of the country
create new columns as needed to enhance the data source: for example employ Boolean T/F logic to indicate which roles are closest to big financial or software centres in the US
make a decision about handling NULLs in the data - fill in values where logical, ignore them or clean them where not
any other data cleaning or wrangling tasks you find useful.
'one pager' summary - including insights, commentary, review of how easy the data was to work with and highlighting any limitations you found in the data set. This can be in pdf, slide, word doc etc... this can be as beautiful or as simple as you like. You will be sharing this with your classmates and the teaching team will provide feedback on your submissions. As you effectively have ONLY one page to make your case, you might start by identifying multiple trends and then scale back to focus on just one or two important ones. The main focus of the exercise is on working with messy data, so if you dont find any great data insights, you should feel free to take screen shots of your cleaning procedures and talk about them. One member of the group should host this one pager on git / googledrive / similar and submit the url.
a short class presentation (aim for 5 minutes) involving all members of your group to talk through your method and findings.

--- any questions reach out to the LT or TAs

{"mode":"full","isActive":false}

tonyhathuc / group_project_1_messy_data Goto Github PK