In this challenge, we are going to grab a live data set from the City of Chicago, parse through the columns, clean the data as it comes in, and insert it into the database. After that, we'll query the database and find some important data points.
Visit the data portal and download the CSV. There should be a pretty significant number of rows!
Create a database on your computer called chicago_salaries
using the PSQL command. Using Psycopg, we want to create an employees
table with the following columns:
- first_name
- last_name
- job_title
- full_or_part_time
- department
- annual_salary. Note: If an employee is an hourly employee, calculate their annual salary by this equation:
hours_per_week * hourly_salary * 50
(50 weeks a year)
In this release, read the CSV with Python, iterate over each row, clean the data, and insert the cleaned record into the database. Note the calculation for the annual salary we mentioned above.
Keep an eye out for two things:
- We want first name and last name split into the database but the City of Chicago just has one name field. Is there an easy way to parse out the first and last name using what we learned so far?
- CSV's read numbers as strings
Now that we have a database full of employees and their salaries, let's query our database:
- Find the employee being paid the most
- Find the employee being paid the least
- Find the department with the highest average salary
- Find the department with the lowest average salary
- Find the average salary difference between full time and part time workers
- Find the most common first name
- Find the most common last name
- If there are people with the same name, find what their job titles, departments, and annual salaries are