ubc-mds / simpute-py Goto Github PK

Python package for simple data imputation

License: MIT License

Python 60.35% Jupyter Notebook 39.65%

simpute-py's Introduction

What does it do?

Have you ever had a time when your missing data was holding you back? Well then this package is for you!

Our python package for simple data imputation will allow you to quickly and seamlessly impute any missing data (be numeric, categorical, date/time or boolean values) using any large datasets.

All you have to do is follow these simple 4 steps:

Import the package and the data you wish you impute
Select the function and method for imputation (this will depend on the data type - read the usage section below for more details)
Hit run
Save your newly imputed dataset

Our package will help simplify all your imputation needs so your data is ready when you need it!

Contributors & Maintainers

Installation

$ pip install simpute_py

Usage and Examples

We have four main functions dealing with each data type:

Num_imputer: This function fills in the empty values of a numeric column with values derived from your selected imputation method. Your options for method include knn (autogenerated values based on KNN), mean, median and mode.
Cat_imputer: This function fills in the empty values of a categorical column with values derived based on most frequent (mode) category.
Bol_imputer: This function fills in the empty values of a boolean column with values derived using most frequent (mode) boolean value.
Date_imputer: This function fills in empty values of a date column with median point of the range of dates in that column.

To get started first install our imputation functions:

from simpute_py.bol_imputer import bol_imputer #For imputing on boolean columns
from simpute_py.cat_imputer import cat_imputer #For imputing on categorical columns
from simpute_py.date_imputer import date_imputer #For imputing on date columns
from simpute_py.num_imputer import num_imputer #For imputing on numerical columns

To run to the function, simply enter the following:

import pandas as pd

#Load test data from home directory
test_df = pd.read_csv('tests/tesla_deaths_mini.csv')

#Test functions
test_df = bol_imputer(test_df, "Driver")
test_df = cat_imputer(test_df, "Country")
test_df = date_imputer(test_df, "Date")
test_df = num_imputer(test_df, "Deaths")

print(test_df)

Place in the Python Ecosystem

Currently, there are many other ways you can impute a dataset, using various functions build within Python, but this packages it neatly into one place and simplifies the process. We do have other packages you can use such as AutoImpute and MIDASpy. However our package aims to provide functionality not provided in either package and for more general audience use.

Contributing

Interested in contributing? Check out the contributing guidelines. Please note that this project is released with a Code of Conduct. By contributing to this project, you agree to abide by its terms.

License

simpute_py was created by Lisa Sequeira, Renee Kwon, Fujie Sun, and Ken Wang. It is licensed under the terms of the MIT license.

Credits

simpute_py was created with cookiecutter and the py-pkgs-cookiecutter template.

simpute-py's People

Contributors

Stargazers

Watchers

Forkers

kenuiuc

simpute-py's Issues

Update the name of the Test function for Boolean_imputer.

Milestone 1 Tasks

Assigned tasks for this milestone

Py cat_imputer function

create `CONTRIBUTOR.md`, `CONTRIBUTING.md`, `CONDUCT.md`

Fix Build Error

We need all green checks in the Github Action build.

Milestone 3 Feedback

Congratulations on finishing milestone 3! We can see you put a lot of work into this project, nice work!
Below we list some specific feedback you can use to improve your project.

We provide tick boxes for you to use in the future as you address these concerns to improve the final grade of your project.
If anything is unclear, please feel free to ask questions in this issue thread.

R package

1. Write test cases and code iteratively

rubric={accuracy:20,quality:10,mechanics:10}

Good Job!

Python package checklist

1. GitHub actions workflow for continuous integration

rubric={mechanics:10}
Good Job!

2. GitHub actions workflow for continuous deployment

rubric={mechanics:10}
Good Job!

3. Documentation

rubric={reasoning:10}

The documentation build with ReadtheDocs is not available (no link on GitHub or in the [README.md](http://readme.md/) file) - -7 reasoning
The documentation lacks of a demostration - -3 reasoning

Comments: I am not able to find link to the ReadtheDocs. Although the README contains usage section describing each function, the README does not contain example code to demonstrate usage of these function.

Specific expectations for this milestone

rubric={mechanics:10}
Good Job!

Submission instructions

rubric={mechanics:10}
Good Job!

Python project CI/CD

Make sure the "badge" appears

create README

create project and task board

Py bol_imputer function

KNN achievement

This is gonna be the "library functions" that other higher level functions will invoke.

Dummy Data Selection for Testing

For Milestone 2 we will need a dummy data file to use for all testing.

all_imputer function

Milestone 2 Feedback

Congratulations on finishing milestone 2! Nice work on the Tests on Python and setting up your R repo!  Below we list some specific feedback you can use to improve your project. We provide tick boxes for you to use in the future as you address these concerns to improve the final grade of your project. If anything is unclear, please feel free to ask questions in this issue thread.

Group 11

There isn't an unit test function for each specified function, which is named after the function being tested (e.g., if the function is named foo then the unit test function is named test_foo). For example: test_bool_simpute_py.py for bol_imputer.py- mechanics -5

Some functions aren't passing the unit tests. - accuracy -1

Ex.

There ins't a paragraph describing where your package fit into the R ecosystem. Are there any other R packages that have the same/similar functionality? Provide links to any that do. If none exist, then clearly state this as well). - mechanics -2.

The functions are developed using the same branch or there are not meaningful names for the branches . - mechanics -2

Althrun is consistently committing to the GitHub at a low level compared to the other team members in both the R and python repo. Remember to try your best to commit equal amounts. In conclusion, this group did a great job for Milestone 2! Looking forward to seeing how this project develops.

Ps, this milestone is out of 70

January 19 Meeting Minutes

January 19 - Meeting Minutes

Today we discussed that we will need users to specify which column they would like considered as a categorical variable.

There is no way to decipher what is text and what is categorical data otherwise.

We will add a new argument for all our functions and the columns that do not fit into boolean data, numerical data, categorical data (user specified) will not be considered into our kNN model used for imputation.

Tasks

Lisa:

update readme with necessary arguments
update docstrings in our script file

Thank you,

Milestone 1 Feedback

Congratulations on finishing milestone 1! We can see you put a lot of work into this project, nice work!
Below we list some specific feedback you can use to improve your project.
We provide tick boxes for you to use in the future as you address these concerns to improve the final grade of your project.
If anything is unclear, please feel free to ask questions in this issue thread.

1. Teamwork contract (10 points)

Well done!

2. Create project structure for the python project (40 points)

Well done!

3. Function specifications (20 points)

You did not lose any points based on the rubric but please note that we expect the docstrings of the 4 functions (and the functions) to be different. It is not possible to recognize the difference among the functions by reading the docstrings.

4. Manage issues (10 points)

Well done!

5. Specific expectations for this milestone are: (10 points)

There are no branches created or the branches have names that are not meaningful/descriptive of the work being done on that branch - mechanics -2

Comments: it is stated in the milestone 1 description that each team member will

create a branch
work on the function you are responsible for in this branch

6. Submission instructions (10 points)

Well done!

create repo and project structure

Team Work Contract

Hello @LisaSeq @kenuiuc @Althrun-sun

I've drafted a team contract here - please comment here if you have any suggestions/changes you'd like to make or if you approve!

Thanks,

Renee

January 17 Lab - Meeting Minutes

We went over milestone 2 guidelines and discussed our next steps.

Discussions:

Separating script file using separate branches
Writing unit tests for each function
Finding a dummy data file for test functions (Renee)
Usage of project board and assigning of tasks

We will meet again on Thursday to finalize our week's tasks.

Py date_imputer function

License Choice

We are using the same MIT License as the R package counterpart.
The license choice discussions can be found here in the R package repo.

ubc-mds / simpute-py Goto Github PK

simpute-py's Introduction

What does it do?

Contributors & Maintainers

Installation

Usage and Examples

Place in the Python Ecosystem

Contributing

License

Credits

simpute-py's People

Contributors

Stargazers

Watchers

Forkers

simpute-py's Issues

R package

1. Write test cases and code iteratively

Python package checklist

1. GitHub actions workflow for continuous integration

2. GitHub actions workflow for continuous deployment

3. Documentation

Specific expectations for this milestone

Submission instructions

1. Teamwork contract (10 points)

2. Create project structure for the python project (40 points)

3. Function specifications (20 points)

4. Manage issues (10 points)

5. Specific expectations for this milestone are: (10 points)

6. Submission instructions (10 points)

Recommend Projects

Recommend Topics

Recommend Org