Giter VIP home page Giter VIP logo

simpute-py's Introduction

What does it do?

Have you ever had a time when your missing data was holding you back? Well then this package is for you!

Our python package for simple data imputation will allow you to quickly and seamlessly impute any missing data (be numeric, categorical, date/time or boolean values) using any large datasets.

All you have to do is follow these simple 4 steps:

  1. Import the package and the data you wish you impute
  2. Select the function and method for imputation (this will depend on the data type - read the usage section below for more details)
  3. Hit run
  4. Save your newly imputed dataset

Our package will help simplify all your imputation needs so your data is ready when you need it!

Contributors & Maintainers

Installation

$ pip install simpute_py

Usage and Examples

We have four main functions dealing with each data type:

  • Num_imputer: This function fills in the empty values of a numeric column with values derived from your selected imputation method. Your options for method include knn (autogenerated values based on KNN), mean, median and mode.
  • Cat_imputer: This function fills in the empty values of a categorical column with values derived based on most frequent (mode) category.
  • Bol_imputer: This function fills in the empty values of a boolean column with values derived using most frequent (mode) boolean value.
  • Date_imputer: This function fills in empty values of a date column with median point of the range of dates in that column.

To get started first install our imputation functions:

from simpute_py.bol_imputer import bol_imputer #For imputing on boolean columns
from simpute_py.cat_imputer import cat_imputer #For imputing on categorical columns
from simpute_py.date_imputer import date_imputer #For imputing on date columns
from simpute_py.num_imputer import num_imputer #For imputing on numerical columns

To run to the function, simply enter the following:

import pandas as pd

#Load test data from home directory
test_df = pd.read_csv('tests/tesla_deaths_mini.csv')

#Test functions
test_df = bol_imputer(test_df, "Driver")
test_df = cat_imputer(test_df, "Country")
test_df = date_imputer(test_df, "Date")
test_df = num_imputer(test_df, "Deaths")

print(test_df)

Place in the Python Ecosystem

Currently, there are many other ways you can impute a dataset, using various functions build within Python, but this packages it neatly into one place and simplifies the process. We do have other packages you can use such as AutoImpute and MIDASpy. However our package aims to provide functionality not provided in either package and for more general audience use.

Contributing

Interested in contributing? Check out the contributing guidelines. Please note that this project is released with a Code of Conduct. By contributing to this project, you agree to abide by its terms.

License

simpute_py was created by Lisa Sequeira, Renee Kwon, Fujie Sun, and Ken Wang. It is licensed under the terms of the MIT license.

Credits

simpute_py was created with cookiecutter and the py-pkgs-cookiecutter template.

simpute-py's People

Contributors

kenuiuc avatar renee-kwon avatar lisaseq avatar althrun-sun avatar

Stargazers

 avatar

Watchers

 avatar

Forkers

kenuiuc

simpute-py's Issues

Milestone 3 Feedback

Congratulations on finishing milestone 3! We can see you put a lot of work into this project, nice work!
Below we list some specific feedback you can use to improve your project.

We provide tick boxes for you to use in the future as you address these concerns to improve the final grade of your project.
If anything is unclear, please feel free to ask questions in this issue thread.

R package

1. Write test cases and code iteratively

rubric={accuracy:20,quality:10,mechanics:10}

Good Job!

Python package checklist

1. GitHub actions workflow for continuous integration

rubric={mechanics:10}
Good Job!

2. GitHub actions workflow for continuous deployment

rubric={mechanics:10}
Good Job!

3. Documentation

rubric={reasoning:10}

  • The documentation build with ReadtheDocs is not available (no link on GitHub or in the [README.md](http://readme.md/) file) - -7 reasoning
  • The documentation lacks of a demostration - -3 reasoning

Comments: I am not able to find link to the ReadtheDocs. Although the README contains usage section describing each function, the README does not contain example code to demonstrate usage of these function.

Specific expectations for this milestone

rubric={mechanics:10}
Good Job!

Submission instructions

rubric={mechanics:10}
Good Job!

KNN achievement

This is gonna be the "library functions" that other higher level functions will invoke.

Milestone 2 Feedback

Congratulations on finishing milestone 2! Nice work on the Tests on Python and setting up your R repo! 
Below we list some specific feedback you can use to improve your project.
We provide tick boxes for you to use in the future as you address these concerns to improve the final grade of your project.
If anything is unclear, please feel free to ask questions in this issue thread.

Group 11

There isn't an unit test function for each specified function, which is named after the function being tested (e.g., if the function is named foo then the unit test function is named test_foo). For example: test_bool_simpute_py.py for bol_imputer.py- mechanics -5

Some functions aren't passing the unit tests. - accuracy -1

Ex.

Screenshot 2023-01-23 at 1 56 29 PM

There ins't a paragraph describing where your package fit into the R ecosystem. Are there any other R packages that have the same/similar functionality? Provide links to any that do. If none exist, then clearly state this as well). - mechanics -2.

The functions are developed using the same branch or there are not meaningful names for the branches . - mechanics -2

Althrun is consistently committing to the GitHub at a low level compared to the other team members in both the R and python repo. Remember to try your best to commit equal amounts. In conclusion, this group did a great job for Milestone 2! Looking forward to seeing how this project develops.

Ps, this milestone is out of 70

January 19 Meeting Minutes

January 19 - Meeting Minutes

Today we discussed that we will need users to specify which column they would like considered as a categorical variable.

There is no way to decipher what is text and what is categorical data otherwise.

We will add a new argument for all our functions and the columns that do not fit into boolean data, numerical data, categorical data (user specified) will not be considered into our kNN model used for imputation.

Tasks

Lisa:

  • update readme with necessary arguments
  • update docstrings in our script file

Thank you,

Milestone 1 Feedback

Congratulations on finishing milestone 1! We can see you put a lot of work into this project, nice work!
Below we list some specific feedback you can use to improve your project.
We provide tick boxes for you to use in the future as you address these concerns to improve the final grade of your project.
If anything is unclear, please feel free to ask questions in this issue thread.


1. Teamwork contract (10 points)

Well done!

2. Create project structure for the python project (40 points)

Well done!

3. Function specifications (20 points)

You did not lose any points based on the rubric but please note that we expect the docstrings of the 4 functions (and the functions) to be different. It is not possible to recognize the difference among the functions by reading the docstrings.

4. Manage issues (10 points)

Well done!

5. Specific expectations for this milestone are: (10 points)

  • There are no branches created or the branches have names that are not meaningful/descriptive of the work being done on that branch - mechanics -2

Comments: it is stated in the milestone 1 description that each team member will

  • create a branch

  • work on the function you are responsible for in this branch

6. Submission instructions (10 points)

Well done!

January 17 Lab - Meeting Minutes

We went over milestone 2 guidelines and discussed our next steps.

Discussions:

  • Separating script file using separate branches
  • Writing unit tests for each function
  • Finding a dummy data file for test functions (Renee)
  • Usage of project board and assigning of tasks

We will meet again on Thursday to finalize our week's tasks.

License Choice

  • We are using the same MIT License as the R package counterpart.
  • The license choice discussions can be found here in the R package repo.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.