Giter VIP home page Giter VIP logo

simpler_eda's Introduction

simpler_eda

codecov Deploy Documentation Status

Overview

Exploratory Data analysis (EDA) is an important step in any data analysis. However, carrying out EDA with the Altair package requires a lot of coding effort. Moreover, it assumes a basic knowledge of functions and grammar of graphics syntax that are appropriate for visualizing categorical and numerical variables. The simpler_eda package addresses this issue by providing functions that are tailored to produce categorical, numerical and correlation plots using a single line of code. Furthermore, the package provides customization capability for the plots based on specific user needs (theme, title, font, size and etc.). The users are able to spend more time on analyzing the data set and less time configuring Altair plot settings.

Installation

$ pip install -i https://test.pypi.org/simple/ simpler_eda

Functions

This package contains three functions, each that accepts a pandas DataFrame for EDA. The EDA functions can be used with a dataset with numerical and categorical features. Each functions will have it's own required and optional arguments to configure the properties of the plot.

  1. corr_map: Plot a correlation map with the given dataframe object and a list of numerical features. Users are allowed to set multiple arguments regarding the setting of the correlation plot including color schemes, plot width, height, and plot title.

  2. numerical_eda: This function takes in a data frame object, two numeric columns, and produces either a scatter or line plot to visualize the relationship between the two numerical features. Users can optionally change default arguments for plot-type, color, title, size of text, color-scheme, and toggle log transformation for the x and y axis.

  3. categorical_eda: This function takes in a data frame object and one categorical feature, to produce a histogram plot that visualizes the distribution of the feature. Users can also choose to plot the density graph of the feature by specifying in plot_type. The function also offers customization on color, plot title, font size, color-scheme, plot size, opacity level, and facet factor.

How the simpler_eda package fits into the Python ecosystem

The simpler_eda package improves upon existing functions in the Altair packages library. Altair already includes many useful functions to visualize the relationship between numerical and categorical features with the use of grammar of graphics syntax. But often this is quite cumbersome and prone to errors, simpler_eda package provides convenience by allowing users to perform EDA with a single line of code. There are a number of packages that already provide similar functionality in the Python Ecosystem, such as pandas profiling, DataPrep, SweetViz and Holoviwes. However, most of them are not easily customizable. Our simpler_eda package allows flexibility from plot types, color scheme, to plot titles.

Dependencies

Dependencies

Usage

Correlation Plot

import pandas as pd
import altair as alt
import numpy as np
from simpler_eda.corr_map import corr_map
from vega_datasets import data
df = data.cars()
corr_map(df,
    ["Horsepower", "Displacement", "Cylinders", "Acceleration"])

Numerical Plot

import altair as alt
import pandas as pd
import numpy as np
from simpler_eda.numerical_eda import numerical_eda
from vega_datasets import data
numerical_eda(data.cars(), xval = "Horsepower", yval = "Acceleration",
    plot_type = "scatter",
                 color = "Origin",
                 title = " Horsepower vs Acceleration",
                 font_size = 10)

Categorical Plot

import altair as alt
import numpy as np
import pandas as pd
from simpler_eda.categorical_eda import categorical_eda
from vega_datasets import data
cars = data.cars()
categorical_eda(data = cars,
                        xval = "Horsepower",
                        color = "Origin",
                        title = "Histogram of Horsepower in Different Origins",
                        plot_height = 200,
                        plot_width = 400,
                        color_scheme="tableau10"
                        )

Documentation

The official documentation is hosted on Read the Docs: https://simpler_eda.readthedocs.io/en/latest/

Contributors

Development Lead

Contributor Name GitHub Username
Cheuk (Chuck) Ho ChuckHo777
Deepak Sidhu deepaksidhu
Nicholas Wu nichowu

We welcome and recognize all contributions. Please find the guide for contribution in Contributing Document.

Credits

This package was created with Cookiecutter and the UBC-MDS/cookiecutter-ubc-mds project template, modified from the pyOpenSci/cookiecutter-pyopensci project template and the audreyr/cookiecutter-pypackage.

simpler_eda's People

Contributors

actions-user avatar chuckho777 avatar deepaksidhu avatar nichowu avatar

Watchers

 avatar

simpler_eda's Issues

Meeting Minutes and To-Do as discussed by the group

Meeting Date: March 10, 2021

Finished the Continuous integration for the python package

  • update the functions for plotting and tests to support flake8 syntax
  • Added Code coverage in build step
  • Successfully run the build.yml

Next Meeting: March 11, 2021
Time: 7:30 pm

To-do:

  • Deployment for python package
  • ReadTheDocs for function documentation
  • Talk about R packages in the meeting tomorrow

Meeting Minute and To Dos - March 3

A brief meeting minute.
Meeting Minutes - March 3

  • We will use the car dataset as dummy data as our function is on plotting.
  • Need to review whether certain optional arguments of the 3 functions make sense, may need to reduce it.
  • Come up with the unit test and function code for the 3 functions for python.

Python

  • Develop the unti test (3-5 edge case)
  • Come up with the function code

R

  • Creating the R project structure and adapting to the project - Chuck
  • Come up the docstrings for 3 functions

Next Meeting Time: March 4 7:30 (Zoom/Slack call)

To-do

  • team work contract
  • add contributors to README.md
  • Edit CONDUCT.md
  • edit CONTRIBUTING.md to reflect your strategy
  • edit README.md
  • function specification

Meeting Minutes

  • Came up with team contract
  • Decided on the project topic:
  • Simple EDA
    i. Correlation maps (input: features to be correlated; output: correlation maps) - Chuck
    ii. Exploratory Plots for numerical features( input: x, y, color/fill, title, size of text, color; output: altair/matplot objects (e.g. Scatter plots))  - Deepak
    iii. Exploratory Plots for categorical features (input: x, y, color/fill, title, size of text, color; output: altair/matplot objects (e.g. histogram or bar charts)  - Nick
  • Creating the project structure with cookiecutter and poetry - Nick
  • Updating the documents: contributor.md, conduct.md, contributing.md - Chuck
  • Manage project task via GitHub Project and Issues
  • Don't need to fork the repo, but branch out for each task

Next Meeting: February 25, 2021 at 7 p.m.

Meeting Minutes (duplicate)

A brief meeting minute.
Meeting Minutes - Feb 24

Came up with team contract
Link to team contract on google doc

Decided on the project topic:
Simpler EDA

  1. Correlation maps (input: features to be correlated; output: correlation maps) - Chuck

  2. Exploratory Plots for numerical features( input: x, y, color/fill, title, size of text, color; output: altair/matplot objects (e.g. Scatter plots)) - Deepak

  3. Exploratory Plots for categorical features (input: x, y, color/fill, title, size of text, color; output: altair/matplot objects (e.g. histogram or bar charts) - Nick

  • Creating the project structure with cookiecutter and poetry -- Nick - Done

  • Updating the documents: contributor.md, conduct.md, contributing.md — Chuck

Manage project task via GitHub Project and Issues

Don’t need to fork the repo, but branch out for each tasks

Github Action

  • update and run build yml
  • update and run deploy yml with added secrets

Milestone 3 Submission Information

Information regarding Milestone 3 submission:
Need to submit on canvas

  • The URL for each of your your public projects’ repositories (Complete Functions in R, and Python with CI, Deployment)
  • The URL to a release for each of your projects repositories (simpler_eda(Python) and RSimplerEda(R))

Milestone 2 Submission Information

Information regarding Milestone 2 submission:
Need to submit on canvas

  • The URL for each of your your public projects’ repositories (Complete Functions in Python, and R package structure)
  • The URL to a release for each of your projects repositories (simpler_eda(Python) and RSimplerEda(R))

Milestone 1 Submission information

Information regarding Milestone 1 submission:
Need to submit on canvas

  • The URL of your public project’s repo
  • The URL to a release on your project repo named v0.1.0
  • A link to your team work document that is accessible to the teaching team

Submission Deadline: February 27, 2021 by 11:59 P.M

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.