daguito81 / ie_koala Goto Github PK

simple dataframe implementation for Adv. Python class

Python 100.00%

ie_koala's Introduction

ie_pandas

MBD 2018 - Adv.Python - Group K

Dagoberto Romer
Meredith Hayward
Andrea Salvati
Moritz Steinbrecher

ie_pandas is a python library to deal with DataFrames. It suports the creation of a dataframe object from a dictionary of data. It supports mixed data types between its different columns, but can only hold 1 data type per column.

Features

The current functionality of the library is:

Import data from a dictionary of Numpy arrays or dictionary of lists (int, float, bool an strings are supported)
String-based bracket indexing Ex: df['sales'] that returs the respective column in a Numpy array.
String and integer based indexing using the df.loc() method Ex: df.loc(2,3) for doing
Aggregation functions (sum, min, max, mean, median, std supported)
Modification and adding new columns by reassigning numpy arrays to indexed columns Ex: df['int] = np.array(...)
Column renaming and column drop both as copies and in place
print the data in table format by using print(df) df in interactive shell or notebook

The data is also treated as a dictionary inside the Object so we can access it with the df attribute and then make any operation to it that would be possible to do on a dictionary. Including passing it into another data library like pandas.

Installation

User installation:

To use the library, you must do these steps:

git clone https://github.com/daguito81/ie_koala.git
cd ie_koala
pip install .

ie_pandas has numpy as a requirement and will be installed when installed throuhg pip.

Take note that the folder is called ie_koala when imported, this is just to avoid confusion between similarly packaged libraries.
For usage inside a python script, we need to import:

import ie_pandas as pd
or
from ie_pandas import DataFrame

For the sake of this documentation we will import the DataFrame class from the library.

Example:

import numpy as np
from ie_pandas import DataFrame

my_dict = {
    'int': [1, 2, 3],
    'float': [1.1, 2.2, 3.3],
    'str': ['one', 'two', 'three'],
    'bool': [True, False, True],
}

# Then we can create the DataFrame with
df = DataFrame(my_dict)

With the DataFrame created we can use aggregate it's columns by different functions

df.sum()  # To add all the elements of numeric columns
df.min()  # To find the minimum value of every numeric columns
df.max()  # To find the maximum value of every numeric column
df.mean()  # To find the mean of every numeric column
df.median()  # To find the median of every numeric column
df.std()  # To find the standard deviation fo every numeric columns

We can also index the columns by integers or strings using bracket indexing or integer indexing
Examples:

df['int']  # returns array([1, 2, 3])
df['float']  # returns array([1.1, 2.2, 3.3])
# We can also index by integer using the df.loc() notation
df.loc(0)  # Returns array([1, 2, 3])
df.loc(0, 1)  # Returns array(2) which is a scalar
# Note that df.loc() notation is (Column, Row)

We can also rename the columns:

df.rename('int', 'integer', inplace=True)  # To change the current instance
df2 = df.rename('int', 'integer', inplace=False)  # To create a copy instead

Or drop a column:

df.drop('int', inplace=True)  # To drop it from the current instance
df2 = df.drop('int', inplace=Flase)  # To create a copy instead

We can also replace a column by bracket indexing and assigning the new column:

df['int'] = np.array([10, 11, 12])
# Note that the new column needs to be passed as a numpy array
# Note that the length needs to be the same as the length of the rest of the dataframe

We can also add a new column by simply assigning it a new column name

df['newcol'] = np.array(['New', 'Data', 'Here'])
# Note that the same restrictions as updating a columns exist

Visualize

To visualize the dataframe, we have provided 2 methods.

to view it in a console output from a script you can use print(df)
to see it as a table in an interactive shell like python shell or notebook you can simply write the name of the data frame df to see it.

Developer Instructions

To contribute and develop on the ie_pandas library, any issue or feature needs to be tested and be fully 100% pep8 compliant. We use flake8 for styling and unit testing to test the code in the library.

Installation

To install the library for development purposes do :

git clone https://github.com/daguito81/ie_koala.git
cd ie_koala
pip install -e .[dev]

This will allow the library to dynamically include the changes to the code without having to reinstall the package after every change.
The[dev] parameter will allow it to include the extra packages for testing.
In case of any trouble running the tests. You can manually install them with:

pip install pytest
pip install pytest-cov
pip install pytest-flake8

Testing

All new tests should be included in the tests folder.

To check that tests pass run in the ie_koala directory:
pytest --cov=ie_pandas tests/
This will check all the tests and the code coverage of said tests. We aim to have 90%+ code coverage in our tests.

To test for style errors we can run:
pytest --flake8
And this will show all errors that don't conform with flake8 which include pep8

To run only the tests just use in ie_koala folder:
pytest

In case there is an error installing. To run the tests, install on the same environment as ie-pandas:
pip install pytest-cov
pip install pytest-flake8

ie_koala's People

Contributors

Stargazers

Watchers

ie_koala's Issues

We need to create tests for edge cases and handle errors

we hve to create at least tests to introduce column and index parameters on the dataframe constructor to make sure they override the defaults

Change README to reflect new print function

Pay attention to mandatory API

I love the "koala" name 🙂 however, part of the grading process (this is not mentioned in the original assignment) will involve running my automated tests in your code. Therefore, even if you did everything correctly, my tests will expect this line to work:

from ie_pandas import DataFrame

rather than the current one from ie_koala import Dataframe (see also capitalization). I'm sorry to limit your creativity in this silly way, but it will make my life much easier. The name of the project can still be ie_koala of course.

Other things like df.get_row(index), df["col_name"] and df.sum() should also be respected.

To be clear: failure to comply exactly with this API won't be a critical factor in your grade, but again it will make the process more annoying to me, I will be more grumpy, and I might subtract some small points.

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL
NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and
"OPTIONAL" in this document are to be interpreted as described in
RFC 2119.

https://www.ietf.org/rfc/rfc2119.txt 😉

Need to set the setter so we can rewrite dataframes

as of now, once the dataframe is created, we can't change data in thre because there is no setter defined.

print function broken if numpy array passed to constructor

if we pass in a dictionary of arrays instead of lists, the printing function breaks

method len is not implemented

Change print function so that it can be tests

Our print function relies on printint to console as a loop. I doubt thats the 'right' way to do it. It would be better to create a string and then print the string. instead of calling the print function once per row.

This would also allow us to create a test for the df.frame @Property

put the data type of each column when it prints it

Create indexing function

we need to create an index method that takes either a string or an integer and then indexes based on the name or the position of the column.

Create aggregation methods

as per requirement:

Methods .sum(), .median(), .min() and .max() that, ignoring the non-numerical columns, return a list of values corresponding to applying the function to each numerical column

Checks for length of arrays

We need to make sure that when constructing a dataframe, all arrays (columns) are of the same length. So

ko.Dataframe(data={
'a': [1, 2, 3],
'b':[1,2]
}

Gives an error

Create Tests for setter, aggregation and get_row

We need to create test cases for changing columns, adding columns, all aggregations and horizontal indexing

Getting different results if columns change and then df.frame vs print(df)

Currently we're remaking the dataframe as self._frame before printing it. We should use self.df which is generated at init so we can keep things synchronized. In this case if I change df.columns, doing df.frame shows the new columns because it recreates the df as self._frame. But if I do print(df) it uses self.df which has not been updated. We need to unify this.

This is a test issue

This is just a test issue

force all data inside the DataFrame into np.arrays

Right now, we have the ability to construct the dataframe from a dictionary of lists or numpy arrays. But the constructed dataframe retains that choice.
This could turn into unwanted behaviour down the laine between dealing with lists or numpy arrays. Especially in the aggregation functions.

We should create a loop in the init that goes through every column and makes it into a numpy array updating self.df

Deleting Columns

After creating a Dataframe, we don't have amethod to delete a specific column

if you pass a dictionary of things like strigs and integers, the DataFrame will be created but break

We need to start filling out the docstring of the Class

We need to start adding the instructions and details and limitations of our class in the Docstring so a user can use it easily

Create a way to visualize the dataframe

need to have a way to show print(df) or a specific method to visualize the DataFrame

Line 51 seems redundant

Line 51 has data = creation of dictionary same as self.df.

This was probably some legacy code we need to take care of.

indexing test has docstring stating it's RED

index parameter needs to be reword to allow np int numbers

Before we were creating from dictionaries of integers. But doing a list of np.array returns np.int32 in its elements and the check in our code is looking for int values, so it gives an error. We can change the condition to include numpy integers as well

Method .get_row(index) that returns a list of values corresponding to the row

daguito81 / ie_koala Goto Github PK

ie_koala's Introduction

ie_pandas

MBD 2018 - Adv.Python - Group K

Features

Installation

User installation:

Visualize

Developer Instructions

Installation

Testing

ie_koala's People

Contributors

Stargazers

Watchers

ie_koala's Issues

Recommend Projects

Recommend Topics

Recommend Org