Giter VIP home page Giter VIP logo

ie_koala's Introduction

ie_pandas

MBD 2018 - Adv.Python - Group K

Dagoberto Romer
Meredith Hayward
Andrea Salvati
Moritz Steinbrecher


ie_pandas is a python library to deal with DataFrames. It suports the creation of a dataframe object from a dictionary of data. It supports mixed data types between its different columns, but can only hold 1 data type per column.


Features

The current functionality of the library is:

  1. Import data from a dictionary of Numpy arrays or dictionary of lists (int, float, bool an strings are supported)
  2. String-based bracket indexing Ex: df['sales'] that returs the respective column in a Numpy array.
  3. String and integer based indexing using the df.loc() method Ex: df.loc(2,3) for doing
  4. Aggregation functions (sum, min, max, mean, median, std supported)
  5. Modification and adding new columns by reassigning numpy arrays to indexed columns Ex: df['int] = np.array(...)
  6. Column renaming and column drop both as copies and in place
  7. print the data in table format by using print(df) df in interactive shell or notebook

The data is also treated as a dictionary inside the Object so we can access it with the df attribute and then make any operation to it that would be possible to do on a dictionary. Including passing it into another data library like pandas.

Installation

User installation:

To use the library, you must do these steps:

  1. git clone https://github.com/daguito81/ie_koala.git
  2. cd ie_koala
  3. pip install .

ie_pandas has numpy as a requirement and will be installed when installed throuhg pip.

Take note that the folder is called ie_koala when imported, this is just to avoid confusion between similarly packaged libraries.
For usage inside a python script, we need to import:

  • import ie_pandas as pd
    or
  • from ie_pandas import DataFrame

For the sake of this documentation we will import the DataFrame class from the library.

Example:

import numpy as np
from ie_pandas import DataFrame

my_dict = {
    'int': [1, 2, 3],
    'float': [1.1, 2.2, 3.3],
    'str': ['one', 'two', 'three'],
    'bool': [True, False, True],
}

# Then we can create the DataFrame with
df = DataFrame(my_dict)

With the DataFrame created we can use aggregate it's columns by different functions

df.sum()  # To add all the elements of numeric columns
df.min()  # To find the minimum value of every numeric columns
df.max()  # To find the maximum value of every numeric column
df.mean()  # To find the mean of every numeric column
df.median()  # To find the median of every numeric column
df.std()  # To find the standard deviation fo every numeric columns

We can also index the columns by integers or strings using bracket indexing or integer indexing
Examples:

df['int']  # returns array([1, 2, 3])
df['float']  # returns array([1.1, 2.2, 3.3])
# We can also index by integer using the df.loc() notation
df.loc(0)  # Returns array([1, 2, 3])
df.loc(0, 1)  # Returns array(2) which is a scalar
# Note that df.loc() notation is (Column, Row)

We can also rename the columns:

df.rename('int', 'integer', inplace=True)  # To change the current instance
df2 = df.rename('int', 'integer', inplace=False)  # To create a copy instead

Or drop a column:

df.drop('int', inplace=True)  # To drop it from the current instance
df2 = df.drop('int', inplace=Flase)  # To create a copy instead

We can also replace a column by bracket indexing and assigning the new column:

df['int'] = np.array([10, 11, 12])
# Note that the new column needs to be passed as a numpy array
# Note that the length needs to be the same as the length of the rest of the dataframe

We can also add a new column by simply assigning it a new column name

df['newcol'] = np.array(['New', 'Data', 'Here'])
# Note that the same restrictions as updating a columns exist

Visualize

To visualize the dataframe, we have provided 2 methods.

  1. to view it in a console output from a script you can use print(df)
  2. to see it as a table in an interactive shell like python shell or notebook you can simply write the name of the data frame df to see it.

Developer Instructions

To contribute and develop on the ie_pandas library, any issue or feature needs to be tested and be fully 100% pep8 compliant. We use flake8 for styling and unit testing to test the code in the library.

Installation

To install the library for development purposes do :

  1. git clone https://github.com/daguito81/ie_koala.git
  2. cd ie_koala
  3. pip install -e .[dev]

This will allow the library to dynamically include the changes to the code without having to reinstall the package after every change.
The[dev] parameter will allow it to include the extra packages for testing.
In case of any trouble running the tests. You can manually install them with:

pip install pytest
pip install pytest-cov
pip install pytest-flake8

Testing

All new tests should be included in the tests folder.

To check that tests pass run in the ie_koala directory:
pytest --cov=ie_pandas tests/
This will check all the tests and the code coverage of said tests. We aim to have 90%+ code coverage in our tests.

To test for style errors we can run:
pytest --flake8
And this will show all errors that don't conform with flake8 which include pep8

To run only the tests just use in ie_koala folder:
pytest

In case there is an error installing. To run the tests, install on the same environment as ie-pandas:
pip install pytest-cov
pip install pytest-flake8

ie_koala's People

Contributors

andreasalvati avatar daguito81 avatar dithmere avatar mosteinbrecher avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar

ie_koala's Issues

Pay attention to mandatory API

I love the "koala" name ๐Ÿ™‚ however, part of the grading process (this is not mentioned in the original assignment) will involve running my automated tests in your code. Therefore, even if you did everything correctly, my tests will expect this line to work:

from ie_pandas import DataFrame

rather than the current one from ie_koala import Dataframe (see also capitalization). I'm sorry to limit your creativity in this silly way, but it will make my life much easier. The name of the project can still be ie_koala of course.

Other things like df.get_row(index), df["col_name"] and df.sum() should also be respected.

To be clear: failure to comply exactly with this API won't be a critical factor in your grade, but again it will make the process more annoying to me, I will be more grumpy, and I might subtract some small points.

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL
NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and
"OPTIONAL" in this document are to be interpreted as described in
RFC 2119.

https://www.ietf.org/rfc/rfc2119.txt ๐Ÿ˜‰

Change print function so that it can be tests

Our print function relies on printint to console as a loop. I doubt thats the 'right' way to do it. It would be better to create a string and then print the string. instead of calling the print function once per row.

This would also allow us to create a test for the df.frame @Property

Create indexing function

we need to create an index method that takes either a string or an integer and then indexes based on the name or the position of the column.

Create aggregation methods

as per requirement:

  • Methods .sum(), .median(), .min() and .max() that, ignoring the non-numerical columns, return a list of values corresponding to applying the function to each numerical column

Checks for length of arrays

We need to make sure that when constructing a dataframe, all arrays (columns) are of the same length. So

ko.Dataframe(data={
'a': [1, 2, 3],
'b':[1,2]
}

Gives an error

Getting different results if columns change and then df.frame vs print(df)

Currently we're remaking the dataframe as self._frame before printing it. We should use self.df which is generated at init so we can keep things synchronized. In this case if I change df.columns, doing df.frame shows the new columns because it recreates the df as self._frame. But if I do print(df) it uses self.df which has not been updated. We need to unify this.

force all data inside the DataFrame into np.arrays

Right now, we have the ability to construct the dataframe from a dictionary of lists or numpy arrays. But the constructed dataframe retains that choice.
This could turn into unwanted behaviour down the laine between dealing with lists or numpy arrays. Especially in the aggregation functions.

We should create a loop in the init that goes through every column and makes it into a numpy array updating self.df

Deleting Columns

After creating a Dataframe, we don't have amethod to delete a specific column

Line 51 seems redundant

Line 51 has data = creation of dictionary same as self.df.

This was probably some legacy code we need to take care of.

index parameter needs to be reword to allow np int numbers

Before we were creating from dictionaries of integers. But doing a list of np.array returns np.int32 in its elements and the check in our code is looking for int values, so it gives an error. We can change the condition to include numpy integers as well

bug on .loc function

Tried doing df.loc(0,1) and I get

Traceback (most recent call last):
File "", line 1, in
File "c:\users\dagoberto\google drive\masters\ie mbd\term 3\adv python\ie_koala\src\ie_koala_init_.py", line 108, in loc
return np.array(self.df[list(self.df)[col][row]])
KeyError: 't'

Issue with the column datatypes

Whenever we create a dataframe the column attribute becomes an np.array of numpy string and shows up to 5 characters. We need to make these an array or list of strings

Renaming Columns

We need a method that we can pass a list of column names and it renames them

create .get_row(index) method

We need to implement a method that returns a list as the requirement

  • Method .get_row(index) that returns a list of values corresponding to the row

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.