pyjanitor-devs / pyjanitor Goto Github PK

View Code? Open in Web Editor NEW

1.3K 16.0 164.0 11 MB

Clean APIs for data cleaning. Python implementation of R package Janitor

Home Page: https://pyjanitor-devs.github.io/pyjanitor

License: MIT License

Python 89.12% Makefile 0.12% Jupyter Notebook 10.41% Dockerfile 0.21% Shell 0.03% CSS 0.11%

pandas dataframe data cleaning-data data-engineering pydata hacktoberfest

pyjanitor's Introduction

pyjanitor

pyjanitor is a Python implementation of the R package janitor, and provides a clean API for cleaning data.

Quick start

Installation: conda install -c conda-forge pyjanitor. Read more installation instructions here.
Check out the collection of general functions.

Why janitor?

Originally a port of the R package, pyjanitor has evolved from a set of convenient data cleaning routines into an experiment with the method chaining paradigm.

Data preprocessing usually consists of a series of steps that involve transforming raw data into an understandable/usable format. These series of steps need to be run in a certain sequence to achieve success. We take a base data file as the starting point, and perform actions on it, such as removing null/empty rows, replacing them with other values, adding/renaming/removing columns of data, filtering rows and others. More formally, these steps along with their relationships and dependencies are commonly referred to as a Directed Acyclic Graph (DAG).

The pandas API has been invaluable for the Python data science ecosystem, and implements method chaining of a subset of methods as part of the API. For example, resetting indexes (.reset_index()), dropping null values (.dropna()), and more, are accomplished via the appropriate pd.DataFrame method calls.

Inspired by the ease-of-use and expressiveness of the dplyr package of the R statistical language ecosystem, we have evolved pyjanitor into a language for expressing the data processing DAG for pandas users.

To accomplish this, actions for which we would need to invoke imperative-style statements, can be replaced with method chains that allow one to read off the logical order of actions taken. Let us see the annotated example below. First off, here is the textual description of a data cleaning pathway:

Create a DataFrame.
Delete one column.
Drop rows with empty values in two particular columns.
Rename another two columns.
Add a new column.

Let's import some libraries and begin with some sample data for this example:

# Libraries
import numpy as np
import pandas as pd
import janitor

# Sample Data curated for this example
company_sales = {
    'SalesMonth': ['Jan', 'Feb', 'Mar', 'April'],
    'Company1': [150.0, 200.0, 300.0, 400.0],
    'Company2': [180.0, 250.0, np.nan, 500.0],
    'Company3': [400.0, 500.0, 600.0, 675.0]
}

In pandas code, most users might type something like this:

# The Pandas Way

# 1. Create a pandas DataFrame from the company_sales dictionary
df = pd.DataFrame.from_dict(company_sales)

# 2. Delete a column from the DataFrame. Say 'Company1'
del df['Company1']

# 3. Drop rows that have empty values in columns 'Company2' and 'Company3'
df = df.dropna(subset=['Company2', 'Company3'])

# 4. Rename 'Company2' to 'Amazon' and 'Company3' to 'Facebook'
df = df.rename(
    {
        'Company2': 'Amazon',
        'Company3': 'Facebook',
    },
    axis=1,
)

# 5. Let's add some data for another company. Say 'Google'
df['Google'] = [450.0, 550.0, 800.0]

# Output looks like this:
# Out[15]:
#   SalesMonth  Amazon  Facebook  Google
# 0        Jan   180.0     400.0   450.0
# 1        Feb   250.0     500.0   550.0
# 3      April   500.0     675.0   800.0

Slightly more advanced users might take advantage of the functional API:

df = (
    pd.DataFrame(company_sales)
    .drop(columns="Company1")
    .dropna(subset=["Company2", "Company3"])
    .rename(columns={"Company2": "Amazon", "Company3": "Facebook"})
    .assign(Google=[450.0, 550.0, 800.0])
)

# The output is the same as before, and looks like this:
# Out[15]:
#   SalesMonth  Amazon  Facebook  Google
# 0        Jan   180.0     400.0   450.0
# 1        Feb   250.0     500.0   550.0
# 3      April   500.0     675.0   800.0

With pyjanitor, we enable method chaining with method names that are explicitly named verbs, which describe the action taken.

df = (
    pd.DataFrame.from_dict(company_sales)
    .remove_columns(["Company1"])
    .dropna(subset=["Company2", "Company3"])
    .rename_column("Company2", "Amazon")
    .rename_column("Company3", "Facebook")
    .add_column("Google", [450.0, 550.0, 800.0])
)

# Output looks like this:
# Out[15]:
#   SalesMonth  Amazon  Facebook  Google
# 0        Jan   180.0     400.0   450.0
# 1        Feb   250.0     500.0   550.0
# 3      April   500.0     675.0   800.0

As such, pyjanitor's etymology has a two-fold relationship to "cleanliness". Firstly, it's about extending Pandas with convenient data cleaning routines. Secondly, it's about providing a cleaner, method-chaining, verb-based API for common pandas routines.

Installation

pyjanitor is currently installable from PyPI:

pip install pyjanitor

pyjanitor also can be installed by the conda package manager:

conda install pyjanitor -c conda-forge

pyjanitor can be installed by the pipenv environment manager too. This requires enabling prerelease dependencies:

pipenv install --pre pyjanitor

pyjanitor requires Python 3.6+.

Functionality

Current functionality includes:

Cleaning columns name (multi-indexes are possible!)
Removing empty rows and columns
Identifying duplicate entries
Encoding columns as categorical
Splitting your data into features and targets (for machine learning)
Adding, removing, and renaming columns
Coalesce multiple columns into a single column
Date conversions (from matlab, excel, unix) to Python datetime format
Expand a single column that has delimited, categorical values into dummy-encoded variables
Concatenating and deconcatenating columns, based on a delimiter
Syntactic sugar for filtering the dataframe based on queries on a column
Experimental submodules for finance, biology, chemistry, engineering, and pyspark

API

The idea behind the API is two-fold:

Copy the R package function names, but enable Pythonic use with method chaining or pandas piping.
Add other utility functions that make it easy to do data cleaning/preprocessing in pandas.

Continuing with the company_sales dataframe previously used:

import pandas as pd
import numpy as np
company_sales = {
    'SalesMonth': ['Jan', 'Feb', 'Mar', 'April'],
    'Company1': [150.0, 200.0, 300.0, 400.0],
    'Company2': [180.0, 250.0, np.nan, 500.0],
    'Company3': [400.0, 500.0, 600.0, 675.0]
}

As such, there are three ways to use the API. The first, and most strongly recommended one, is to use pyjanitor's functions as if they were native to pandas.

import janitor  # upon import, functions are registered as part of pandas.

# This cleans the column names as well as removes any duplicate rows
df = pd.DataFrame.from_dict(company_sales).clean_names().remove_empty()

The second is the functional API.

from janitor import clean_names, remove_empty

df = pd.DataFrame.from_dict(company_sales)
df = clean_names(df)
df = remove_empty(df)

The final way is to use the pipe() method:

from janitor import clean_names, remove_empty
df = (
    pd.DataFrame.from_dict(company_sales)
    .pipe(clean_names)
    .pipe(remove_empty)
)

Contributing

Follow the development guide for a full description of the process of contributing to pyjanitor.

Adding new functionality

Keeping in mind the etymology of pyjanitor, contributing a new function to pyjanitor is a task that is not difficult at all.

Define a function

First off, you will need to define the function that expresses the data processing/cleaning routine, such that it accepts a dataframe as the first argument, and returns a modified dataframe:

import pandas_flavor as pf

@pf.register_dataframe_method
def my_data_cleaning_function(df, arg1, arg2, ...):
    # Put data processing function here.
    return df

We use pandas_flavor to register the function natively on a pandas.DataFrame.

Add a test case

Secondly, we ask that you contribute a test case, to ensure that the function works as intended. Follow the contribution docs for further details.

Feature requests

If you have a feature request, please post it as an issue on the GitHub repository issue tracker. Even better, put in a PR for it! We are more than happy to guide you through the codebase so that you can put in a contribution to the codebase.

Because pyjanitor is currently maintained by volunteers and has no fiscal support, any feature requests will be prioritized according to what maintainers encounter as a need in our day-to-day jobs. Please temper expectations accordingly.

API Policy

pyjanitor only extends or aliases the pandas API (and other dataframe APIs), but will never fix or replace them.

Undesirable pandas behaviour should be reported upstream in the pandas issue tracker. We explicitly do not fix the pandas API. If at some point the pandas devs decide to take something from pyjanitor and internalize it as part of the official pandas API, then we will deprecate it from pyjanitor, while acknowledging the original contributors' contribution as part of the official deprecation record.

Credits

Test data for chemistry submodule can be found at Predictive Toxicology.

pyjanitor's People

Contributors

Stargazers

Watchers

Forkers

mamonu joshuac3 thesmarthomeninja cduvallet gbalaji88 hectormz smeichle szuckerman zbarry zsailer ocefpaf paddyalton cwen001 stoltzmaniac jk3587 sorenfrohlich kurtispinkney ricky-lim kimt33 jonnybazookatone jqwotos lphk92 eidhagen dsouzadaniel science4fun rsaavy rajat-181 dendrondal catherinedevlin napsterinblue jekwatt stephenschroeder moomoofarm1 chrisfs rahosbach bradkeogh rebeccawperry asearfos porch4 gjlynx dwgoltra uribe-convers emnemnemnem taedaniellim puruckertom h-klein chungkim271 benjaminjack kulini qtson samwalkow mengksun6 dsnortsev nickdelgrosso sbarman-mi9 jeprescottroy aopisco cjmayers gaworecki5 jiafengkevinchen thomasjpfan shandou mralbu ram-n zjpoh goodmonsters clayton-springer gddcunh jdice deepchatterjeevns stjordanis milog17 shalevy1 project-renard-survey cozydoomer bdice markfairbanks abrarahmedraza nournegm evan-anderson changhsinlee mlogotheti anzelpwj smu095 chaztikov luiggi629 puremath86 drimdave elarkk t-i-r skerker vperrollaz srijan-deepsource vishalbelsare mphirke dushyantkhosla sallyhong sauln minchinweb richardqiu

pyjanitor's Issues

Fix default pandas methods

Is it possible to "fix" pandas methods using pyjanitor? For e.g. I will like to validate the parameters for read_excel or read_csv. I have raised an issue but there is no progress.

pandas-dev/pandas#22189

If janitor accept the PR to override the default behavior of read methods, then it will be great.

Rename dataframe inplace

Is this possible with janitor?
df.clean_names(inplace=True)

Add filter_date function

This issue is to track creating a filter_date function, or whatever else you might want to call it.

The main idea is that even though there are already filter functions, date filtering is a bit more complex. For example, when filtering strings you can just write a string, but for dates it usually needs to be a date or datetime object, and not all of them play together nicely.

More important than the above is that when filtering a range, the lines can get really long, like so:

start_date = date(2017, 1, 1)
end_date = date(2018, 1, 1)
df = df[(df.date_column >= start_date) & (df.date_column <= end_date)]

I think something like the following is nicer:

df = df.filter_date('date_column', start='2018-01-01', end='2018-01-01')

It could also include arguments for year, years, etc.

filtering with nice strings in between

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.query.html

The .query() method opens up a ton of possibilities for us! We can make a series of specific filtering functions, I believe.

Reminder of add_columns implementation

For myself, mostly:

Chaining implementation:

import pandas_flavor as pf

@pf.register_dataframe_method
def add_columns(df, **kwargs):
    # column already exists
    # v is not a scalar but is a different length from the dataframe
    
    for k, v in kwargs.items():
        df = df.add_column(k, v)
    return df

Example usage for copying repeating rows from a DataFrame into another:

df1.columns is of {'var1', 'var2', 'var3'}

column_order = ['var1', 'var2']

df2.add_columns(**{
    col: vals
    for col, vals in zip(column_order, df1[column_order].iloc[0]) 
})

example.py broken

import pandas as pd
import janitor as jn

df = pd.read_excel("dirty_data.xlsx")

df = (
    jn.DataFrame(df)
    .clean_names()
    .remove_empty()
    .rename_column("%_allocated", "percent_allocated")
    .rename_column("full_time?", "full_time")
    .coalesce(["certification", "certification.1"], "certification")
    .encode_categorical(["subject", "employee_status", "full_time"])
    .convert_excel_date("hire_date")
)

print(df)

Traceback (most recent call last):
  File "/home/zachary/projects/pyjanitor/examples/example.py", line 7, in <module>
    jn.DataFrame(df)
AttributeError: module 'janitor' has no attribute 'DataFrame'

DataFrame is not in the janitor namespace. Also, df is already a DataFrame, so not sure what the intent would be of doing this, anyway.

Note that if I replace jn.DataFrame(df) with simply df, I get:

Traceback (most recent call last):
  File "/home/zachary/miniconda3/envs/hv/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 3078, in get_loc
    return self._engine.get_loc(key)
  File "pandas/_libs/index.pyx", line 140, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 162, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 1492, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 1500, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'certification.1'

Formatting of the readme is messed up

Select_Columns Function Added Suggestion...

I see there is remove columns function. I think a select_columns function would work nice. It would be cleaner and easier to understand then df[['col1',col2','col3']].

clean_names breaks for MultiIndex Columns

MultiIndex columns returns expected str not tuple type error.

If you instead used,
df.rename(columns=lambda x: x.lower().replace('_', ''))
this would work for standard and MultiIndex DataFrames.

Can do PR if required.

Feature thought: Add generic imputation

Hi Eric,

Just a thought -- I find myself doing a lot of hand imputation. It would be nice if you could add imputation as a chain able function.

I'm under the gun and can't submit a PR, but I think this would be a great feature.

software paper

@zbarry @szuckerman, I would like to invite you to participate in the pyjanitor software manuscript that I am writing.

I am writing it in a branch off master: https://github.com/ericmjl/pyjanitor/blob/whitepaper/paper/manuscript.md

At the moment, I am seeking out input on:

Current known limitations of pyjanitor.
Possible extensions.
Readability.

If you would like to participate, please put in a PR against the whitepaper branch and add your name!

Skip tests for optional dependencies if they are not installed

I just tried to run the test suite and had some failures because I don't have some of the Chemistry packages installed. I realized that those packages are only available through conda.

I don't use conda and am finding out that it's not so easy to just "install" conda packages into a previously created virtual environment (I mean, I'll figure it out eventually, but this is just at first glance).

In any event, I think it raises an interesting question: Does this create issues for people who don't use conda? Meaning, if I try to use something from a submodule, but because I don't have a dependency installed, like rdkit, it'll tell me to remedy with a conda install that's not going to work?

I'm not really sure what the answer is, but I could see either putting functions that rely on conda dependencies in their own package for conda, or just keeping it how it is and let people deal with it (assuming that people using these modules most likely already have conda installed).

Just wanted to throw this out there before the submodules get bigger.

add_column sanity checking broken for strings

There's a sanity check I put in there where if the supplied value is a sequence, it makes sure that the length is the same as the number of rows in the DataFrame as long as fill_remaining is false. This is done by checking for the existence of __len__. I need to add logic to exclude this checking for str objects. Fixing now.

Index munging methods

As per Twitter chat with @twiecki, I think a feature that might come in handy is method chaining for index methods.

Reference: https://twitter.com/twiecki/status/973892601018572800

For example, instead of:

df.index = df.index.drop_level()

We would have

df.remove_empty().index_drop_level()...

Switching tests to Hypothesis

With hypothesis providing a pandas dataframe generator, I'm wondering if we could generate more better battle-tested code by using hypothesis for property-based testing, instead of using our current hacky way of generating tests based on simple dataframes.

One of Hypothesis' best traits is the ability to find edge cases that I myself was unable to find. I think this is worth a shot. Will tag this issue as appropriate.

Case Sensitive True/False kwarg on clean_names

Using pyjanitor again today I realised I didn't want to change capitals to lowercase. I think a case sensitive kwarg (default on) would be good.

Happy to do PR for this if you think it is a good idea. One Q on this, what would should the kwarg be called?

Some ideas:
lower, remove_upper, drop_case, case_sensitive,...

Cleaning key-value stores?

A thought came to my mind: how do we clean key-value stores? Do we have to implicitly assume that there are regularly repeating key-value pairs? Or do we define data types and provide easy commands to clean them?

One low-hanging fruit is possibly changing all key names to lowercase + underscore-separated.

Administrative todos

Setup Travis CI for continuous integration.
Setup pyup.io for dependency management/pinning.

Warning shows up with adding new attributes

@shantanuo this appears to be related to your earlier PR. When I run tests, this shows up consistently.

  /Users/maer3/github/software/pyjanitor/janitor/functions.py:143: UserWarning: Pandas doesn't allow columns to be created via a new attribute name - see https://pandas.pydata.org/pandas-docs/stable/indexing.html#attribute-access
    df.original_columns = original_column_names

From your reading, do you know of an appropriate way to attach a new dataframe attribute without invoking this warning?

great idea!

please make sure to make it compatible with Pandas dataframes...
thanks!

transform_column, using .apply(), operates elementwise in a column individually

This precludes doing something nice like normalizing a column into probabilities as in:

df['col'] = df['col'] / df['col'].sum()

Criteria for 1.0 release to PyPI

I'll commit to a 1.0 release to PyPI when:

We have 80% of R-janitor's functionality implemented.
Test coverage is acceptable.
We have more than one example available on the Docs.

why the scipy dependency ?

It does not seem to be needed anywhere in the source code

[ENH] Naming conventions

I wanted to discuss naming conventions for the various functions and arguments for consistency. expand_column has a parameter column whereas add_column had col_name.

Also, is _column necessary in each function? Would it be ok to just have an add() or transform() method? In general I'm more on the side of more verbose function names, but just wanted to throw the idea out there.

Similarly, when following the format of limit_column_characters functions like change_type should probably be named change_column_type.

I'm not married to any of this (except the function arguments, those should be consistent) but wanted to get peoples' ideas.

Welcome page documentation code example problem - remove_column() doesn't exist

In:

df = (
    pd.DataFrame(...)
    .remove_column('column1')
    .dropna(subset=['column2', 'column3'])
    .rename_column('column2', 'unicorns')
    .rename_column('column3', 'dragons')
    .add_column('newcolumn', ['iterable', 'of', 'items'])
)

remove_column() is not [any longer?] a function. I guess this should be remove_columns(), instead. Note that the latter only takes list arguments apparently, instead of just a string in the case where you only want to remove one column. Might be useful to support both types of inputs.

transform_column proposed improvement

transform_column(df, col_name: str, function) could be augmented to include a destination column. Example use case is when you want to perform a log10 transformation, as in the docs example, yet you also want to preserve the non-transformed data for some other purpose.

Documenting how tests are organized

Heads-up, @zbarry and @szuckerman.

I have refactored the test suite a little bit (actually, quite a lot). The changes hopefully make the test suite easier to develop against and enable newer contributions. I will be writing a section on how to write tests (as part of the CONTRIBUTING.rst) file

The key changes here are:

Each function gets tested in a test_<function_name>.py file. This structure allows us to create multiple tests for each function, without cluttering up one big test_functions.py file.
Testing utilities, such as fixtures and hypothesis strategies, are now part of the main library. This enables them to be imported into the tests instead. Side note: We do not need to have tests for the testing utilities.

Big thanks to both of you for your contributions thus far, I really appreciate the work that has gone in 😄.

Remove Trailing (and Leading) underscores from clean_names

Pandas handles this relatively well but would be a good as a kwarg for jn.clean_names().

The default False or None could leave leading, True or 'both' to remove all and trailing and then pass 'leading' and 'trailing' or similar to remove each.

documentation clarification (registering new methods)

Hi! Decided to have a bit of fun with pandas-flavor which I came across via pyjanitor*. I noticed that the docs here for contributors refer to register_dataframe_function, but it looks like the correct name is register_dataframe_method.

Not exactly earth-shattering stuff, but it didn't seem like it had been mentioned in another issue so thought I'd just flag it!

* -- just for a little Christmas break project, on which point, compliments of the season!

Examples needed

As per title! Having more than one example can be helpful for getting other users to use the package.

Returning functions in place

After seeing issue #67, was curious as to what people think as to adding this capability to all functions?

Some of them, like df.limit_column_characters(), already return in place. I don't think it will be hard to extend to others.

Pandas 23.0.0 Release

I noticed the pyjanitor-dev conda environment is running on Pandas 22.X.X. Pandas has just done a new release to 23.0.0. I have no idea if this is significant, or if you have plans to keep up to date with the most recent Pandas, but I thought it worth mentioning :)

limit column characters

Is there any way to limit the length of column name? for e.g. this_is_very_long_column_needs_truncated should be truncated to left 10 or 20 characters. This may lead to duplicate column names - those should be appended by numbers like this_is_very_long_1 and this_is_very_long_2

Feature enhancement thought: reorder_columns()

Example usage:

df = (
    pd.read_csv('blah.csv')  # containing ['col1', 'col2']
    .add_column('col3', 12345)
    .reorder_columns(['col3'])  
)

Columns not specified retain their order and follow after specified columns

Preserve original names

While renaming the dataframe, I need to preserve the original names. For e.g.

santandar_data = pd.read_csv(r"train.csv", nrows=40000)  
santandar_data.shape  

santandar_data.original_names=santandar_data.columns

ndf=santandar_data

ndf.original_names

Index(['ID', 'var3', 'var15', 'imp_ent_var16_ult1', 'imp_op_var39_comer_ult1',
       'imp_op_var39_comer_ult3', 'imp_op_var40_comer_ult1',
       'imp_op_var40_comer_ult3', 'imp_op_var40_efect_ult1',
       'imp_op_var40_efect_ult3',
       ...
       'saldo_medio_var33_hace2', 'saldo_medio_var33_hace3',
       'saldo_medio_var33_ult1', 'saldo_medio_var33_ult3',
       'saldo_medio_var44_hace2', 'saldo_medio_var44_hace3',
       'saldo_medio_var44_ult1', 'saldo_medio_var44_ult3', 'var38', 'TARGET'],
      dtype='object', length=371)

The ndf dataframe object has a property original_names that works correctly. But when I use clean_names function, I do not get this functionality.

df=santandar_data.clean_names(case_type="upper", remove_special=True).limit_column_characters(3)
df.original_names

AttributeError: 'DataFrame' object has no attribute 'original_names'

Feature thought - chainable, in place reset_index

As far as I'm aware, an inplace=True reset_index() does not return a DataFrame.

Remaining functions from R version

The following is a list of functions missing from the PyJanitor library that are implemented in the R version. I think the aggregation and adornment can be put in their own submodules later.

To be implemented

Main Functions

Aggregation

as_and_untabyl.R
print_tabyl.R
tabyl.R
top_levels.R

Adornment

Won't be implemented

round_half_up.R
Probably don't need to implement this; the main reason it exists is because round(2.5) in R is 2, this makes round(2.5) == 3. For Python, round(2.5) == 3.
make_clean_names.R
Helper function for clean_names.R
get_level_groups.R
Helper function for top_levels.R
crosstab.R
Deprecated in favor of tabyl

Install fails with Python 3.7

Due to pinning of sklearn version, it fails on 3.7. Need to change the dep for sklearn from == in requirements.txt to =>

Warning about deprication when not using those methods

With python 3.6.6 and pyjanitor 0.5.0 installed using pipenv with Windows 10, I get the following UserWarning:

In [1]: import janitor
C:\Users\<USERNAME>\.virtualenvs\<REPO_NAME>\Lib\site-packages\janitor\dataframe.py:24: 
UserWarning: Janitor's subclassed DataFrame and Series will be deprecated before
the 1.0 release. Instead of importing the Janitor DataFrame, please instead
`import janitor`, and use the functions directly attached to native pandas
dataframe.
  warnings.warn(msg)

This is when using pyjanitor in the recommended way. Shouldn't the warning only be there on import of the the depricated DataFrame functions? I even get the warning when only getting clean_names:

In [1]: from janitor import clean_names
C:\Users\<USERNAME>\.virtualenvs\<REPO_NAME>\Lib\site-packages\janitor\dataframe.py:24: 
UserWarning: Janitor's subclassed DataFrame and Series will be deprecated before
the 1.0 release. Instead of importing the Janitor DataFrame, please instead
`import janitor`, and use the functions directly attached to native pandas
dataframe.
  warnings.warn(msg)

Additional Namespaces

From a previous pull-request the issue of namespaces arose. I wanted to open this issue to discuss various new namespaces possible for the module.

It appears that the R version also is having this issue as well.

There was discussion of a finance submodule, which sounds good, but I don't work in finance and would be unfamiliar with many necessary items to be included.

I think that a summary submodule or something like that would be good to add tabyl or other summary statistics.

Thoughts?

Inconsistencies in original-dataframe mutation

Old is re: reorder_columns: Does not mutate original DataFrame. I'm thinking about modding it to do so to be consistent with everything else I implemented.

Edit:

In working on the Jupyter Notebook example walkthrough for pyjanitor, I'm noticing some inconsistencies regarding whether the original DataFrame is changed after an operation in the provided example. My notes:

.clean_names() does not mutate
.remove_empty() does
.rename_column() does not
.coalesce() does not
.encode_categorical() does
.convert_excel_date() does

What do we think about this?

[ENH] Add read_folderfiles() function

I find myself loading a lot of files from a folder. I would use glob library for this but it is a lot to write out. For example I will write:

path ="C:/Finance/Month End/2018/CSV Imports YTD"
files_xls = glob.glob(path + "/*.csv")
df = pd.DataFrame()

for f in files_xls:
data1 = pd.read_csv(f,skiprows=0,low_memory=False,encoding="cp1252")
data1['File_Name'] = (f)
#data2['File_Name'] = (f)
data1.append(data1,ignore_index=True)
df = df.append(data1,ignore_index=True)

Somthing like this would be easier:

read_folderfiles( path = "", extension ="", encoding = "", add_filenames = True)

thoughts?
I can try to create this if you like.

[DOC] Add "missing data" viz to example notebooks

This is quite important. I'd like to wrap other packages rather than invent the wheel. One possible package is to wrap missingno with a user-friendly API.

Take ideas/code from Agate?

Speaking as a newcomer to Pandas who finds its syntax confusing, pyjanitor is a breath of fresh air. There's a similar, older project which may provide extra inspiration, maybe even code: agate by @onyxfish (who also created the fantastic Quartz guide to bad data). However, he now seems to have moved on to other things, and development has been considerably slower for the past couple of years.

Agate is well worth a look: the documentation is extensive and well-written, and it has a few features which pyjanitor doesn't (yet). Unfortunately much of the code may be hard to port as it relies on its own table implementation rather than using Pandas DataFrame or similar. Even so, it's worth checking out.

Feature enhancement: collapsing MultiIndex columns

For example, after aggregation with multiple functions.

For df with columns ['group', 'category', 'value']:

stats_df = (
    df.groupby(['group', 'category'])
    .agg(['mean', 'median'])
    .reset_index()
)

produces stats_df with a MultiIndex .columns attribute where {'mean', 'median'} are second level column names under value. It would be nice if .columns was just an Index instead, for some applications.

Now, to flatten the MultiIndex out into an Index by concatenating the different levels with an underscore:

stats_df.columns.values is array([('group', ''), ('category', ''), ('value', 'mean'), ('value', 'median')], dtype=object)

stats_df.columns = ['_'.join(tup) if tup[1] != '' else tup[0] for tup in stats_df.columns.values]

stats_df.columns is now ['group', 'category', 'value_mean', 'value_median']

All the code blocks in the docs are borked.

Title.

proposal: import errors raised for submodules

In order to keep pyjanitor lightweight, I would like to propose that the submodule dependencies not be installed with the main module.

In order for this to work out in a user-friendly fashion, I think we will need to provide some try/except imports. For example, in the biology submodule:

try:
    from Bio import Seq
except ImportError:
    print('You need to install `biopython`: \n\n    conda install -c conda-forge biopython')

Question about chaining

Are the successive chaining operations happening in place, or is the entire dataframe copied every time?

new function proposal: find and replace

Colleague proposed this function: within a column that houses strings, within each cell, provide a find_replace() function that finds a substring in the cell, and replaces it with another substring.

Note to self:

Figure out if this is already possible with pandas.
Decide if implementing this provides a more fluent function API.

Jupyter notebook example of pyjanitor usage

I kind of want to put something together to show usage, talk about philosophy of pyjanitor, etc.