deppen8 / pandas-vet Goto Github PK
View Code? Open in Web Editor NEWA plugin for Flake8 that checks pandas code
Home Page: https://pandas-vet.readthedocs.io
License: MIT License
A plugin for Flake8 that checks pandas code
Home Page: https://pandas-vet.readthedocs.io
License: MIT License
pandas-vet
is available from PyPI (https://pypi.org/project/pandas-vet/) and conda-forge (https://github.com/conda-forge/pandas-vet-feedstock). I would love to automate the process that uploads a pandas-vet release to both of these places.
This will help avoid embarrassing situations like #78
Is there a good reason to prefer .loc over .at when you need to get a single value?
We use .at over .loc in our codebase when we want to signal to other developers that we intend to get a single value, as opposed to a Series or DataFrame.
The warning seems to assume the developer picked .at for speed, while their intention is more likely to have picked it for correctness and clarity.
Check for import pandas as pd
pattern.
If only import pandas
, give error message:
Use
import pandas as pd
convention
The contributing guide needs an item to explain installing in "develop" mode.
Need to add:
This feature request is one of the topics on Minimally Sufficient Pandas
namely:
https://deppen8.github.io/pandas-bw/general/Python-funcs-vs-pandas.png
In order to implement will need to be able to validate type, e.g. the argument inside sum, min, max, abs is a data frame.
Solving this false positive should be easier than solving all of them at the same time (#81), because in pandas .values is a property, not a method.
There are existing flake8 lint warnings in the source and tests
(.venv) > flake8 {pandas_vet,tests}
pandas_vet/__init__.py:116:80: E501 line too long (83 > 79 characters)
pandas_vet/__init__.py:134:80: E501 line too long (94 > 79 characters)
pandas_vet/__init__.py:137:80: E501 line too long (83 > 79 characters)
pandas_vet/__init__.py:155:80: E501 line too long (105 > 79 characters)
pandas_vet/__init__.py:158:80: E501 line too long (83 > 79 characters)
pandas_vet/__init__.py:174:80: E501 line too long (107 > 79 characters)
pandas_vet/__init__.py:177:80: E501 line too long (83 > 79 characters)
pandas_vet/__init__.py:232:80: E501 line too long (85 > 79 characters)
pandas_vet/__init__.py:365:80: E501 line too long (84 > 79 characters)
pandas_vet/__init__.py:369:80: E501 line too long (82 > 79 characters)
pandas_vet/__init__.py:373:80: E501 line too long (84 > 79 characters)
pandas_vet/__init__.py:383:80: E501 line too long (83 > 79 characters)
pandas_vet/__init__.py:386:80: E501 line too long (85 > 79 characters)
pandas_vet/__init__.py:389:80: E501 line too long (102 > 79 characters)
pandas_vet/__init__.py:392:80: E501 line too long (93 > 79 characters)
pandas_vet/__init__.py:395:80: E501 line too long (90 > 79 characters)
pandas_vet/__init__.py:398:80: E501 line too long (81 > 79 characters)
pandas_vet/__init__.py:401:80: E501 line too long (107 > 79 characters)
tests/test_PD015.py:56:80: E501 line too long (86 > 79 characters)
The simple fix here is to just update this code.
But it may also be a good idea to add flake8
lint check it CI, and to consider the library's interest in adopting black auto-formatting and how that might influence a flake8
config file.
Is your feature request related to a problem? Please describe.
I want to make it easier for new contributors to get going.
Describe the solution you'd like
Create a Dockerfile that installs the necessary files and dependencies for developing and testing the code. Modern IDEs like VSCode and PyCharm allow you to connect to a running Docker container which makes it a breeze to work on a container.
Check for deprecated .ix
. Give error message:
'.ix' is deprecated. Use '.loc' or '.iloc' instead.
We should document both the general disabling/ignoring of checks as well as the per-line disabling provided by flake8 (#no-qa
)
This is a spin-off from #81
This is a dedicated issue for the big discussion in #74
The problem is that many of our checks rely on the type of the object being a pandas object. This is a fundamental issue with static linting in Python because the AST doesn't know what type a thing is. This leads to false positives for things like re.sub()
or dict.values()
I am open to suggestions on how to get around this, but it will likely be a big job. Some kind of integration with mypy
or some other way to leverage type annotations might be a way to fix this, at least for folks who use those type annotations. What exactly that looks like is unclear to me, so please let me know if you have any ideas.
For now, the undesirable workaround is to turn off checks that are particularly bothersome.
Is your feature request related to a problem? Please describe.
pd.concat([df for df in my_dfs])
works, but could also be written as
pd.concat(df for df in my_dfs)
thanks for PEP289.
Describe the solution you'd like
Flag cases when a generator expression could be used instead of a list comprehension
Describe alternatives you've considered
Giving a warning from within pandas - problem is, at runtime, one does not know whether a function has been passed a list written as a list comprehension or if it was a preformed list. So this is likely better-suited to a linter
Additional context
The trickiest part would probably be to identify all the cases when this can be done - that might mean having to parse the codebase and look for public functions where an argument is annotated with Iterable
. I'm happy to raise a PR if you'd like this
Describe the bug
First of all thanks as this repository look awesome, what I expect to happend is that pandas-vet should run when flake-8 invoked by pre-commit and it doesnt
To Reproduce
Steps to reproduce the behavior:
pip install pre-commit==2.1.1
pip install pandas-vet==0.2.2
repos:
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v2.4.0
hooks:
- id: flake8
pre-commit install
Expected behavior
I would like that it will run in similar way when executed by the pre-commit hook
Desktop (please complete the following information):
Check for use of pd.merge
function. Preferred is .merge
method. Even pandas
docs use .merge
for the documentation of pd.merge
! See flashcard.
Give error message:
Use '.merge' method instead of 'pd.merge' function. They have equivalent functionality.
Check for .stack
method. See flashcard. Give error message:
Prefer '.melt' to '.stack'. '.melt' allows direct column renaming and avoids a MultiIndex
Creating this issue to propose and discuss additions that should be in the next release. Anything that seems like a winner will get added to the Milestone and we can work from there.
Check for pd.read_table
function call. See flashcard here. Give error message:
'pd.read_table' is deprecated. Use 'pd.read_csv' for all delimited files.
Describe the bug
raised exception
To Reproduce
Steps to reproduce the behavior:
Have the following code in a file
def some_function(dataFrame, in_place=False):
return dataFrame.drop([], inplace=in_place)
Expected behavior
Allow flake8
to report violations and not throw exceptions.
Additional context
bash-5.1# cat /usr/lib/python3.9/site-packages/pandas_vet/version.py
__version__ = "0.2.2"
This is running on a docker container based on alpine:3.14.1. Same results obtained on a mac.
Things work if we do not provide a variable:
def some_function(dataFrame, in_place=False):
return dataFrame.drop([], inplace=False)
Describe the bug
for PD011 should the suggestion instead be to_numpy() instead of to_array() (given suggestion in warnings of https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.values.html)? Perhaps I am wrong on this but just wanted to bring up or least get clarification on potential difference
Try to respond to as many of the following as possible
Generally describe the pandas
behavior that the linter should check for and why that is a problem. Links to resources, recommendations, docs appreciated
The linter should check for nunique being compared to 1. The detected pattern is less performant because it does not leverage short-circuiting when multiple unique values are found, and simply continues counting..
def setup(n):
return pd.Series(list(range(n)))
def setup(n):
return pd.Series([1] * (n - 1) + [2])
Suggest specific syntax or pattern(s) that should trigger the linter (e.g., .iat
)
df.column.nunique() == 1
df.column.nunique() != 1
df.column.nunique(dropna=True) == 1
df.column.nunique(dropna=True) != 1
df.column.nunique(dropna=False) == 1
df.column.nunique(dropna=False) != 1
Suggest specific syntax or pattern(s) that the linter should allow (e.g., .iloc
)
Note that the solution is simple when there are no NaN values:
(series.values[0] == series.values).all()
And needs some additional logic when NaN/NA values are present.
For dropna=True
v = series.values
v = remove_na_arraylike(v)
if v.shape[0] == 0:
return False
(v[0] == v).all()
For dropna=False
v = s.values
if v.shape[0] == 0:
return False
(v[0] == v).all() or not pd.notna(v).any()
if included in pandas:
series.is_constant()
Suggest a specific error message that the linter should display (e.g., "Use '.iloc' instead of '.iat'. If speed is important, use numpy indexing")
Consider checking equality to first element instead of .nunique() == 1
for checking for a constant column.
Are you willing to try to implement this check?
Developing the check
functions for each linter errors requires exploration of the AST nodes to determine the corresponding attributes to be compared against the valid code patterns. This presents a barrier to quickly implementing new linter checks that could be reduced if there were a tool that returns the appropriate AST node attributes for a specified pattern.
The envisioned solution might utilize the following form:
attributes = ast_explore(code_signature)
where code_signature
is a string representing the code pattern of interest, and attributes
is a string (or list of strings) representing the composed attributes for the corresponding AST node. This string could then be appended to the AST node in the check
functions:
if node.<attribute_string> == <test_condition>:
or
for attribute in node.<attribute_string>:
Installing pandas-vet with pip or conda on Windows or WSL results in flake8 --version
returning
3.7.9 () CPython 3.7.3 on Windows
3.7.9 () CPython 3.8.2 on Linux
Expected behavior
pip install flake8 pandas-vet==0.2.2
flake8 --version
3.7.9 (flake8-pandas-vet: 0.2.1, mccabe: 0.6.1, pycodestyle: 2.5.0, pyflakes: 2.1.1) CPython 3.7.3 on Windows
To Reproduce
Steps to reproduce the behavior:
pip install flake8 pandas-vet
or pip install flake8 pandas-vet==0.2.2
flake8 --version
Desktop Environment:
Additional Context
Running flake8 -v --version
results in:
flake8.plugins.manager MainProcess 65 INFO Loading entry-points for "flake8.extension".
flake8.plugins.manager MainProcess 73 INFO Loading entry-points for "flake8.report".
flake8.plugins.manager MainProcess 80 INFO Loading plugin "F" from entry-point.
flake8.plugins.manager MainProcess 118 INFO Loading plugin "pycodestyle.ambiguous_identifier" from entry-point
.
flake8.plugins.manager MainProcess 123 INFO Loading plugin "pycodestyle.bare_except" from entry-point.
flake8.plugins.manager MainProcess 123 INFO Loading plugin "pycodestyle.blank_lines" from entry-point.
flake8.plugins.manager MainProcess 123 INFO Loading plugin "pycodestyle.break_after_binary_operator" from entr
y-point.
flake8.plugins.manager MainProcess 123 INFO Loading plugin "pycodestyle.break_before_binary_operator" from ent
ry-point.
flake8.plugins.manager MainProcess 123 INFO Loading plugin "pycodestyle.comparison_negative" from entry-point.
flake8.plugins.manager MainProcess 124 INFO Loading plugin "pycodestyle.comparison_to_singleton" from entry-po
int.
flake8.plugins.manager MainProcess 124 INFO Loading plugin "pycodestyle.comparison_type" from entry-point.
flake8.plugins.manager MainProcess 124 INFO Loading plugin "pycodestyle.compound_statements" from entry-point.
flake8.plugins.manager MainProcess 124 INFO Loading plugin "pycodestyle.continued_indentation" from entry-poin
t.
flake8.plugins.manager MainProcess 124 INFO Loading plugin "pycodestyle.explicit_line_join" from entry-point.
flake8.plugins.manager MainProcess 124 INFO Loading plugin "pycodestyle.extraneous_whitespace" from entry-poin
t.
flake8.plugins.manager MainProcess 125 INFO Loading plugin "pycodestyle.imports_on_separate_lines" from entry-
point.
flake8.plugins.manager MainProcess 125 INFO Loading plugin "pycodestyle.indentation" from entry-point.
flake8.plugins.manager MainProcess 125 INFO Loading plugin "pycodestyle.maximum_doc_length" from entry-point.
flake8.plugins.manager MainProcess 125 INFO Loading plugin "pycodestyle.maximum_line_length" from entry-point.
flake8.plugins.manager MainProcess 125 INFO Loading plugin "pycodestyle.missing_whitespace" from entry-point.
flake8.plugins.manager MainProcess 126 INFO Loading plugin "pycodestyle.missing_whitespace_after_import_keywor
d" from entry-point.
flake8.plugins.manager MainProcess 126 INFO Loading plugin "pycodestyle.missing_whitespace_around_operator" fr
om entry-point.
flake8.plugins.manager MainProcess 126 INFO Loading plugin "pycodestyle.module_imports_on_top_of_file" from en
try-point.
flake8.plugins.manager MainProcess 126 INFO Loading plugin "pycodestyle.python_3000_async_await_keywords" from
entry-point.
flake8.plugins.manager MainProcess 126 INFO Loading plugin "pycodestyle.python_3000_backticks" from entry-poin
t.
flake8.plugins.manager MainProcess 127 INFO Loading plugin "pycodestyle.python_3000_has_key" from entry-point.
flake8.plugins.manager MainProcess 127 INFO Loading plugin "pycodestyle.python_3000_invalid_escape_sequence" f
rom entry-point.
flake8.plugins.manager MainProcess 127 INFO Loading plugin "pycodestyle.python_3000_not_equal" from entry-poin
t.
flake8.plugins.manager MainProcess 127 INFO Loading plugin "pycodestyle.python_3000_raise_comma" from entry-po
int.
flake8.plugins.manager MainProcess 127 INFO Loading plugin "pycodestyle.tabs_obsolete" from entry-point.
flake8.plugins.manager MainProcess 128 INFO Loading plugin "pycodestyle.tabs_or_spaces" from entry-point.
flake8.plugins.manager MainProcess 128 INFO Loading plugin "pycodestyle.trailing_blank_lines" from entry-point
.
flake8.plugins.manager MainProcess 128 INFO Loading plugin "pycodestyle.trailing_whitespace" from entry-point.
flake8.plugins.manager MainProcess 128 INFO Loading plugin "pycodestyle.whitespace_around_comma" from entry-po
int.
flake8.plugins.manager MainProcess 129 INFO Loading plugin "pycodestyle.whitespace_around_keywords" from entry
-point.
flake8.plugins.manager MainProcess 129 INFO Loading plugin "pycodestyle.whitespace_around_named_parameter_equa
ls" from entry-point.
flake8.plugins.manager MainProcess 129 INFO Loading plugin "pycodestyle.whitespace_around_operator" from entry
-point.
flake8.plugins.manager MainProcess 129 INFO Loading plugin "pycodestyle.whitespace_before_comment" from entry-
point.
flake8.plugins.manager MainProcess 129 INFO Loading plugin "pycodestyle.whitespace_before_parameters" from ent
ry-point.
flake8.plugins.manager MainProcess 130 INFO Loading plugin "C90" from entry-point.
flake8.plugins.manager MainProcess 131 INFO Loading plugin "PD" from entry-point.
flake8.plugins.manager MainProcess 147 INFO Loading plugin "default" from entry-point.
flake8.plugins.manager MainProcess 151 INFO Loading plugin "pylint" from entry-point.
flake8.plugins.manager MainProcess 151 INFO Loading plugin "quiet-filename" from entry-point.
flake8.plugins.manager MainProcess 151 INFO Loading plugin "quiet-nothing" from entry-point.
Describe the bug
False positive: PD005 for regex sub
method
To Reproduce
PD005 for this code:
return re.sub(r"\s+", " ", variable).strip()
Expected behavior
no warning
As discussed in #64, it would be good to have some checks that can be implemented but are "off" by default. These would be the most opinionated checks that would be a bit too strict to be activated out-of-the-box.
Check for .at
and .iat
indexing methods. These should probably be separate checks for consistency with other cases. See flashcard. Give error messages:
Use '.loc' instead of '.at'. If speed is important, use numpy.
Use '.iloc' instead of '.iat'. If speed is important, use numpy.
Describe the bug
PD005 is triggered when any module has a method named something like .add()
To Reproduce
I noticed this while using the tarfile
module. The tarfile
module has a method named .add()
Here is the snippet. Flake8 triggered PD005 on both of the tar.add()
calls.
tar = tarfile.open(tar_filename, "w")
tar.add(other_filename) # <-- triggers PD005
for filename in filenames:
tar.add(filename) # <-- triggers PD005
tar.close()
Expected behavior
I expect this to only be triggered when a method like .add()
is called on a pandas
object.
Additional context
I expect that this applies to the comparison operators too (PD006). Any fix for PD005 should also be made for PD006 too.
First off, thank you for developing this plugin!
Is your feature request related to a problem? Please describe.
read_table
is no longer deprecated in favour of read_csv
following a discussion. As far as I can tell from reading the references in the given motivation, this means there is no longer a good reason to recommend read_csv
over read_table
per se.
Describe the solution you'd like
The rule should either be categorised as opinionated to require an opt-in (like PD901
), removed, or changed to only emit a warning if read_table
is called with sep=","
(which is what ruff has just done).
It would be cool to have examples of real pandas scripts to feed to the tests. This might help us track down bugs. For example, I am anticipating we might catch false positives for some places where pandas
overlaps numpy
Check for use of parameter inplace = True
. If found, give error message:
inplace
operations do not always behave as you might expect
@simchuck is adding some good notes to the wiki about tools to explore and understand the AST. We should add an item "0" in the "How to add a check to the linter" section of README.md
that links folks to the wiki for tips on working with the AST.
Is your feature request related to a problem? Please describe.
I really like this project, and I want to contribute, but there are some roadblocks in the way.
Getting the project setup on my local machine had some issues:
This is not meant to be one big complaint. This project is really useful to me, so thanks for building it.
When flake8
runs check_for_ix
, check_for_at
, or check_for_iat
, it raises AttributeError: 'Name' object has no attribute 'attr'
This seems to indicate that node.value
returns a 'Name' object
and that node.value.attr
should be changed. Not sure if we can follow the pattern for check_for_isnull
, etc. because it .ix[]
, .at[]
.
https://deppen8.github.io/pandas-bw/general/values-array-to_numpy.png
Check for calls to .values
. If found, error message:
Use
.to_numpy
if you need anumpy ndarray
or.array
(returns aPandasArray
)
nox
can do much of what TravisCI is doing for us, only it can do it locally and, in the case of black
, update the local version that can then be pushed to the repo.
Description/Steps to Reproduce
For the below file, pandas-vet
returns the error tmp.py:2:1: PD005 Use arithmetic operator instead of method
. It looks like the rule is intended for pandas.DataFrame.sub
but is being applied to re.sub
import re
re.sub('', '', '')
Need to add:
Basically, everything that is currently in the README.md should be moved to the docs. The README.md should have minimal info and a basic example.
Check for .pivot
and .unstack
methods. See flashcard. Give error message:
'.pivot' and '.unstack' functionality can be achieved with just '.pivot_table'
Every class, function, etc. should have a docstring that at least describes functionality, but many are currently lacking.
Describe the bug
Cannot use the --annoy
flag with flake8
To Reproduce
Steps to reproduce the behavior:
tmp.py
filepandas-vet
--annoy
flag. This fails with error: flake8: error: no such option: --annoy
tmp.py
import pandas
that = 1
this = that
print(this)
df = 1
Terminal:
(py368) C:\Users\king.kyle\>flake8 tmp.py
tmp.py:1:1: F401 'pandas' imported but unused
tmp.py:1:1: D100 Missing docstring in public module
tmp.py:5:1: T001 print found.
(py368) C:\Users\king.kyle\>python -m pip install pandas-vet
Collecting pandas-vet
Using cached https://files.pythonhosted.org/packages/21/53/d031fd623fde85f554c73d87c431ad4cf5d929d89c1cd728ab5e4d145a52/pandas_vet-0.2.1-py3-none-any.whl
Installing collected packages: pandas-vet
Successfully installed pandas-vet-0.2.1
(py368) C:\Users\king.kyle\>flake8 tmp.py
tmp.py:1:1: F401 'pandas' imported but unused
tmp.py:1:1: D100 Missing docstring in public module
tmp.py:1:1: PD001 pandas should always be imported as 'import pandas as pd'
tmp.py:5:1: T001 print found.
(py368) C:\Users\king.kyle\>flake8 tmp.py --annoy
Usage: flake8 [options] file file ...
flake8: error: no such option: --annoy
(py368) C:\Users\king.kyle\>
Expected behavior
Expected the error PD901 'df' is a bad variable name. Be kinder to your future self.
Environment:
(py368) C:\Users\king.kyle\Developer\Packages\common_dev>python
Python 3.6.8 |Anaconda, Inc.| (default, Feb 21 2019, 18:30:04) [MSC v.1916 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> exit()
(py368) C:\Users\king.kyle\Developer\Packages\common_dev>flake8 --version
3.7.9 (flake8-blind-except: 0.1.1, flake8-bugbear: 19.8.0, flake8-docstrings: 1.5.0, pydocstyle: 4.0.1, flake8-mock: 0.3, flake8-pandas-vet: 0.2.1, flake8-print: 3.1.4, flake8-tuple: 0.4.1, flake8_builtins: 1.4.1, flake8_commas: 2.0.0, flake8_deprecated: 1.2, flake8_isort: 2.3, flake8_quotes: 2.1.1, logging-format: 0.6.0, mccabe: 0.6.1, naming: 0.8.2, pycodestyle: 2.5.0, pyflakes: 2.1.1) CPython 3.6.8 on Windows
I would like to expand our docs beyond just the README file. A simple Read-the-Docs site or GitHub Page would be good. It should include pages for
With #88 we have tests, flake8 and black running on every PR and head. We should get some of those fancy banners in the readme to reflect that these are passing.
Take a look at how black does these for inspiration:
https://github.com/psf/black/blame/ce14fa8b497bae2b50ec48b3bd7022573a59cdb1/README.md#L5-L14
For help with #24, we need to better understand the range of groupby
uses. We should collect some real-world uses of groupby
to better understand whether a linting pattern will raise too many false positives.
I have recently started adopting JupyterBook for documenting projects and I love it. It is great for adding the things that I would like to add to pandas-vet
like tutorials and mixing Markdown and reST Sphinx docs.
It also has nice pre-defined GitHub Actions for building the docs and putting them in a branch for GitHub Pages.
Not even sure if this can be implemented, but maybe with a clever regular expression, it could. See the flashcard for some more details.
All of our tests are currently in test_PD001.py
. We need to either change the name of that file to something more general or breakout our tests into separate files corresponding to the error codes. I prefer the latter. So for each error code, we should have a different test_PD<code>.py
file.
The bot created this issue to inform you that pyup.io has been set up on this repo.
Once you have closed it, the bot will open pull requests for updates as soon as they are available.
We should implement checks for all of the text-based arithmetic and comparison operators. If found, recommend using the operator itself. Something like:
Use instead of
use | check for |
---|---|
+ |
.add |
- |
.sub and .subtract |
* |
.mul and .multiply |
/ |
.div , .divide and .truediv |
** |
.pow |
// |
.floordiv |
% |
.mod |
> |
.gt |
< |
.lt |
>= |
.ge |
<= |
.le |
== |
.eq |
!= |
.ne |
Check for .isnull
and .notnull
methods. Check them separately.
If .isnull
found, give error message:
Use
.isna
instead of.isnull
If .notnull
found, give error message:
Use
.notna
instead of.notnull
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.