Comments (26)
Hey @simonwongwong hope all is well!
I'll take a closer look at this sometime this week. Thanks for bringing this up and opening up the issue.
from datacompy.
from datacompy.
@jborchma Just want to circle back on this. Thoughts on just checking if both are empty and throwing an exception? This might be something which is never encountered (comparing 2 empty dataframes)
from datacompy.
So technically two empty dataframes should be equal. Maybe we could return True
and a log message?
from datacompy.
I'm aligned with that. I'll try and do a quick PR here.
from datacompy.
@jborchma So it seems like @simonwongwong is comparing arrays here. So obviously empty arrays make sense. But the following also doesn't work
df1 = pd.DataFrame({"some_col": [np.array([1,2]) for _ in range(10)], "id": [i for i in range(10)]})
df2 = pd.DataFrame({"some_col": [np.array([1,2]) for _ in range(10)], "id": [i for i in range(10)]})
Mainly due to the fact that: ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
0
is a issue. 1
element works fine, and than >1
is also an issue.
I guess this isn't a simple change but I'll look into it. I guess we never really had a usecase where the field was an array.
from datacompy.
@simonwongwong Are you comparing a lot of np.arrays? (Could you have arrays of > 1 length?) I'd like to think about the use case a bit more if you have thoughts.
from datacompy.
My use case was reading CSV files with empty arrays using pandas -- pandas will read arrays as numpy arrays and two empty numpy arrays cannot be equal
from datacompy.
Makes sense. I think the issue boils down to how Pandas internalises the dtype for an array. It will be an O
(Object). But so is string, or another other items which isn't a base dtype that Pandas supports. We could check the first item of that columns and check to see if it is a np.ndarray
but that seems really ugly to me. I'm open to other thoughts or suggestions. For your use case will it always be empty?
from datacompy.
In my case it wasn't always empty.
Another option could be to convert them to Python lists instead of np.ndarray
from datacompy.
@theianrobertson Thoughts on this issue? Dataframes with numpy arrays in columns.
from datacompy.
So what Simon really wants is elementwise comparison of the arrays, right?
from datacompy.
from datacompy.
I guess we would want to use something like the numpy function to compare arrays.
from datacompy.
Yeah, on non-empty arrays it'll do a normal element wise comparison, but empty np.array
s will never be equal
from datacompy.
from datacompy.
Yeah, on non-empty arrays it'll do a normal element wise comparison, but empty
np.array
s will never be equal
If you look at my above example I’m not sure datacompy will automatically work. It will complain and suggest any or all.
from datacompy.
@jborchma Any further thoughts on this. I think the main issue is where and if you draw the line of things to compare.
from datacompy.
@simonwongwong This was a while back now. I'm going to close this issue, but feel free to reopen if it seems like something we need to rehash. Trying to organize our backlog and work through some of these older issue if needed.
from datacompy.
Hi everyone! This is a feature we're missing and I'm happy to spend some time implementing a solution (and also coming up with a proposal how to move forward, if you want).
from datacompy.
Hey @jonashaag yes please. Would love contributions and thoughts from others. Happy to have you take this on. Appreciate you willing to help out. 🚀
from datacompy.
Had a look into the implementation -- the actual column comparison code (def columns_equal
) seems rather unflexible/specifically built for use cases at Capital One. Here's two ideas how to deal with the NumPy array issue:
A) Add new fixed logic for NumPy arrays: try to detect NumPy array columns by looking at the actual series values. Use .all()
for NumPy arrays.
B) Add a new system for custom declaration of "comparators", ie. give more flexibility to the user to configure how columns are compared. We would ship a default configuration that mimics the current behavior, and users would be free to change the configuration to their liking. This could be as simple as giving a list of comparators that are tried in order until one of them "understand" the data, ie. the user could pass something like:
columns_equal(..., comparators=[
FloatComparator(rtol=1e-3),
StringComparator(case_sensitive=False),
ArrayComparator(aggregate="all") # calls .all()
])
Or it could be an explicit list of comparators for each column, or something similar.
from datacompy.
@jonashaag ill take a look at this on Monday. Been on vacation all week. Thanks for your help with this. I do think datacompy could be ready for a major refactor to be honest. Especially aligning the spark and pandas APIs
from datacompy.
Reopening this issue. I like the idea of option B @jonashaag . But that seems like a bit of a refactor and something I've been thinking about with the package. I'd like to revisit it and see if there are opportunities to one make it more flexible and also play nicer with Spark/Pandas all in one spot. I was thinking maybe koalas might be a good option here. Option A would be the quickest and solve this direct issue immediately it seems.
Thoughts? @jonashaag @jborchma @elzzhu @theianrobertson ?
from datacompy.
I have little experience with Spark and I'm not sure if I'll be able to invest the learning time right now.
from datacompy.
That is perfectly fine, that is something I can lean into.
from datacompy.
Related Issues (20)
- No objects to concatenate issue with Fugue HOT 3
- The intersection logic of Compare has problems. HOT 3
- Adding column naming differences to the column summary page HOT 3
- Speed up spark unit tests HOT 2
- Python 3.11 support HOT 12
- Feature Request: Ability to Update Compare Object Over Multiple Chunks HOT 4
- Datacompare for Date field is not working HOT 4
- SparkCompare() not working for dask - dropDuplicates HOT 1
- Add list of dissimilar columns to report HOT 8
- Restrictive dependency versions - NumPy 1.24.4 blocked HOT 6
- confused about df_unq_rows HOT 2
- Add mypy to the project HOT 4
- Add new action for running tests when PySpark is NOT installed HOT 1
- Comparison fails on dataframes with a single column HOT 8
- Benchmark Documentation between pandas, fugue, and native spark.
- who can help make the result significantly HOT 2
- Issue in writing report HOT 9
- Look into porting Compare to a polars backend for performance testing. HOT 2
- Abstract base class for native Compare functionality HOT 3
- Are there plans to support Python 3.12.1? HOT 11
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from datacompy.