Comments (7)
Hey @PABNY would you be able to provide an example of your column names you are using? Typically when we create data frames the best practice would to use _
in place of spaces. I'd like to look into this a bit more and have a discussion about this. thanks for raising the issue!
from datacompy.
Hi Faisal
Issue is in data itself having whitespaces in join columns not in column names.
Like
JOIN_COL_1='B' in DF1
JOIN_COL_1 ='B ' in DF2
so below section should be enhanced or a new section should be there to take care of doing strip for join columns as well?
Below I guess takes care of stripping of only non join columns in 2 data-frames:
try:
if ignore_spaces:
if col_1.dtype.kind == "O":
col_1 = col_1.str.strip()
if col_2.dtype.kind == "O":
col_2 = col_2.str.strip()
For now as a workaround when post we read data into a DF we are doing below to avoid this issue before supplying the DF to compare function of datacompy
df = df.apply(lambda x: x.str.strip() if x.dtype == "object" else x)
from datacompy.
Ahh I see. Sorry I misunderstood the original comment there. OK, let me take a look in a bit. Just a bit busy ATM, so will get back to you shortly.
from datacompy.
@jborchma Thoughts on stripping the join column values before processing? I'm ok with it but wanted to get your thoughts. Any downsides from your side?
from datacompy.
Example:
df1 = pd.DataFrame({"some_col": [i for i in range(10)], "id": [str(i) for i in range(10)]})
df2 = pd.DataFrame({"some_col": [i for i in range(10)], "id": [str(i)+ " " for i in range(10)]})
pdcompare = datacompy.Compare(df1, df2, join_columns="id")
print(pdcompare.report())
Which yields:
DataComPy Comparison
--------------------
DataFrame Summary
-----------------
DataFrame Columns Rows
0 df1 2 10
1 df2 2 10
Column Summary
--------------
Number of columns in common: 2
Number of columns in df1 but not in df2: 0
Number of columns in df2 but not in df1: 0
Row Summary
-----------
Matched on: id
Any duplicates on match values: No
Absolute Tolerance: 0
Relative Tolerance: 0
Number of rows in common: 0
Number of rows in df1 but not in df2: 10
Number of rows in df2 but not in df1: 10
Number of rows with some compared columns unequal: 0
Number of rows with all compared columns equal: 0
Column Comparison
-----------------
Number of columns compared with some values unequal: 0
Number of columns compared with all values equal: 2
Total number of values which compare unequal: 0
Sample Rows Only in df1 (First 10 Columns)
------------------------------------------
some_col id
5 5.0 5
6 6.0 6
3 3.0 3
4 4.0 4
1 1.0 1
2 2.0 2
9 9.0 9
0 0.0 0
7 7.0 7
8 8.0 8
Sample Rows Only in df2 (First 10 Columns)
------------------------------------------
some_col id
19 9.0 9
10 0.0 0
14 4.0 4
13 3.0 3
12 2.0 2
11 1.0 1
18 8.0 8
17 7.0 7
15 5.0 5
16 6.0 6
from datacompy.
Don't really have any issues with that. Sounds like a nice feature to have
from datacompy.
from datacompy.
Related Issues (20)
- Error df1 must have all columns from join_columns HOT 2
- convert all docs to markdown from rst. HOT 1
- consolidate the common functions between pandas and spark HOT 2
- documentation about fugue functionality HOT 1
- modernize docs
- Fugue Phase 2 functionality HOT 2
- Pandas 2.0 support
- Fugue support for extra helper functions from core HOT 2
- No objects to concatenate issue with Fugue HOT 3
- The intersection logic of Compare has problems. HOT 3
- Adding column naming differences to the column summary page HOT 3
- Speed up spark unit tests HOT 2
- Python 3.11 support HOT 12
- Feature Request: Ability to Update Compare Object Over Multiple Chunks HOT 4
- Datacompare for Date field is not working HOT 4
- SparkCompare() not working for dask - dropDuplicates HOT 1
- Add list of dissimilar columns to report HOT 8
- Restrictive dependency versions - NumPy 1.24.4 blocked HOT 6
- confused about df_unq_rows HOT 2
- Add mypy to the project HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from datacompy.