Comments (5)
Hi @eapframework, thank you for raising this issue. Could you provide an example of the behavior you're seeing and the behavior you expect/want instead?
We're actually looking into a similar issue - here's an example unit test using the Completeness
analyzer:
For the dataframe:
def getDfCompleteAndInCompleteColumns(sparkSession: SparkSession): DataFrame = {
import sparkSession.implicits._
Seq(
("1", "a", "f"),
("2", "b", "d"),
("3", "a", null),
("4", "a", "f"),
("5", "b", null),
("6", "a", "f")
).toDF("item", "att1", "att2")
}
This test checks the row-level results
"return row-level results for columns filtered" in withSparkSession { session =>
val data = getDfCompleteAndInCompleteColumns(session)
val completenessAtt2 = Completeness("att2", Option("att1 = \"a\""))
val state = completenessAtt2.computeStateFrom(data)
val metric: DoubleMetric with FullColumn = completenessAtt2.computeMetricFrom(state)
data.withColumn("new", metric.fullColumn.get).collect().map(_.getAs[Boolean]("new")) shouldBe
Seq(true, false, false, true, false, true)
}
Using the verification suite on a similar test:
+----+----+----+-----+
|item|att1|att2|rule1|
+----+----+----+-----+
| 1| a| f| true|
| 2| b| d|false|
| 3| a|null|false|
| 4| a| f| true|
| 5| b|null|false|
| 6| a| f| true|
+----+----+----+-----+
Here we can see that the rows that EITHER are filtered out (rows 2,5 att1 is not a
) or fail the check (row 3 is null) are marked as false.
Would you expect rows 2,5 to show true/None in this case?
from deequ.
Thanks for your feedback @eapframework,
We're working through different use cases for different users and I'm planning a PR for this soon. We're planning on providing a configuration so users can set filtered rules as Null or True - so setting this configuration to True should meet your use-case. I'll tag you on the PR once we have that out as well.
from deequ.
Hi @eycho-am , Thanks for merging the PR to address the use-case. Sorry for the delayed response.
It is working fine except when values are null
in the column.
For example:
- Rule: containsCreditCardNumber("credit_no", _ == 1.0).where("indicator == 'b'")
Values:
credit_no indicator
4012888888881881 null
For above values, status is marked as false
but expected true
- hasPattern("account_no", "[0-9]{7}".r).where("indicator == 'b'")
Values:
account_no indicator
1288888 null
For above values, status is marked as false
but expected true
- hasCompleteness("num_hist", _ => 0.8).where("indicator == 'b'")
Values:
num_hist indicator
18 null
For above values, status is marked as blank
but expected true
from deequ.
Thanks for your response. This is the same issue I am facing.
I am expecting rows 2,5 to show true because those are not failed records
Expected result:
+----+----+----+-----+
|item|att1|att2|rule1|
+----+----+----+-----+
| 1| a| f| true|
| 2| b| d| true|
| 3| a|null|false|
| 4| a| f| true|
| 5| b|null| true|
| 6| a| f| true|
+----+----+----+-----+
from deequ.
Hi @eapframework, we've merged PR #532 addressing this issue for Uniqueness and Completeness analyzers and another one open for other analyzers: #535
Please let us know if you have any feedback on these PRs and add comments or open a PR if this doesn't quite meet your use-case.
from deequ.
Related Issues (20)
- Compliance calculation result HOT 1
- numerical statistical indicators have lost precision
- [FEATURE] Supporing Aggregation metrics for a group
- Anomaly checks when fails
- containsCreditCardNumber analyser constraint doesnt support for JCB credit card
- Performance impact when trying to generate profiling report for more than 200 columns HOT 2
- Is AggregateMatch type check supported in the library? HOT 1
- [FEATURE] Cross-building via Mill HOT 5
- How to use Deequ to implement a custom return result set and return the correct and incorrect number of each check result
- Java null pointer issue , while creating sparksession , with deequ jar
- [BUG] Spark 3.4 and Deequ breeze version conflict HOT 1
- [FEATURE] Can we enhance `VerificationSuite` to supports more than one Dataframe?
- Custom user analyzers
- Support for Custom SQL Execution in Deequ Library
- Question: DQ over time
- [FEATURE] Extend RatioOfSums to support other aggregations
- [FEATURE] Support Wilson Score Interval for RetainCompletenessRule
- [BUG] Row-level filtering marking the records as pass when null values are present in the column
- Why is `Distance` not an analyzer?
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from deequ.