Comments (5)
The issue seems to have nothing to do with the selection predicate itself. Rather, the FINALDATA lens itself seems to be having some issues when called on to classify rows that do not appear in the final result set. Concretely, MISSING_VALUE runs data-harvesting queries of the form:
PROJECT[ROWID <= JOIN_ROWIDS(__LHS_ROWID, __RHS_ROWID), ID <= PRODUCT_ID, CATEGORY <= PRODUCT_CATEGORY, NAME <= PRODUCT_NAME, RATING <= {{ TYPEDRATINGS1_1[__LHS_ROWID] }}, PID <= {{ TYPEDRATINGS1_0[__LHS_ROWID] }}, REVIEW_CT <= {{ TYPEDRATINGS1_2[__LHS_ROWID] }}, BRAND <= PRODUCT_BRAND, __MIMIR_CONDITION <= ( ({{ TYPEDRATINGS1_0[__LHS_ROWID] }}=PRODUCT_ID) AND (JOIN_ROWIDS(__LHS_ROWID, __RHS_ROWID)='2.1') ) ](
JOIN(
PROJECT[RATINGS1_PID <= RATINGS1_PID, RATINGS1_RATING <= RATINGS1_RATING, RATINGS1_REVIEW_CT <= RATINGS1_REVIEW_CT, __LHS_ROWID <= ROWID](
RATINGS1(RATINGS1_PID:string, RATINGS1_RATING:string, RATINGS1_REVIEW_CT:string // ROWID:rowid)
),
PROJECT[PRODUCT_ID <= PRODUCT_ID, PRODUCT_NAME <= PRODUCT_NAME, PRODUCT_BRAND <= PRODUCT_BRAND, PRODUCT_CATEGORY <= PRODUCT_CATEGORY, __RHS_ROWID <= ROWID](
PRODUCT(PRODUCT_ID:string, PRODUCT_NAME:string, PRODUCT_BRAND:string, PRODUCT_CATEGORY:string // ROWID:rowid)
)
)
)
Note the condition:
( ({{ TYPEDRATINGS1_0[__LHS_ROWID] }}=PRODUCT_ID) AND (JOIN_ROWIDS(__LHS_ROWID, __RHS_ROWID)='2.1') )
2.1 is the rowid of a row that the MISSING_VALUE lens is being asked to classify a record for, specifically the 2nd row of ratings1 and the 1st row of product. Looking at the data --- these do not join, and ({{ TYPEDRATINGS1_0[__LHS_ROWID] }}=PRODUCT_ID) is false.
What seems to be happening is that rating>4 is triggering some sort of premature evaluation of classify() for a row that is straight up not in the result set.
from mimir.
For reference, here's the full query:
--- Optimized Query ---
PROJECT[NAME <= PRODUCT_NAME, BRAND <= PRODUCT_BRAND, CATEGORY <= PRODUCT_CATEGORY, REVIEW_CT <= {{ TYPEDRATINGS1_2[__LHS_ROWID] }}, PID <= {{ TYPEDRATINGS1_0[__LHS_ROWID] }}, ID <= PRODUCT_ID, RATING <= CASE WHEN {{ TYPEDRATINGS1_1[__LHS_ROWID] }} IS NULL THEN {{ FINALDATA_3[JOIN_ROWIDS(__LHS_ROWID, __RHS_ROWID)] }} ELSE {{ TYPEDRATINGS1_1[__LHS_ROWID] }} END, __MIMIR_CONDITION <= ( ({{ TYPEDRATINGS1_0[__LHS_ROWID] }}=PRODUCT_ID) AND (CASE WHEN {{ TYPEDRATINGS1_1[__LHS_ROWID] }} IS NULL THEN {{ FINALDATA_3[JOIN_ROWIDS(__LHS_ROWID, __RHS_ROWID)] }} ELSE {{ TYPEDRATINGS1_1[__LHS_ROWID] }} END>4) ) ](
JOIN(
PROJECT[RATINGS1_PID <= RATINGS1_PID, RATINGS1_RATING <= RATINGS1_RATING, RATINGS1_REVIEW_CT <= RATINGS1_REVIEW_CT, __LHS_ROWID <= ROWID](
RATINGS1(RATINGS1_PID:string, RATINGS1_RATING:string, RATINGS1_REVIEW_CT:string // ROWID:rowid)
),
PROJECT[PRODUCT_ID <= PRODUCT_ID, PRODUCT_NAME <= PRODUCT_NAME, PRODUCT_BRAND <= PRODUCT_BRAND, PRODUCT_CATEGORY <= PRODUCT_CATEGORY, __RHS_ROWID <= ROWID](
PRODUCT(PRODUCT_ID:string, PRODUCT_NAME:string, PRODUCT_BRAND:string, PRODUCT_CATEGORY:string // ROWID:rowid)
)
)
)
from mimir.
This is also fixed I think with commit fe0ae23
from mimir.
I'd like to test things a bit more before closing the issue outright, since I still don't have an idea why this got broken in the first place. Do you know why the fix fixed things?
from mimir.
The missing value lens was breaking for multiple columns because every missing value model created by the missing value lens was using the same iterator to get the results. So if there were a combination of multiple missing value models and no-op models, only one of the missing value models was getting the actual data, since as of now there is no reset()
in the iterator interface.
Now each model gets its own iterator. I think this is why this issue is being resolved.
from mimir.
Related Issues (20)
- Mimir still creating spark-warehouse and metastore_db
- Support for Vizual within Mimir
- Support table mutation operations in Mimir
- OFFSET queries are painfullly slow / do not complete HOT 1
- Catch UserInterruptedException (and others) in Mimir Command Line
- crash on use of 'like' HOT 2
- Error compiling float/int addition
- row_number is incorrectly pulled into lazy_row HOT 1
- Interpolation Model impossibly slow HOT 1
- Add NLP lenses, e.g., for Date Extraction
- Sanitize sheet name and headers for google sheet datasource HOT 1
- Replace UDFs/UDAs with Spark's Catalog
- Stratified Sampling Operator HOT 3
- The shape detector lens is not producing sensible caveats HOT 1
- CAST behavior inconsistent between Mimir and Spark
- Detect Headers not properly removing header row
- Order-by resolves (group-by) attributes against pre-aggregate schema, not post-aggregate schema.
- Replace typesystem with Spark-/Hive- types
- Switch to HyperLogLog for domain tests
- SystemCatalog is sloooooow. HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from mimir.