Comments (1)
Ok... so this was a case of reality (and me) being stupider than I'd originally expected.
The specific error being seen was based on the NYC Cause of Death test files (now in the repo under /test/NYC_CoD
). These two files come from the NYS open data portal and include data for 2008-2014 and 2008-2016 respectively. When given to the shape detector:
- The shape detector inconsistently reports no nulls in the
DEATHS
column (which contains.
characters for some lines) in the first dataset:- The second data instead reports a small %age of nulls in the same column
- Two other columns:
DEATH_RATE
andAGE_ADJUSTED_DEATH_RATE
also contain.
characters that are reported properly as nulls.
- The shape detector does not report the
M
/F
->M
/F
/Male
/Female
or theWhite Non-Hispanic
/Black Non-Hispanic
->Non-Hispanic White
/Non-Hispanic Black
errors in the 2nd dataset.
🤦♂️
These errors result from a combination of several issues that should now be resolved (as of 8a37630)
CAST Inconsistency
Spark's CAST operation evaluates CAST('.' AS bigint)
to 0
, while Mimir's evaluates it to NULL
. Curiously, CAST('.' AS double)
evaluates to NULL
in Spark as well...
The inconsistency between Mimir and Spark needs to be fixed (#364), but it was particularly pronounced because evaluation could be shared between Mimir and Spark. Due to an old optimization: Mimir would take over evaluation of the final stages of a query, since these usually had Scala UDFs and SQLite had rather poor performance due to repeated crossing of the JVM boundary.
I simplified the compiler pipeline, removing this unnecessary optimization, and behavior of the Null-ish facets should now be more stable and in particular, users shouldn't see query results that differ based on query complexity.
Legitimate Data Errors
So... it turns out that in the 2015/2016 data dump, NYC added blank cells as missing values. These are universally interpreted as NULL
by Mimir and Spark, so the 2nd dataset actually had a mix of '.'
s and ''
s in the DEATHS
column, and actually did legitimately have nulls where the first dataset did not.
Senility
I could have sworn that there were domain facets already implemented. I could have also sworn we had an oxfordComma
method in StringUtils
. Apparently, I was wrong on both counts... at least until now. There are two new facet types: DrawnFromRange locks in the min/max values of sequential-typed column and DrawnFromDomain locks in the set of distinct values of any string-typed column with fewer than 20 distinct values. This now correctly handles the two test data files.
from mimir.
Related Issues (20)
- Mimir still creating spark-warehouse and metastore_db
- Support for Vizual within Mimir
- Support table mutation operations in Mimir
- OFFSET queries are painfullly slow / do not complete HOT 1
- Catch UserInterruptedException (and others) in Mimir Command Line
- crash on use of 'like' HOT 2
- Error compiling float/int addition
- row_number is incorrectly pulled into lazy_row HOT 1
- Interpolation Model impossibly slow HOT 1
- Add NLP lenses, e.g., for Date Extraction
- Sanitize sheet name and headers for google sheet datasource HOT 1
- Replace UDFs/UDAs with Spark's Catalog
- Stratified Sampling Operator HOT 3
- CAST behavior inconsistent between Mimir and Spark
- Detect Headers not properly removing header row
- Order-by resolves (group-by) attributes against pre-aggregate schema, not post-aggregate schema.
- Replace typesystem with Spark-/Hive- types
- Switch to HyperLogLog for domain tests
- SystemCatalog is sloooooow. HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from mimir.