Giter VIP home page Giter VIP logo

Comments (1)

okennedy avatar okennedy commented on July 23, 2024

Ok... so this was a case of reality (and me) being stupider than I'd originally expected.

The specific error being seen was based on the NYC Cause of Death test files (now in the repo under /test/NYC_CoD). These two files come from the NYS open data portal and include data for 2008-2014 and 2008-2016 respectively. When given to the shape detector:

  • The shape detector inconsistently reports no nulls in the DEATHS column (which contains . characters for some lines) in the first dataset:
    • The second data instead reports a small %age of nulls in the same column
    • Two other columns: DEATH_RATE and AGE_ADJUSTED_DEATH_RATE also contain . characters that are reported properly as nulls.
  • The shape detector does not report the M/F -> M/F/Male/Female or the White Non-Hispanic/Black Non-Hispanic -> Non-Hispanic White/Non-Hispanic Black errors in the 2nd dataset.

🤦‍♂️

These errors result from a combination of several issues that should now be resolved (as of 8a37630)

CAST Inconsistency

Spark's CAST operation evaluates CAST('.' AS bigint) to 0, while Mimir's evaluates it to NULL. Curiously, CAST('.' AS double) evaluates to NULL in Spark as well...

The inconsistency between Mimir and Spark needs to be fixed (#364), but it was particularly pronounced because evaluation could be shared between Mimir and Spark. Due to an old optimization: Mimir would take over evaluation of the final stages of a query, since these usually had Scala UDFs and SQLite had rather poor performance due to repeated crossing of the JVM boundary.

I simplified the compiler pipeline, removing this unnecessary optimization, and behavior of the Null-ish facets should now be more stable and in particular, users shouldn't see query results that differ based on query complexity.

Legitimate Data Errors

So... it turns out that in the 2015/2016 data dump, NYC added blank cells as missing values. These are universally interpreted as NULL by Mimir and Spark, so the 2nd dataset actually had a mix of '.'s and ''s in the DEATHS column, and actually did legitimately have nulls where the first dataset did not.

Senility

I could have sworn that there were domain facets already implemented. I could have also sworn we had an oxfordComma method in StringUtils. Apparently, I was wrong on both counts... at least until now. There are two new facet types: DrawnFromRange locks in the min/max values of sequential-typed column and DrawnFromDomain locks in the set of distinct values of any string-typed column with fewer than 20 distinct values. This now correctly handles the two test data files.

from mimir.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.