Giter VIP home page Giter VIP logo

gate's People

Contributors

kldarek avatar shreyashankar avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

Forkers

kldarek

gate's Issues

Utils to drill down into embedding drifts

When embeddings drift, it would be useful to drill down into examples of embeddings that have drifted. This involves:

  1. Clustering embeddings across partitions
  2. Picking examples from each cluster
  3. DriftResult method that returns these examples
  4. (reach goal) Creating a dashboard that allows users to visualize the drift + examples

Partition summary for embeddings

Interesting approach for drift detection! Can you please tell me if the partition summary in the case of embeddings is the same as below (https://dm4ml.github.io/gate/how-it-works/) or are you taking into account other factors:
coverage: The fraction of the column that has non-null values.
mean: The mean of the column.
p50: The median of the column.
num_unique_values: The number of unique values in the column.
occurrence_ratio: The count of the most frequent value divided by the total count.
p95: The 95th percentile of the column.

Crash of modal due to pyarrow incompatibility

I am currently trying to run the codes for the Data Validation in Production ML Pipelines course, and I run into the following problem both on my local machine and on the modal remote. I think it is the latest version of python+the pyarrow that cause this problem.

  running build_ext
      creating /tmp/pip-install-y2x7mnua/pyarrow_11f173a8029a4b4aafed72e11e381502/build/temp.linux-x86_64-cpython-312
      -- Running cmake for PyArrow
      cmake -DCMAKE_INSTALL_PREFIX=/tmp/pip-install-y2x7mnua/pyarrow_11f173a8029a4b4aafed72e11e381502/build/lib.linux-x86_64-cpython-312/pyarrow -DPYTHON_EXECUTABLE=/usr/local/bin/python -DPython3_EXECUTABLE=/usr/local/bin/python -DPYARROW_CXXFLAGS= -DPYARROW_BUILD_CUDA=off -DPYARROW_BUILD_SUBSTRAIT=off -DPYARROW_BUILD_FLIGHT=off -DPYARROW_BUILD_GANDIVA=off -DPYARROW_BUILD_DATASET=off -DPYARROW_BUILD_ORC=off -DPYARROW_BUILD_PARQUET=off -DPYARROW_BUILD_PARQUET_ENCRYPTION=off -DPYARROW_BUILD_PLASMA=off -DPYARROW_BUILD_GCS=off -DPYARROW_BUILD_S3=off -DPYARROW_BUILD_HDFS=off -DPYARROW_USE_TENSORFLOW=off -DPYARROW_BUNDLE_ARROW_CPP=off -DPYARROW_BUNDLE_BOOST=off -DPYARROW_BUNDLE_CYTHON_CPP=off -DPYARROW_BUNDLE_PLASMA_EXECUTABLE=on -DPYARROW_GENERATE_COVERAGE=off -DPYARROW_BOOST_USE_SHARED=on -DPYARROW_PARQUET_USE_SHARED=on -DCMAKE_BUILD_TYPE=release /tmp/pip-install-y2x7mnua/pyarrow_11f173a8029a4b4aafed72e11e381502
      error: command 'cmake' failed: No such file or directory
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.

May be changing to fixed pyarrow might solve the problem as during installation I noticed that it was requiring a version between >= 11.0.0 and <12.0.0

Revamp embeddings clustering

Currently, there are a fixed number of clusters of embeddings identified per partition. We want to:

  1. Have the number of clusters be dynamic (use GATE's PCA method to determine the number of clusters)
  2. Come up with one-sentence summaries for each cluster, for interpretability. We can probably use an LLM for this.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.