postgresml / postgresml Goto Github PK

The GPU-powered AI application database. Get your app to market faster using the simplicity of SQL and the latest NLP, ML + LLM models.

Home Page: https://postgresml.org

License: MIT License

Dockerfile 0.09% Shell 0.62% Python 6.49% PLpgSQL 0.55% CSS 0.63% HTML 9.53% JavaScript 11.70% Rust 61.78% SCSS 7.92% TypeScript 0.53% Ruby 0.01% Smarty 0.07% HCL 0.08%

ml machine-learning ai ann artificial-intelligence classification embeddings javascript knn llm

postgresml's Issues

xgboost gpu work ?

My essay:

SELECT * FROM pgml.train(
project_name => 'totalvalor',
task => 'classification',
relation_name => 'vtotalvalor',
y_column_name => 'id_inferencia',
algorithm => 'xgboost',
hyperparams => '{"tree_method" : "gpu_hist"}'
);

Results:

NFO: Snapshotting table "vtotal", this may take a little while...
INFO: Validating relation: vtotal
INFO: Validating relation: vtotal
INFO: Snapshot of table "vtotal" created and saved in "pgml"."snapshot_1"
INFO: Dataset { num_features: 6, num_labels: 1, num_distinct_labels: 3, num_rows: 1947801, num_train_rows: 1460851, num_test_rows: 486950 }
INFO: Training Model { id: 1, algorithm: xgboost, runtime: rust }
INFO: Hyperparameter searches: 1, cross validation folds: 1
INFO: Hyperparams: {
"tree_method": "gpu_hist"
}
ERROR: called Result::unwrap() on an Err value: XGBError { desc: "[07:58:02] /app/target/release/build/xgboost-sys-8117c1510d0b1933/out/xgboost/src/gbm/../common/common.h:239: XGBoost version not compiled with GPU support.\nStack trace:\n [bt] (0) /usr/lib/postgresql/12/lib/pgml.so(+0x1af9c68) [0x7fbe2f48ac68]\n [bt] (1) /usr/lib/postgresql/12/lib/pgml.so(+0x1af9d01) [0x7fbe2f48ad01]\n [bt] (2) /usr/lib/postgresql/12/lib/pgml.so(+0x1af9dbd) [0x7fbe2f48adbd]\n [bt] (3) /usr/lib/postgresql/12/lib/pgml.so(+0x1b10460) [0x7fbe2f4a1460]\n [bt] (4) /usr/lib/postgresql/12/lib/pgml.so(+0x19c665c) [0x7fbe2f35765c]\n [bt] (5) /usr/lib/postgresql/12/lib/pgml.so(+0x19b4868) [0x7fbe2f345868]\n [bt] (6) /usr/lib/postgresql/12/lib/pgml.so(+0x19460ad) [0x7fbe2f2d70ad]\n [bt] (7) /usr/lib/postgresql/12/lib/pgml.so(+0x2e594c) [0x7fbe2dc7694c]\n [bt] (8) /usr/lib/postgresql/12/lib/pgml.so(+0x150101) [0x7fbe2dae1101]\n\n" }
CONTEXT: src/bindings/xgboost.rs:211:43
SQL state: XX000

... But python xgboost works perfectly with gpu on other test model ...

postgresql-pgml-12 version 2.0.2

TNX !!

unable to locate postgresml-13

When I'm running apt-get update && apt-get install -y postgresql-pgml-13, I'm getting error E: Unable to locate package postgresql-pgml-13

Could you please have a check on this.

Friendly error for HuggingFace models that don't exist

select pgml.embed('intfloat/e5_large', 'this is a test');

INFO:  Cache miss for "intfloat/e5_large", loading transformer, please wait
ERROR:  called `Result::unwrap()` on an `Err` value: PyErr { type: <class 'huggingface_hub.utils._errors.RepositoryNotFoundError'>, value: RepositoryNotFoundError('401 Client Error. (Request ID: Root=1-6453dc33-0113b3d85bfa81fa2711c0f2)\n\nRepository Not Found for url: https://huggingface.co/api/models/intfloat/e5_large.\nPlease make sure you specified the correct `repo_id` and `repo_type`.\nIf you are trying to access a private or gated repo, make sure you are authenticated.\nInvalid username or password.'), traceback: Some(<traceback object at 0x7f72f59c4ec0>) }

What actually happened here is the transformer name is with a dash, not an underscore. We should return a friendlier error that tells the user what actually happened.

failed to find image postgresml-postgres

Hi all,

I'm trying to install postgresML in a docker container via the following guide:
https://postgresml.org/user_guides/setup/quick_start_with_docker/
Unfortunately, its steps out with the following error in the Running step:
Error response from daemon: failed to find image postgresml-postgres: docker.io/library/postgresml-postgres: No such image

Anyone know how to proceed with this?

Tnx in advance,

Niels

Python version support for 3.6

Hi,

Curious if the Python 3.7 is a hard requirement or if you could possibly run on some earlier 3.x version. In particular curious about Python 3.6 as we run with it already and would love to give this a try.

when I shutdown the docker-compose and restart again the DB is refreshed and all data is lost

when I shutdown the docker-compose using docker-compose down and restart again using docker-compose up -d the DB is refreshed and all data is lost.
Is there any way I can keep my db consistent even if I restart.

Expose XGBoost feature importance visualizations in the dashboard

Iris models always predict the same class

Following the iris classification example, you end up with a model that classifies every flower as Iris-virginica.

pgml_development=# SELECT pgml.predict('Iris Classifier', ARRAY[sepal_length, sepal_width, petal_length, petal_width]) AS prediction, count(target) FROM pgml.iris GROUP BY prediction;
 prediction | count
------------+-------
          2 |   150
(1 row)

This is the case for at least the following algorithms:

linear
ridge
xgboost
random_forest

Dashboard snapshot analysis is not static

The snapshot samples, correlations and data size are from the live table, not a point in time snapshot from model training. This also manifests as an error if you view a snapshot in the dashboard, after the original table has been dropped.

ERROR: Service 'postgres' failed to build: Unknown flag: chown

Hi,
After clone repo, I've an error during docker-compose

cd postgresml && docker-compose up

Building postgres
Step 1/14 : FROM debian:bullseye-slim
---> c9cb6c086ef7
Step 2/14 : MAINTAINER [email protected]
---> Using cache
---> 3664ba4873e1
Step 3/14 : RUN apt-get update
---> Using cache
---> d923784e5c89
Step 4/14 : ARG DEBIAN_FRONTEND=noninteractive
---> Using cache
---> daef8f796ade
Step 5/14 : ENV TZ Etc/UTC
---> Using cache
---> 3af4c9b887c7
Step 6/14 : RUN apt-get install -y postgresql-plpython3-13 python3 python3-pip postgresql-13 tzdata sudo cmake
---> Using cache
---> 2e573258b766
Step 7/14 : RUN pip3 install xgboost sklearn diptest
---> Using cache
---> 3f93c7617c69
Step 8/14 : COPY --chown=postgres:postgres . /app
ERROR: Service 'postgres' failed to build: Unknown flag: chown

My version of docker in my centos
docker version
Client:
Version: 1.13.1
API version: 1.26
Package version: docker-1.13.1-209.git7d71120.el7.centos.x86_64
Go version: go1.10.3
Git commit: 7d71120/1.13.1
Built: Wed Mar 2 15:25:43 2022
OS/Arch: linux/amd64

Server:
Version: 1.13.1
API version: 1.26 (minimum version 1.12)
Package version: docker-1.13.1-209.git7d71120.el7.centos.x86_64
Go version: go1.10.3
Git commit: 7d71120/1.13.1
Built: Wed Mar 2 15:25:43 2022
OS/Arch: linux/amd64
Experimental: false

Freeze Python library versions in Dockerfile

Some of the libraries can have breaking changes and may not provide backward compatibility. Freeze python library versions and update them only when all tests are passed.

signal 11: Segmentation fault

I've pulled fresh version of postgresml from github instead of using ubuntu package (as from what I saw it still doesn't have preprocessing features) and managed to compile it and update .so library with the new one. But now it gives me a segfault even on the basic example:

2023-02-03 16:34:32.815 UTC [125591] LOG: database system is ready to accept connections
2023-02-03 16:34:44.698 UTC [125591] LOG: server process (PID 125611) was terminated by signal 11: Segmentation fault
2023-02-03 16:34:44.698 UTC [125591] DETAIL: Failed process was running: SELECT * FROM pgml.train(
project_name => 'Breast Cancer Detection',
task => 'classification',
relation_name => 'pgml.breast_cancer',
y_column_name => 'malignant'
)

ubuntu 22.04, postgresql 14

Unable to use Categorical columns as target or input fields for classification and regression

For classification, if the target is categorical the model is not getting trained and is throwing assertion error. It expects the categories to be numbers such as 0, 1, 2, etc.
For regression, the model is getting trained with categorical input fields but when predicting it is not accepting categorical values.
How to incorporate categorical columns with string values as input or target?

pgml.embed does not work

I use the hosted version of PGML.

select pgml.version() returns 2.1.1.

As a test, I have a database

create table european_anthems (
    country varchar(20),
    anthem_first_line varchar(255)
);

I have inserted 20 lines into this table OK and can query them as normal.

I now want to create embedding vectors for anthem_first_line, and according to the manual this can be done using:

SELECT pgml.embed('distilbert-base-uncased', anthem_first_line)
FROM european_anthems;

However this gives the error message

[42883] ERROR: function pgml.embed(unknown, character varying) does not exist Hint: No function matches the given name and argument types. You might need to add explicit type casts. Position: 8

Has the method been changed? Do I need to enable something extra on the hosted version of PGML to make this work?

Can you please update the documentation accordingly?

docker-compose up fails with 11 errors

Tested on master and v2.2.0.

$ docker-compose up
Sending build context to Docker daemon  339.4kB
Step 1/4 : FROM rust:1
 ---> d4572ea67e7e
Step 2/4 : COPY . /app
 ---> Using cache
 ---> d97b737197ef
Step 3/4 : WORKDIR /app
 ---> Using cache
 ---> 07ba01bbe2ce
Step 4/4 : RUN cargo build
 ---> Running in 9707e2dc17f0
    Updating crates.io index
 Downloading crates ...
  Downloaded adler v1.0.2
.
.
.
Compiling pgml-dashboard v2.2.0 (/app)
error: failed to find data for query SELECT * FROM pgml.notebooks WHERE id = $1
  --> src/models.rs:89:13
   |
89 |             sqlx::query_as!(Notebook, "SELECT * FROM pgml.notebooks WHERE id = $1", id,)
   |             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   |
   = note: this error originates in the macro `$crate::sqlx_macros::expand_query` which comes from the expansion of the macro `sqlx::query_as` (in Nightly builds, run with -Z macro-backtrace for more info)

error: failed to find data for query INSERT INTO pgml.notebooks (name) VALUES ($1) RETURNING *
   --> src/models.rs:96:12
    |
96  |           Ok(sqlx::query_as!(
    |  ____________^
97  | |             Notebook,
98  | |             "INSERT INTO pgml.notebooks (name) VALUES ($1) RETURNING *",
99  | |             name,
100 | |         )
    | |_________^
    |
    = note: this error originates in the macro `$crate::sqlx_macros::expand_query` which comes from the expansion of the macro `sqlx::query_as` (in Nightly builds, run with -Z macro-backtrace for more info)

error: failed to find data for query SELECT * FROM pgml.notebooks
   --> src/models.rs:106:12
    |
106 |         Ok(sqlx::query_as!(Notebook, "SELECT * FROM pgml.notebooks")
    |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    |
    = note: this error originates in the macro `$crate::sqlx_macros::expand_query` which comes from the expansion of the macro `sqlx::query_as` (in Nightly builds, run with -Z macro-backtrace for more info)

error: failed to find data for query SELECT * FROM pgml.notebook_cells
                       WHERE notebook_id = $1
                       AND deleted_at IS NULL
                   ORDER BY cell_number
   --> src/models.rs:112:12
    |
112 |           Ok(sqlx::query_as!(
    |  ____________^
113 | |             Cell,
114 | |             "SELECT * FROM pgml.notebook_cells
115 | |                 WHERE notebook_id = $1
...   |
118 | |             self.id,
119 | |         )
    | |_________^
    |
    = note: this error originates in the macro `$crate::sqlx_macros::expand_query` which comes from the expansion of the macro `sqlx::query_as` (in Nightly builds, run with -Z macro-backtrace for more info)

error: failed to find data for query UPDATE pgml.notebook_cells
                       SET
                       execution_time = NULL,
                       rendering = NULL
                   WHERE notebook_id = $1
                   AND cell_type = $2
   --> src/models.rs:125:17
    |
125 |           let _ = sqlx::query!(
    |  _________________^
126 | |             "UPDATE pgml.notebook_cells
127 | |                 SET
128 | |                 execution_time = NULL,
...   |
133 | |             CellType::Sql as i32,
134 | |         )
    | |_________^
    |
    = note: this error originates in the macro `$crate::sqlx_macros::expand_query` which comes from the expansion of the macro `sqlx::query` (in Nightly builds, run with -Z macro-backtrace for more info)

error: failed to find data for query
                   WITH
                       lock AS (
                           SELECT * FROM pgml.notebooks WHERE id = $1 FOR UPDATE
                       ),
                       max_cell AS (
                           SELECT COALESCE(MAX(cell_number), 0) AS cell_number
                           FROM pgml.notebook_cells
                           WHERE notebook_id = $1
                           AND deleted_at IS NULL
                       )
                   INSERT INTO pgml.notebook_cells
                       (notebook_id, cell_type, contents, cell_number, version)
                   VALUES
                       ($1, $2, $3, (SELECT cell_number + 1 FROM max_cell), 1)
                   RETURNING id,
                           notebook_id,
                           cell_type,
                           contents,
                           rendering,
                           execution_time,
                           cell_number,
                           version,
                           deleted_at
   --> src/models.rs:187:12
    |
187 |           Ok(sqlx::query_as!(
    |  ____________^
188 | |             Cell,
189 | |             "
190 | |             WITH
...   |
215 | |             contents,
216 | |         )
    | |_________^
    |
    = note: this error originates in the macro `$crate::sqlx_macros::expand_query` which comes from the expansion of the macro `sqlx::query_as` (in Nightly builds, run with -Z macro-backtrace for more info)

error: failed to find data for query SELECT
                           id,
                           notebook_id,
                           cell_type,
                           contents,
                           rendering,
                           execution_time,
                           cell_number,
                           version,
                           deleted_at
                       FROM pgml.notebook_cells
                       WHERE id = $1

   --> src/models.rs:222:12
    |
222 |           Ok(sqlx::query_as!(
    |  ____________^
223 | |             Cell,
224 | |             "SELECT
225 | |                     id,
...   |
237 | |             id,
238 | |         )
    | |_________^
    |
    = note: this error originates in the macro `$crate::sqlx_macros::expand_query` which comes from the expansion of the macro `sqlx::query_as` (in Nightly builds, run with -Z macro-backtrace for more info)

error: failed to find data for query UPDATE pgml.notebook_cells
                   SET
                       cell_type = $1,
                       contents = $2,
                       version = version + 1
                   WHERE id = $3
   --> src/models.rs:252:17
    |
252 |           let _ = sqlx::query!(
    |  _________________^
253 | |             "UPDATE pgml.notebook_cells
254 | |             SET
255 | |                 cell_type = $1,
...   |
261 | |             self.id,
262 | |         )
    | |_________^
    |
    = note: this error originates in the macro `$crate::sqlx_macros::expand_query` which comes from the expansion of the macro `sqlx::query` (in Nightly builds, run with -Z macro-backtrace for more info)

error: failed to find data for query UPDATE pgml.notebook_cells
                   SET deleted_at = NOW()
                   WHERE id = $1
                   RETURNING id,
                           notebook_id,
                           cell_type,
                           contents,
                           rendering,
                           execution_time,
                           cell_number,
                           version,
                           deleted_at
   --> src/models.rs:270:12
    |
270 |           Ok(sqlx::query_as!(
    |  ____________^
271 | |             Cell,
272 | |             "UPDATE pgml.notebook_cells
273 | |             SET deleted_at = NOW()
...   |
284 | |             self.id
285 | |         )
    | |_________^
    |
    = note: this error originates in the macro `$crate::sqlx_macros::expand_query` which comes from the expansion of the macro `sqlx::query_as` (in Nightly builds, run with -Z macro-backtrace for more info)

error: failed to find data for query UPDATE pgml.notebook_cells SET rendering = $1 WHERE id = $2
   --> src/models.rs:339:9
    |
339 | /         sqlx::query!(
340 | |             "UPDATE pgml.notebook_cells SET rendering = $1 WHERE id = $2",
341 | |             rendering,
342 | |             self.id
343 | |         )
    | |_________^
    |
    = note: this error originates in the macro `$crate::sqlx_macros::expand_query` which comes from the expansion of the macro `sqlx::query` (in Nightly builds, run with -Z macro-backtrace for more info)

error: failed to find data for query INSERT INTO pgml.uploaded_files (id, created_at) VALUES (DEFAULT, DEFAULT)
                       RETURNING id, created_at
   --> src/models.rs:798:12
    |
798 |           Ok(sqlx::query_as!(
    |  ____________^
799 | |             UploadedFile,
800 | |             "INSERT INTO pgml.uploaded_files (id, created_at) VALUES (DEFAULT, DEFAULT)
801 | |                 RETURNING id, created_at"
802 | |         )
    | |_________^
    |
    = note: this error originates in the macro `$crate::sqlx_macros::expand_query` which comes from the expansion of the macro `sqlx::query_as` (in Nightly builds, run with -Z macro-backtrace for more info)

error: could not compile `pgml-dashboard` due to 11 previous errors
warning: build failed, waiting for other jobs to finish...
1 error occurred:
	* Status: The command '/bin/sh -c cargo build' returned a non-zero code: 101, Code: 101

Training gives TypeError: '<' not supported between instances of 'float' and 'NoneType'

this is my invocation:

SELECT pgml.train(
'EQ-5D-angst',
task => 'regression',
relation_name => 'pgml.angst_view',
y_column_name => 'angst',
algorithm => 'xgboost'
);

this is how my data look like:

failed to get `linfa` as a dependency of package `pgml v2.1.2 (/app)`

Hi,

I am trying to build a docker image only for pgml-extension. Am getting error at 'RUN cargo pgx package', am trying to build this on postgres13 and debian 10.

Step 30/35 : RUN cargo pgx package
 ---> Running in 48fc7645f156
Error:
   0: couldn't get cargo metadata
   1: `cargo metadata` exited with an error:     Updating crates.io index
          Updating git repository `https://github.com/postgresml/lightgbm-rs`
          Updating git submodule `https://github.com/microsoft/LightGBM/`
          Updating git submodule `https://gitlab.com/libeigen/eigen.git`
          Updating git submodule `https://github.com/lemire/fast_double_parser.git`
          Updating git submodule `https://github.com/abseil/abseil-cpp.git`
          Updating git submodule `https://github.com/google/double-conversion.git`
          Updating git submodule `https://github.com/fmtlib/fmt.git`
          Updating git submodule `https://github.com/boostorg/compute`
      error: failed to get `linfa` as a dependency of package `pgml v2.1.2 (/app)`

      Caused by:
        failed to load source for dependency `linfa`

      Caused by:
        Unable to update /app/deps/linfa

      Caused by:
        failed to read `/app/deps/linfa/Cargo.toml`

      Caused by:
        No such file or directory (os error 2)


Location:
   /var/lib/postgresql/.cargo/registry/src/github.com-1ecc6299db9ec823/cargo-pgx-0.6.0/src/metadata.rs:23

  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ SPANTRACE ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

   0: cargo_pgx::command::package::execute
      at /var/lib/postgresql/.cargo/registry/src/github.com-1ecc6299db9ec823/cargo-pgx-0.6.0/src/command/package.rs:50

Backtrace omitted. Run with RUST_BACKTRACE=1 environment variable to display it.
Run with RUST_BACKTRACE=full to include source snippets.
```

what is happening ?

SELECT 'id_inferencia', pgml.predict( 2, ARRAY [ 1.84,0.05565,133,133,16739,1.38,0.006346 ]) AS prediccion;
->
Column text | prediccion real
id_inferencia | 1

SELECT 'id_inferencia', pgml.predict( 2, ARRAY[ 'var0', 'var1', 'var2', 'var3', 'var4', 'var5', 'var6' ]) AS prediccion
FROM pgml.vwdata;

ERROR: function pgml.predict(integer, text[]) does not exist
LINE 1: SELECT id_inferencia, pgml.predict( 2, ARRAY[ 'var0', 'var1'...
^
HINT: No function matches on the name and types of arguments. It may be necessary to add explicit type casting.
SQL state: 42883
Character: 23

I'm a bit lost ...

TNX !!

Unable to add new project

While trying to add new project we are getting 403 error. Please find the below screenshot.

And also we have made changes in allowed hosts entry in the following files which is under the directory pgml_dashboard .env.TEMPLATE, .env.docker and settings.py . Please let us know whether these changes are possible or not.

Support using materialized views

Materialized views are recorded in a separate metadata table and not in information_schema.tables, so when one tries to use them in pgml.train(), they would get a "table not found" error.

A quick workaround is to create a regular view around the materialized view:

CREATE VIEW my_view AS SELECT * FROM my_mat_view;

Set random seed for XGBoost

It would be nice to expose the random_state parameter when training xgboost models.

Streaming responses from LLMs

It'd be nice to have an API over a server side cursor that returned individual tokens as rows from the model to stream responses back to the end user.

Train/Inference between unevenly matched replicas

Hello,

I'm Ilya from The Stone Cross Foundation of Ukraine, a data-driven humanitarian volunteer cooperative tasked with solving some of the issues charities in Ukraine are having during these tumultuous times. We've discovered this project via HN a couple weeks back and have been evaluating it ever since. We're using a high-availability variant of TimescaleDB which is another Postgres extension, providing time-series and some advanced capabilities, so called hypertables and hyperfunctions. This is where we keep all our data, including but not limited to geospatial, analytics and everything related to day-to-day operations of our charity, the pleas and the volunteers we're working with.

For us, the primary area of interest is the following: How does postgresml store the trained models, and can these models be streaming-replicated to the other replicas in order to perform inference afterwards? Our database instances are not particularly beefy on the CPU side, and it wouldn't be cost-effective to have them so, but we would really love to perform some regressions on our data in the following way:

Provision a training replica on a highly-performant VM from the latest backup.
(Inquiry: Does pgml's backend implementation in Python support GPU devices?)
Give it time to get up to speed with the latest changes, WAL, etc.
Construct a materialised corpus of our data to be used in training.
pgml.train()
Have other replicas pick it up and perform inference on it.
The training replica can be terminated until the next time we would want to re-train the model.

I'm guessing whether if this approach is even possible, if not too far-out?

This could potentially significantly simplify our data analysis pipeline. Instead of having to worry about ingest, provisioning of the dedicated machine learning services, their inference APIs exposure, and many other things required to do basic machine learning, we could just play around with our Postgres-derived replicas and perform these scheduled training sessions without having to do any of that!

In our case this could translate to relief of actual human suffering.

Best regards,
Ilya

When training a classification model, I receive an error

Data is a two-column view:

vector - Integer[]
result - Integer

Running the following:

SELECT * FROM pgml.train(
  'commits:category:build',
  'classification',
  'commits_build',
  'result'
);

This results in the following error:

Query 1 ERROR: ERROR:  ValueError: y should be a 1d array, got an array of shape (325, 2) instead.
CONTEXT:  Traceback (most recent call last):
  PL/Python function "train", line 4, in <module>
    status = train(
  PL/Python function "train", line 839, in train
  PL/Python function "train", line 718, in fit
  PL/Python function "train", line 568, in roc_auc_score
  PL/Python function "train", line 74, in _average_binary_score
  PL/Python function "train", line 342, in _binary_roc_auc_score
  PL/Python function "train", line 977, in roc_curve
  PL/Python function "train", line 741, in _binary_clf_curve
  PL/Python function "train", line 1151, in column_or_1d
PL/Python function "train"

The db works without password in start, but it has happened to me after some time it starts asking for password which is not set anywhere. What's the solution for this?

"could not find native static library dmlc" error on Centos8

Hi,

I am trying to install postgresML from source on my Centos8 and having problem on cargo pgx package step.
I already installed Python xgboost and other packages but I am getting the "error: could not find native static library dmlc " error.
Can you please help me for this issue?

$ rustc --version
rustc 1.68.2 (9eb3afe9e 2023-03-27)
$ pip3 install xgboost lightgbm scikit-learn
Defaulting to user installation because normal site-packages is not writeable
Requirement already satisfied: xgboost in /usr/local/lib64/python3.9/site-packages (1.7.5)
Requirement already satisfied: lightgbm in /usr/local/lib/python3.9/site-packages (3.3.5)
Requirement already satisfied: scikit-learn in /usr/local/lib64/python3.9/site-packages (1.2.2)
Requirement already satisfied: numpy in /usr/local/lib64/python3.9/site-packages (from xgboost) (1.24.2)
Requirement already satisfied: scipy in /usr/local/lib64/python3.9/site-packages (from xgboost) (1.10.1)
Requirement already satisfied: wheel in /usr/local/lib/python3.9/site-packages (from lightgbm) (0.40.0)
Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.9/site-packages (from scikit-learn) (3.1.0)
Requirement already satisfied: joblib>=1.1.1 in /usr/local/lib/python3.9/site-packages (from scikit-learn) (1.2.0)
$

Error:

$ cargo pgx package
       Using PgConfig("pg15") and `pg_config` from /usr/pgsql-15/bin/pg_config
    Building extension with features python pg15
     Running command "cargo" "build" "--release" "--features" "python pg15" "--no-default-features" "--message-format=json-render-diagnostics"
   Compiling pyo3-build-config v0.17.3
   Compiling xgboost-sys v0.2.0 (https://github.com/postgresml/rust-xgboost.git?branch=master#8f3a8fb7)
   Compiling xgboost v0.2.0 (https://github.com/postgresml/rust-xgboost.git?branch=master#8f3a8fb7)
error: could not find native static library `dmlc`, perhaps an -L flag is missing?

error: could not compile `xgboost-sys` due to previous error
warning: build failed, waiting for other jobs to finish...
$

Verbose output:

   ...
   Compiling typetag v0.2.7
   Compiling linfa-linear v0.6.0 (/var/lib/pgsql/postgresml/pgml-extension/deps/linfa/algorithms/linfa-linear)
   Compiling rmp-serde v1.1.1
   Compiling blas-src v0.8.0
   Compiling csv v1.2.1
   Compiling linfa-svm v0.6.0 (/var/lib/pgsql/postgresml/pgml-extension/deps/linfa/algorithms/linfa-svm)
   Compiling linfa-logistic v0.6.0 (/var/lib/pgsql/postgresml/pgml-extension/deps/linfa/algorithms/linfa-logistic)
   Compiling pgx v0.7.4
   Compiling lightgbm v0.2.3 (https://github.com/postgresml/lightgbm-rs?branch=main#ab547686)
   Compiling xgboost v0.2.0 (https://github.com/postgresml/rust-xgboost.git?branch=master#8f3a8fb7)
error: could not find native static library `dmlc`, perhaps an -L flag is missing?

error: could not compile `xgboost-sys` due to previous error
warning: build failed, waiting for other jobs to finish...
$

version 2.2.0 gives pgml.predict is not unique

New version installed from the apt repo gives error such as:
error returned from database: function pgml.predict(unknown, smallint[]) is not unique

Caused by:
function pgml.predict(unknown, smallint[]) is not unique

even for the most of the examples in the notebooks section.

Test scripts missing?

docker-compose up results in:

...
postgres_1   | psql:tests/test.sql:17: error:  target | prediction 
postgres_1   | --------+------------
postgres_1   |       0 |          0
postgres_1   |       1 |          1
postgres_1   |       2 |          2
postgres_1   |       3 |          3
postgres_1   |       4 |          4
postgres_1   |       5 |          5
postgres_1   |       6 |          6
postgres_1   |       7 |          7
postgres_1   |       8 |          8
postgres_1   |       9 |          9
postgres_1   | (10 rows)
postgres_1   | 
postgres_1   | Time: 2.516 ms
postgres_1   | tests/joint_regression.sql: No such file or directory
postgresml_postgres_1 exited with code 3

XGBoost & LightGBM on Apple M1 & M2

XGBoost and LightGBM won't compile / segfault on Apple M1s (and M2s) aarch64. The issue comes from openmp which appears to break and requires special workarounds which seem to be unreliable.

XGBoost issue: dmlc/xgboost#7039

Document stratification split policy for train/test

test_sampling => 'last' isn't intuitive to everyone.

Broken link ( https://postgresml.org/projects/ ) in Tutorial 1: ⏱️ Real Time Fraud Detection

Hi, I'm following the tutorials, and they are super helpful.

IDK if this is exactly the right place to report.

On Tutorial 1: ⏱️ Real Time Fraud Detection , step 7, there is a broken link.

We'll organize our work on this task under the project name "Breast Cancer Detection", which you can now see it in your list of projects.

And

You can pop over to the projects tab for a visualization

Edit: the broken links are through multiple tutorials, and I realise now the links should point to https://postgresml.org/dashboard/projects

Catboost integration

Look to https://catboost.ai/en/docs/concepts/c-plus-plus-api_dynamic-c-pluplus-wrapper, could it be integrated to postgresml?

Support to return probabilities of classification as well instead of only class number

Functions using `blas` cause a segfault (SIGSEV)

After working with @levkk and @montanalow to install PostgresML (as of master: 63ebce3) on my linux box, I discovered that functions such as pgml.cosine_similarity and pgml.norm_l1 cause Postgres to segfault.

As an example:

[v15.1][5126] pgml=# select pgml.norm_l1(ARRAY[1,2,3]::real[]);
server closed the connection unexpectedly
    This probably means the server terminated abnormally
    before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
The connection to the server was lost. Attempting reset: Failed.
Time: 188.973 ms
[v][] ?!>

Postgres logs leading up to a crash against pgml.cosine_similarity() are:

/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/utils/logging.py:65: RuntimeWarning: Error deriving logger module name, using <None>. Exception: <module '' from '/home/pg/15/data'> is a built-in module
  warnings.warn(
No sentence-transformers model found with name /home/zombodb/.cache/torch/sentence_transformers/intfloat_e5-large. Creating a new one with MEAN pooling.
2023-05-04 18:35:48.950 UTC [20973] LOG:  server process (PID 21218) was terminated by signal 11: Segmentation fault
2023-05-04 18:35:48.950 UTC [20973] DETAIL:  Failed process was running: select *, pgml.cosine_similarity(embed, pgml.embed('intfloat/e5-large', 'meetings with beer or wine and cheese')) from embeddings_e5large_100k limit 10;
2023-05-04 18:35:48.950 UTC [20973] LOG:  terminating any other active server processes
2023-05-04 18:35:48.953 UTC [20973] LOG:  all server processes terminated; reinitializing
2023-05-04 18:35:48.979 UTC [20973] FATAL:  Can't attach, lock is not in an empty state: PgLwLockInner
2023-05-04 18:35:48.980 UTC [20973] LOG:  database system is shut down

The backtrace from a --debug build of pgml is:

Thread 1 "postgres" received signal SIGSEGV, Segmentation fault.
0x00007ff52bc65d76 in sdot_ () from /home/pg/15/lib/postgresql/pgml.so
(gdb) bt
#0  0x00007ff52bc65d76 in sdot_ () from /home/pg/15/lib/postgresql/pgml.so
#1  0x00007ff52b8b363a in blas::sdot (n=1024, x=..., incx=1, y=..., incy=1)
    at /home/zombodb/.cargo/registry/src/github.com-1ecc6299db9ec823/blas-0.22.0/src/lib.rs:109
#2  0x00007ff52b7c6aa6 in pgml::vectors::cosine_similarity_s (vector=..., other=...) at src/vectors.rs:304
#3  0x00007ff52b7c6d9a in pgml::vectors::cosine_similarity_s_wrapper::cosine_similarity_s_wrapper_inner (_fcinfo=0x55fa5656e560) at src/vectors.rs:302
#4  0x00007ff52b4ae1c1 in pgml::vectors::cosine_similarity_s_wrapper::{closure#0} () at src/vectors.rs:302
#5  0x00007ff52b6edb8c in std::panicking::try::do_call<pgml::vectors::cosine_similarity_s_wrapper::{closure_env#0}, pgrx_pg_sys::submodules::datum::Datum> (
    data=0x7ffe798f2828) at /rustc/d5a82bbd26e1ad8b7401f6a718a9c57c96905483/library/std/src/panicking.rs:483
#6  0x00007ff52b6f0f6b in __rust_try.llvm.11079318101650794703 () from /home/pg/15/lib/postgresql/pgml.so
#7  0x00007ff52b6ea049 in std::panicking::try<pgrx_pg_sys::submodules::datum::Datum, pgml::vectors::cosine_similarity_s_wrapper::{closure_env#0}> (f=...)
    at /rustc/d5a82bbd26e1ad8b7401f6a718a9c57c96905483/library/std/src/panicking.rs:447
#8  0x00007ff52b75a0f6 in std::panic::catch_unwind<pgml::vectors::cosine_similarity_s_wrapper::{closure_env#0}, pgrx_pg_sys::submodules::datum::Datum> (f=...)
    at /rustc/d5a82bbd26e1ad8b7401f6a718a9c57c96905483/library/std/src/panic.rs:137
#9  0x00007ff52b765983 in pgrx_pg_sys::submodules::panic::run_guarded<pgml::vectors::cosine_similarity_s_wrapper::{closure_env#0}, pgrx_pg_sys::submodules::datum::Datum> (f=...) at /home/zombodb/.cargo/registry/src/github.com-1ecc6299db9ec823/pgrx-pg-sys-0.8.3/src/submodules/panic.rs:403
#10 0x00007ff52b77111c in pgrx_pg_sys::submodules::panic::pgrx_extern_c_guard<pgml::vectors::cosine_similarity_s_wrapper::{closure_env#0}, pgrx_pg_sys::submodules::datum::Datum> (f=...) at /home/zombodb/.cargo/registry/src/github.com-1ecc6299db9ec823/pgrx-pg-sys-0.8.3/src/submodules/panic.rs:380
#11 0x00007ff52b7c6c9d in pgml::vectors::cosine_similarity_s_wrapper (_fcinfo=0x55fa5656e560) at src/vectors.rs:302
#12 0x000055fa54ce4b43 in ExecInterpExpr ()
#13 0x000055fa54cf15a2 in ExecScan ()
#14 0x000055fa54d0c368 in ExecLimit ()
#15 0x000055fa54ce88a2 in standard_ExecutorRun ()

My box is a (humblebrag):

$ lscpu
Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         43 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  64
  On-line CPU(s) list:   0-63
Vendor ID:               AuthenticAMD
  Model name:            AMD Ryzen Threadripper 3970X 32-Core Processor
    CPU family:          23
    Model:               49
    Thread(s) per core:  2
    Core(s) per socket:  32
    Socket(s):           1
    Stepping:            0
    Frequency boost:     enabled
    CPU max MHz:         3700.0000
    CPU min MHz:         2200.0000
    BogoMIPS:            7386.30
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb 
                         rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 mo
                         vbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt t
                         ce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibpb stibp vmmcall fsgsbase bmi1 avx2 s
                         mep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_loc
                         al clzero irperf xsaveerptr rdpru wbnoinvd arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter p
                         fthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip rdpid overflow_recov succor smca sme sev sev_es
Virtualization features: 
  Virtualization:        AMD-V
Caches (sum of all):     
  L1d:                   1 MiB (32 instances)
  L1i:                   1 MiB (32 instances)
  L2:                    16 MiB (32 instances)
  L3:                    128 MiB (8 instances)
NUMA:                    
  NUMA node(s):          1
  NUMA node0 CPU(s):     0-63
Vulnerabilities:         
  Itlb multihit:         Not affected
  L1tf:                  Not affected
  Mds:                   Not affected
  Meltdown:              Not affected
  Mmio stale data:       Not affected
  Retbleed:              Vulnerable
  Spec store bypass:     Vulnerable
  Spectre v1:            Vulnerable: __user pointer sanitization and usercopy barriers only; no swapgs barriers
  Spectre v2:            Vulnerable, IBPB: disabled, STIBP: disabled, PBRSB-eIBRS: Not affected
  Srbds:                 Not affected
  Tsx async abort:       Not affected

With an nvidia RTX 4080:

  nvidia-debugdump -l
Found 1 NVIDIA devices
   Device ID:              0
   Device name:            NVIDIA GeForce RTX 4080   (*PrimaryCard)
   GPU internal ID:        GPU-b772ddf7-d413-e1bb-d1e1-8e7022c59343

Lev helped me discover that by commenting out this line,

postgresml/pgml-extension/build.rs

Line 2 in 8b3ac40

println!("cargo:rustc-link-lib=static=openblas");

, everything works:

[v15.1][8595] pgml=# select pgml.norm_l1(ARRAY[1,2,3]::real[]);
 norm_l1 
---------
       6
(1 row)

Time: 0.620 ms

This crash seems to be isolated to blas as I created 100k embeddings with pgml.embed() in a mere 7m 50s, using 4 parallel workers, even. So that part is good.

I had a thought that rebooting the computer might help since I had just stressed the GPU making all those embeddings, but naw, that didn't change anything.

A theory is that since pgml links to so many libraries (probably directly and indirectly) that maybe there's some kind of symbol resolution problem and the wrong symbols are being called? Just a theory.

@thomcc might be able to offer some help with this if it's some kind of linking problem? Offering up his services as PostgresML's success is pgrx's success!

Input data structure and recursion error

First of all, I would like to preface that I have absolutely no experience, so my apologies in advance for the naivete.

I am seeing an error when attempting to train a model:

Query 1 ERROR: ERROR:  RecursionError: maximum recursion depth exceeded in comparison
CONTEXT:  Traceback (most recent call last):
  PL/Python function "train", line 4, in <module>
    status = train(
  PL/Python function "train", line 794, in train
  PL/Python function "train", line 650, in fit
  PL/Python function "train", line 363, in data
  PL/Python function "train", line 32, in flatten
  PL/Python function "train", line 33, in flatten
  PL/Python function "train", line 33, in flatten
  PL/Python function "train", line 33, in flatten
  PL/Python function "train", line 33, in flatten
  PL/Python function "train", line 33, in flatten
  PL/Python function "train", line 33, in flatten

Data

My database view returns two columns

"{0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...}",1

Where the vector is an Integer[] column comprised of 1200 values and result is an integer.

I am seeing the error when I call the following:

SELECT * FROM pgml.train(
  'Project - category:build',
  'classification',
  'category_build',
  'result'
)

Am I structuring my data incorrectly? Can I use an integer array to provide the training data?

Thank you very much for your patience and assistance!

Handling categorical variables

SELECT * FROM pgml.train(
'pk20102022',
'classification',
'viewdata',
'inferencia',
'xgboost');

INFO: Snapshotting table "viewdata", this may take a little while...
ERROR: unhandled type: text for inferencia
CONTEXT: src/orm/snapshot.rs:288:25
SQL state: XX000

'inferencia' can be: reliable, neutral or possible

any ideas ?

TNX !!

consider a client connection as an alternative to being an extension

Being an python-linking extension introduces some interesting problems, such as python versions changing with OS upgrades.

Perhaps allowing for operation as a client -- but with a suggestion to run adjacent to the database instance, via container -- would improve adoption.

sklearn.py typo error

The right function is : mean_squared_error

[BUG] Unable to see snapshots generated from views

I created a view on some data, such as:

CREATE VIEW iris_view AS SELECT * FROM pgml.iris ORDER BY random() LIMIT 100;

and then trained a model with the following command:

SELECT * FROM pgml.train('Iris Classifier', 'classification', 'iris_view', 'target');

When trying to access the UI /snapshots, I got a ProgrammingError:

relation "iris_view" does not exist
LINE 1: SELECT pg_size_pretty(pg_total_relation_size('iris_view'))

(but when I execute the size query in psql it doesn't throw an error; it just returns nothing.) I have attached a screenshot of the error.

how I can set some filters on the dataset table to use for training

As we are using SQL tables to fetch data for training. Do we have support to use data stored in only last two days for training and ignore old data?
One way I could imagine doing this is using the view in SQL. But that would need some extra lines, if we have this support in train call that would be great.

Please add more newbie tutorials

XGBRanker support

Hey Guys.

Just a quick question. Are you guys planning to support XGBRanker?
https://xgboost.readthedocs.io/en/stable/python/python_api.html#xgboost.XGBRanker

Thanks. Peter

docker-compose up fails - Dockerfile.local scikit-learn

docker-compose up fails as on line 14 in Dockerfile.local of pgml-extension you try to pip install sklearn and its deprecated, it should be changed to:
RUN pip3 install xgboost skikit-learn diptest torch lightgbm transformers datasets sentencepiece sacremoses sacrebleu rouge

Add authentication support to the dashboard

Since the application is open we tried to deploy in Nginx and Traefik using basic authorization. After setting up the authorization when we try to create a new project the next button doesn't work. But when the basic authorization is disabled both in Nginx and Traefik then the Next button works while creating a project. Attaching the screenshots for your reference.

Failed to load resource: the server responded with a status of 403 () new-project.js:83
Uncaught (in promise) TypeError: Cannot read properties of undefined (reading 'length') at new-project.js:83:26

It would be great if you could advise on how we can secure postgresml with an authentication service.

Models view in dashboard throwing error

When visiting the following URL http://localhost:8000/models/81 I am receiving the following error. This is happening on all models show pages.

KeyError at /models/81
KeyError at 'image_p50'

-- | --
http://localhost:8000/models/81
4.0.4
KeyError
'image_p50'
/app/app/views/models.py, line 35, in <dictcomp>
/usr/local/bin/python3
3.10.4
['/app',  '/usr/local/lib/python310.zip',  '/usr/local/lib/python3.10',  '/usr/local/lib/python3.10/lib-dynload',  '/usr/local/lib/python3.10/site-packages',  '/app/..']

Switch to unix line endings?

GitHub recommends to always use \n as a newline character in git-handled repos.

Windows line endings make it difficult to collaborate with folks using unix, and result in the docker-compose entrypoint scripts being parsed incorrectly on linux:

postgresml-admin-1     | /app/docker/entrypoint.sh: line 8: $'\r': command not found
postgresml-admin-1     | /app/docker/entrypoint.sh: line 23: syntax error: unexpected end of file

This should be an easy fix. I haven't created a PR, because it would involve changing a large number of files...

refit_final_model_on_all_data param for cross validation

We already have k-fold cross validation (although it's not well documented, it's the folds arguments to train). We could add another param to refit_final_model_on_all_data, and have that default to true, since if you are cross validating, doing one more training run is only incrementally more expensive. I'm open to a more concise name for that param.

Pull Request Preview Environments for increasing maintainer productivity

I would like to make life easier for PostgresML maintainers by implementing Uffizzi preview environments.
Disclaimer: I work on Uffizzi.

Uffizzi is a Open Source full stack previews engine and our platform is available completely free for PostgresML (and all open source projects). This will provide maintainers with preview environments of their PRs in the cloud, allowing them iterate faster and reduce time to merge.

Uffizzi is purpose-built for the task of previewing PRs and it integrates with your workflow to deploy preview environments in the background without any manual steps for maintainers or contributors.

TODO:

Intial PoC

.env.docker overrides settings of .env

Hi,

Thank you for the project and your effort.

Issue: I was experimenting with deploying PostgresML on a workstation and accessing the dashboard from another machine. I bumped into the issue where I couldn't override DJANGO_ALLOWED_HOSTS. I tried to set environment variables through docker-compose.yml (see example below) and through .env file, but according to Django's debug/error page, DJANGO_ALLOWED_HOSTS (ALLOWED_HOSTS after parsing) was unchanged.

  dashboard:
    depends_on:
      - postgres
    build:
      context: ./pgml-dashboard/
      dockerfile: Dockerfile
    ports:
      - "8000:8000"
    environment:
      - DJANGO_ALLOWED_HOSTS
      - DJANGO_CSRF_TRUSTED_ORIGINS
    command:
      - python3
      - manage.py
      - runserver
      - 0.0.0.0:8000

Workaround: After digging into the code, I noticed that file ./pgml-dashboard/docker/.env.docker overrides everything I define in the environment (in docker-compose.yaml) and .env at the root or ./pgml-dashboard. Commenting out the DJANGO_ALLOWED_HOSTS there solves the issue.

I see two solutions to the problem:

Remove the ./pgml-dashboard/docker/.env.docker file and introduce "global" dotenv at the root of the project. This way, environment variables would be passed from .env or compose file into the containers. In docker-compose you may define fallback values, if those are not defined anywhere.
Add documentation to README describing where to change variables for the dashboard.

Kind regards,
Gregor

Unable to create new project

I encountered an error "TypeError at /api/tables/columns/" while setting up a new project. I've included the entire screenshot of the error. kindly have a look.

I connected the pgml dashboard to my local Postgres database, I have made changes in allowed hosts entry in the following files which are under the directory pgml_dashboard .env, and settings.py

kindly inform me of the best way to handle this error.

postgresml / postgresml Goto Github PK

postgresml's Issues

cd postgresml && docker-compose up

Data

Recommend Projects

Recommend Topics

Recommend Org