ytsaurus / ytsaurus Goto Github PK

YTsaurus is a scalable and fault-tolerant open-source big data platform.

License: Other

Makefile 0.31% Python 16.54% C++ 45.42% CMake 0.47% Go 1.51% NASL 0.05% Shell 0.05% C 27.82% Jinja 0.01% Cython 0.75% Scala 5.12% Dockerfile 0.01% Meson 0.01% Fortran 0.01% Assembly 1.06% Smarty 0.01% Java 0.87% Roff 0.01% POV-Ray SDL 0.01% Lua 0.01%

big-data clickhouse distributed-database lakehouse olap-database spark sql ytsaurus

ytsaurus's Introduction

YTsaurus

Website | Documentation | YouTube

YTsaurus is a distributed storage and processing platform for big data with support for MapReduce model, a distributed file system and a NoSQL key-value database.

You can read post about YTsaurus or check video:

Advantages of the platform

Multitenant ecosystem

A set of interrelated subsystems: MapReduce, an SQL query engine, a job schedule, and a key-value store for OLTP workloads.
Support for large numbers of users that eliminates multiple installations and streamlines hardware usage

Reliability and stability

No single point of failure
Automated replication between servers
Updates with no loss of computing progress

Scalability

Up to 1 million CPU cores and thousands of GPUs
Exabytes of data on different media: HDD, SSD, NVME, RAM
Tens of thousands of nodes
Automated server up and down-scaling

Rich functionality

Expansive MapReduce module
Distributed ACID transactions
A variety of SDKs and APIs
Secure isolation for compute resources and storage
User-friendly and easy-to-use UI

CHYT powered by ClickHouse®

A well-known SQL dialect and familiar functionality
Fast analytic queries
Integration with popular BI solutions via JDBC and ODBC

SPYT powered by Apache Spark

A set of popular tools for writing ETL processes
Launch and support for multiple mini SPYT clusters
Easy migration for ready-made solutions

Getting Started

Try YTsaurus cluster using Kubernetes or try our online demo.

How to Build from Source Code

Build from source code.

How to Contribute

We are glad to welcome new contributors!

Please read the contributor's guide and the styleguide.
We can accept your work to YTsaurus after you have signed contributor's license agreement (aka CLA).
Please don't forget to add a note to your pull request, that you agree to the terms of the CLA.

ytsaurus's People

Contributors

Stargazers

Watchers

Forkers

slon shchegolevpavel aligator77 nethask de30 dpoluyanov ktaranov oneunreadmail despitedeath yasinai blockspacer kaikash lnikon renatlot 7biz maxwellboss vndanilchenko ankerok4 cema93 lee9604 shafiahmed jrcribb richardsonjf rnz nanderoo savnadya kidmam coco-bigdata wiigin kpdatacloud ankur612 jacklicn hexiaofeng dut3062796s dangyajun gmh5225 mamogaaa kokizzu spread0x duzhanyuan yibit khasanov41 zamazan4ik hackerproff optionals alex-teplov mlpxt alexeilebedev e-gutrov aleexf siyu618 fewfy usermocius spozdeev satanson xiedeyantu sashamelentyev gongcoding kkxiaotikk alextokarew dawsongzhao0523 ayyatsenko amosdoublec ivan-gorin gritukan zlobober amosbird ygun vk4arm zhigibig chegoryu sophialaz dzen-platform webclinic017 kzconnection xiaodizi l0kix2 krock21 valinurovdenis koct9i sergey-v-galtsev mutexed yutsareva holycheater kruftik nazarov-yuriy hefengtian0220 omgronny krisha11 ernado howlotus artsiukhou kontakter ogorbacheva tagirhamitov qurname2 rbunin andrey-khropov amrivkin mosaic2004

ytsaurus's Issues

Add decimal.Decimal support for python client

Evaluate button in YQL UI

There is currently no way to check if the query is going to compile other than running it. It makes the development process unnecessarily complex and risky.

Could you please add a separate button to compile the query (check that the locks are OK etc) without starting?

Consider using LTO + PGO + Bolt

Inspired by ydb-platform/ydb#140

I think the same optimization techniques could be applied to the YTSaurus as well since on YDB it shows some improvements. Would be great, if anyone will test it and report back the results.

Probably, for further details regarding PGO and results in other projects you could interested in this link: https://github.com/ZaMaZaN4iK/awesome-pgo

[Feature] Implement dbt adapter Integration for Ytsaurus

dbt offers valuable integrations with clickhouse.
dbt is a very common framework in the field of big data:
https://docs.getdbt.com/docs/community-adapters

Replication writer should force natural blocks order during journal chunk replication

[Feature] Store yql query id to all output tables

It would be helpful to have connections between the artefacts(tables) and the processes which created them.
Yql queries have ids, which could be used to make such connection.
They could be saved in table's attributes.

Support building core/other libraries separately

E.g.:

async & multithreading
-http clients/servers
rpc stack
disk primitives (incl. IO uring)
profiling
tracing
logging subsystem

Make them separate cmake targets? Or exclude them into a separate repository and make it submodule?

"Conflicting Solomon tags" alert on newly setup cluster

Fresh cluster setup shows "Conflicting Solomon tags" alert on tablet node
No matter k8s or minikube, any image (latest, latest-yql, unstable-0.0.1), any number of tablet nodes (1 or 3)

Steps to reproduce:

helm pull oci://docker.io/ytsaurus/ytop-chart --version 0.1.6 --untar
kubectl create namespace ytsaurus
kubectl create secret generic ytadminsec --from-literal=login=admin --from-literal=password=somepass --from-literal=token=somepass -n ytsaurus
helm install ytsaurus ytop-chart/ -n ytsaurus
wget https://raw.githubusercontent.com/ytsaurus/yt-k8s-operator/main/config/samples/cluster_v1_demo_without_yql.yaml
kubectl apply -f cluster_v1_demo_without_yql.yaml -n ytsaurus
navigate to http://cluster-ip/ytdemo/components/nodes/tnd-0.tablet-nodes.ytsaurus.svc.cluster.local:9012/alerts

Alert
Conflicting Solomon tags[1]
Attributes
{
    "cellar_type": "tablet",
    "datetime": "2023-05-07T16:45:10.498439Z",
    "fid": 18446446897001216000,
    "host": "tnd-0.tablet-nodes.ytsaurus.svc.cluster.local",
    "pid": 1,
    "tags": [
        "default",
        "sys"
    ],
    "tid": 9238167990267787000
}

Small typo in docs

On the page https://ytsaurus.tech/docs/en/overview/try-yt#installing-the-operator there is a small typo in the code formatting of the 2nd line - it should be helm pull oci://docker.io/ytsaurus/top-chart --version 0.1.6 --untar instead of `helm pull oci://docker.io/ytsaurus/top-chart --version 0.1.6 --untar' .

YQL fails hard with 1000+ tables in range

When attempting to run a query over 1000+ tables, the query fails with an error:
Operation has failed to initialize"; ... "inner_errors"=[{"code"=1;"message"="Too many input tables: maximum allowed 1000, actual 2347

Expected behavior:
the tables are pre-merged into anonymous tables in groups of 1000, and then fed into the query. The documentation suggests that this should be the behavior (with loss of the TablePath() attribute):
https://ytsaurus.tech/docs/en/yql/builtins/basic#tablepath

feat(sdk/go): use OpenTelemetry SDK

Instead of opentracing.Tracer use trace.Tracer from opentelemetry go sdk.

Should be pretty straightforward.

Add riscv64 architecture support

The riscv64 arch is gaining popularity, so it will be nice to have support for it.
After adding #4 ARM support (which is obviously more important) most multiarch infrastructure should be ready and it will be simpler to add another arch.

I was able to build clickhouse for riscv64: ClickHouse/ClickHouse#40141

Looks like it will require clang 16.

Documentation: lack of important documentation in multiple areas

After reading the documentation I've noticed the lack of multiple important areas:

Does it support any kind of multi-cluster support? E.g. Cross-Cluster Replication, Stand-By Clusters.
Supported Hardware Architectures/OS combinations. From the Install page, it's not clear, which hardware architectures (like x86-64, ARM, etc.) are supported on which operating systems. If you have special requirements for the instruction sets (like SSE or AVX) - write about this in the documentation too. This information should be available in official documentation, not only in https://github.com/ytsaurus/ytsaurus/blob/main/BUILD.md
Are there any recommendations regarding setup on cloud environments (like AWS/Azure/GCP/Yandex Cloud)? E.g. reference architectures, recommended hardware (e.g. recommended AWS EC2 machine type, disks, etc.), maybe even ready-to-use Terraform scripts? What about reference deploy architectures for on-premise installations?
How could I install Highly-Available (HA) cluster? Are there any restrictions/recommendations regarding network latency between nodes? Recommendations regarding clusters across multiple Availability zones also would be useful.
How to upgrade/downgrade YTsaurus? Does it support zero-downtime upgrades and downgrades? What about backward/forward compatibility between releases - what is the current policy?
How to backup and restore YTsaurus? Are there any built-in integrity checks for the backup?
How to monitor YTsaurus? Does it support any kind of integrated monitoring (like Prometheus endpoint, statsd integration, etc.)? If yes, how to configure it, and which metrics are supported?
Would be great if you would be able to publish and maintain a public roadmap for the product.
Is there any built-in benchmark utility like it's done in YDB (https://ydb.tech/en/docs/development/load-actors-overview)? Would be useful for benchmarking, choosing the proper cluster size, and performing PGO optimizations.
Having public-available benchmarks (like https://benchmark.clickhouse.com/) also would be nice to have.
Do you perform Jepsen-like tests? :)

I think this list could be somehow transformed into the documentation task epic and could be resolved step by step.

Thanks in advance!

Failed YQL query to cluster with '-' in the name

The YQL query from the UI to the cluster with '-' in the name failed:

Error
Query 81058aa1-376b84e7-5dec3d89-c73ee972 failed[1]
YQL embedded call failed[1]
yql/library/embedded/yql_embedded.cpp:429: Failed to parse SQL: -memory-:<main>: Error: Parse Sql

    -memory-:<main>:3:5: Error: Unexpected token 'cloud' : cannot match to any predicted input...

    	from cloud-yt1.`//home/demo/chyt/uk_price_paid`;
	    ^
    -memory-:<main>:3:10: Error: Unexpected token '-' : cannot match to any predicted input...

    	from cloud-yt1.`//home/demo/chyt/uk_price_paid`;
	         ^

If this is the expected behavior and such naming is forbidden please describe it in the documentation.

YQL queries not reading from a table fail in the UI

Both queries are OK according to the documentation but produce an error in the UI. Particularly problematic is the case when the query writing from one table to another executes successfully (the table is modified) but the UI result is still "Error".

Decimals are casted to Integers in UI

USE yt;

SELECT
CAST("1.2345" AS Decimal(10, 5)),
CAST("1.2345" AS Decimal(10, 5)) * 100
FROM //tmp/lol
LIMIT 1;

Shows 1 and 123

Styleguide links broken/missing

I was unable to find a general style guide, and the references found in util/README.md point to y-t.ru, and thus are inaccessible.

Is it possible to have a style guide included in the repo?

Query Tracker Fails to Update Query State Due to uint32 Range Issue

The Query Tracker is experiencing a problem where it fails to correctly update the state of a query. The underlying issue appears to be related to a Value is out of range to fit into uint32 error.

Details of the error message:

2023-07-29 18:12:40,637961 E QueryHandler Failed to write query state, backing off (QueryId: b9f81cb4-697f258f-a2ff28f2-4131ab67, Engine: Yql)\nValue is out of range to fit into uint32\n origin qt-0.query-trackers.yt.svc.cluster.local (pid 1, thread Control, fid fffef47e368c12c8)\n datetime 2023-07-29T18:12:40.637648Z\n Control fffef47e368c12c8 2a276c92-c52337af-5a72971c-c2600800

Steps to Reproduce:

Run the following YQL query:

use main;

select
    received_at,
    Yson::YPath(`json`, "/data/E") as E
from (
    select * from range(`//home/table1`)
    union all
    select * from range(`//home/table2`)
)
limit 10

Expected Result: The Query Tracker should be able to accurately update the state of each query.

Actual Result: The Query Tracker fails to update the state of a query due to a Value is out of range to fit into uint32 error.

Please look into this issue at the earliest convenience and let us know if you need any further information to aid your investigation. Thank you.

Missing repositories to build SPYT/Invalid file path

Can’t build ytsaurus-spyt because of lack of the specific files in the referred repositories:
https://repo1.maven.org/maven2/tech/ytsaurus/ - no spark directory at all
https://s01.oss.sonatype.org/content/repositories/snapshots/tech/ytsaurus/spark/ - no files for v.1.69 (e.g. https://s01.oss.sonatype.org/content/repositories/snapshots/tech/ytsaurus/spark/spark-sql_2.12/1.69.0/spark-sql_2.12-1.69.0.pom)

Also sbt uses the invalid paths to find spark-catalyst - ${fork.version} is not resolved to the real value:

[error] (file-system / update) sbt.librarymanagement.ResolveException: Error downloading tech.ytsaurus.spark:spark-catalyst_2.12:${fork.version}
[error]   Not found
[error]   Not found
[error]   not found: /home/oiartemeva/.ivy2/localtech.ytsaurus.spark/spark-catalyst_2.12/${fork.version}/ivys/ivy.xml
[error]   not found: /home/oiartemeva/.m2/repository/tech/ytsaurus/spark/spark-catalyst_2.12/${fork.version}/spark-catalyst_2.12-${fork.version}.pom
[error]   not found: https://repo1.maven.org/maven2/tech/ytsaurus/spark/spark-catalyst_2.12/${fork.version}/spark-catalyst_2.12-${fork.version}.pom
[error]   not found: https://s01.oss.sonatype.org/content/repositories/snapshots/tech/ytsaurus/spark/spark-catalyst_2.12/${fork.version}/spark-catalyst_2.12-${fork.version}.pom

Typos in user_job_memory_digest_* parameters

ytsaurus/yt/documentation/en/_includes/user-guide/data-processing/scheduler/memory-digest.md

Lines 33 to 35 in 6aaed61

 * `user_job_memory-digest_default_value`: Initial assumption for selecting the memory reserve (the default value is 0.5). 

 * `user_job_memory-digest_lower_bound`: The limit below which the reserve must not fall (the default value is 0.05). We do not recommend changing the default value. 

 * `memory_reserve_factor`: The alias for the `user_job_memory-digest_lower_bound` and `user_job_memory-digest_default_value` options concurrently. Using this option is not recommended.

and

ytsaurus/yt/documentation/ru/_includes/user-guide/data-processing/scheduler/memory-digest.md

Lines 33 to 35 in 6aaed61

 * `user_job_memory-digest_default_value` - исходное предположение для выбора резерва памяти (по умолчанию 0.5). 

 * `user_job_memory-digest_lower_bound` - граница, ниже которой не может опускаться резерв (по умолчанию 0.05). Не рекомендуется менять значение по умолчанию. 

 * `memory_reserve_factor` - алиас для опции `user_job_memory-digest_lower_bound` и `user_job_memory-digest_default_value` одновременно. Не рекомендуется использовать данную опцию.

the parameters user_job_memory_digest_default_value and user_job_memory_digest_lower_bound are written with typos, - instead of _. I stumbled across this issue in my work, so I would like to fix it.

Suggestion for .clang-format and .clang-tidy Configuration Files

Hello,

I noticed the repository lacks .clang-format and .clang-tidy configuration files, crucial for maintaining consistent code style and quality.

I suggest:

Adding a .clang-format file at the project's root for a uniform coding style.
Implementing a .clang-tidy file for automated code checks and to uphold code quality.

These additions would increase code readability, simplify the contribution process, and ensure high-quality code.

Please consider this proposal.

[Feature] Implement S3/HDFS Integrations for CHYT in Ytsaurus

Clickhouse offers valuable integrations like S3, HDFS, and others. Adding support for these integrations in CHYT within Ytsaurus would greatly enhance convenience and functionality.

can't run yt by minikube

can't run yt by minikube following the instuctions of https://ytsaurus.tech/docs/en/overview/try-yt#kubernetes

Add ARM support

Hi.

ARM is becoming more and more popular architecture on servers (AWS with Gravitons, GCP with Ampere (AFAIK), other cloud providers) and even for local development (Mac M1 and some of the newest Lenovo laptops).

It would be great to see ARM support in the project.

Old SPYT version in jupyter-tutorial docker image

Please update ytsaurus/jupyter-tutorial docker image: change the used ytsaurus-spyt version from 1.67.0 to 1.69.0.
There is only spyt v1.69 that available for pip installation. The update allows to interact with the same cluster both from the training jupyter-notebook and CLI/API.

Support tags with digit as first symbol in BooleanFormula

Sometimes we need to launch jobs on particular nodes of our YT cluster in order to test new YT versions, infrastructure changes and so on.
Using --spec with scheduling_tag_filter by node is a good point for such kind of exercise.
Major part of our YT infrastructure operates with domain names starting from number so we have 1.exec.node.fqdn, 2.exec.node.fqdn and so on in //sys/cluster_nodes/<node>/@tags.
Unfortunately, ytsaurus scheduling_tag_filter doesn't support such kind of tags, failing with:

Error while parsing boolean formula:
2.exec.node.fqdn:9012
 ^

I've tested filter with different values and recorded result in table below

filter	result	error
--spec '{"scheduling_tag_filter"="a.1.exec.node.fqdn:9012"}'	✓
--spec '{"scheduling_tag_filter"="exec.node.fqdn:9012"}'	✓
--spec '{"scheduling_tag_filter"="1.exec.node.fqdn:9012"}'	✗	Error while parsing boolean formula: 2.exec.node.fqdn:9012 ^ Unexpected character
--spec '{"scheduling_tag_filter"="127.0.0.1"}'	✗	Error while parsing boolean formula: 127.0.0.1:9012 ^ Unexpected character

If exec-node assigns these values automatically, should such tags be supported in YTsaurus?

[Feature] Add lables for storage classes and chunks distribution rules based on them

YT has a "media" storage class abstraction, which allow to set a storage class in Cypress.
The proposed feature is about extending the storage mechanism to support some advantage chunk distribution scenarios.

Idea

Add an additional abstraction "label" which can be associated with each "media" on the node level and represents more details about it, like redundancy or performance.
Add a possibility to set policy for chunk distribution based on labels.
For example: we have 3 nodes for blobs, each of it has a single "ssd_blobs" media. Also, we have a table /my/mytable, with replication_factor=3.
On the first node we can set label "type=redundant", on other 2 - "type=ephemeral".
After that, we can set a distribution policy for the table /my/mytable, which will allow us to make sure we always have one replica on the "type=redundant" media and all others on the "type=ephemeral".
Optional: make priority during data read on the labels level.

Motivation
In clouds you can have an expensive, but redundant and relatively slow network volumes and fast and cheap, but not redundant local SSDs. Enabling chunk distribution between them can speed up operations and at the same time reduce the cost of ownership, because you can make sure you have at least 1 copy on the attached storage (which has an underlayer replication) even if all nodes will down and deleted, but at the same time you can store a major part of the data on local SSDs which are fast and relatively cheap for day-to-day operations - until everything is fine we can use 2 replicas on local SSDs (ephemeral) even to replace or restore a redundant storage on a network disk.

Provide Grafana dashboard

Hi! Thanks for the release.

For administration purposes, would be much easier to have ready-to-use Grafana dashboard with some "default" metrics to see. As far as I see, Prometheus kinda supported by ytsaurus but the link is broken - https://ytsaurus.tech/en/user-guide/data-processing/spyt/cluster/prometheus

Thanks in advance!

Error based on deployment

I use a single server for deployment,Here are the steps I used
Server information：
centos7.9
16G
8c
Free disk space： 200G

git clone https://github.com/ytsaurus/ytsaurus.git
cd ytsaurus/yt/docker/local
./run_local_cluster.sh

The container starts successfully，But use ip:8001 there was an error opening the webpage.

The following is the error message

Error
Request failed with status code 500
Error

Attributes

{
"message": "{"message":"timeout of 5000ms exceeded","code":0}"
}

Error
Oops! something went wrong. If the problem persists please report it via Bug Reporter.
{"message":"timeout of 5000ms exceeded","code":0}

What's more, YTsaurus will exit the container of yt.bakend inexplicably after a period of time after the container is started. What is the reason for this?

Please how to solve it, look forward to your reply

Images latest-yql and unstable-0.0.1 missing SPYT python package

K8s job yt-spyt-init-job-spyt-environment will fail with following images, because /opt/conda/lib/python3.7/site-packages/spyt is missing

ytsaurus/ytsaurus:unstable-0.0.1
ytsaurus/ytsaurus:latest-yql

Steps to reproduce:

helm pull oci://docker.io/ytsaurus/ytop-chart --version 0.1.7 --untar
kubectl create namespace ytsaurus
kubectl create secret generic ytadminsec --from-literal=login=admin --from-literal=password=somepass --from-literal=token=somepass -n ytsaurus
helm install ytsaurus ytop-chart/ -n ytsaurus
wget https://raw.githubusercontent.com/ytsaurus/yt-k8s-operator/main/config/samples/cluster_v1_demo.yaml
sed -i s#ytsaurus/ytsaurus:unstable-0.0.1#ytsaurus/ytsaurus:latest-yql# cluster_v1_demo.yaml
kubectl apply -f cluster_v1_demo.yaml -n ytsaurus

Container log:

$ kubectl logs -n ytsaurus yt-spyt-init-job-spyt-environment-x2g2g
++ export YT_DRIVER_CONFIG_PATH=/config/client.yson
++ YT_DRIVER_CONFIG_PATH=/config/client.yson
++ /usr/bin/yt create document //home/spark/conf/global --ignore-existing -r
1-365-101a5-5c56f46a
++ /usr/bin/yt set //home/spark/conf/global '{
                "latest_spark_cluster_version" = "1.67.0";
                "operation_spec" = {
                        "job_cpu_monitor" = {
                                "enable_cpu_reclaim" = "false"
                        }
                };
                "python_cluster_paths" = {
                        "3.7" = "/opt/conda/bin/python3.7";
                };
                "layer_paths" = [
                ];
                "worker_num_limit" = 1000;
                "environment" = {
                        "ARROW_ENABLE_UNSAFE_MEMORY_ACCESS" = "true";
                        "YT_ALLOW_HTTP_REQUESTS_TO_YT_FROM_JOB" = "1";
                        "JAVA_HOME" = "/usr/bin/java";
                        "ARROW_ENABLE_NULL_CHECK_FOR_GET" = "false";
                        "IS_SPARK_CLUSTER" = "true";
                };
                "spark_conf" = {
                        "spark.yt.log.enabled" = "false";
                        "spark.hadoop.yt.proxyRole" = "spark";
                        "spark.datasource.yt.recursiveFileLookup" = "true";
                };
        }'
++ /usr/bin/yt create file //home/spark/conf/releases/1.67.0/metrics.properties --ignore-existing -r
1-379-10190-4e84fde2
++ /usr/bin/yt set //home/spark/conf/releases/1.67.0/metrics.properties/@replication_factor 1
++ cat /usr/bin/metrics.properties
++ /usr/bin/yt write-file //home/spark/conf/releases/1.67.0/metrics.properties
++ /usr/bin/yt create file //home/spark/conf/releases/1.67.0/solomon-agent.template.conf --ignore-existing
1-a25-10190-a8fd9d7a
++ /usr/bin/yt set //home/spark/conf/releases/1.67.0/solomon-agent.template.conf/@replication_factor 1
++ cat /usr/bin/solomon-agent.template.conf
++ /usr/bin/yt upload //home/spark/conf/releases/1.67.0/solomon-agent.template.conf
++ /usr/bin/yt create file //home/spark/conf/releases/1.67.0/solomon-service-master.template.conf --ignore-existing
1-a87-10190-d40882a9
++ /usr/bin/yt set //home/spark/conf/releases/1.67.0/solomon-service-master.template.conf/@replication_factor 1
++ cat /usr/bin/solomon-service-master.template.conf
++ /usr/bin/yt upload //home/spark/conf/releases/1.67.0/solomon-service-master.template.conf
++ /usr/bin/yt create file //home/spark/conf/releases/1.67.0/solomon-service-worker.template.conf --ignore-existing
1-add-10190-6522b17f
++ /usr/bin/yt set //home/spark/conf/releases/1.67.0/solomon-service-worker.template.conf/@replication_factor 1
++ cat /usr/bin/solomon-service-worker.template.conf
++ /usr/bin/yt upload //home/spark/conf/releases/1.67.0/solomon-service-worker.template.conf
++ /usr/bin/yt create document //home/spark/conf/releases/1.67.0/spark-launch-conf --ignore-existing
1-b2f-101a5-409bb92b
++ /usr/bin/yt set //home/spark/conf/releases/1.67.0/spark-launch-conf '{
                "layer_paths" = [
                ];
                "spark_yt_base_path" = "//home/spark/bin/releases/1.67.0";
                "file_paths" = [
                        "//home/spark/bin/releases/1.67.0/spark.tgz";
                        "//home/spark/bin/releases/1.67.0/spark-yt-launcher.jar";
                        "//home/spark/conf/releases/1.67.0/metrics.properties";
                        "//home/spark/conf/releases/1.67.0/solomon-agent.template.conf";
                "//home/spark/conf/releases/1.67.0/solomon-service-master.template.conf";
                "//home/spark/conf/releases/1.67.0/solomon-service-worker.template.conf";
                ];
                "spark_conf" = {
                        "spark.yt.jarCaching" = "true";
                };
                "environment" = {};
                "enablers" = {
                        "enable_byop" = %false;
                        "enable_arrow" = %true;
                        "enable_mtn" = %true;
                };
        }'
++ /usr/bin/yt create file //home/spark/spark/releases/3.2.2-fork-1.67.0/spark.tgz --ignore-existing -r
1-b51-10190-9f99324f
++ /usr/bin/yt set //home/spark/spark/releases/3.2.2-fork-1.67.0/spark.tgz/@replication_factor 1
++ cat /usr/bin/spark.tgz
++ /usr/bin/yt write-file //home/spark/spark/releases/3.2.2-fork-1.67.0/spark.tgz
++ /usr/bin/yt create file //home/spark/spyt/releases/1.67.3/spark-yt-data-source.jar --ignore-existing -r
1-c24-10190-ee3169f9
++ /usr/bin/yt set //home/spark/spyt/releases/1.67.3/spark-yt-data-source.jar/@replication_factor 1
++ cat /usr/bin/spark-yt-data-source.jar
++ /usr/bin/yt upload //home/spark/spyt/releases/1.67.3/spark-yt-data-source.jar
++ /usr/bin/yt create file //home/spark/spyt/releases/1.67.3/spyt.zip --ignore-existing
1-c78-10190-d2631a4f
++ /usr/bin/yt set //home/spark/spyt/releases/1.67.3/spyt.zip/@replication_factor 1
++ cat /usr/bin/spyt.zip
++ /usr/bin/yt upload //home/spark/spyt/releases/1.67.3/spyt.zip
++ /usr/bin/yt create map_node //home/spark/bin/releases/1.67.0 --ignore-existing -r
1-cc9-1012f-22cba307
++ /usr/bin/yt create file //home/spark/bin/releases/1.67.0/spark-yt-launcher.jar --ignore-existing -r
1-cd9-10190-3ef58cda
++ /usr/bin/yt set //home/spark/bin/releases/1.67.0/spark-yt-launcher.jar/@replication_factor 1
++ cat /usr/bin/spark-yt-launcher.jar
++ /usr/bin/yt upload //home/spark/bin/releases/1.67.0/spark-yt-launcher.jar
++ /usr/bin/yt link //home/spark/spark/releases/3.2.2-fork-1.67.0/spark.tgz //home/spark/bin/releases/1.67.0/spark.tgz -f
++ /usr/bin/yt link //home //sys/spark
++ /usr/bin/yt set //home/spark/conf/global/ytserver_proxy_path '"//sys/spark"'
++ sed -i 's/\"27099\"/\"27099\"\n    environment[\"SPARK_YT_SOLOMON_ENABLED\"] = \"false\"/' /opt/conda/lib/python3.7/site-packages/spyt/standalone.py
sed: can't read /opt/conda/lib/python3.7/site-packages/spyt/standalone.py: No such file or directory

Additional info:

$ docker run -it ytsaurus/ytsaurus:latest ls -l /opt/conda/lib/python3.7/site-packages/spyt/standalone.py
-rw-r--r-- 1 root users 33867 Mar 19 18:58 /opt/conda/lib/python3.7/site-packages/spyt/standalone.py

$ docker run -it ytsaurus/ytsaurus:latest-yql ls -l /opt/conda/lib/python3.7/site-packages/spyt/standalone.py
ls: cannot access '/opt/conda/lib/python3.7/site-packages/spyt/standalone.py': No such file or directory

$ docker run -it ytsaurus/ytsaurus:unstable-0.0.1 ls -l /opt/conda/lib/python3.7/site-packages/spyt/standalone.py
ls: cannot access '/opt/conda/lib/python3.7/site-packages/spyt/standalone.py': No such file or directory

Spark array of decimals does not work

Code to reproduce

from pyspark.sql import SparkSession, DataFrame
from pyspark.sql.types import ArrayType, DecimalType, LongType, StructType, StructField, StringType
import pyspark.sql.functions as F
from pathlib import Path
import os
import spyt

spark = spyt.connect(discovery_path='//home/spyt-discovery-external', dynamic_allocation=False,
num_executors=1, cores_per_executor=1, executor_memory_per_core="500M",driver_memory="500M")
spark.conf.set("spark.yt.read.parsingTypeV3.enabled", "true")
spark.conf.set("spark.yt.write.writingTypeV3.enabled", "true")

data = [
(["1.234", "5.678", "9.1011"],),
(["12.13", "14.15", "16.17"],),
(["21.22", "23.24", "25.26"],),
]

schema = StructType([
StructField("array_of_strings", ArrayType(StringType()), True)
])

df = spark.createDataFrame(data, schema)

df = df.withColumn("array_of_decimals", F.col("array_of_strings").cast(ArrayType(DecimalType(38, 18))))
df.write.option("write_type_v3", "true").yt('//tmp/kek3')

throws error ending with:
Caused by: scala.MatchError: DecimalType(38,18) (of class org.apache.spark.sql.types.DecimalType) at tech.ytsaurus.spyt.serializers.YsonRowConverter$.serializeValue(YsonRowConverter.scala:235)

bug(cli): ValueError when running yt run-job-shell command

When attempting to run a job shell with the yt command, I encountered an error.

The error occurred during the execution of the following command:

$ yt run-job-shell ec771307-d3b685d-10384-7

This command resulted in a ValueError: kwargs can't be used if request is an HTTPRequest object traceback, as shown here:

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/yt/wrapper/cli_helpers.py", line 75, in run_main
    main_func()
  File "/usr/local/lib/python3.8/dist-packages/yt/cli/yt_binary.py", line 2703, in main_func
    args.func(**func_args)
  File "/usr/local/lib/python3.8/dist-packages/yt/wrapper/job_commands.py", line 66, in run_job_shell
    JobShell(job_id, shell_name=shell_name, interactive=True, timeout=timeout, client=client).run(command=command)
  File "/usr/local/lib/python3.8/dist-packages/yt/wrapper/job_shell.py", line 328, in run
    self._spawn_shell(command=command)
  File "/usr/local/lib/python3.8/dist-packages/yt/wrapper/job_shell.py", line 258, in _spawn_shell
    self.make_request("spawn", callback=self._on_spawn_response, term=os.environ["TERM"], command=command)
  File "/usr/local/lib/python3.8/dist-packages/yt/wrapper/job_shell.py", line 162, in make_request
    self.client.fetch(req, callback=callback)
  File "/usr/local/lib/python3.8/dist-packages/tornado/httpclient.py", line 289, in fetch
    raise ValueError(
ValueError: kwargs can't be used if request is an HTTPRequest object

This issue occurred while running YT wrapper version 0.13.1, confirmed with the command:

$ yt --version

Which returned:

Version: YT wrapper 0.13.1

Single quotes prevent variable substitution in shell commands in BUILD.md

When trying to execute the following shell commands from the BUILD.md file on Ubuntu, I encountered an error because the variable substitution did not occur due to the use of single quotes:

curl -s https://apt.llvm.org/llvm-snapshot.gpg.key | sudo apt-key add
curl -s https://apt.kitware.com/keys/kitware-archive-latest.asc | gpg --dearmor - | sudo tee /usr/share/keyrings/kitware-archive-keyring.gpg
echo 'deb http://apt.llvm.org/$(lsb_release -cs)/ llvm-toolchain-$(lsb_release -cs)-12 main' | sudo tee /etc/apt/sources.list.d/llvm.list
echo 'deb [signed-by=/usr/share/keyrings/kitware-archive-keyring.gpg] https://apt.kitware.com/ubuntu/ $(lsb_release -cs) main' | sudo tee /etc/apt/sources.list.d/kitware.list

As you can see, the commands use single quotes to enclose the string arguments that contain variables. This prevents the shell from expanding the variables, causing the commands to fail.

To fix the issue, the single quotes should be replaced with double quotes, like so:

echo "deb http://apt.llvm.org/$(lsb_release -cs)/ llvm-toolchain-$(lsb_release -cs)-12 main" | sudo tee /etc/apt/sources.list.d/llvm.list
echo "deb [signed-by=/usr/share/keyrings/kitware-archive-keyring.gpg] https://apt.kitware.com/ubuntu/ $(lsb_release -cs) main" | sudo tee /etc/apt/sources.list.d/kitware.list

This will allow the shell to expand the variables correctly.

I suggest updating the BUILD.md file to reflect this change and avoid confusion for other users who may encounter the same issue.

bug(cli): Client is missing credentials on yt issue-token admin

I'm trying to issue a new token for admin using password. Password is correct and I'm able to auth via web ui.

$ kubectl -n yt port-forward services/http-proxies-lb 3080:80

$ env | grep -i yt
PWD=/src/faster/gh-archive-yt
YT_CONFIG_PATCHES={proxy={enable_proxy_discovery=%false}}
YT_PROXY=localhost:3080
$ yt issue-token admin --password MY-PASSWORD
Your request 3225e24b-47ca434-8a664506-153bb101 has failed to authenticate at hp-0.http-proxies.yt.svc.cluster.local. Make sure that you have provided an OAuth token with the request. In case you do not have a valid token, please refer to  for obtaining one. If the error persists and system keeps rejecting your token, please kindly submit a request to https://ytsaurus.tech/#contact
    Client is missing credentials

***** Details:
Your request 3225e24b-47ca434-8a664506-153bb101 has failed to authenticate at hp-0.http-proxies.yt.svc.cluster.local. Make sure that you have provided an OAuth token with the request. In case you do not have a valid token, please refer to  for obtaining one. If the error persists and system keeps rejecting your token, please kindly submit a request to https://ytsaurus.tech/#contact    
    origin          work on 2023-05-24T16:11:18.716412Z
Received HTTP response with error    
    origin          work on 2023-05-24T16:11:18.716200Z    
    url             http://localhost:3080/api/v4/issue_token    
    request_headers {
                      "User-Agent": "Python wrapper 0.13.1",
                      "Accept-Encoding": "gzip, identity",
                      "X-Started-By": "{\"pid\"=460512;\"user\"=\"ernado\";}",
                      "X-YT-Header-Format": "<format=text>yson",
                      "X-YT-Parameters": "{\"suppress_transaction_coordinator_sync\"=%false;\"user\"=\"admin\";\"password_sha256\"=\"****\";}",
                      "X-YT-Correlation-Id": "8d0304c1-dc905c81-71eaa288-771c5da7"
                    }    
    response_headers {
                      "Content-Length": "320",
                      "X-YT-Response-Message": "Client is missing credentials",
                      "X-YT-Response-Code": "111",
                      "X-YT-Trace-Id": "4fe58282-7c8bb156-50da2bca-fdade6a5",
                      "X-YT-Error": "{\"code\":111,\"message\":\"Client is missing credentials\",\"attributes\":{\"host\":\"hp-0.http-proxies.yt.svc.cluster.local\",\"pid\":1,\"tid\":14507871149296041952,\"thread\":\"Poller:1\",\"fid\":18446443602852345016,\"datetime\":\"2023-05-24T16:11:18.713049Z\",\"trace_id\":\"4fe58282-7c8bb156-50da2bca-fdade6a5\",\"span_id\":14898918599301604335}}",
                      "X-YT-Request-Id": "3225e24b-47ca434-8a664506-153bb101",
                      "Content-Type": "application/json",
                      "Cache-Control": "no-store",
                      "X-YT-Proxy": "hp-0.http-proxies.yt.svc.cluster.local"
                    }    
    params          {
                      "suppress_transaction_coordinator_sync": false,
                      "user": "admin",
                      "password_sha256": "***"
                    }    
    transparent     True
Client is missing credentials    
    code            111    
    origin          hp-0.http-proxies.yt.svc.cluster.local on 2023-05-24T16:11:18.713049Z (pid 1, tid c9564ff1bcedbfe0, fid fffeeeb92d4a74b8)    
    thread          Poller:1    
    trace_id        4fe58282-7c8bb156-50da2bca-fdade6a5    
    span_id         14898918599301604335

Add Rust SDK support

In addition to the already existing SDK, would be nice to see official Rust support.

Unexpected 'finish' expected 'value' in YQL

Query:

use minisaurus;

insert into `//tmp/a` with truncate
select 1 as a;

Link 'try using kubernetes' in README is broken

https://ytsaurus.tech/en/docs/overview/try-yt

Add support for 3rd party Identity Providers

Often some kind of centralized Identity Providers are used in IT landscapes to manage access control to the systems (like Active Directory, OpenLDAP, etc.).

As far as I see, YTSaurus right now supports only built-in user management that could bing some additional administration overhead in managing users set in at least two places (AD and YTSaurus itself).

Allow to disable proxy discovery in go client without magic

For now the only way to disable proxy discovery in go is to add numbers to your DNS record: https://github.com/ytsaurus/ytsaurus/blob/main/yt/go/yt/config.go#L261-L263

func NormalizeProxyURL(proxy string, tvmOnly bool, tvmOnlyPort int) ClusterURL {
// ...

	if strings.ContainsAny(proxy, "0123456789") {
		url.DisableDiscovery = true
	}
// ...

I had to patch /etc/hosts to get go mapreduce client working.

Also current behavior is very non-obvious. If you suddenly have numbers in DNS records, everything works for you through the balancer, even if you wanted to work with proxies directly.

UI couldn’t be reached

I install ytsuarus as the follow instructions:
cd yt/docker/local
./run_local_cluster.sh

go to http://localhost:8001/ in chrome.
get the follow error:
UI couldn’t be reached
is there any tricks to fix it?

Wrong name in title

Change "Jyputer" to "Jupyter"

https://ytsaurus.tech/docs/ru/user-guide/data-processing/spyt/API/spyt-jupyter

connectors/import.py does not work for local files

My guess: passed path to spark.read inside script is expected to be cypress path

For example for parquet

The error is
23/07/26 18:13:04 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
23/07/26 18:13:04 WARN FileSystem: Cannot load filesystem
java.util.ServiceConfigurationError: org.apache.hadoop.fs.FileSystem: Provider org.apache.hadoop.hdfs.web.WebHdfsFileSystem could not be instantiated
at java.base/java.util.ServiceLoader.fail(ServiceLoader.java:582)
at java.base/java.util.ServiceLoader$ProviderImpl.newInstance(ServiceLoader.java:804)
at java.base/java.util.ServiceLoader$ProviderImpl.get(ServiceLoader.java:722)
at java.base/java.util.ServiceLoader$3.next(ServiceLoader.java:1395)
at org.apache.hadoop.fs.FileSystem.loadFileSystems(FileSystem.java:2631)
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2650)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
at org.apache.spark.util.DependencyUtils$.resolveGlobPath(DependencyUtils.scala:317)
at org.apache.spark.util.DependencyUtils$.$anonfun$resolveGlobPaths$2(DependencyUtils.scala:273)
at org.apache.spark.util.DependencyUtils$.$anonfun$resolveGlobPaths$2$adapted(DependencyUtils.scala:271)
at scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:293)
at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:38)
at scala.collection.TraversableLike.flatMap(TraversableLike.scala:293)
at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:290)
at scala.collection.AbstractTraversable.flatMap(Traversable.scala:108)
at org.apache.spark.util.DependencyUtils$.resolveGlobPaths(DependencyUtils.scala:271)
at org.apache.spark.deploy.SparkSubmit.$anonfun$prepareSubmitEnvironment$9(SparkSubmit.scala:368)
at scala.Option.map(Option.scala:230)
at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:368)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:931)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1078)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1087)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.NoClassDefFoundError: org/codehaus/jackson/map/ObjectMapper

Clang 16 build error on Fedora 38

During the build YTsaurus on the latest main branch (commit bfd3ff23c327b385e199682d2179d2d95b4a3c91) I get the following error during the CMake phase (cmake -G Ninja -DCMAKE_BUILD_TYPE=Release -DCMAKE_TOOLCHAIN_FILE=../ytsaurus/clang.toolchain ../ytsaurus):

...
/home/zamazan4ik/.conan/data/bison/3.5.3/_/_/build/8d34a439bc60645553a8166e604aef7532be0f25/src/lib/obstack.c:351:31: error: incompatible function pointer types initializing 'void (*)(void) __attribute__((noreturn))' with an expression of type 'void (void)' [-Wincompatible-function-pointer-types]
__attribute_noreturn__ void (*obstack_alloc_failed_handler) (void)
                              ^
1 error generated.
make[2]: *** [Makefile:6033: lib/libbison_a-obstack.o] Error 1
make[2]: *** Waiting for unfinished jobs....
  CC       src/bison-ielr.o
make[2]: Leaving directory '/home/zamazan4ik/.conan/data/bison/3.5.3/_/_/build/8d34a439bc60645553a8166e604aef7532be0f25/build-release'
make[1]: *** [Makefile:8342: install-recursive] Error 1
make[1]: Leaving directory '/home/zamazan4ik/.conan/data/bison/3.5.3/_/_/build/8d34a439bc60645553a8166e604aef7532be0f25/build-release'
make: *** [Makefile:8798: install] Error 2
bison/3.5.3: 
bison/3.5.3: ERROR: Package '8d34a439bc60645553a8166e604aef7532be0f25' build failed
bison/3.5.3: WARN: Build folder /home/zamazan4ik/.conan/data/bison/3.5.3/_/_/build/8d34a439bc60645553a8166e604aef7532be0f25/build-release
ERROR: bison/3.5.3: Error in build() method, line 139
        autotools.install()
        ConanException: Error 2 while executing make install DESTDIR=/home/zamazan4ik/.conan/data/bison/3.5.3/_/_/package/8d34a439bc60645553a8166e604aef7532be0f25 -j24
CMake Error at cmake/conan.cmake:668 (message):
  Conan install failed='1'
Call Stack (most recent call first):
  CMakeLists.txt:43 (conan_cmake_install)

As a compiler, I use Clang 16 (clang version 16.0.6 (Fedora 16.0.6-2.fc38)).

YQL code generation failing with unexpected errors

The request

use ytdemo;
pragma yt.DefaultCluster ='ytdemo';


$multiply_member = ($st, $m, $x) -> {
    return $st.$m*$x
};

$struct_multiply = ($struct, $x) -> {
    $members = ["a", "b"];
    $multiplier = EvaluateCode(LambdaCode(($st) -> {
        return FuncCode(
            "AsStruct",
            ListMap(
                $members,
                ($m) -> (
                    ListCode(
                        AtomCode($m),
                        FuncCode(
                            "Apply",
                            QuoteCode($multiply_member),
                            $st,
                            ReprCode($m),
                            ReprCode($x)
                        )
                    )
                )
            )
        )
    }));
    return $multiplier($struct);
};

insert into `//home/ibushmarinov/testmult`
with truncate
select * from (
    select 
    -- /RenameMembers(
        $struct_multiply(
            TableRow() , 2)
    --     ,
    --      ($x) -> ($x || "_2")),
    -- RenameMembers($struct_multiply(TableRow(), 3), ($x) -> ($x || "_3"))
    from (Select 1 as a, 2 as b)
) Flatten columns;
commit;
select * from `//home/ibushmarinov/testmult`;

fails with the following cryptic error:

YQL embedded call failed[1]
yt/yql/plugin/plugin.cpp:171: Failed to run: -memory-:<main>: Fatal: Expression evaluation

    -memory-:<main>:11:19: Fatal: At function: EvaluateCode
    	    $multiplier = EvaluateCode(LambdaCode(($st) -> {
	                  ^
        -memory-:<main>: Fatal: yql/providers/yt/lib/config_clusters/config_clusters.cpp:96: TYtGatewayConfig: No default cluster
        
[1]

This seems to be a bug on the server side and not in the code itself. What does "default cluster" even mean in this context?

bug(sdk/go): http client generates invalid header values

Sending a select_rows command with a placeholder with a binary value leads to a net/http: invalid header field value for "X-Yt-Parameters" error.

It seems YSON text encoder does not escape some characters considered by HTTP as control.

Reproducer

package internal_test

import (
	"context"
	"fmt"
	"testing"

	"go.uber.org/zap/zaptest"
	ytzap "go.ytsaurus.tech/library/go/core/log/zap"
	"go.ytsaurus.tech/yt/go/yson"
	"go.ytsaurus.tech/yt/go/yt"
	"go.ytsaurus.tech/yt/go/yt/ythttp"
)

// placeholderSet is a simple wrapper to encode placeholder values.
//
// See https://github.com/ytsaurus/ytsaurus/blob/bbaedcf016db5e599563bb3f3b5ba87b047b4faa/yt/yt/library/query/base/query_preparer.cpp#L2461.
type placeholderSet struct {
	values []any
}

func (p placeholderSet) MarshalYSON(w *yson.Writer) error {
	w.BeginMap()
	for i, v := range p.values {
		w.MapKeyString(fmt.Sprintf("placeholder%d", i))
		w.Any(v)
	}
	w.EndMap()
	return nil
}

func TestPlaceholders(t *testing.T) {
	ctx := context.Background()

	log := zaptest.NewLogger(t)
	yc, err := ythttp.NewClient(&yt.Config{
		Logger: &ytzap.Logger{L: log},
	})
	if err != nil {
		t.Fatal(err)
	}

	var (
		id = [16]byte{
			0x1f, 0x1f, 0x1f, 0x1f, // <- 0x1f is a control character (unit separator)
			0x1f, 0x1f, 0x1f, 0x1f,
			0x1f, 0x1f, 0x1f, 0x1f,
			0x1f, 0x1f, 0x1f, 0x1f,
		}
		// Table schema does not matter, since query does not actually reach the proxy.
		query        = "* from [//table] where uuid = {placeholder0}"
		placeholders = placeholderSet{
			values: []any{id[:]},
		}
	)

	r, err := yc.SelectRows(ctx, query, &yt.SelectRowsOptions{
		PlaceholderValues: placeholders,
	})
	if err != nil {
		t.Fatal(err)
	}
	defer r.Close()
}

Logs

2023-06-01T21:28:18.836+0300      DEBUG   request started {"call_id": "f4d2888d-b94c52bc-7541d52f-5e7febcf", "method": "select_rows", "query": "* from [//table] where uuid = {placeholder0}"}
2023-06-01T21:28:18.836+0300      DEBUG   sending HTTP request    {"call_id": "f4d2888d-b94c52bc-7541d52f-5e7febcf", "proxy": "localhost:8000"}
2023-06-01T21:28:18.836+0300      WARN    retrying heavy read request     {"call_id": "f4d2888d-b94c52bc-7541d52f-5e7febcf", "backoff": "2.012342203s", "error": "Get \"http://localhost:8000/api/v4/select_rows\": net/http: invalid header field value for \"X-Yt-Parameters\""}

Support decimal type in Python API

[Feature] Implement S3 Proxy for YTsaurus

Is your feature request related to a problem? Please describe.

Currently, YTsaurus does not provide an option to access data using the Amazon S3 protocol. This would be a valuable addition to improve compatibility with various tools and services that are industry standards.

Describe the solution you'd like

I would like to propose implementing an S3 Proxy for YTSaurus, which would enable users to access data stored in YTsaurus using the Amazon S3 protocol. This would make YTsaurus more versatile and user-friendly, as it would integrate seamlessly with a wide range of applications and services that rely on the S3 protocol.

Describe alternatives you've considered

An alternative would be to create custom adapters for each application or service that requires access to data stored in YTsaurus. However, this would be less efficient, as it would require more development effort and would not leverage the advantages of the widely-used S3 protocol.

Add Type Annotations to yt.wrapper.YtClient Class

I'm extensively using the YtClient class in my codebase, especially in the following scenarios:

When mocking yt_client with a preconfigured environment.
For scripts that are interacting with two different clusters.

However, I've noticed that the current type for importing YtClient is Any, which in turn disables any code suggestions or autocompletions. This greatly reduces developer productivity and could potentially lead to type-related issues.

It would be a significant improvement to the development experience if type annotations were added to the YtClient class. This will assist in better auto-completion and early error detection, leading to higher quality code.

Looking forward to hearing from you, and thank you for your consideration!

Provide ability to set data_center and rack attribute for exec/tablet/data nodes via config file

We have plans for multiple ytsaurus clusters with different replication schemes: 3dc, single-dc-3nodes, cross-cluster.

For such setup we've created simple bootstrap program which looks into given environment and prepares appropriate config for given node: table, data, exec node.

For now after node starts it is necessary to set attributes in cypress: //sys/cluster_nodes/<node>/@data_center and //sys/cluster_nodes/<node>/@rack in order to use rack-awareness or datacenter-aware policies.
This obliges to put cluster token into each node container as well as python and its environment, or write another tool to monitor cluster for incoming nodes and re-pin then to proper rack and datacenter.

It would be useful to provide necessary node-placement attributes in config file as an alternative to cypress leaf

[Feature] Use GitHub container registry

I propose to use GitHub Container Registry (ghcr.io) instead of docker hub.

Points:

Dockerhub has pretty strict rate limits (100 pulls per 6 hours per IP address) for anonymous users
Docker Inc. tried to sunset free teams for dockerhub (but reverted under pressure)
It is easier and safer to push images to ghcr from github pipelines
Improved discoverability, user can instantly see all available images in "packages" section of repo or org

	* `user_job_memory-digest_default_value`: Initial assumption for selecting the memory reserve (the default value is 0.5).
	* `user_job_memory-digest_lower_bound`: The limit below which the reserve must not fall (the default value is 0.05). We do not recommend changing the default value.
	* `memory_reserve_factor`: The alias for the `user_job_memory-digest_lower_bound` and `user_job_memory-digest_default_value` options concurrently. Using this option is not recommended.

	* `user_job_memory-digest_default_value` - исходное предположение для выбора резерва памяти (по умолчанию 0.5).
	* `user_job_memory-digest_lower_bound` - граница, ниже которой не может опускаться резерв (по умолчанию 0.05). Не рекомендуется менять значение по умолчанию.
	* `memory_reserve_factor` - алиас для опции `user_job_memory-digest_lower_bound` и `user_job_memory-digest_default_value` одновременно. Не рекомендуется использовать данную опцию.

ytsaurus / ytsaurus Goto Github PK

ytsaurus's Introduction

YTsaurus

Advantages of the platform

Multitenant ecosystem

Reliability and stability

Scalability

Rich functionality

CHYT powered by ClickHouse®

SPYT powered by Apache Spark

Getting Started

How to Build from Source Code

How to Contribute

ytsaurus's People

Contributors

Stargazers

Watchers

Forkers

ytsaurus's Issues

Steps to reproduce:

The following is the error message

{ "message": "{"message":"timeout of 5000ms exceeded","code":0}" }

Error Oops! something went wrong. If the problem persists please report it via Bug Reporter. {"message":"timeout of 5000ms exceeded","code":0}

Steps to reproduce:

Container log:

Additional info:

Recommend Projects

Recommend Topics

Recommend Org

{
"message": "{"message":"timeout of 5000ms exceeded","code":0}"
}

Error
Oops! something went wrong. If the problem persists please report it via Bug Reporter.
{"message":"timeout of 5000ms exceeded","code":0}