ydb-platform / nbs Goto Github PK

Network Block Store

License: Apache License 2.0

CMake 0.01% Python 12.63% C 28.70% C++ 55.47% Go 1.78% Assembly 0.74% Cython 0.41% Roff 0.01% POV-Ray SDL 0.01% Lua 0.01% NASL 0.04% Pawn 0.01% SWIG 0.01% CSS 0.02% HTML 0.02% JavaScript 0.12% Shell 0.04% LLVM 0.01% GAP 0.01% SourcePawn 0.01%

nbs's Introduction

Network Block Store

Network Block Device implementation over YDB BlobStorage or over our own storage nodes. Offers reliable thin-provisioned block devices which support snapshots.

Block storage overview diagram

Quickstart

Follow the instructions here to generate workspace and install the necessary plugins.

Follow the instructions here to build and run NBS on your machine and to attach an NBS-based disk via NBD. NBS-based disks can be attached via vhost-user-blk as well.

Follow the instructions here to install clang-format for formatting the code.

Additional information about features of our Github Actions (labels, test results and so on)

Documentation

The docs can be found here. We are in the process of writing them atm. The overall repository structure can be found here.

How to Deploy

TODO

nbs's People

Stargazers

Watchers

Forkers

dpoluyanov blockspacer librarian dcherednik t33m manykey jkuradobery linecode 4justme4 enjection aikuchin alexvru shadchin bsv798

nbs's Issues

[NBS] Ability to save block checksums in TBlobMeta to check them upon ReadBlob

In case of data corruption bugs we need to be able to track the layer which causes the issue. This issue is about recording block checksums after WriteBlob in TBlobMeta (

nbs/cloud/blockstore/libs/storage/protos/part.proto

Line 148 in dd8b1e2

message TBlobMeta

) to check that the data read via ReadBlob is the same data that was written upon WriteBlob. Since it implies storing a lot of metadata it will cause a performance hit => we can't enable this logic by default, need to be able to enable it via features config.

[NBS] Implement CSI driver for NBS

DoD:

using kubectl you can run pod with container with attached NBS-volume as a nbd device.
using kubectl you can run pod with container with QEMU with attached NBS-volume via vhost.
using kubectl you can create NBS-volumes from images

Stopped to upload volume stats to some of the ydb tables

It turned out that ydb TValue type variable passed to BulkUpsert method looses it's value, so any copy of this TValue looses it's value to.
Submitted issue to ydb c++ sdk.
ydb-platform/ydb#1659

So far the only way to avoid this problem is to send distinct TValue to BulkUpsert

[NBS] Increase NRD/Mirrored migration throughput by proper throttling policy and proper migration process parallelism

Right now migration throughput throttling throttles throughput by the following formula: Max(4MiB, MaxMigrationBandwidth / SourceAgentShare), where SourceAgentShare == NumberOfVolumeDevicesOnAgent / TotalNumberOfDevicesOnAgent, where the denominator is not calculated properly and is just set to a constant 15. First of all, this constant should not be a constant. Second, this kind of throttling would've been fine if migration ran in parallel for all agents on which the volume depends. Right now we run migration sequentially - from the first device to the last - which makes our throttling policy too pessimistic and increases total data migration time.

[NBS] Don't block writes if one of the replicas of a STORAGE_MEDIA_SSD_MIRROR{2,3} disk is temporarily unavailable

Right now every write waits until all replicas return a response. If one of the replicas is unavailable for 30+ seconds, it will be automatically replaced by DiskRegistry, but during these 30+ seconds all write requests will be frozen for the client. We can implement the following logic:

Implement "soft timeout" (default = 5s) at the TMirrorRequestActor level - if a write request to one of the replicas hangs for this duration, report this problem to TMirrorPartitionActor which will notify the volume (TVolumeActor), TMirrorRequestActor should then respond with E_REJECTED and die
The volume records this fact in its local database by setting the IncompleteWriteMode flag, notifies TMirrorPartitionActor, TMirrorRequestActor transitions to the incomplete write mode
For all new write requests TMirrorPartitionActor initializes TMirrorRequestActor in the incomplete write mode

In the incomplete write mode TMirrorRequestActor stops waiting for the lagging replica after the aforementioned soft timeout and reports these timeouts to TMirrorPartitionActor, which keeps a queue of these unfinished requests and resends these requests to the lagging replica. If the queue becomes too big (say, the amount of data becomes > 256MiB), TMirrorPartitionActor notifies the volume (and the volume handles this notification with a restart). If the queue becomes empty, TMirrorPartitionActor disables the incomplete write mode and notifies the volume. The volume resets the IncompleteWriteMode flag.

Upon volume restart or upon MirrorRequestActor restart the volume starts the Resync process if the IncompleteWriteMode flag was true. When the Resync process finishes the IncompleteWriteMode flag should be set to false.

If we implement this logic, the most common problem - unavailability of a single blockstore-disk-agent - will freeze clients' requests by only 5 seconds instead of 30 seconds.

The logic can be implemented in a series of PRs:

Implement the IncompleteWriteMode flag and the corresponding TMirrorPartitionActor<->TVolumeActor notifications + uts
Implement the incomplete write mode in TMirrorPartitionActor and TMirrorRequestActor + uts
Implement the soft timeout logic + uts
Write loadtest testcases

docs: unable to run example

Trying to follow https://github.com/ydb-platform/nbs/tree/main/example.

Encountered following problems:

0. Binary names

Probably the diskagentd is the new name for blockstore-disk-agent.

1. Local binaries

Example expects all binaries being available locally.
I've created symlinks as a fix, like that:

ln -s /home/ernado/nbswork/build/cloud/blockstore/apps/disk_agent/diskagentd ./diskagentd

2. can't open "nbs/nbs-log.txt" with mode RdOnly|Seq

3-start_nbs.sh prints following after start:

2023-07-16-18-02-42 :BLOCKSTORE_SERVER INFO: /home/ernado/nbswork/nbs/cloud/blockstore/libs/daemon/common/bootstrap.cpp:198: NBS server version: 1689516977.main
main bootstrap start: (TFileError) (No such file or directory) /home/ernado/nbswork/nbs/util/system/file.cpp:857: can't open "nbs/nbs-sys.txt" with mode RdOnly|Seq (0x00000028)
2023-07-16-18-02-42 :BLOCKSTORE_SERVER INFO: /home/ernado/nbswork/nbs/cloud/blockstore/libs/daemon/common/bootstrap.cpp:828: Stopped Scheduler
2023-07-16-18-02-42 :BLOCKSTORE_SERVER INFO: /home/ernado/nbswork/nbs/cloud/blockstore/libs/daemon/common/bootstrap.cpp:831: Stopped BackgroundThreadPoo

For some reason, files are opened as readonly, but probably meant to be created by nbsd:

$NBSD \
    --domain             Root \
    --node-broker        localhost:$GRPC_PORT \
    --ic-port            $IC_PORT \
    --mon-port           $MON_PORT \
    --server-port        $SERVER_PORT \
    --data-server-port   $DATA_SERVER_PORT \
    --secure-server-port $SECURE_SERVER_PORT \
    --discovery-file     nbs/nbs-discovery.txt \
    --domains-file       nbs/nbs-domains.txt \
    --ic-file            nbs/nbs-ic.txt \
    --log-file           nbs/nbs-log.txt \
    --sys-file           nbs/nbs-sys.txt \
    --server-file        nbs/nbs-server.txt \
    --storage-file       nbs/nbs-storage.txt \
    --naming-file        nbs/nbs-names.txt \
    --diag-file          nbs/nbs-diag.txt \
    --auth-file          nbs/nbs-auth.txt \
    --dr-proxy-file      nbs/nbs-dr-proxy.txt \
    --service            kikimr \
    --load-configs-from-cms \
    --profile-file       logs/profile-log.bin \

I've created all these files via touch as a workaround.

3. can't open "nbs/nbs-log.txt" with mode RdOnly|Seq

2023-07-16-18-04-34 :BLOCKSTORE_SERVER ERROR: /home/ernado/nbswork/nbs/cloud/blockstore/libs/storage/core/manually_preempted_volumes.cpp:156: Failed to load manually preempted volumes: Failed to read preempted volumes list with error: (TFileError) (No such file or directory) /home/ernado/nbswork/nbs/util/system/file.cpp:857: can't open "/var/log/nbs-server/nbs-preempted-volumes.json" with mode RdOnly (0x00000008)

The /var/log/nbs-server/nbs-preempted-volumes.json file is required and the local path for it is not provided in config.

I've also created it manually.

4. requirement systemConfig.HasScheduler() failed

VERIFY failed (2023-07-16T18:12:14.181403+0300): 
  /home/ernado/nbswork/nbs/ydb/core/driver_lib/run/kikimr_services_initializers.cpp:594
  InitializeServices(): requirement systemConfig.HasScheduler() failed

Can't proceed or find workaround :(

Full log:

2023-07-16-18-13-01 :BLOCKSTORE_SERVER INFO: /home/ernado/nbswork/nbs/cloud/blockstore/libs/daemon/common/bootstrap.cpp:198: NBS server version: 1689516977.main
2023-07-16-18-13-01 :BLOCKSTORE_SERVER INFO: /home/ernado/nbswork/nbs/cloud/blockstore/libs/daemon/ydb/bootstrap.cpp:262: Configs initialized
2023-07-16-18-13-01 :BLOCKSTORE_SERVER INFO: /home/ernado/nbswork/nbs/cloud/storage/core/libs/kikimr/node.cpp:97: Trying to register dynamic node at "localhost:9001"
2023-07-16-18-13-01 :BLOCKSTORE_SERVER INFO: /home/ernado/nbswork/nbs/cloud/storage/core/libs/kikimr/node.cpp:124: Registered dynamic node at "localhost:9001" with address "127.0.1.1"
2023-07-16-18-13-01 :BLOCKSTORE_SERVER INFO: /home/ernado/nbswork/nbs/cloud/blockstore/libs/daemon/ydb/bootstrap.cpp:273: CMS configs initialized
2023-07-16-18-13-01 :BLOCKSTORE_SERVER INFO: /home/ernado/nbswork/nbs/cloud/blockstore/libs/daemon/ydb/bootstrap.cpp:292: ClientPercentiles initialized
2023-07-16-18-13-01 :BLOCKSTORE_SERVER INFO: /home/ernado/nbswork/nbs/cloud/blockstore/libs/daemon/ydb/bootstrap.cpp:307: StatsAggregator initialized
2023-07-16-18-13-01 :BLOCKSTORE_SERVER INFO: /home/ernado/nbswork/nbs/cloud/blockstore/libs/daemon/ydb/bootstrap.cpp:333: StatsUploader initialized
2023-07-16-18-13-01 :BLOCKSTORE_SERVER INFO: /home/ernado/nbswork/nbs/cloud/blockstore/libs/daemon/ydb/bootstrap.cpp:389: DiscoveryService initialized
2023-07-16-18-13-01 :BLOCKSTORE_SERVER INFO: /home/ernado/nbswork/nbs/cloud/blockstore/libs/daemon/ydb/bootstrap.cpp:396: TraceSerializer initialized
2023-07-16-18-13-01 :BLOCKSTORE_SERVER INFO: /home/ernado/nbswork/nbs/cloud/blockstore/libs/daemon/ydb/bootstrap.cpp:442: Allocator initialized
2023-07-16-18-13-01 :BLOCKSTORE_SERVER INFO: /home/ernado/nbswork/nbs/cloud/blockstore/libs/daemon/ydb/bootstrap.cpp:446: ProfileLog initialized
2023-07-16-18-13-01 :BLOCKSTORE_SERVER INFO: /home/ernado/nbswork/nbs/cloud/blockstore/libs/daemon/ydb/bootstrap.cpp:464: DigestGenerator initialized
2023-07-16-18-13-01 :BLOCKSTORE_SERVER INFO: /home/ernado/nbswork/nbs/cloud/blockstore/libs/daemon/ydb/bootstrap.cpp:470: LogbrokerService initialized
2023-07-16-18-13-01 :BLOCKSTORE_SERVER INFO: /home/ernado/nbswork/nbs/cloud/blockstore/libs/daemon/ydb/bootstrap.cpp:476: NotifyService initialized
UDFsDir is not specified, no dynamic UDFs will be loaded. 
VERIFY failed (2023-07-16T18:13:01.120651+0300): 
  /home/ernado/nbswork/nbs/ydb/core/driver_lib/run/kikimr_services_initializers.cpp:594
  InitializeServices(): requirement systemConfig.HasScheduler() failed
??+0 (0x8A08D57)
??+0 (0x8A005CB)
??+0 (0x8EF78A4)
??+0 (0x8EEFEB4)
??+0 (0x91D60A4)
??+0 (0x91C6C84)
??+0 (0x8EE9017)
??+0 (0x8A1EBB1)
??+0 (0x8AD2820)
??+0 (0x89953E9)
??+0 (0x89952E2)
??+0 (0x7F34C8609D90)
__libc_start_main+128 (0x7F34C8609E40)
??+0 (0x7F40025)
./3-start_nbs.sh: line 33: 1595767 Aborted                 (core dumped) $NBSD --domain Root --node-broker localhost:$GRPC_PORT --ic-port $IC_PORT --mon-port $MON_PORT --server-port $SERVER_PORT --data-server-port $DATA_SERVER_PORT --secure-server-port $SECURE_SERVER_PORT --discovery-file nbs/nbs-discovery.txt --domains-file nbs/nbs-domains.txt --ic-file nbs/nbs-ic.txt --log-file nbs/nbs-log.txt --sys-file nbs/nbs-sys.txt --server-file nbs/nbs-server.txt --storage-file nbs/nbs-storage.txt --naming-file nbs/nbs-names.txt --diag-file nbs/nbs-diag.txt --auth-file nbs/nbs-auth.txt --dr-proxy-file nbs/nbs-dr-proxy.txt --service kikimr --load-configs-from-cms --profile-file logs/profile-log.bin $@

Add request metrics for hive_proxy and ss_proxy

Need them to track problems with ydb hive and scheme shard. Required metrics are requests latency and requests max time

Start nbd-endpoint with a connected nbd-device

There are two subtask:

nbs should connect NBD device to NBD endpoint when nbs starts this NBD endpoint. NBD device file should be passed to StartEndpointRequest.
if there is no NBD device file in StartEndpointRequest, then nbs should choose any available NBD device.

Need for #455

[Disk Manager] move Disk Manager into a separate git repo

move Disk Manager into a separate git repo
move tasks into a separate git repo
tasks should be a Disk Manager submodule
get rid of the dependence on the nbs
create separate go mod for disk manager/tasks

[NBS] NRD/Mirrored disk default encryption

We can implement default encryption for STORAGE_MEDIA_SSD_{NONREPLICATED,MIRROR2,MIRROR3} both at rest and in transit by encrypting data at the endpoint level - just like user key-based encryption. The default key can be generated upon CreateVolume and stored in NProto::TVolumeConfig and then propagated to the endpoint upon MountVolume. The only remaining problem is the performance hit which is currently mostly caused by the need to update UsedBlockMap in the volume database. But actually we don't need this bitmap. A good encryption algorithm should produce data that's indistinguishable from the data generated by a uniform random generator. Therefore we don't need this bitmap - we can simply check that the block that we have just read from the storage node contains only zeroes - if so, we should not decrypt this block, otherwise - decrypt. We don't even have to check the whole block - checking the first 128 bytes should be more than enough. And just in case (which should be literally impossible - P=1/2^1024) an encrypted block contains zeroes in the first 128 bytes, we can check it upon write and report a E_IO or a E_ARGUMENT error to the client and raise a CriticalEvent. This endpoint-level encryption is also great because it won't impact the vhost-server-based fast rdma datapath - we will just disable it at the fastpath endpoint level.

[Blockstore] should add blobs to index in parallel while writing them

Currently, WriteBlocks request consists of two consecutive stages:

Writing a blob to the Blob Storage
Adding the blob to the partition tablet index

This approach needs improvement. The idea is simple:

Persisting 'unconfirmed blob' information in parallel during writing the blobs to Blob Storage
Deferring or returning a retriable error for 'unconfirmed ranges' at this time
Confirming the blobs by adding them to the index when it's known that all the blobs have been successfully written to Blob Storage
Finally, cleaning up the 'unconfirmed blobs', previously persisted in step 1

The only subtlety is what to do with partially-written blobs. Suppose the tablet is restarted in the middle of writing data. In this case, some blobs may be partially written and cannot be recovered. Therefore, we need to confirm/recover these blobs at tablet startup. This recovery process is done using the TEvBlobStorage::TEvGet request with the 'restore' flag for each blob.

[NBS] Any errors in the mirrored disk replica should be retriable

It is necessary not to throw an error into the guest system for any errors with the replica, because the replica will be replaced with a new one.

[Filestore] Implement authorization

Almost done, but still need to implement control endpoint via unix socket

[Filestore] Make Filestore read throughput scalable with respect to the number of clients

We want a single FS to be able to provide more read throughput when more clients (VMs) connect to the FS. The idea is simple: right now the bottleneck for large reads is the IndexTablet, because all data is proxied via it. But we don't really need to transfer all data via the tablet. The tablet can return <BSGroupId, BlobId, ByteRange> tupes to the client (filestore-vhost) instead of returning the requested data. The client can read the data from the specified BSGroups by itself. A fallback should be implemented for the case when the client cannot read the data by itself - e.g. if the specified BlobId has already been deleted or if the client has no direct access to the storage nodes.

[NBS] Support simultaneous black and whitelist in features config

Right now black and whitelist are mutually exclusive. We need to specify both to safely roll out features that can potentially disrupt operations

[NBS] неверно вычисляется MaxMigrationTime

Похоже что после окончания миграции не сбрасывается время начала миграции и если начинается новая миграция, считается что она началась тогда же когда началась первая, из-за этого если у диска было несколько миграций, метрика MaxMigrationTime в DiskRegistry начинает врать и зажигать алерты.

[NBS] support proxy overlay disks with multipartition base disks

now we create proxy overlay disks if base disk (user disk) have only one partition and don't use optimization they give

want to support proxy overlay disks created for base disk with several partitions

[Filestore] Get rid of libfuse/virtiofsd

Right now we use libfuse and virtiofsd as dynamically linked libraries because of their licenses. Plus they have fatal flaws:

constantly jumping back and forth between this lib and our code is crooked and also inconvenient;
there is an uaf bug at endpoint stop due to crooked resource ownership, potentially leading to data loss or corruptions;
no possibility of customization (for example, libfuse parses its own events and if parsing ended with an error, we won't know about it).

[NBS] If DA is temporarily unavailable, DR resets the disk session

If one of agents is unavailable, a request to reacquire a disk session may result in the termination of that session. Which will lead to IO freeze (until the disk is acquired again).

[NBS] Включение Configs Dispatcher в сервисах nbs/nfs/etc

Для того, что бы ноды захватывали конфигурации из CMS их нужно переключить на новый CMS, а так же включить Configs Dispatcher, что бы захватывать изменения в CMS без рестартов сервиса.

Нужно переключить получение CMS на новый путь при рестрате.

nbs/cloud/blockstore/libs/disk_agent/config_initializer.cpp

Line 27 in 1ec032c

void TConfigInitializer::ApplyCMSConfigs(NKikimrConfig::TAppConfig cmsConfig)
В NBS используются только следующие сервисы, однако Configs Dispatcher не должен захватывать конфиги для всех т.к. некоторые настраиваются локально.

nbs/contrib/ydb/core/driver_lib/run/config.h

Line 15 in a11ddd6

struct {

[NBS] Race between send and recv completions

Once in a while we get send completion that refers to a request that doesn't seem to be active. Since work request id is perfectly normal, we can be pretty certain that it was active at some point, but probably got completed after receiving a response and indeed no longer active

We want to make sure that it's actually the case

:BLOCKSTORE_RDMA WARN: [16392301608885229807] SEND E37D2776:0:FE error: request 0x000007A8F2268280 not found

[NBS] Automatically increase partition tablet count upon volume resize

Partition tablet count depends on volume size - the bigger the volume, the more partition tablets it should use. But right now we cannot migrate from one partition tablet "geometry" to another => we create multiple partition tablets only upon volume creation. If a user creates a small volume and then resizes it this volume will still have only 1 partition tablet and will provide less performance than a volume which is created big.

Upon a resize operation which requires a change in the number of partition tablets we can create a new partition tablet geometry and write data only to the new geometry, mirror zero requests to both geometries and migrate the data from the old geometry to the new one in background. After this migration finishes we should delete the old geometry.

[NBS] Local SSD over NBS

Need to enable and deploy all the necessary settings in order for local disks to start working under NBS control in all zones.

Invalid migration of nrd disk

Disk Registry tried to perform invalid migration. Namely, DR tried to migrate device X and assumed that this device belonged to disk Y. But device X didn't actually belong to disk Y.

This caused repeating DiskRegistryDeviceDoesNotBelongToDisk errors that lasted until DR tablet reboot.

Need to find the reason why the invalid migration occured and fix the bug that caused it.

[NBS] missing sensor for inflight requests on disk agent

We want to monitor I/O requests on disk to help us detect situations where we have I/O during secure erase, for example.
We do have some sensors for RDMA server, but we want to have additional sensors for the level above RDMA.

YDB client logger does not write fields from context

YDB client uses its own logger, which calls methods of underlying logger (code). It should write fields from context, but it does not. In particular, it does not write syslog identifier.

Proposed solution:

Make possible to add arbitrariy logging fields to context.
In the YDB client logger, add all necessary fields to the context and then call functions from the interface.

Implement support for unix domain sockets in NFS

This will allow compute node (or any other local agent) to issue endpoint-related commands without using tcp endpoint attached to network card and without passing secure token (of course socket should have corresponding permissions).

Note that only nfs-vhost needs this functionality. For nfs-server at svms it seems to be useless.

Potential race between StopEndpoint and RefreshEndpoint

The main problem is that you cannot safely use EndpointManager without EndpointService, cause all checks for processing sockets are implemented in EndpointService.
So first of all we need to move all these checks to EndpointManager. EndpointService should handle endpoint storage requests (KickEndpoint, RestoreEndpoints) and don't allow change endpoints while they are restoring.
Next step is to add RefreshEndpointRequest to these EndpointManager checks.

[NBS] NonreplicatedPartitionRdmaActor implementation lacks volume mode and checkpoint checks

When processing read/write request the NonreplicatedPartitionActor checks for volume mode (is read only/ has broken device) and whether checkpoint support is needed.

nbs/cloud/blockstore/libs/storage/partition_nonrepl/part_nonrepl_actor.cpp

Line 140 in f0bfcb0

if (PartConfig->IsReadOnly()) {

This functionality is missing from NonreplicatedPartitionRdmaActor

[NBS, Filestore] utils shard metrics for elapsedmicroseconds by activity does not contain blockstore activities

Seems like the way of setting activity type in actors has changed

[NBS] Test issuse

Test description
Изменение в описании

[NBS] NRD request errors continue to trigger alerts.

The disk sees that some of the devices are broken: IOMode is ReadOnly and MuteIOErrors is enabled.
But despite this, ReadBlocksLocal requests were not masked with a Silent flag.
At the same time, according to logs, all WriteBlocksLocal requests are masked.

[NBS] Mirrored disks: hedged read requests

Usually we can serve read requests via any replica. Right now we select the replica in a round robin manner. Sometimes one of the replicas becomes slower then the other ones for short periods of time. In this case we can send a hedged request to another replica and use the response that arrives faster.

Small pseudopaper regarding the calculation of the soft timeout after which a hedged request may be sent: avtopodbor-soft-timeout.pdf

[NBS] Check the performance of RDMA

Just need to run tests on disks with RDMA enabled and compare their metrics with the one obtained from disks without it.

Then enable nrd over libs/rdma on clusters

[NBS] Errors in rdma metrics

At this moment, the activeWrite counter from the rdma server is always negative because of WriteResponseData does not call Counters->WriteResponseStarted();

[NBS] Implement blob compression

The following sensors {project="nbs", cluster="mycluster", service="service", host="cluster", sensor="*ompressedBytesWritten", type="ssd"} observed on our clusters show that the data that our users store can be compressed very well (x2.5 - x3 with lz4 codec). We can save a lot of space if we store compressed blobs in blobstorage. But a naive implementation - simply compressing whole blobs - will not work - if we compress a 4MiB blob and store the compressed form we will need to read and decompress the whole blob when the user decides to read a single 4KiB block from it. That's why we will need to do something like this:

Compaction should split each 4MiB blob into 103 40KiB chunks
we should try to compress each chunk - if the compression ratio is better than, say, x3, this chunk will be stored in a compressed form, otherwise - non-compressed
we should store chunk offsets in TBlobMeta to be able to find and read only the chunks that are required to process the request (upon receiving a read request)

Chunk size and min compression ratio should be configurable via TStorageServiceConfig. In the future more complex logic may be implemented: e.g. we can track read request sizes and dynamically change chunk size and min compression ratio.

[NBS] Fast data path for NRD

Currently data path for NRD volume involves IO coming from guest through vhost user interface, going through actor system to the nrd partition actor and redirected to disk agent through interconnect or RDMA. This data path (slow data path) has some performance bottlenecks related to the actor system and usage of different logic layers involved in handling the IO. The advantage of this data path is that it supports migration, throttling and all the tools we inherit from using the actor system

We want to create simpler data path similar to what we have for local disks (fast data path). In this data path we will start external process (vhost-server) which will connect to vhost user inrerface and redirect io directly to disk agent through rdma without using the actor system. We expect better performance and latency for such solution.

To enable migration and other advanced features we will implement switching to the slow data path when we want to apply more advanced logic (for example switch to slow path, perform migration and switch back).

[NBS] Pass endpoint id to the server

We need to pass client endpoint id to the server during initial connection and use it inside TServerSession

On a side note: TServerSession is probably not a very good name

[Filestore] Compaction/Cleanup/FlushBytes optimization

Implement lazy compaction map loading DONE
Implement garbage-based compaction DONE
Use blob patching in compaction LATER
Think about some ways to optimize Cleanup and FlushBytes and the FreshBytes layer #1129
Global triggers for Compaction and Cleanup - triggers for total deletion marker count per fs, total blob count per fs, etc. DONE
Add metrics for background ops, display more background ops info on the tablet monpage DONE

[NBS] blockstore-server crashed in TExternalVhostEndpointListener

Core was generated by `/usr/bin/blockstore-server --domain pre-prod_vla --ic-port 29010 --mon-port 876'.
#0 TIntrusivePtr<TStdString<std::__y1::basic_string<char, std::__y1::char_traits<char>, std::__y1::allocator<char> > >, TStringPtrOps<TStdString<std::__y1::basic_string<char, std::__y1::char_traits<char>, std::__y1::allocator<char> > > > >::Get (this=0x0) at /opt/buildagent/work/4ec98910e7de170b/__FUSE/mount_path/util/generic/ptr.h:560
[Current thread is 11716 (LWP 3461804)]

Thread 11716 (LWP 3461804):
#0 TIntrusivePtr<TStdString<std::__y1::basic_string<char, std::__y1::char_traits<char>, std::__y1::allocator<char> > >, TStringPtrOps<TStdString<std::__y1::basic_string<char, std::__y1::char_traits<char>, std::__y1::allocator<char> > > > >::Get (this=0x0) at /opt/buildagent/work/4ec98910e7de170b/__FUSE/mount_path/util/generic/ptr.h +560
#1 TPointerCommon<TIntrusivePtr<TStdString<std::__y1::basic_string<char, std::__y1::char_traits<char>, std::__y1::allocator<char> > >, TStringPtrOps<TStdString<std::__y1::basic_string<char, std::__y1::char_traits<char>, std::__y1::allocator<char> > > > >, TStdString<std::__y1::basic_string<char, std::__y1::char_traits<char>, std::__y1::allocator<char> > > >::AsT (this=0x0) at /opt/buildagent/work/4ec98910e7de170b/__FUSE/mount_path/util/generic/ptr.h +132
#2 TPointerBase<TIntrusivePtr<TStdString<std::__y1::basic_string<char, std::__y1::char_traits<char>, std::__y1::allocator<char> > >, TStringPtrOps<TStdString<std::__y1::basic_string<char, std::__y1::char_traits<char>, std::__y1::allocator<char> > > > >, TStdString<std::__y1::basic_string<char, std::__y1::char_traits<char>, std::__y1::allocator<char> > > >::operator* (this=0x0) at /opt/buildagent/work/4ec98910e7de170b/__FUSE/mount_path/util/generic/ptr.h +148
#3 TBasicString<char, std::__y1::char_traits<char> >::StdStr (this=0x0) at /opt/buildagent/work/4ec98910e7de170b/__FUSE/mount_path/util/generic/string.h +218
#4 TBasicString<char, std::__y1::char_traits<char> >::ConstRef (this=0x0) at /opt/buildagent/work/4ec98910e7de170b/__FUSE/mount_path/util/generic/string.h +240
#5 TBasicString<char, std::__y1::char_traits<char> >::data (this=0x0) at /opt/buildagent/work/4ec98910e7de170b/__FUSE/mount_path/util/generic/string.h +303
#6 TStringBase<TBasicString<char, std::__y1::char_traits<char> >, char, std::__y1::char_traits<char> >::Ptr (this=0x0) at /opt/buildagent/work/4ec98910e7de170b/__FUSE/mount_path/util/generic/strbase.h +536
#7 TStringBase<TBasicString<char, std::__y1::char_traits<char> >, char, std::__y1::char_traits<char> >::data (this=0x0) at /opt/buildagent/work/4ec98910e7de170b/__FUSE/mount_path/util/generic/strbase.h +128
#8 TBasicStringBuf<char, std::__y1::char_traits<char> >::TBasicStringBuf<TBasicString<char, std::__y1::char_traits<char> >, std::__y1::char_traits<char> > (str=..., this=<optimized out>) at /opt/buildagent/work/4ec98910e7de170b/__FUSE/mount_path/util/generic/strbuf.h +121
#9 THashTable<std::__y1::pair<TBasicString<char, std::__y1::char_traits<char> > const, std::__y1::shared_ptr<NCloud::NBlockStore::NServer::IExternalEndpoint> >, TBasicString<char, std::__y1::char_traits<char> >, THash<TBasicString<char, std::__y1::char_traits<char> > >, TSelect1st, TEqualTo<TBasicString<char, std::__y1::char_traits<char> > >, std::__y1::allocator<TBasicString<char, std::__y1::char_traits<char> > > >::bkt_num_key<TBasicString<char, std::__y1::char_traits<char> > >(TBasicString<char, std::__y1::char_traits<char> > const&, NPrivate::TReciprocalDivisor<unsigned int, unsigned long, NPrivate::TMulUnsignedUpper<unsigned long, unsigned __int128, 64ul> >) const (this=0x1702ff8057a0, key=..., n=...) at /opt/buildagent/work/4ec98910e7de170b/__FUSE/mount_path/util/generic/hash_table.h +923
#10 THashTable<std::__y1::pair<TBasicString<char, std::__y1::char_traits<char> > const, std::__y1::shared_ptr<NCloud::NBlockStore::NServer::IExternalEndpoint> >, TBasicString<char, std::__y1::char_traits<char> >, THash<TBasicString<char, std::__y1::char_traits<char> > >, TSelect1st, TEqualTo<TBasicString<char, std::__y1::char_traits<char> > >, std::__y1::allocator<TBasicString<char, std::__y1::char_traits<char> > > >::bkt_num_key<TBasicString<char, std::__y1::char_traits<char> > > (this=0x1702ff8057a0, key=...) at /opt/buildagent/work/4ec98910e7de170b/__FUSE/mount_path/util/generic/hash_table.h +913
#11 THashTable<std::__y1::pair<TBasicString<char, std::__y1::char_traits<char> > const, std::__y1::shared_ptr<NCloud::NBlockStore::NServer::IExternalEndpoint> >, TBasicString<char, std::__y1::char_traits<char> >, THash<TBasicString<char, std::__y1::char_traits<char> > >, TSelect1st, TEqualTo<TBasicString<char, std::__y1::char_traits<char> > >, std::__y1::allocator<TBasicString<char, std::__y1::char_traits<char> > > >::find<TBasicString<char, std::__y1::char_traits<char> > > (this=0x1702ff8057a0, key=...) at /opt/buildagent/work/4ec98910e7de170b/__FUSE/mount_path/util/generic/hash_table.h +767
#12 THashMap<TBasicString<char, std::__y1::char_traits<char> >, std::__y1::shared_ptr<NCloud::NBlockStore::NServer::IExternalEndpoint>, THash<TBasicString<char, std::__y1::char_traits<char> > >, TEqualTo<TBasicString<char, std::__y1::char_traits<char> > >, std::__y1::allocator<TBasicString<char, std::__y1::char_traits<char> > > >::find<TBasicString<char, std::__y1::char_traits<char> > > (this=0x1702ff8057a0, key=...) at /opt/buildagent/work/4ec98910e7de170b/__FUSE/mount_path/util/generic/hash.h +213
#13 NCloud::NBlockStore::NServer::(anonymous namespace)::TExternalVhostEndpointListener::SwitchEndpoint (this=0x1702ff8056d0, request=..., volume=..., session=...) at /opt/buildagent/work/4ec98910e7de170b/__FUSE/mount_path/cloud/blockstore/libs/endpoints_vhost/external_vhost_server.cpp +675
#14 NCloud::NBlockStore::NServer::(anonymous namespace)::TEndpointManager::TrySwitchEndpoint (this=0x1702ff3c8620, diskId=...) at /opt/buildagent/work/4ec98910e7de170b/__FUSE/mount_path/cloud/blockstore/libs/endpoints/endpoint_manager.cpp +660
#15 NCloud::NBlockStore::NServer::(anonymous namespace)::TEndpointManager::OnVolumeConnectionEstablished(TBasicString<char, std::__y1::char_traits<char> > const&)::$_5::operator()() const (this=<optimized out>) at /opt/buildagent/work/4ec98910e7de170b/__FUSE/mount_path/cloud/blockstore/libs/endpoints/endpoint_manager.cpp +676
#16 NCloud::TSimpleTask<NCloud::NBlockStore::NServer::(anonymous namespace)::TEndpointManager::OnVolumeConnectionEstablished(TBasicString<char, std::__y1::char_traits<char> > const&)::$_5>::Execute() (this=<optimized out>) at /opt/buildagent/work/4ec98910e7de170b/__FUSE/mount_path/cloud/storage/core/libs/common/task_queue.h +61
#17 NCloud::(anonymous namespace)::TWorker::Execute (this=0x1703245722c0, c=<optimized out>) at /opt/buildagent/work/4ec98910e7de170b/__FUSE/mount_path/cloud/storage/core/libs/coroutine/executor.cpp +43
#18 ContHelperMemberFunc<NCloud::(anonymous namespace)::TWorker, &NCloud::(anonymous namespace)::TWorker::Execute> (c=<optimized out>, arg=0x0) at /opt/buildagent/work/4ec98910e7de170b/__FUSE/mount_path/library/cpp/coroutine/engine/impl.h +151
#19 TContExecutor::Create(void (*)(TCont*, void*), void*, char const*, TMaybe<unsigned int, NMaybe::TPolicyUndefinedExcept>)::$_3::operator()(TCont*) const (cont=0x1702ff8057a0, this=<optimized out>) at /opt/buildagent/work/4ec98910e7de170b/__FUSE/mount_path/library/cpp/coroutine/engine/impl.cpp +249
#20 std::__y1::__invoke<TContExecutor::Create(void (*)(TCont*, void*), void*, char const*, TMaybe<unsigned int, NMaybe::TPolicyUndefinedExcept>)::$_3&, TCont*>(TContExecutor::Create(void (*)(TCont*, void*), void*, char const*, TMaybe<unsigned int, NMaybe::TPolicyUndefinedExcept>)::$_3&, TCont*&&) (__f=..., __args=<optimized out>) at /opt/buildagent/work/4ec98910e7de170b/__FUSE/mount_path/contrib/libs/cxxsupp/libcxx/include/type_traits +3663
#21 std::__y1::__invoke_void_return_wrapper<void, true>::__call<TContExecutor::Create(void (*)(TCont*, void*), void*, char const*, TMaybe<unsigned int, NMaybe::TPolicyUndefinedExcept>)::$_3&, TCont*>(TContExecutor::Create(void (*)(TCont*, void*), void*, char const*, TMaybe<unsigned int, NMaybe::TPolicyUndefinedExcept>)::$_3&, TCont*&&) (__args=<optimized out>, __args=<optimized out>) at /opt/buildagent/work/4ec98910e7de170b/__FUSE/mount_path/contrib/libs/cxxsupp/libcxx/include/__functional/invoke.h +61
#22 std::__y1::__function::__alloc_func<TContExecutor::Create(void (*)(TCont*, void*), void*, char const*, TMaybe<unsigned int, NMaybe::TPolicyUndefinedExcept>)::$_3, std::__y1::allocator<TContExecutor::Create(void (*)(TCont*, void*), void*, char const*, TMaybe<unsigned int, NMaybe::TPolicyUndefinedExcept>)::$_3>, void (TCont*)>::operator()(TCont*&&) (this=0x76c487d7bd98f408, __arg=<optimized out>) at /opt/buildagent/work/4ec98910e7de170b/__FUSE/mount_path/contrib/libs/cxxsupp/libcxx/include/__functional/function.h +181
#23 std::__y1::__function::__func<TContExecutor::Create(void (*)(TCont*, void*), void*, char const*, TMaybe<unsigned int, NMaybe::TPolicyUndefinedExcept>)::$_3, std::__y1::allocator<TContExecutor::Create(void (*)(TCont*, void*), void*, char const*, TMaybe<unsigned int, NMaybe::TPolicyUndefinedExcept>)::$_3>, void (TCont*)>::operator()(TCont*&&) (this=0x76c487d7bd98f400, __arg=<optimized out>) at /opt/buildagent/work/4ec98910e7de170b/__FUSE/mount_path/contrib/libs/cxxsupp/libcxx/include/__functional/function.h +355
#24 std::__y1::__function::__value_func<void (TCont*)>::operator()(TCont*&&) const (this=0x1702f13fe9d0, [__args=@0x1702d20b8f58](mailto:__args=@0x1702d20b8f58): 0x1702f13fe910) at /opt/buildagent/work/4ec98910e7de170b/__FUSE/mount_path/contrib/libs/cxxsupp/libcxx/include/__functional/function.h +508
#25 std::__y1::function<void (TCont*)>::operator()(TCont*) const (this=0x1702f13fe9d0, __arg=0x1702f13fe910) at /opt/buildagent/work/4ec98910e7de170b/__FUSE/mount_path/contrib/libs/cxxsupp/libcxx/include/__functional/function.h +1192
#26 NCoro::TTrampoline::DoRun (this=0x1702f13fe930) at /opt/buildagent/work/4ec98910e7de170b/__FUSE/mount_path/library/cpp/coroutine/engine/trampoline.cpp +30
#27 NCoro::TTrampoline::DoRunNaked (this=0x1702ff8057a0) at /opt/buildagent/work/4ec98910e7de170b/__FUSE/mount_path/library/cpp/coroutine/engine/trampoline.cpp +46
#28 Run (arg=0x1702ff8057a0) at /opt/buildagent/work/4ec98910e7de170b/__FUSE/mount_path/util/system/context.cpp +47
#29 ContextTrampoLine (t1=0x1702f13fe930, t2=<optimized out>) at /opt/buildagent/work/4ec98910e7de170b/__FUSE/mount_path/util/system/context.cpp +124
#30 ?? ()

[Misc] Логирование в сервисе

Нужно настроить логирование во всех сервисах, таких как NBS, NFS, Disk Manager, Snapshot.

Основная задача состоит в том, что бы унифицировать сохранение логов и поиск по ним.

Идеальным выходом этого эпика будет запрос в систему хранения логов, например по id диска и уровню логирования, а в ответ все логи по этому диску с DM, nbs-control, nbs, snapshot.

[NBS] Полноценные снимки с NRD

Общая идея такая:

Добавляем еще одну реплику к существующему диску, так если это был mirror3 он станет mirror4.
Существующей машинерией наливаем эту новую реплику
Когда реплика будет налита, отключаем ее, переводим в состояние readonly
ассоццируем созданную рекплику с чекпоинтом
говорим что создали чекпоинт
все чтения из чекпоинта перенаправляем в реплику
при удаления данных чекпоинта или самого чекпоинта удаляем эту дополнительную реплику
если в процессе наливки у нас ломается девайс под репликой - мы заменяем его на новый и продолжает наливать реплику, тут должна работать уже существующая машинерия.
если после наливки реплики у нее ломается девайс, то мы отдаем ошибку чтения. Диск менджер в таком случае должен удалить чекпоинт и начать все с начала.

[NBS] Blob Patching-based partition tablet prototype w/o Compaction

We can build a partition tablet version which heavily relies on blob patching. It can simply store an array of 4MiB blobs corresponding to current 4MiB compaction ranges - i.e. disk content will be represented by an array of non-overlapping 4MiB blobs. We will store the array of blob ids in memory so the reads will be trivial - just find those blob ids that intersect with the read range and read the corresponding subranges of those blobs. Writes, unlike our current implementation, will not be blind: upon each write we will send an EvPatch request for the blobs that intersect with the write range, update our blob id array (both in memory and in local db), then mark the old blobs with DontKeep flags.

Mass deployment of this partition tablet implementation will require proper in-place patching at VDisks, which is a difficult task, so it requires a "proof-of-concept" implementation which will show that this kind of an optimization at the VDisk level is really useful. A single instance of such a tablet deployed on a cluster can show whether we can significantly simplify and optimize partition tablet implementation using patching. If such an implementation turns out to work well, we will think again about the in-place patching optimization at the VDisk level.

Detect zero chunks in snapshot mechanism for network-ssd-{nonreplicated,io-m2,io-m3}

Issue Definition: For nonreplicated disks, NBS does not mark chunks full of zeroes as zero-chunks. This causes the snapshot service to save those chunks as is, which leads to additional ~100KB per 4MB chunk and significantly increases snapshot operations to and from nonreplicated disks.
Proposed solution:
Compare chunks with zeroes if the chunk is not a zero chunk. The performance overhead of this check if insignificant compared to the necessity of each 4MB chunks lz4 compression and additional storage consumption.

Add tests for CSI driver

https://kubernetes-csi.github.io/docs/testing-drivers.html

Crash in TServiceActor::RenderDownDisks

Thread 1 (LWP 117000):
#0 raise () from /lib/x86_64-linux-gnu/libc.so.6
#1 abort () from /lib/x86_64-linux-gnu/libc.so.6
#2 bt_terminate_handler () at /opt/buildagent/work/4ec98910e7de170b/__FUSE/mount_path/contrib/libs/cxxsupp/libcxxrt/exception.cc +333
#3 std::terminate () at /opt/buildagent/work/4ec98910e7de170b/__FUSE/mount_path/contrib/libs/cxxsupp/libcxxrt/exception.cc +1595
#4 report_failure (err=<optimized out>, [thrown_exception=thrown_exception@entry](mailto:thrown_exception=thrown_exception@entry)=0x1450b4f3d810) at /opt/buildagent/work/4ec98910e7de170b/__FUSE/mount_path/contrib/libs/cxxsupp/libcxxrt/exception.cc +780
#5 throw_exception ([ex=ex@entry](mailto:ex=ex@entry)=0x1450b4f3d810) at /opt/buildagent/work/4ec98910e7de170b/__FUSE/mount_path/contrib/libs/cxxsupp/libcxxrt/exception.cc +837
#6 __cxa_throw (thrown_exception=0x1450b4f3d890, tinfo=0x5613949dedd8 <typeinfo for yexception>, dest=0x561383170c30 <NPrivateException::yexception::~yexception()>) at /opt/buildagent/work/4ec98910e7de170b/__FUSE/mount_path/contrib/libs/cxxsupp/libcxxrt/exception.cc +868
#7 NMaybe::TPolicyUndefinedExcept::OnEmpty (valueTypeInfo=...) at /opt/buildagent/work/4ec98910e7de170b/__FUSE/mount_path/util/generic/maybe.cpp +5
#8 TMaybe<NCloud::NBlockStore::NProto::TVolume, NMaybe::TPolicyUndefinedExcept>::CheckDefined (this=0x1450bd928828) at /opt/buildagent/work/4ec98910e7de170b/__FUSE/mount_path/util/generic/maybe.h +305
#9 TMaybe<NCloud::NBlockStore::NProto::TVolume, NMaybe::TPolicyUndefinedExcept>::GetRef() const & (this=0x1450bd928828) at /opt/buildagent/work/4ec98910e7de170b/__FUSE/mount_path/util/generic/maybe.h +318
#10 TMaybe<NCloud::NBlockStore::NProto::TVolume, NMaybe::TPolicyUndefinedExcept>::operator-> (this=0x1450bd928828) at /opt/buildagent/work/4ec98910e7de170b/__FUSE/mount_path/util/generic/maybe.h +358
#11 NCloud::NBlockStore::NStorage::TServiceActor::RenderDownDisks(IOutputStream&) const::$_0::operator()(NCloud::NBlockStore::NStorage::TVolumeInfo const&) const (volume=..., this=<optimized out>) at /opt/buildagent/work/4ec98910e7de170b/__FUSE/mount_path/cloud/blockstore/libs/storage/service/service_actor_monitoring.cpp +186
#12 NCloud::NBlockStore::NStorage::TServiceActor::RenderDownDisks ([this=this@entry](mailto:this=this@entry)=0x1450bf001260, out=...) at /opt/buildagent/work/4ec98910e7de170b/__FUSE/mount_path/cloud/blockstore/libs/storage/service/service_actor_monitoring.cpp +214
#13 NCloud::NBlockStore::NStorage::TServiceActor::RenderHtmlInfo ([this=this@entry](mailto:this=this@entry)=0x1450bf001260, out=...) at /opt/buildagent/work/4ec98910e7de170b/__FUSE/mount_path/cloud/blockstore/libs/storage/service/service_actor_monitoring.cpp +154
#14 NCloud::NBlockStore::NStorage::TServiceActor::HandleHttpInfo (this=0x1c8f4, ev=..., ctx=...) at /opt/buildagent/work/4ec98910e7de170b/__FUSE/mount_path/cloud/blockstore/libs/storage/service/service_actor_monitoring.cpp +142
#15 NCloud::NBlockStore::NStorage::TServiceActor::StateWork (this=0x1450bf001260, ev=...) at /opt/buildagent/work/4ec98910e7de170b/__FUSE/mount_path/cloud/blockstore/libs/storage/service/service_actor.cpp +258
#16 NActors::TActorCallbackBehaviour::Receive (this=<optimized out>, actor=<optimized out>, ev=...) at /opt/buildagent/work/4ec98910e7de170b/__FUSE/mount_path/library/cpp/actors/core/actor.cpp +149
#17 NActors::IActor::Receive ([this=this@entry](mailto:this=this@entry)=0x1450bf001260, ev=...) at /opt/buildagent/work/4ec98910e7de170b/__FUSE/mount_path/library/cpp/actors/core/actor.h +496
#18 NActors::TExecutorThread::Execute<NActors::TMailboxTable::TRevolvingMailbox, false> ([this=this@entry](mailto:this=this@entry)=0x1450bea45e80, [mailbox=mailbox@entry](mailto:mailbox=mailbox@entry)=0x426e3f57fe40, [hint=hint@entry](mailto:hint=hint@entry)=10237) at /opt/buildagent/work/4ec98910e7de170b/__FUSE/mount_path/library/cpp/actors/core/executor_thread.cpp +196
#19 NActors::TExecutorThread::ThreadProc()::$_0::operator()<false>(auto, unsigned int) const (this=<optimized out>, activation=<optimized out>) at /opt/buildagent/work/4ec98910e7de170b/__FUSE/mount_path/library/cpp/actors/core/executor_thread.cpp +375
#20 NActors::TExecutorThread::ThreadProc (this=0x1c8f4) at /opt/buildagent/work/4ec98910e7de170b/__FUSE/mount_path/library/cpp/actors/core/executor_thread.cpp +425
#21 (anonymous namespace)::ThreadProcWrapper<ISimpleThread> (param=0x1c8f4) at /opt/buildagent/work/4ec98910e7de170b/__FUSE/mount_path/util/system/thread.cpp +383
#22 (anonymous namespace)::TPosixThread::ThreadProxy (arg=0x1450b99e9c70) at /opt/buildagent/work/4ec98910e7de170b/__FUSE/mount_path/util/system/thread.cpp +244
#23 start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#24 clone () from /lib/x86_64-linux-gnu/libc.so.6

ydb-platform / nbs Goto Github PK

nbs's Introduction

Network Block Store

Quickstart

Documentation

How to Deploy

nbs's People

Stargazers

Watchers

Forkers

nbs's Issues

0. Binary names

1. Local binaries

2. can't open "nbs/nbs-log.txt" with mode RdOnly|Seq

3. can't open "nbs/nbs-log.txt" with mode RdOnly|Seq

4. requirement systemConfig.HasScheduler() failed

Recommend Projects

Recommend Topics

Recommend Org