khonsulabs / bonsaidb Goto Github PK
View Code? Open in Web Editor NEWA developer-friendly document database that grows with you, written in Rust
Home Page: https://bonsaidb.io/
License: Apache License 2.0
A developer-friendly document database that grows with you, written in Rust
Home Page: https://bonsaidb.io/
License: Apache License 2.0
With the initial implementation of the server (#19), a choice was made to delay this work.
The problems this setting is aiming to solve:
The logic implemented should keep track of the last time a database was accessed, and when a database needs to be unloaded, the oldest database should be evicted first.
One question that needs to be answered is, under high load, if a server is configured to only have, let's say 20 databases open, should we allow temporary bursting if a queue of requests comes in that exceeds that? Or should requests block until the limit is exceeded? Maybe two settings -- a target and a maximum?
In discussing some of the PubSub details yesterday, I reminded myself that the Client is "dumb" relative to its knowledge of what existing subscribers are subscribed to. The point of the conversation was pointing out how the API for pubsub gives an Arc<Message>
, and I mistakenly thought I handled a cool optimization in the client: if two subscribers subscribe to a single topic, the server would only send the message once.
This isn't true, and at the time, it seemed like just an optimization that could be done. However, in working on the reconnecting logic for #61, I realized that the reconnecting logic for all clients retained the same SubscriberMap
. The effect is that if a client disconnect occurs, existing subscribers will never receive an error nor will they receive any messages once a reconnect happens.
This could be fixed by implementing the optimization mentioned in the first paragraph. The client would keep track of all topics for all local subscribers. Upon connecting/reconnect, the client would create a single remote subscriber and subscribe to all of the topics. From the subscriber's perspective, the disconnect would be transparent.
While that sounds amazing for many use cases, it also prevents the ability for a subscriber to know of a disconnect. Another approach would be to clear the subscribers map upon reconnect, thus forcibly disconnecting existing subscribers. The pubsub loops would just need an outer loop to manage recreating the subscriber upon an error.
I don't think these approaches are mutually exclusive, but it might be reasonable to only implement one of these approaches to solve this bug.
The process and structure of websocket_worker.rs and worker.rs are very similar. It should be possible to abstract most of the logic away and keep the transport-specific code minimized to small chunks of glue code.
Should create a folder structure a la:
export-path\collection-name\document-id.cbor
Ability to cross-convert to JSON or other output formats would be nice too.
This tool should be written in such a way that it can operate without the schema -- for now needing a list of collection IDs.
Code coverage started failing recently, and it appears that it's due to an ICE that dates back a little bit, but only recently started cropping up most likely due to rustfmt being broken on nightly for a while: . I tried to narrow it down, but I can't seem to get the ICE to occur outside of the project.
The earliest version of nightly that installed all the default components that doesn't cause an ICE is 2021-03-25.
The commit to revert is: cd668bf
Updated after #44.
We now have properly namespaced strings in use everywhere. One issue with using strings for collection IDs is that they're sent across the network. This, at most, should only need 4 bytes to represent if we had a reliable collision-free way to convert these strings to u32s. Originally, the idea was to use a hash. However, there's a more correct way to do it that ensures there will be no collisions:
Currently, push()
and update()
are implemented by creating single-entry transactions and issuing apply_transaction()
with the created transaction.
We should have a way to build a transaction with more than one entry.
While we're not working on replication yet, we should add the necessary methods to read the transaction logs so that we can unit test the transactions.
While many developers have bash available to them, there's no reason for it to require bash. We should install a rust executable as the pre-commit that does the same commands, but doesn't require bash. This will enable the pre-commit hook to work on Windows.
It appears the xtask repo also has some tools for this exact task.
The view query API needs to:
For Fabruic, we've opted for connect_with
to resolve a hostname using the CloudFlare resolver with very secure settings. This is perfect for a game client, but not great for hosting a database server on a private network without hostnames or with private DNS.
Right now, PliantDB doesn't use connect_with
and instead resolves the hostname using Tokio's ToSocketAddrs
and uses connect
with the resolved address. This means PliantDB currently works great for hosting a database, but for secure, trustworthy DNS resolution, we don't have a solution.
I see a couple of options:
Given the goals of fabruic, I'm leaning towards solving it completely in PliantDB with something like option 2.
We need to have a way to allow collections to upgrade themselves between versions.
CollectionSchema
trait with a version()
function, like View
. Also add a a function like async fn upgrade_from_previous_version<C: Connection>(connection: &C, stored_version: u64)
. Maybe blocked by #113.upgrade_from_previous_version
before allowing any operations on that collection.Fabruic now supports using a certificate store to authenticate the QUIC connection instead of only using pinned certificates, and once #40 is updated, we can use the same TLS certificate for HTTP as we can for QUIC.
To make deploying easier, having built-in functionality to generate certificates using ACME would be incredibly useful.
Right now, BonsaiDb uses UDP Port 5645, unless otherwise specified, and is not registered with IANA. I've taken initial steps to attempt to register a port with IANA, but the reality is that this project is early in development. Because there have been deployments of BonsaiDb in non-experimental environments, we are using a currently-unassigned user port.
The only experimental UDP ports available are 1021 and 1022, both of which require superuser privileges to bind to on Linux.
This ticket is to serve as a reminder that there is no guarantee or expectation that the port used by default will be available at this time. Technically even registering a port doesn't give you that guarantee, but it at least gives us more of a right to use that port by default in deployments.
Initial implementation should reduce values into the MapEntry, which means that for unique key outputs, we keep the values minimized.
When querying, we re-reduce using all resulting values, using the cached entries from each of the nodes.
When a document is removed, we have to re-reduce the remaining values.
For both websocket and fabruic connections, we should have a shutdown handle that can be "selected" with each of the payload receivers, so that when a shutdown is requested, any existing requests for connections are handled.
Right now the Query API forces using Range, but ideally it would allow any RangeBound. This means the API needs to change, however.
I did some initial searching for a quick solution but didn't find any other range types that supported serde out of the box. It's not a tough problem, just something that seems low-priority for now.
We shouldn't ever use bincode::deserialize
directly. The preferred method is to use the Options
trait. The DefaultOptions
struct talks about what options are set up by default, and the important part for us is the byte limit.
Technically since PliantDb relies on Fabruic, to truly be preventing it, this issue should also be fixed before closing this one out fully:
For cbor, the situation is more complicated. Here's a devlog describing my experiment of writing my own serialization format. As of this edit (July 13), my thoughts are now that we should:
pbor
. And... rename it for goodness sake. (done, now named Pot)pot
bincode
and pot
. CBOR isn't an attack vector when it's export-only.To implement the view system properly, we need to have a background job service. For now, the service can be simple: The ability to launch or check to see if a job with a given ID is running or in queue, and if so, wait for it to complete.
After #43, we should have a location to store client-specific information on the server. Once we do, we should track which subscriber ids each client created, and return errors if a client tries to unsubscribe or subscribe on an ID that doesn't belong to the client.
Right now, we're not using the QUIC protocol exchange at all. We should have some logic for major protocol version negotiation.
Originally, I thought it would be a long time before a book would be useful, preferring to put more documentation into the docs themselves. However, the more I think about it, a step-by-step guide to adopting, using, and administrating PliantDB would be incredibly useful for adoption.
I think the general goal should be for the book to be focused on practical use cases and "guide"-style documentation. The docs in the code should be focused on the functionality of the code, and reference the book when sections are available that are helpful.
An idea for the book would be to build an app from start to finish with sections highlighting the migration from a single-user executable all the way to a fully deployed cluster. Perhaps using Kludgine as the UI to tie all the projects together.
The initial implementation in #28 was a quick-and-dirty way to get a secondary transport to help test what bugs were in PliantDB vs the new QUIC transport layer.
Ideally, we would support a layer of routing to eventually support REST APIs and more. This would move the WebSocket endpoint to a URL.
I'm uncertain if using warp is the best for this or not. It's what I currently am most familiar with, and it seems like it would support the composability I would hope to offer someday.
CouchDB View API for reference
Allow querying by:
In working on creating a more extensive demo, I found server-side generated PubSub
messages never reached the clients.
As I finished fixing it today, I realized that also websockets and pliant clients wouldn't have heard each other either. This is all fixed, but we should have a unit test to test these use cases.
The goal of this issue is to create a basic app platform that allows defining restful API endpoints, simple HTTP serving, a custom over-the-wire (WebSocket/PliantDb) API. A developer could use this platform to create an app that uses PliantDb to manage users and permissions, and optionally an HTTP layer with WebSocket support serving requests. The HTTP layer's main purposes are writing RESTful APIs and serving a single-page app that is powered by WASM + PliantDb over WebSockets.
At the local database level, this should be implemented as a lightweight atomic-operation focused key-value store looking to replicate these features in redis:
We can go above and beyond the default redis configuration and use Sled to allow the data set to be larger than what can fit in memory: Use a Sled tree to store each entry, but keep an in-memory cache of recently used keys. When keys are modified, keep track and only update each changed entry when flushing to Sled.
The last trick of the puzzle is to try to enforce memory limitations on the amount of data loaded in memory and evicting keys based on last usage.
When exposed in this fashion, the API should fit into the existing DatabaseRequest
/DatabaseResponse
structures, and make exposing them to the Client fairly straightforward as those things go.
Update fabruic and use the new methods that take Uris when no pinned certificate is passed. Use the existing functionality otherwise.
Use case for this feature: storing a tags
array in a document. A view to list all documents by tag would want to emit one entry per tag in the array.
Two approaches:
I slightly prefer the latter, but the former feels more functional in design.
Project looks interesting to me. How can I start contributing to a project ? Where do I start from ?
Initial server requirements:
macro_rules
macro to create a main entrypoint for you.Right now the server and client do not do any of the recommended WebSocket closing procedures. It doesn't really impact the protocol used in PliantDB, but my understanding from doing some research is that if we want to support interacting with WebSockets on the browser, implementing graceful closing will prevent errors from popping up in the browser consoles for expected disconnections.
We should transparently support using the PliantDb client within WASM. The client will only be able to support using the WebSocket protocol.
The default behavior is to ensure the view is up-to-date before returning any data.
Keep this behavior by default, but add the ability to:
This allows ultimate flexibility: If you want eventually-consistent data accesss, use "update_after". If you know another process is updating the view regularly, you can just request stale data always allowing the other process to control the "caching".
Connection
-- best method will probably be to create a PubSub client from a Connection
.
Storage
type.Currently, reduce is implemented by default on the View trait with an Err that returns that the method isn't implemented. When executing the view code, we always call reduce, which means we always do a little extra work even if the view hasn't actually provided an implementation.
Either we should test if it's actually implemented when registering the view and note it somewhere and optimize this flow, or we should refactor how Reduce is implemented -- can we leverage the type system better by splitting traits? Or at the end of the day, is that more complicated?
While this project is being developed to be the core architecture of Cosmic Verge, it's a huge project, and it's nice to have more achievable goals.
For Cosmic Verge, there's a unified vision of using PliantDb as an app platform. We are developing Gooey with the goal of being able to write functional PliantDb-backed applications using mostly cross-platform code.
The idea of the example application is still nebulous. A typical example is a todo app, and honestly it's tempting because I'm between todo apps myself right now. But, for now, this is a placeholder issue.
To track this "milestone", refer to this project
To test this properly, we need to be able to query with stale results. Blocked by #13.
Unsure of how this impacts local queries. For iteration in sled, each result is returned, so "skipping" to catch up in a view would still scan those items. I'm not sure if pagination makes sense in the traditional sense. If sled, however, is returning handles to data that can be loaded, then we could support simple pagination by just skipping along the iterator.
If traditional pagination will make it too difficult to keep things performant, we should think if we care to expose pagination -- we could expose a "result limit", and you're responsible for incrementing your key request range.
As views are updated, the returned transaction ID should be able to be cached to allow for RwLock
-level blocking (not potentially IO blocking like accessing sled`) when accessing a view that is updated.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.