beneath-hq / beneath Goto Github PK

Beneath is a serverless real-time data platform ⚡️

License: Other

Dockerfile 0.11% JavaScript 0.32% TypeScript 35.01% Python 14.43% Jupyter Notebook 0.81% Shell 0.16% Go 45.12% Makefile 0.01% Mustache 0.15% Java 3.88%

dataops mlops data-science data-engineering data-pipelines developer-tools data-warehouse streaming etl analytics

beneath's Introduction

beneath's People

Contributors

Stargazers

Watchers

Forkers

admariner woop edisonmj lynxye rektifyai tangjiang xqbumu cloudnepal bubblyorca

beneath's Issues

Simplify table client APIs

Summary

The client APIs for dealing with tables has gotten a little too nested and intricate. This issue investigates if we can settle on a flatter and more straightforward interface. Ideally, we can unify the lower-level and "easy" functions in one interface. It seems it's okay to be slightly less verbose than we are today (the verbosity might actually be adding complexity).

Risks and challenges

Balance between verbosity and simplicity
Is it a sufficient leap in simplicity?
Make sure it's portable to JS and the upcoming Go client

Involved components

Python client
Data server

Proposed changes

WIP alert! Starting with the UX, I imagine a refactored Python client interface like this:

table = await client.table("user/project/table", schema=...)
table = await table.version(...) # optional

# writing
await table.write(records, deletes=...)

# querying
cursor = await table.log(replay=True, changes=True, subscription=...) # consumers embedded
cursor = await table.index("time >= '2021-02-01'")
cursor = await client.olap(f"SELECT count(*) FROM {table}")

# reading
df = await cursor.df()
records = await cursor.all()
async for record in cursor:
    print(record)
async for batch in cursor.batches(size=...):
    print(record)
await cursor.consume(callback)
# + consumer APIs, including batching and time delay

# get/set features
await table.get({ ... }) # maps to table.index
await table.set({ ... }, { ... }) # oltp writes (+ ability to pass timestamp to fence writes)
await table.delete({ ... }) # oltp delete

# checkpointers?

Refactor "streams" to "real-time tables"

Summary

Naming things is hard. One major term that's been aching for a while are our use of the word "stream". In Beneath, a stream is both a log with infinite retention, an indexed table, and a data warehouse table. The term "stream" is particularly bad for batch datasets, like uploaded dataframes ("batch stream", "finalized stream", ...).

I've considered and tested several alternative terms: tables, collections, topics, sets, datasets, frames, dataframes, slices, objects. We could also invent a new term altogether. But I think none of them beats the simplicity of "table", which everyone understands. By saying "real-time tables" or "subscribe to table" or "replay table", I think we can reasonably convey the log streaming feature.

So this refactor is about changing our use of "stream" across the code base with "(real-time) table".

Risks and challenges

Our APIs (gql and grpc) use "stream" extensively, and changing it will break old client versions. It's a good exercise to see if we're geared to maintain backwards compatibility.

Involved components

Virtually the entire codebase. This is a superficial change, but will touch every part of the codebase.

Future-proofing GraphQL server schemas

Summary

Here and there in our control-plane GraphQL server schemas, there are some non-future-proof constructs. We might as well address them now, to avoid breaking client changes.

Risks and challenges

This will probably be a breaking change for the CLI (we'll use the deprecation notice system we built, so it's not so bad)

Involved components

Control server
Frontend queries
CLI code (in the Python client)

Proposed changes

Make sure all inputs are XXXInput types (regardless of number of values)
Replace separate ID and path lookups with XXXQualifier types
- If path or ID lookups is unsupported, the resolver can just return an error
- Consider using unions
Remove unpaginated sub-fields and replace with top-level fields
- The sub-fields in question are: Organization.projects, Project.tables, Project.services
- (The point is not to not have paginated sub-fields. But they require more care to implement with connections, and doing so in the future would be a non-breaking improvment)
Implement top-level list results as connections
Generally, go through every field name and consider if it's well-named

Login from CLI without manually creating a secret

Problem to solve and solution

It's a barrier to have to create a CLI secret to login from the command line. We'll add a ticket-based login flow that opens in your browser when you run "beneath login".

Proposed solution and changes

Create a model AuthenticationTickets
Create resolvers for creating, approving and polling authentication tickets
Create a frontend flow for approving tickets (e.g. /-/auth/ticket?ticket=XXX)
- If logged in, ask for approval
- If not logged in, redirect to auth, then direct back to approval page

Risks and challenges

Make sure the flow works for new users who haven't previously created a user (i.e. they'll be shown the welcome screen before redirecting to the approval page)

Record deletes and consistent writes

Problem to solve and solution

We want record deletes and strongly consistent writes. Would be really useful for e.g. checkpointers.

Proposed solution and changes

WIP:

Rewrite engine to distinguish keys and values. Null values will then mean deletes (what about keys that span all columns?).
Engine interface update to support deletes and immediate writes
Data server should execute immediate writes directly on the index driver, not pass over MQ first
Add tombstone support to the data server, client libraries, and frontend
The frontend interfaces must support tombstones
Initially, we'll do a weak implementation in the Bigtable driver with tombstones and simultaneous log and index writes

Risks and challenges

We really need transactional capabilities at the index layer to do this right. So this effectively adds the right interface, but the backend implementation is an MVP. When we adopt a better indexing engine (like FDB), we can do it right on the backend.
Without log compaction and warehouse merging, deletes won't completely remove data from our servers

beneath-hq / beneath Goto Github PK

beneath's Introduction

🧠 Philosophy

🚀 Status

🎬 Tour

🐣 Get started

👋 Community and Support

🎓 Documentation

📦 Features and roadmap

🍿 How it works

🛒 License

beneath's People

Contributors

Stargazers

Watchers

Forkers

beneath's Issues

Summary

Risks and challenges

Involved components

Proposed changes

Summary

Risks and challenges

Involved components

Summary

Risks and challenges

Involved components

Proposed changes

Problem to solve and solution

Proposed solution and changes

Risks and challenges

Problem to solve and solution

Proposed solution and changes

Risks and challenges

Recommend Projects

Recommend Topics

Recommend Org