Giter VIP home page Giter VIP logo

beneath-hq / beneath Goto Github PK

View Code? Open in Web Editor NEW
81.0 9.0 9.0 11.29 MB

Beneath is a serverless real-time data platform ⚡️

Home Page: https://beneath.dev

License: Other

Dockerfile 0.11% JavaScript 0.32% TypeScript 35.01% Python 14.43% Jupyter Notebook 0.81% Shell 0.16% Go 45.12% Makefile 0.01% Mustache 0.15% Java 3.88%
dataops mlops data-science data-engineering data-pipelines developer-tools data-warehouse streaming etl analytics

beneath's Introduction

Beneath

Beneath is a serverless real-time data platform. Our goal is to create one end-to-end platform for data workers that combines data storage, processing, and visualization with data quality management and governance.


Go Report Card GoDoc Twitter

Beneath is a work in progress and your input makes a big difference! If you like it, star the project to show your support or reach out and tell us what you think.

🧠 Philosophy

The holy grail of data work is putting data science into production. It's glorious to build live dashboards that aggregate multiple data sources, send real-time alerts based on a machine learning model, or offer customer-specific analytics in your frontend.

But building a modern data management stack is a full-time job, and a lot can go wrong. If you were starting a project from scratch today, you might set up Postgres, BigQuery, Kafka, Airflow, DBT and Metabase just to cover the basics. Later, you would need more tools to do data quality management, data cataloging, data versioning, data lineage, permissions management, change data capture, stream processing, and so on.

Beneath is a new way of building data apps. It takes an end-to-end approach that combines data storage, processing, and visualization with data quality management and governance in one serverless platform. The idea is to provide one opinionated layer of abstraction, i.e. one SDK and UI, which under the hood builds on modern data technologies.

Beneath is inspired by services like Netlify and Vercel that make it remarkable easy for developers to build and run web apps. In that same spirit, we want to give data scientists and engineers the fastest developer experience for building data products.

🚀 Status

We started with the data storage and governance layers. You can use the Beneath Beta today to store, explore, query, stream, monitor and share data. It offers several interfaces, including a Python client, a CLI, websockets, and a web UI. The beta is stable for non-critical use cases. If you try out the beta and have any feedback to share, we'd love to hear it!

Next up, we're tackling the data processing and data visualization layers, which will bring expanded opportunity for data governance and data quality management (see the roadmap at the end of this README for progress).

🎬 Tour

The snippet below presents a whirlwind tour of the Python API:

# Create a new table
table = await client.create_table("examples/demo/foo", schema="""
  type Foo @schema {
    foo: String! @key
    bar: Timestamp
  }
""")

# Write batch or real-time data
await table.write(data)

# Load into a dataframe
df = await beneath.load_full(table)

# Replay and subscribe to changes
await beneath.consume(table, callback, subscription_path="...")

# Analyze with SQL
data = await beneath.query_warehouse(f"SELECT count(*) FROM `{table}`")

# Lookup by key, range or prefix
data = await table.query_index(filter={"foo": {"_prefix": "bar"}})

The image below shows a screenshot from the Beneath console. Check out the home page for a demo video.

Source code example

🐣 Get started

The best way to try Beneath is with a free beta account. Sign up here. When you have created an account, you can:

  1. Install and authenticate the Beneath SDK
  2. Browse public projects and integrate using Python, JavaScript, Websockets and more
  3. Create a private or public project and start writing data

We're working on bundling a self-hosted version that you can run locally. If you're interested in self-hosting, let us know!

👋 Community and Support

🎓 Documentation

📦 Features and roadmap

  • Data storage
    • Log streaming for replay and subscribe
    • Replication to key-value store for fast indexed lookups
    • Replication to data warehouse for OLAP queries (SQL)
    • Schema management and enforcement
    • Data versioning
    • Schema evolution and migrations
    • Secondary indexes
    • Strongly consistent operations for OLTP
    • Geo-replicated storage
  • Data processing
    • Scheduled/triggered SQL queries
    • Compute sandbox for batch and streaming pipelines
    • Git-integration for continuous deployments
    • DAG view of tables and pipelines for data lineage
    • Data app catalog (one-click parameterized deployments)
  • Data visualization and exploration
    • Vega-based charts
    • Dashboards composed from charts and tables
    • Alerting layer
    • Python notebooks (Jupyter)
  • Data governance
    • Web console and CLI for creating and browsing resources
    • Usage dashboards for tables, services, users and organizations
    • Usage quota management
    • Granular permissions management
    • Service accounts with custom permissions and quotas
    • API secrets (tokens) that can be issued/revoked
    • Data search and discovery
    • Audit logs as meta-tables
  • Data quality management
    • Field validation rules, checked on write
    • Alert triggers
    • Data distribution tests
    • Machine learning model re-training and monitoring
  • Integrations
    • gRPC, REST and websockets APIs
    • Command-line interface (CLI)
    • Python client
    • JS and React client
    • PostgreSQL wire-protocol compatibility
    • GraphQL API for data
    • Row restricted access tokens for identity-centered apps
    • Self-hosted Beneath on Kubernetes with federation

🍿 How it works

Check out the Concepts section of the docs for an overview of how Beneath works.

The contributing/ directory in this repository contains a deeper technical walkthrough of the software architecture.

🛒 License

This repository contains the full source code for Beneath. Beneath's core is source available, licensed under the Business Source License, which converts to the Apache 2.0 license after four years. All the client libraries (in the clients/ directory) and examples (in the examples/ directory) are open-source, licensed under the MIT license.

beneath's People

Contributors

begelundmuller avatar ericpgreen2 avatar svantetobias avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

beneath's Issues

Simplify table client APIs

Summary

The client APIs for dealing with tables has gotten a little too nested and intricate. This issue investigates if we can settle on a flatter and more straightforward interface. Ideally, we can unify the lower-level and "easy" functions in one interface. It seems it's okay to be slightly less verbose than we are today (the verbosity might actually be adding complexity).

Risks and challenges

  • Balance between verbosity and simplicity
  • Is it a sufficient leap in simplicity?
  • Make sure it's portable to JS and the upcoming Go client

Involved components

  • Python client
  • Data server

Proposed changes

WIP alert! Starting with the UX, I imagine a refactored Python client interface like this:

table = await client.table("user/project/table", schema=...)
table = await table.version(...) # optional

# writing
await table.write(records, deletes=...)

# querying
cursor = await table.log(replay=True, changes=True, subscription=...) # consumers embedded
cursor = await table.index("time >= '2021-02-01'")
cursor = await client.olap(f"SELECT count(*) FROM {table}")

# reading
df = await cursor.df()
records = await cursor.all()
async for record in cursor:
    print(record)
async for batch in cursor.batches(size=...):
    print(record)
await cursor.consume(callback)
# + consumer APIs, including batching and time delay

# get/set features
await table.get({ ... }) # maps to table.index
await table.set({ ... }, { ... }) # oltp writes (+ ability to pass timestamp to fence writes)
await table.delete({ ... }) # oltp delete

# checkpointers?

Refactor "streams" to "real-time tables"

Summary

Naming things is hard. One major term that's been aching for a while are our use of the word "stream". In Beneath, a stream is both a log with infinite retention, an indexed table, and a data warehouse table. The term "stream" is particularly bad for batch datasets, like uploaded dataframes ("batch stream", "finalized stream", ...).

I've considered and tested several alternative terms: tables, collections, topics, sets, datasets, frames, dataframes, slices, objects. We could also invent a new term altogether. But I think none of them beats the simplicity of "table", which everyone understands. By saying "real-time tables" or "subscribe to table" or "replay table", I think we can reasonably convey the log streaming feature.

So this refactor is about changing our use of "stream" across the code base with "(real-time) table".

Risks and challenges

Our APIs (gql and grpc) use "stream" extensively, and changing it will break old client versions. It's a good exercise to see if we're geared to maintain backwards compatibility.

Involved components

Virtually the entire codebase. This is a superficial change, but will touch every part of the codebase.

Future-proofing GraphQL server schemas

Summary

Here and there in our control-plane GraphQL server schemas, there are some non-future-proof constructs. We might as well address them now, to avoid breaking client changes.

Risks and challenges

  • This will probably be a breaking change for the CLI (we'll use the deprecation notice system we built, so it's not so bad)

Involved components

  • Control server
  • Frontend queries
  • CLI code (in the Python client)

Proposed changes

  • Make sure all inputs are XXXInput types (regardless of number of values)
  • Replace separate ID and path lookups with XXXQualifier types
    • If path or ID lookups is unsupported, the resolver can just return an error
    • Consider using unions
  • Remove unpaginated sub-fields and replace with top-level fields
    • The sub-fields in question are: Organization.projects, Project.tables, Project.services
    • (The point is not to not have paginated sub-fields. But they require more care to implement with connections, and doing so in the future would be a non-breaking improvment)
  • Implement top-level list results as connections
  • Generally, go through every field name and consider if it's well-named

Login from CLI without manually creating a secret

Problem to solve and solution

It's a barrier to have to create a CLI secret to login from the command line. We'll add a ticket-based login flow that opens in your browser when you run "beneath login".

Proposed solution and changes

  • Create a model AuthenticationTickets
  • Create resolvers for creating, approving and polling authentication tickets
  • Create a frontend flow for approving tickets (e.g. /-/auth/ticket?ticket=XXX)
    • If logged in, ask for approval
    • If not logged in, redirect to auth, then direct back to approval page

Risks and challenges

  • Make sure the flow works for new users who haven't previously created a user (i.e. they'll be shown the welcome screen before redirecting to the approval page)

Record deletes and consistent writes

Problem to solve and solution

We want record deletes and strongly consistent writes. Would be really useful for e.g. checkpointers.

Proposed solution and changes

WIP:

  • Rewrite engine to distinguish keys and values. Null values will then mean deletes (what about keys that span all columns?).
  • Engine interface update to support deletes and immediate writes
  • Data server should execute immediate writes directly on the index driver, not pass over MQ first
  • Add tombstone support to the data server, client libraries, and frontend
  • The frontend interfaces must support tombstones
  • Initially, we'll do a weak implementation in the Bigtable driver with tombstones and simultaneous log and index writes

Risks and challenges

  • We really need transactional capabilities at the index layer to do this right. So this effectively adds the right interface, but the backend implementation is an MVP. When we adopt a better indexing engine (like FDB), we can do it right on the backend.
  • Without log compaction and warehouse merging, deletes won't completely remove data from our servers

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.