Giter VIP home page Giter VIP logo

nebula's Introduction

Nebula

Extremely-fast Interactive Real-Time Analytics

logo
Nebula is an extremely-fast end-to-end interactive big data analytics solution. Nebula is designed as a high-performance columnar data storage and tabular OLAP engine.

What is Nebula?

  • Extreme Fast Data Analytics System with Access Control.
  • Distributed Cache Tier for Tabular Data.
  • Build Unified Service API for any Sources (files, streaming, services, etc.)

Nebula can run on

  • Local box
  • VM cluster
  • Kubenettes

Documents of design, internals and stories will be shared at project docs (under-construction).

A Simple Story

To cut it short, check this story and see if it's interesting to you:

  1. You have some data, they are files on cloud storage, or streaming (eg. kafka), or even just a bunch of CSV files on Github, pretty much any source...
  2. You deploy a Nebula cluster, it is either single box, a cluster of a few EC2 machines on AWS, or just a Kubenettes cluster. Nebula doesn't have external dependencies, just a couple binaries (or docker images), so it's easy to maintain.
  3. Now, you add a table defintion in the cluster config file. Right away, you have these available:
    • A web UI where you can slice/dice your data for interactive visualization. You can also write script to transform your data in server side.
    • A REST API that you can build your own application with.

Highlight - visualize your real-time streaming from Kafka

demo

Sounds interesting? Continue to read...

Contents

Introduction

With Nebula, you could easily:

pretty chart 1

Transform column, aggregate by it with filters

  • To learn more, check out these resources:
  1. 10 minutes quick tutorial video

  2. Nebula presentation slides

Get Started

Run example instance with sample data on local

  1. clone the repo: git clone https://github.com/varchar-io/nebula.git
  2. build latest code: cd nebula && ./build.sh
  3. launch services: ./run.sh (the script uses test config file build/configs/test.yml which you can modify to connect your own data)
  4. explore nebula UI in browser if all up running: http://localhost:8088

Run example instance with sample data on Kubernetes

Deploy a single node k8s cluster on your local box. Assume your current kubectl points to the cluster, just run:

  • apply: kubectl apply -f deploy/k8s/nebula.yaml.
  • forward: kubectl port-forward nebula/server 8088:8088
  • explore: http://localhost:8088

Build Source & Test

The whole repo can be built on either MacOS or Linux. Just run ./build.sh.

After built the source successfully, the binaries can be found in ./build directory. Now you can launch a simple cluster of "server" + "one worker" + "web server" like this:

  • launch node: ~/nebula/build%./NodeServer
  • launch server: ~/nebula/build% ./NebulaServer --CLS_CONF configs/test.yml
  • launch web server: ~/nebula/src/service/http/nebula% NS_ADDR=localhost:9190 NODE_PORT=8081 node node.js`

If everything goes as expected, now you should be able to explore and query the sample data from its UI at http://localhost:8081

Birdeye View

Overview

Common Scenarios

As you may see in the previous section where we talk about running the sample locally. All of Nebula data tables are defined by a yaml section in the cluster config file, it's configs/test.yml in the example. Each of the use case demonstrated here is a table defintion, which you can copy to configs/test.yml and run it in that test. (Just replace the real values of your own data, such as schema and file location)

CASE-1: Static Data Analytics

Configure your data source from a permanent storage (file system) and run analytics on it. AWS S3, Azure Blob Storage are often used storage system with support of file formats like CSV, Parquet, ORC. These file formats and storage system are frequently used in modern big data ecosystems.

For example, this simple config will let you analyze a S3 data on Nebula

seattle.calls:
  retention:
    max-mb: 40000
    max-hr: 0
  schema: "ROW<cad:long, clearence:string, type:string, priority:int, init_type:string, final_type:string, queue_time:string, arrive_time:string, precinct:string, sector:string, beat:string>"
  data: s3
  loader: Swap
  source: s3://nebula/seattle_calls.10k.tsv
  backup: s3://nebula/n202/
  format: csv
  csv:
    hasHeader: true
    delimiter: ","
  time:
    type: column
    column: queue_time
    pattern: "%m/%d/%Y %H:%M:%S"

CASE-2: Realtime Data Analytics

Connect Nebula to real-time data source such as Kafka with data formats in thrift or JSON, and do real-time data analytics.

For example, this config section will ask Nebula to connect one Kafka topic for real time code profiling.

  k.pinterest-code:
    retention:
      max-mb: 200000
      max-hr: 48
    schema: "ROW<service:string, host:string, tag:string, lang:string, stack:string>"
    data: kafka
    loader: Streaming
    source: <brokers>
    backup: s3://nebula/n116/
    format: json
    kafka:
      topic: <topic>
    columns:
      service:
        dict: true
      host:
        dict: true
      tag:
        dict: true
      lang:
        dict: true
    time:
      # kafka will inject a time column when specified provided
      type: provided
    settings:
      batch: 500

CASE-3: Ephemeral Data Analytics

Define a template in Nebula, and load data through Nebula API to allow data live for specific period. Run analytics on Nebula to serve queries in this ephemeral data's life time.

CASE-4: Sparse Storage

Highly break down input data into huge small data cubes living in Nebula nodes, usually a simple predicate (filter) will massively prune dowm data to scan for super low latency in your analytics.

For exmaple, config internal partition leveraging sparse storage for super fast pruning for queries targeting specific dimension: (It also demonstrates how to set up column level access control: access group and access action for specific columns)

  nebula.test:
    retention:
      # max 10G RAM assigment
      max-mb: 10000
      # max 10 days assignment
      max-hr: 240
    schema: "ROW<id:int, event:string, tag:string, items:list<string>, flag:bool, value:tinyint>"
    data: custom
    loader: NebulaTest
    source: ""
    backup: s3://nebula/n100/
    format: none
    # NOTE: refernece only, column properties defined here will not take effect
    # because they are overwritten/decided by definition of TestTable.h
    columns:
      id:
        bloom_filter: true
      event:
        access:
          read:
            groups: ["nebula-users"]
            action: mask
      tag:
        partition:
          values: ["a", "b", "c"]
          chunk: 1
    time:
      type: static
      # get it from linux by "date +%s"
      value: 1565994194

SDK: Nebula Is Programmable

Through the great projecct QuickJS, Nebula is able to support full ES6 programing through its simple UI code editor. Below is an snippet code that generates a pie charts for your SQL-like query code in JS.

On the page top, the demo video shows how nebula client SDK is used and tables and charts are generated in milliseconds!

    // define an customized column
    const colx = () => nebula.column("value") % 20;
    nebula.apply("colx", nebula.Type.INT, colx);

    // get a data set from data set stored in HTTPS or S3
    nebula
        .source("nebula.test")
        .time("2020-08-16", "2020-08-26")
        .select("colx", count("id"))
        .where(and(gt("id", 5), eq("flag", true)))
        .sortby(nebula.Sort.DESC)
        .limit(10)
        .run();

Open source

Open source is wonderful - that is the reason we can build software and make innovations on top of others. Without these great open source projects, Nebula won't be possible:

Many others are used by Nebula:

  • common tools (glog/gflags/gtest/yaml-cpp/fmt/leveldb)
  • serde (msgpack/rapidjson/rdkafka)
  • algos(xxhash, roaring bitmap, zstd, lz4)
  • ...

Adoptions

Pinterest

nebula's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

nebula's Issues

Support Ingesting Compressed Data Files

If the data source is file system, such as S3, based on the file extension, we should be able to decompress it transparently before ingest the data files (csv/tsv files are much less in compressed format) if they are compressed files, such as .gz, .zst, .lz4, etc. Basically we can decompress the file in place after download from object store.

ingestion UDF

Often when user read s3 dir to nebula, the data schema need to apply certain transform

  • extract nested columns and explode to multiple rows
  • data type translate from timestamp to date_str

The goal is to allow user define their own ingestion UDF and applied to each row read from s3

  • declare after transform schema in yaml config
  • define UDF and applied column in yaml config
  • evaluate python/node.js ? user can define and apply udf ondemand.
  • refactor IngestionSpec::build
  • add test converage to ingestion UDF

Enable cardinality function all all columns

This is a tiny task, good for first issue for any new comers.
Right now, cardinality estimate function only attached to NUMBER typed column when populating in the UI, just allow it to populate for any type since cardinality is not limited to numbers only.

Remove display type from the interface.

Nebula query engine should not care about what display type the front end is.
The only thing it cares about are the query fields

  • does it have aggregation field
  • does it aggregated by timeline/window

And client should be able to handle whatever the final query result is.

Setup Github Action

Set up Github Action to build Nebula nightly.

Right now, to make the nebula build work (Ubuntu) on a fresh machine, it is not EASY!
@samprasyork probably had a great feeling about it, :)

This issue is to track let's make the build experience easy for future developers on this project. Github Action supports cmake based project, by enabling it, I guess we will fix most of the build issues.

Support flatten field in JSON ingestion

Some JSON data has structure, similar like thrift data object which uses field-mapping to ingest data in.
For JSON, an easier way is to flatten structure into field name.

for example:
schema "ROW<a.b:int, a.c:string>" can ingest JSON data like

{
  a: {
    b: 230,
    c: "xyz"
  }
}

An alternative approach is to fix all issues with nested type in Nebula including Map, List and Struct which haven't been tested at all so I assume there are will be tons of work in this space.

Histogram view enhancement by updating min/max from predicates/filters

This is an enhancement based on user's intention, not a priority.

Screen Shot 2021-01-04 at 10 45 04 PM

When there is no filter or filter has nothing to do with the column to run histogram on, we use min/max from metadata. But when user put a predicate like value>90 and run hist("value"), we should be able to update the min value as [90, max] and then run 10 buckets over it, this scenario is called zoom in histogram which helps user to keep zoom into smaller data range with different granularity.

@shuoshang1990 feel free to take a look, but you don't have to take it, it's a new feature essentially on top of current hist.

Support Instant UDF In Filter

Currently Instant UDF (javascript function/lambda) can be used to define new column. So we can use it in select clause.
However, we should support it in Where clause as well, so below pseudo code could be executed successfully.
const x = () => { return nebula.column("age") % 3; }; nebula.apply("x", nebula.Type.INT, x); nebula.select("type", count("id")).where(eq("x", 1));

coding IDE refresh

Better discovery for user to write javascript UDF to do data analysis.
Keep code logic in-sync with user interface.

Table only view

Some query results aren't fit for graph visual, such as error message and its count, in this case, user may want to hide graph visual, but display as table only.

We should provide an option to allow user to hide graph or just table only option for display.

Revamp Nebula UI To Remove Visualization Choices In Query

Today Nebula UI requires user to choose which visualization to use. This is unnecessary and not good UX.

Make the change as below:
Based on fields (with/without aggregations), Nebula UI will always display a table for the result of data (truncating super long string column), and above it, user can choose display as any visual they want.

Timeline query needs a special treatment to respect window column as x-axis always.

Support Time Pattern In Column Property

column property can have time pattern, which could be used

  1. ingesting the string literals from data source which follows this pattern.
  2. display this column in the format defined by pattern.

example:
columnx:
timestamp: "%Y-%m-%d %H:%M:%S"

support dynamic updating time line based on interval and relative time range

Client will make continuous query template and materialize with new interval start and end for new data point.
Push the new data point into the timeline buckets and allow visual rendering part decide how to consume the growing queue.

If we cap the queue size, the old data point will be pushed out while new data points appended.

Support timeline visual for normal query result

Nebula timeline requires time column presents, but in ephemeral case, a time column may be represented by some custom column, it could be named by anything (even generated by instant UDF).

For this type of query result, what if user wants to visualize it as timeline? yes, we should add this type of pure client side support for visual, as a parallel feature in addition to existing timeline query.

Experiment deploying Nebula with Kubernetes

One of the design goals for Nebula is elasticity - meaning nebula nodes should be flexible to join and leave to react workload changes. Kubernetes is a great environment to exercise it, it will help improve Nebula design to its ideal state.

As a first step, I would love to see somebody to give it a try and tell us what the gap is.

Use local time in UI

Right now, we're using UTC time through the whole stack from Nebula Engine to Nebula UI.

However it's not intuitive for UI to not use local time, like I'm in PST and I don't want to use my brain to translate this back and forth, very inconvenient. This work should be done in a single place to serving UI time translation, better to have an option in UI to allow user to choose UTC or Local time.

Work can be started from https://github.com/varchar-io/nebula/blob/master/src/service/http/nebula/_/time.js

Update file header

Replace all occurrences of below

Copyright 2017-present Shawn Cao from file header into Copyright 2017-present varchar.io

Headless mode

To support embedding nebula UI into other surface (dashboard, embedded env).
Introduce a state property to indicate if current rendering includes visualization display only.

Benchmark Nebula performance

Lots of unknown for this item;

  • who to compare with? Druid/Pinot/Clickhouse?
  • did any of those engines do benchmark work before that we can rerun on Nebula?
  • what the setup look like? sing node / cluster?

Regarding performance metrics: data storage size/cost, query latency, query set to run, etc.

Ingest sampling support

Some use case needs sampling support during data ingestion for some super heavy data source, users can get insights without scan full data.

a few initial thoughts

  • sampling policy: random sampling, shard key sampling, partition level sampling
  • sampling ratio: percentage

support compatible casting in type eval

test case using constant value eval as example

  {                                                             
    int8_t value = 90;
    auto c = nebula::surface::eval::constant(value);
    auto ev = c->eval<int32_t>(ctx);        
    LOG(INFO) << "Origin=" << value << ", Eval=" << ev.value();
  }

This is debatable as type enforcement is a good thing, but the issue is if we made wrong as above example, it will fail in runtime rather than compile time. So I feel either we support compile time check or add compatible cast in runtime.

Support compare view

This is an end to end feature request to show data in a diff view. Including

Table
Timeline
Bar / line
Flame (πŸ”₯ )

I think there are two major comparison types we should support, and this will simplify how diff is defined and supported in Nebula.

  • Diff on two different time range. In this mode, we allow user to input two different time ranges, and the rest of the data/metrics are the same. In some visualization, we may want to introduce T1 vs T2 as dimensions for diff view.

  • Diff on two different filter set. (time range is essentially one specific filter) Here, we generate two types T1 and T2 to represent two different filters, and the result will be presented in two groups for comparison.

Improve build steps

It's painful to make the C++ build pass today, the major problems are dependencies and linker issues.
Target platform: Linux Ubuntu 18

Though we may not be able to make build fully automation, we at least can have a clear guide which is for sure working.

At the same time, we should maintain the build for MacOS as well.

Histogram values seems incorrect

steps:

  • clone latest code from master and build it locally
  • run local stack by invoking ./run.sh
  • go to http://localhost:8088

in the test set, run hist(value) only, the value distribution doesn't meet the expectation, I think value column should be even distributed as it's random generated in range [-128, 128] -

return std::numeric_limits<int8_t>::max() * rand_();

so I believe every bucket should have similar number of values in the histogram chart.

Also this query (value > 90) should produce non-zero buckets for value greater than 90, some bug out there! :)
Screen Shot 2020-12-29 at 11 21 59 AM

[Ingestion] Hourly Roll spec support

The goal here is to support reading from hourly partitioned s3 directory.

https://ci.apache.org/projects/flink/flink-docs-release-1.11/dev/table/connectors/filesystem.html#full-example

  1. we need to improve genSpecs4Roll and add hourly partitioned specs instead of daily. Currently macro is hard coded as date (daily), hence some work to read macro and decided how genSpecs4Roll doing listing respectively

  2. often there are commit flag _SUCCESS when an hourly partition finished. fs->list() should be able to customize it's behavior and let user decide if they want fresh yet incomplete dateset or complete yet a bit long latency dateset.

detail discussion on #34

Anomaly detection on top of timeline

(Experimental & Direction Exploration)
Investigate Facebook prophet.

A timeline could be defined, and an anomaly detection module could wrap it around, this could be saved as an item, when things go wild, fire an event on a dashboard.

Support fixed-length string

In DB - there are types like fixed length string char(3) or variable length string varchar(30).

Many scenarios, fixed-length string types hints big reduction in both storage and computation.

Support Google Cloud Storage

Currently Nebula defines file system interface in its storage component. Under this interface, there are a few implementations such as

  1. local file system
  2. S3 file system

There two are mostly used so far, now we're asking a new support to allow Nebula read and write data with GCS. This will be needed for nebula deployment in Google Cloud Platform.

Revamp doc site https://nebula.bz

The feedback from Chris makes sense: we should write docs from users' perspective instead of developers' perspective.

For dev notes, we should use more md files in source code. Leaves these docs for users who can finally adopt the project in their work。

Support filters in client SDK

Right now, nebula client (javascript) SDK doesn't support filters yet, basically where clause function.
This issue will tracking the support for that.

prototyping Nebula Ingestion DDL

Nebula Ingestion DDL

YAML is powerful way to express configurations, it's easy for people to understand and change. At same time, remember all different configurations and concepts can pose high tax when we starts support functions and preprocess, indexing, or consistent hashing (possible concept to expand/shrink storage cluster) This may lead to invent new set of configuration and concepts that only expert can remember.

Moreover, OLAP system is working as part of big data ecosystem, be able to transform and pre-process during ingestion time will provide an edge compare to other OLAP engines for user to adopt.

Use an inspiring example that not yet supported by nebula.

User has a hive table and a kafka stream ingest into nebula. Hive table has hourly partition keeping last 60 days of moving average of business spent per account; kafka stream contains business transactions in foreign currency of each account. User want to investigate account spending status in near realtime in home currency (e.g USD)

The complexity of this use case comes from three folds

  • hive table may read data and eventually shard per account basis
  • kafka stream may need to rpc and convert currency into usd
  • both kafka stream may need to do stream/table join on per account basis before land result to run slice and dice.

If user write a RDBMS query, it should look like

OPTION 1 Materialized View with schema as part of config

create view nebula.transaction_analytic as (select accountid, avg(spend), transactionid, TO_USD(transaction_amount) from hive right join kafka on hive.account = kafka.acount where <all configs on hive, kafka>)

Alternatively, we can support two statement flow like

OPTION 2 Full Table with schema inference

DDL
`
// mapping of hive table sync to nebula table
create table hive.account (
accountid bigint PRIMARY KEY,
spend double,
dt varchat(20) UNIQUE NOT NULL
) with ();

create table kafka.transaction (
transactionid bigint PRIMARY KEY,
accountid bigint not null,
transaction_amount double,
_time timestamp
) with ();

create table transaction_analytic (
accountid bigint PRIMARY KEY,
avg_transaction double,
transaction_amount_in_usd double,
_time timestamp
) with ();
`

DML
insert into transaction_analytic select accountid, avg(spend), transactionid, TO_USD(transaction_amount) from hive right join transaction on hive.account = transaction.acount;

Be adaptive to stream speed

Today, for real time streaming data source, such as Kafka, nebula create each data block based on offset start and end, and seal a data block when all records arrived, then put this sealed block into queryable blocks pool. Query will scan blocks that are in the pool.

For busy streams, this may not be an issue, as the system will place data blocks very often. But for slow streams, it may wait for a few minutes to generate a sizable batch, our current solution is to decrease the batch size for that stream, this will end up with more blocks to manage and not adaptable to stream speed, some traffic speeds up and down in different period.

The issue is to ask improvement to make ingestion adaptive to stream speed and still maintain ideal block size, options are

  • keep latest block open for both query and append
  • make a copy of a progressive block until it's sealed.
  • open for other designs

Histogram query can have multiple charts based on the keys

We should be able to support query show histograms for each key in the query.

For example, if user query tag, hist(value), we should display 4 charts in the UI and each display histogram view for each key (a, b, c, d in this example).

As a reference, we have similar handling for TREEMERGE function in flame/icicle view.
Screen Shot 2020-12-29 at 11 40 41 AM

Treemerge performance

Treemerge is the core algo to provide call stack merge into a weighted tree which is displayed as icicle or flame graph in the frontend.

However, observed large performance degradation, it needs some investigation to understand where the optimizations could be implemented.

A few initial leads

  • Algorithm itself may not be effective. (may take long time in parsing large string blobs into frames)
  • Large data set (the final collection could be further trimmed during aggregation - lift up threshold, introduce compression, etc)
  • Use vector/list to store the call stack frames? And probably introduce dictionary encoding for list item.

duplicate data returned

Metadata needs to track if there are duplicate blocks exist for the same spec.

Nebula server keeps generate new data spec, specially in "swap" scenario, how to guarantee a block belongs to current spec rather than last spec even though the they may have the same spec definition.

Revisit the structure of metadata layers

  • (offline, expired specs support)
  • Table -> Specs -> Blocks
  • Spec Versioning

Table are identified by name, there are no duplicate table names in nebula.
Spec are identified by signature which has time stamped.
Block are identified by signature.

Implement histogram view on numeric columns

Basically "select hist("col") from

", we can start with 100 buckets and make it configurable in query interface.
folly:Histogram can be used to achieve this:

  • figure out range of column value at query time (metadata support)
  • Histogram aggregation function to be added
  • UI/display with column/bar chart.

support data transformation on the query result before visualize

Sometimes, user have defined metrics on the query result itself. For example

  1. new metric column per row = col(A) / col(B)
  2. further aggregate across different rows: (A B 1) (A C 2) => (A B/C=0.5)

No matter what - we will pass the raw JSON blob to user and let it transform uses its own transform lambda, new schema will be discovered in the transformed result.

MetaDB backup integrity

Currently we're backup the internal metadata DB (leveldb) when it is 'dirty' and exceeding backup interval.
However the process is not atomic and may leave a broken version in the backup.

I think the issue could be just to fix the current design to ensure a version left in the backup media is a valid.
Also open to different design, such as operate metadb completely separate which definitely requires more diligent work.

Though MetaDB only powers Nebula short links today, it becomes pivotal and will be more and more important as we are leveraging it for future data set integrity source as well as internal load balancing data set across different nebula nodes. Would love to see more deeper thoughts around this issue.

Implement Bubble chart support

Bubble chart support is useful for some use cases. Bubble chart is popular, would love to see it to be available in Nebula UI.

Decouple time macro in source path from time spec

Currently we support roll spec with time MACROs (date, hour, min, second) by specifying MACRO pattern in time spec. I think we should decouple this to get a better flexibility.

For example, below spec should be a legit spec:

test-table:
  ...
  source: s3://xxx/{date}/{hour}/
  time:
      type: column
      column: col2
      pattern: UNIXTIME_MS

This spec is basically asking us to scan s3 file path with supported macros in it, but the time is actually from an existing column. My understanding is we don't support MACRO parsing if we don't specify it in time spec.

cc @chenqin

Make a thread-safe paged slice

Paged Slice can effectively reduce memory waste by organizing internal compressed blocks for the whole data block.
However, concurrent reading the list will requires lots of locking and isolation for thread safety.

Not high priority but an interesting problem to tackle.

Histogram function may crash the whole nebula node

Some error log captured when a nebula node dies due to a hist() call in production

I0128 19:12:14.229609 20261 Dsl.cpp:354] Nodes to execute the query: 1
I0128 19:12:14.230307 20261 BlockManager.cpp:147] Fetch blcoks 1009 / 1009 for table cdn_requests in window [1611838876, 1611860964].
I0128 19:12:14.230335 20261 NodeExecutor.cpp:84] Processing total blocks: 1009
F0128 19:12:14.232291 36092 Histogram-inl.h:34] Check failed: bucketSize_ > ValueType(0) (0 vs. 0)
*** Check failure stack trace: ***
*** Aborted at 1611861134 (Unix time, try 'date -d @1611861134') ***
*** Signal 6 (SIGABRT) (0x3e800000006) received by PID 6 (pthread TID 0x7fe8c5ec2700) (linux TID 36092) (maybe from PID 6, UID 1000) (code: -6), stack trace: ***
(error retrieving stack trace)

Support hot swapping schema

The schema of a table (data set) could be updated. When user adds/remove fields to existing data set, Nebula is not updating their schema. The reason is Nebula maintains an immutable schema when it first see the table schema.

Current workaround to this issue is to adding a new data set name and delete old dataset name, basically keep schema immutable.

A better schema compatibility fix is expected:

  1. User can update existing table's schema
  2. Nebula can support different schemas (old, new) co-existing given a time point.
  3. When query, Nebula will treat missing field as NULL on some data blocks, but compute correctly on new data blocks.
  4. The schema presented to UI/API should be a union.
  5. Schema is dynamic and capture all columns data at any given time point.

This type of support is good for UI (flexibility) but may not be good for consistency expectation from API. We should think a bit more carefully on this issue before implement the support.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.