Giter VIP home page Giter VIP logo

Comments (3)

jinmingjian avatar jinmingjian commented on May 23, 2024

@sanikolaev thanks for interesting. I am busy in too many things in these days:)

  1. DRAM is 32*6=192GB (6-channel, 32GB per channel, standard config for xeon-sp single socket bare metal) NOTE: The size of RAM is not important here in that we run results multiple times (to let data in kinds of cache) and the query set is far smaller than 192GB. (But it will be more much interesting we show big big big dataset in future.)

  2. the data is simply: 2-column, 32bit Integer per column(Datetime is implemented in 32bit for both CH and TB), 1.47B-row stripped NYC taxi dataset. NOTE: The number of total columns in a table is not important here in that we are talking about column-wise stores.

  3. I am working on primary String support, so it will be soon to show some initial benchmark results from TPC-H. (the alpha website is released in advance before my imagination)

  4. There is no paper because of time... The initial storage, in fact, is primary. The interesting is that how the data goes into storage. It does not use common LSM tree or similar like in CH and even mostly popular opensource peers. The bad of LSM tree is that you pay for fast writing in a long run. Further two questions you can ask here:

  • Does the LSM tree achieve the global optimum for the 7x24 time-span servers?
  • how fast if we discard LSM tree?
    And I think TensorBase gives out its innovative answers:smile:
  1. The full open source could come faster than I thought. Before this happen, I am interesting to invite some early users/people/partners to join the work more quickly. If you and others are interested in this, you can connect me via any ways.

from tensorbase_frontier_edition.

sanikolaev avatar sanikolaev commented on May 23, 2024

NOTE: The size of RAM is not important here in that we run results multiple times (to let data in kinds of cache)

Why is it so? It doesn't seem practical to me to measure only hot queries. In real analytics while doing real queries the chance for a IO operation is high.

and the query set is far smaller than 192GB. (But it will be more much interesting we show big big big dataset in future.)

Yes, even if you measure only hot queries it will be interesting to see the results when the data can't be fully fit into RAM. Then you'll have to read from disk, then the storage format will be a key thing: how well the data is compressed, in how many iops you can read it, what exactly you want to keep in the limited RAM amount while processing the query etc.

I'll be happy to play with the opensource version when it's available.

from tensorbase_frontier_edition.

jinmingjian avatar jinmingjian commented on May 23, 2024

three fold:

  1. Benchmark should happen in the context of "apple to apple". (In fact, many benchmarks fails to achieve this). To enable cache effect is just to enable "apple to apple". Because the loading of data from disks may be too complex. The cache mechanism (data loading from disks) are varied. You said compression is just one part of big picture to affect the loading. For example, I can use two layers of cache, but you just use a layer. You may be faster in the one-shot first loading, but I am better at global performance. This is why the modern x86 CPU has L1/2/3 three-layer cache.
  2. Also as you said, hot data can be hit in the (memory) cache. So this comparison is still meaningful in good parts of real world cases. And this is why we have kinds of cache.
  3. You are right. benchmarks which is IO bounding is another important scenario because data can't be fully fit into RAM. This is workload dependent. This is in fact TensorBase wants to solve than other opensource peers. It is possible to show interesting results in next time benchmark time. But the caveat is still "apple to apple". We just discuss this in that time.

from tensorbase_frontier_edition.

Related Issues (2)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.