Giter VIP home page Giter VIP logo

Comments (18)

HouzuoGuo avatar HouzuoGuo commented on May 22, 2024

In default configuration, data file (documents) grows every 128 MB, and hash table has an initial capacity of:

  • 2 ^ 14 keys (16384 distinct values)
  • 100 entries per key

The very original story for such large initial size, was for benchmarks to get accurate measure (> 1 second for each feature) without being interrupted by file capacity growth.

But you are absolutely correct - in real usage scenarios, high initial capacity is not desired. What do you think about 32 MB data + 32 MB index?

from tiedot.

alexandrestein avatar alexandrestein commented on May 22, 2024

It's a good idea... 👍
I think we already discussed about that. :-)

from tiedot.

HouzuoGuo avatar HouzuoGuo commented on May 22, 2024

There actually is a "small_disk" branch that cuts down initial size to only 4 MB per collection =D but as a result, the performance is about 100x worse.

That sounds like a plan - I will fix benchmarks, and reduce initial file size as well.

from tiedot.

olekukonko avatar olekukonko commented on May 22, 2024

I see 3 other possibilities

  • Making it optional 4 , 8 , 16, 32 , 64, 128 MB and recommend 128MB for best performance
  • Introduce preemption (Percentage Usage & Volume of Data insert per sec) on increment
  • Store data and read data from memory while file system would just be backup

from tiedot.

alexandrestein avatar alexandrestein commented on May 22, 2024

From 130MB to 4MB I can imagine, it has an impact on performances.
But maybe somewhere among those values you can find a balance between performances and DB files size.

You spoke about 32MB, it sounds good to me 👍

from tiedot.

olekukonko avatar olekukonko commented on May 22, 2024

@alexandrestein i agree 32MB sounds good but the size can be flexible

from tiedot.

HouzuoGuo avatar HouzuoGuo commented on May 22, 2024

How about offering two options:

  • Small collection - grows every 32 MB
  • Large collection - grows every 128MB (current config)

And by default, HTTP API creates a small collection; a request parameter will be set for creating the large collection.

Benchmarks will continue to use large collection.

from tiedot.

olekukonko avatar olekukonko commented on May 22, 2024

Does the collection growth really need to be fixed ? How about Small collection - grows every 32MB when data <= X because gowning a large data by 32MB might be too much overhead.

Example

func getIncrement() int {
    size := getSize()
    increment := 0
    switch true {
    case size > 536870912: // 512MB
        increment = 134217728 // 128MB
    case size > 134217728: // 128MB
        increment = 67108864 // Increase by 64MB
    default:
        increment = 33554432 // Increase by 32MB
    }
    return increment
}

We can still look for a better threshold after proper testing but this is just an example

from tiedot.

HouzuoGuo avatar HouzuoGuo commented on May 22, 2024

That sounds like a rather nice idea.

from tiedot.

HouzuoGuo avatar HouzuoGuo commented on May 22, 2024

The next question may be more interesting: what shall we do with hash table?

There are some difficulties with downsizing hash table:

  • The algorithm is a classic static hash table (unfortunately dynamic resizing is close to impossible)
  • The initial "head" buckets must be allocated upfront.
  • The performance worsens a lot if initial hash table size is brought down.
  • Downsize hash table configuration is not feasible right now - it will break everyone's existing hash table.

There seems to be two easy solutions:

  • Rewrite hash table to use a better algorithm
  • Make hash table parameter configurable

What do you think, any better idea?

from tiedot.

alexandrestein avatar alexandrestein commented on May 22, 2024

I don't think this is a big deal.

In most of the real case data won't grow more than some MB per minute. And 1MB of "data" per minute is a lot even if you store logs or things like this (I exclude the cases where you store images or binary files).

And for those how has data which grows like this, they probably take care of setting up database correctly :-)

I think the best thing to do is to set by default a small grow size and let user configure properly the database if he has special needs (server app which append a lot or millions of users adding content).

I'm maybe wrong...

from tiedot.

HouzuoGuo avatar HouzuoGuo commented on May 22, 2024

I think dynamically determine collection growth is a very good idea.

How about:

  • Dynamically determine collection growth
  • Make hash table size configurable (user has choice of small/large)

from tiedot.

HouzuoGuo avatar HouzuoGuo commented on May 22, 2024

See my comment in #23

What do you think?

from tiedot.

HouzuoGuo avatar HouzuoGuo commented on May 22, 2024

Fixed in nextgen - number of collection partitions is now configurable. A collection of one partition will only use 32MB disk storage in the beginning.

from tiedot.

agozie avatar agozie commented on May 22, 2024

I have 11 collections, 256Mb for each with no data at all! Some collections will only hold a few entries. This is a major blow to my project.
Any Thoughts?

from tiedot.

HouzuoGuo avatar HouzuoGuo commented on May 22, 2024

@agozie sorry - the size of collection files depends on the number of CPUs on the system:
https://github.com/HouzuoGuo/tiedot/blob/master/db/db.go#L50

I intended for the initial size of collection to depend on GOMAXPROCS, so the line must have been a mistake.

The collection file size could be reduced to 64MB by replacing runtime.NumCPU()) into 1.

from tiedot.

agozie avatar agozie commented on May 22, 2024

How will reducing runtime.NumCPU() to one affect performance? Thanks alot.

from tiedot.

HouzuoGuo avatar HouzuoGuo commented on May 22, 2024

Reduce it to one should retain approx.30% in your scenario. Run benchmark ./tiedot -mode=bench to be sure.

from tiedot.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.