Just curios why tiedot need to create initial <code c

I see 3 other possibilities Making it optional 4 , 8 , 16, 32

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

How about offering two options: Small collection - grows every

Why 130MB Initial Bucket on windows ? about tiedot HOT 18 CLOSED

houzuoguo commented on July 29, 2024

Why 130MB Initial Bucket on windows ?

from tiedot.

Comments (18)

HouzuoGuo commented on July 29, 2024

In default configuration, data file (documents) grows every 128 MB, and hash table has an initial capacity of:

2 ^ 14 keys (16384 distinct values)
100 entries per key

The very original story for such large initial size, was for benchmarks to get accurate measure (> 1 second for each feature) without being interrupted by file capacity growth.

But you are absolutely correct - in real usage scenarios, high initial capacity is not desired. What do you think about 32 MB data + 32 MB index?

from tiedot.

alexandrestein commented on July 29, 2024

It's a good idea... 👍
I think we already discussed about that. :-)

from tiedot.

HouzuoGuo commented on July 29, 2024

There actually is a "small_disk" branch that cuts down initial size to only 4 MB per collection =D but as a result, the performance is about 100x worse.

That sounds like a plan - I will fix benchmarks, and reduce initial file size as well.

from tiedot.

olekukonko commented on July 29, 2024

I see 3 other possibilities

Making it optional 4 , 8 , 16, 32 , 64, 128 MB and recommend 128MB for best performance
Introduce preemption (Percentage Usage & Volume of Data insert per sec) on increment
Store data and read data from memory while file system would just be backup

from tiedot.

alexandrestein commented on July 29, 2024

From 130MB to 4MB I can imagine, it has an impact on performances.
But maybe somewhere among those values you can find a balance between performances and DB files size.

You spoke about 32MB, it sounds good to me 👍

from tiedot.

olekukonko commented on July 29, 2024

@alexandrestein i agree 32MB sounds good but the size can be flexible

from tiedot.

HouzuoGuo commented on July 29, 2024

How about offering two options:

Small collection - grows every 32 MB
Large collection - grows every 128MB (current config)

And by default, HTTP API creates a small collection; a request parameter will be set for creating the large collection.

Benchmarks will continue to use large collection.

from tiedot.

olekukonko commented on July 29, 2024

Does the collection growth really need to be fixed ? How about Small collection - grows every 32MB when data <= X because gowning a large data by 32MB might be too much overhead.

Example

func getIncrement() int {
    size := getSize()
    increment := 0
    switch true {
    case size > 536870912: // 512MB
        increment = 134217728 // 128MB
    case size > 134217728: // 128MB
        increment = 67108864 // Increase by 64MB
    default:
        increment = 33554432 // Increase by 32MB
    }
    return increment
}

We can still look for a better threshold after proper testing but this is just an example

from tiedot.

HouzuoGuo commented on July 29, 2024

That sounds like a rather nice idea.

from tiedot.

HouzuoGuo commented on July 29, 2024

The next question may be more interesting: what shall we do with hash table?

There are some difficulties with downsizing hash table:

The algorithm is a classic static hash table (unfortunately dynamic resizing is close to impossible)
The initial "head" buckets must be allocated upfront.
The performance worsens a lot if initial hash table size is brought down.
Downsize hash table configuration is not feasible right now - it will break everyone's existing hash table.

There seems to be two easy solutions:

Rewrite hash table to use a better algorithm
Make hash table parameter configurable

What do you think, any better idea?

from tiedot.

alexandrestein commented on July 29, 2024

I don't think this is a big deal.

In most of the real case data won't grow more than some MB per minute. And 1MB of "data" per minute is a lot even if you store logs or things like this (I exclude the cases where you store images or binary files).

And for those how has data which grows like this, they probably take care of setting up database correctly :-)

I think the best thing to do is to set by default a small grow size and let user configure properly the database if he has special needs (server app which append a lot or millions of users adding content).

I'm maybe wrong...

from tiedot.

HouzuoGuo commented on July 29, 2024

I think dynamically determine collection growth is a very good idea.

How about:

Dynamically determine collection growth
Make hash table size configurable (user has choice of small/large)

from tiedot.

HouzuoGuo commented on July 29, 2024

See my comment in #23

What do you think?

from tiedot.

HouzuoGuo commented on July 29, 2024

Fixed in nextgen - number of collection partitions is now configurable. A collection of one partition will only use 32MB disk storage in the beginning.

from tiedot.

agozie commented on July 29, 2024

I have 11 collections, 256Mb for each with no data at all! Some collections will only hold a few entries. This is a major blow to my project.
Any Thoughts?

from tiedot.

HouzuoGuo commented on July 29, 2024

@agozie sorry - the size of collection files depends on the number of CPUs on the system:
https://github.com/HouzuoGuo/tiedot/blob/master/db/db.go#L50

I intended for the initial size of collection to depend on GOMAXPROCS, so the line must have been a mistake.

The collection file size could be reduced to 64MB by replacing runtime.NumCPU()) into 1.

from tiedot.

agozie commented on July 29, 2024

How will reducing runtime.NumCPU() to one affect performance? Thanks alot.

from tiedot.

HouzuoGuo commented on July 29, 2024

Reduce it to one should retain approx.30% in your scenario. Run benchmark ./tiedot -mode=bench to be sure.

from tiedot.

Why 130MB Initial Bucket on windows ? about tiedot HOT 18 CLOSED

Comments (18)

Example

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent