Comments (18)
In default configuration, data file (documents) grows every 128 MB, and hash table has an initial capacity of:
- 2 ^ 14 keys (16384 distinct values)
- 100 entries per key
The very original story for such large initial size, was for benchmarks to get accurate measure (> 1 second for each feature) without being interrupted by file capacity growth.
But you are absolutely correct - in real usage scenarios, high initial capacity is not desired. What do you think about 32 MB data + 32 MB index?
from tiedot.
It's a good idea... 👍
I think we already discussed about that. :-)
from tiedot.
There actually is a "small_disk" branch that cuts down initial size to only 4 MB per collection =D but as a result, the performance is about 100x worse.
That sounds like a plan - I will fix benchmarks, and reduce initial file size as well.
from tiedot.
I see 3 other possibilities
- Making it optional 4 , 8 , 16, 32 , 64, 128 MB and recommend
128MB
for best performance - Introduce preemption (Percentage Usage & Volume of Data insert per sec) on increment
- Store data and read data from memory while file system would just be backup
from tiedot.
From 130MB to 4MB I can imagine, it has an impact on performances.
But maybe somewhere among those values you can find a balance between performances and DB files size.
You spoke about 32MB, it sounds good to me 👍
from tiedot.
@alexandrestein i agree 32MB
sounds good but the size can be flexible
from tiedot.
How about offering two options:
- Small collection - grows every 32 MB
- Large collection - grows every 128MB (current config)
And by default, HTTP API creates a small collection; a request parameter will be set for creating the large collection.
Benchmarks will continue to use large collection.
from tiedot.
Does the collection growth really need to be fixed ? How about Small collection - grows every 32MB
when data <= X
because gowning a large data by 32MB
might be too much overhead.
Example
func getIncrement() int {
size := getSize()
increment := 0
switch true {
case size > 536870912: // 512MB
increment = 134217728 // 128MB
case size > 134217728: // 128MB
increment = 67108864 // Increase by 64MB
default:
increment = 33554432 // Increase by 32MB
}
return increment
}
We can still look for a better threshold after proper testing but this is just an example
from tiedot.
That sounds like a rather nice idea.
from tiedot.
The next question may be more interesting: what shall we do with hash table?
There are some difficulties with downsizing hash table:
- The algorithm is a classic static hash table (unfortunately dynamic resizing is close to impossible)
- The initial "head" buckets must be allocated upfront.
- The performance worsens a lot if initial hash table size is brought down.
- Downsize hash table configuration is not feasible right now - it will break everyone's existing hash table.
There seems to be two easy solutions:
- Rewrite hash table to use a better algorithm
- Make hash table parameter configurable
What do you think, any better idea?
from tiedot.
I don't think this is a big deal.
In most of the real case data won't grow more than some MB per minute. And 1MB of "data" per minute is a lot even if you store logs or things like this (I exclude the cases where you store images or binary files).
And for those how has data which grows like this, they probably take care of setting up database correctly :-)
I think the best thing to do is to set by default a small grow size and let user configure properly the database if he has special needs (server app which append a lot or millions of users adding content).
I'm maybe wrong...
from tiedot.
I think dynamically determine collection growth is a very good idea.
How about:
- Dynamically determine collection growth
- Make hash table size configurable (user has choice of small/large)
from tiedot.
See my comment in #23
What do you think?
from tiedot.
Fixed in nextgen - number of collection partitions is now configurable. A collection of one partition will only use 32MB disk storage in the beginning.
from tiedot.
I have 11 collections, 256Mb for each with no data at all! Some collections will only hold a few entries. This is a major blow to my project.
Any Thoughts?
from tiedot.
@agozie sorry - the size of collection files depends on the number of CPUs on the system:
https://github.com/HouzuoGuo/tiedot/blob/master/db/db.go#L50
I intended for the initial size of collection to depend on GOMAXPROCS, so the line must have been a mistake.
The collection file size could be reduced to 64MB by replacing runtime.NumCPU()) into 1.
from tiedot.
How will reducing runtime.NumCPU() to one affect performance? Thanks alot.
from tiedot.
Reduce it to one should retain approx.30% in your scenario. Run benchmark ./tiedot -mode=bench to be sure.
from tiedot.
Related Issues (20)
- duplicate init HOT 2
- Delete old repository HOT 3
- Add tiedot in codesponsor HOT 1
- Merry Christmas and happy new year! HOT 2
- concurrent write and foreach not working as expected. HOT 2
- Question about result caching HOT 5
- My ID is storing in scientific notation HOT 1
- httpapi: suspicious condition in srv_test.go
- Func CreateOrReadConfig, config file is created but not closed
- does tiedot has max storage limit? HOT 1
- Add examples/information for pull request #157 HOT 1
- Incremented record ID HOT 4
- Replace bou.ke/monkey HOT 2
- Increase size of parsed doc HOT 1
- Any plan on replication/sync? HOT 1
- Indexing and range lookup on datetime HOT 2
- Tiedot take a lot of storage HOT 2
- "Like" or Fuzzy search HOT 1
- who can tel me,why col include db RWMutex,another single RWMutex in col? HOT 1
- update a document lead to overwrite a document
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from tiedot.