Comments (15)
I would agree to add count()
and bsize()
methods for the whole space, but I won't support adding counts by a condition. In too many cases this will be a full scan.
In almost all cases I know, it is safe to not show the total count of whatever you return to the user.
from crud.
len() and bsize() doesn't help if the customer wants to count filtered tuples without actually loading them to the client (over the network in the general case).
A full scan on a sharded space is not a big deal if it is not done often. We cannot avoid full scans completely in other cases, knowledge of how it works will be always necessary for developers.
from crud.
It is a big deal because it stops other things from accessing a database. And yes, we definitely can avoid full scans. We just won't allow them through the API.
from crud.
If we won't allow the full scans, they will not disappear from the customer tasks. This pain will just shift to another place.
We need some kind of support for such tasks for being able to implement these things in connectors.
from crud.
Your statement is demonstrably false. Aerospike and Redis can exist without such queries. They have the ability to iterate over the collection on the client the same way we propose to do with select
or pairs
.
To count the items of a large collection you can create a separate space with counters. With interactive transactions, you can atomically update both of those spaces from the client. This will not require you to write any additional code.
If you have only a few items to count, you can just select them all.
from crud.
But counters in special spaces look like an implementation detail, why cannot we have count() method in CRUD API which does all this boilerplate under the hood?
I see that for every simple task like count which may involve scan complexity we are going to push the customers to reinvent the wheel. And connectors cannot help to avoid this because there is no DDL API for now.
UPD: There is a problem that CRUD API doesn't rely on any DDL API at the moment too.
from crud.
Seems it's time to triage once more because we have the following use case:
User have to get count by any contitions and user agree that the result will not be accurate.
So I suggest to make count_async:
- arguments and options like crud.select/crud.pairs;
- implement storage_count_async with cycle with paris that will count number of rows in space with yeild by batch_size;
- router must call storage_count_async on all replicasets;
To avoid any locks and slowdowns - implement mutex, that will guarantee that storage_count_async may run no more than N times simultaneously on each storage.
from crud.
@no1seman
Here we are solving a special case of a general problem with a map-reducer call for crudes.
I suggest thinking about this in the direction of sending a stored procedure with a special contract for the return value and calling this procedure from the router.
Because, for example, there is still a frequent task on the cluster to write a set of data on the storage in a transaction. And in this transaction on the storage, you need to perform many different operations.
It is not necessary to send the procedure code through the cruise, you can simply teach to call an already existing store.
from crud.
local count = crud.count({{'=', 'status', 'NEW'}})
Lets do something like
crud.count({ '=', 'status', 'NEW' }, {options})
Where options:
sec_scan
, default value isfalse
Implementation:
count look through space indexes and find index for status
.
- If the index is found, count iterates using it.
- If the index is not found, count iterates using pk
if options.sec_scan == true
The same for bsize, pairs
from crud.
@unera Why not to use the same API as select/pairs? The man difference from select/pairs: count not get data and do it with yields. So, seems need the folllowing options:
batch_size (number of pairs cycles to yield after)
use_box_count - in some cases, for example not huge space we may need to count precisely by index, but if the size of space huge - need to count approximately with yeilds (this option may be automatic, because we may get len of space on this particular instance, if it is larger than COUNT_HARD_LIMIT we have to falldown to approximate algorithm)
from crud.
One more thing, to kill the whole cluster with one wrong query.
Fullscan and filters are pure evil for Tarantool.
from crud.
Why not to use the same API as select/pairs?
I agree :)
I didn't think that the question and select/pairs are different.
So, lets do as select/pairs. Drop my comment from 1 Oct.
from crud.
local objects, err = crud.count(space_name, conditions, opts)
Syntax is the same,
excluding options:
- first
- after
- batch_size
- fields
from crud.
@unera batch_size may be used as number of pairs cycles between yields or there may be any other option.
from crud.
What about this case:
select count(field) from t1
in case the field can be nullable ?
Can instead of inventing one more not working 'killer feature' , make a general map/reduce?
from crud.
Related Issues (20)
- Empty error on zero rows input
- test: switch master with vshard
- Run perf tests on demand
- Borders fails to process bad fields
- operation_data usability
- Update operation convert is broken for splice HOT 2
- Replace/insert errors `operation_data` should always contain tuple that was not inserted
- [BUG] Calling `crud.get` causes a bunch of `fiber leak` errors HOT 1
- crud doesn't build key from conditions HOT 1
- Support vshard's `identification_mode` = `name_as_key` HOT 2
- Сan't initialize storage/router if no UUIDs
- support vshard `master: auto` HOT 1
- `box.info.ro` can be true in `init_storage()` on all instances in replicaset HOT 1
- It seems that crud is doing two selects in one `crud.get` HOT 3
- Add a role for tarantool 3.0 HOT 1
- After upgrade CRUD from 0.10.0 to 1.4.2 we got errors for different operations for old spaces. HOT 15
- Проблема с crud.count HOT 3
- Consider reworking batch operations info HOT 2
- Handle 'wrong symbol )' exception in case of comparision field with nil or {} on nonindexed column
- Bad error handling
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from crud.