This grows out of the discussion here <a class="issue-link js-issue-link" data-error-t

greenlets are really cheap and you probably don't have to worry about having too

BTrees not finding contained keys on parallel access with gevent about btrees HOT 12 CLOSED

zopefoundation commented on July 18, 2024

BTrees not finding contained keys on parallel access with gevent

from btrees.

Comments (12)

jamadden commented on July 18, 2024

I have some greenlets writing changes to a BTree, and another greenlet committing the changes... I probably should have used a separate transaction manager for every single greenlet.

I suspect that's your problem. BTrees are not safe for concurrent use; they do no locking to protect themselves from having inconsistent internal states in that case. They're meant to be used within a single thread at a time only. In general, I think ZODB Connection and Persistent objects are not meant for concurrent use---best case scenario is that they load their state out of the database multiple times; worst case scenario their state gets corrupted and you get errors like this. Asynchronously commiting that transaction and thus syncing the connection doesn't help matters. (Note that I mean identical objects, where id(x) == id(y); each thread can load the same key from its own Connection and access it that way, but then each thread has its own copy of the object.)

For this meaning, threads and greenlets are equivalent. They both potentially provide concurrency where concurrency was not expected.

Using ZEO with greenlets allows lots of concurrency as each ZEO operation crosses a socket, allowing a greenlet switch to occur.
Using greenlets with a FileStorage, in contrast, allows for no automatic concurrency because file operations do not switch greenlets, effectively serializing database access.
Using native threads with a FileStorage, however, might expose the same issues.

from btrees.

mgedmin commented on July 18, 2024

I do not think sharing a single ZODB connection object between multiple greenlets is a good idea either.

Normally, in a multi-threaded app, each thread would get its own connection and its own transaction manager (the latter managed transparently by transaction.get()).

from btrees.

ml31415 commented on July 18, 2024

Hmm, looks like lots of rewrite :/ I'm also a bit worried in terms of performance, as I really got a zoo of microthreads. May there be a way, to force serialized access on the socket, just like with the FileStorage?

from btrees.

jamadden commented on July 18, 2024

greenlets are really cheap and you probably don't have to worry about having too many
If you don't monkey patch the whole system (e.g., pass socket=False to the patch_all call), then using sockets won't allow greenlets to switch and ClientStorage will behave more like FileStorage. But that rather defeats the point of using greenlets because you've lost the opportunity to increase CPU utilization by improved concurrency. And you'd still be using Connection, BTrees and Persistent objects in a non-recommended configuration (sharing them between "threads").

from btrees.

ml31415 commented on July 18, 2024

I was rather worried about the zoo of connections, that would come when I'd attach one to every single greenlet. But anyways, looks like there is no way around this. Thanks for the help, well appreciated! I just hope the segfault will disappear as well, when I fix this.

from btrees.

ml31415 commented on July 18, 2024

Now that I refactored it to have a single connection for each greenlet, I run into a myriad of ConflictErrors.

2016-03-29 10:11:36:ERROR:gutils:>(264)> failed with ConflictError
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/gevent/greenlet.py", line 327, in run
    result = self._run(*self.args, **self.kwargs)
  File "/home/michael/workspace/.../core/monitor/eventmangement.py", line 103, in _full_update
    joinall([self.workers.spawn(ds.update_if_required, event, conn=conn) for ds in self.simple_datasources])
  File "/home/michael/workspace/.../utils/database/bbdatabase.py", line 49, in __exit__
    self.close()
  File "/home/michael/workspace/.../utils/database/database.py", line 447, in close
    self.tm.commit()
  File "/usr/local/lib/python2.7/dist-packages/transaction/_manager.py", line 111, in commit
    return self.get().commit()
  File "/usr/local/lib/python2.7/dist-packages/transaction/_transaction.py", line 280, in commit
    reraise(t, v, tb)
  File "/usr/local/lib/python2.7/dist-packages/transaction/_transaction.py", line 271, in commit
    self._commitResources()
  File "/usr/local/lib/python2.7/dist-packages/transaction/_transaction.py", line 417, in _commitResources
    reraise(t, v, tb)
  File "/usr/local/lib/python2.7/dist-packages/transaction/_transaction.py", line 391, in _commitResources
    rm.commit(self)
  File "/usr/local/lib/python2.7/dist-packages/ZODB/Connection.py", line 572, in commit
    self._commit(transaction)
  File "/usr/local/lib/python2.7/dist-packages/ZODB/Connection.py", line 628, in _commit
    self._store_objects(ObjectWriter(obj), transaction)
  File "/usr/local/lib/python2.7/dist-packages/ZODB/Connection.py", line 687, in _store_objects
    s = self._storage.store(oid, serial, p, '', transaction)
  File "/usr/local/lib/python2.7/dist-packages/zc/zlibstorage/__init__.py", line 91, in store
    transaction)
  File "/usr/local/lib/python2.7/dist-packages/ZODB/FileStorage/FileStorage.py", line 517, in store
    oldserial, data)
  File "/usr/local/lib/python2.7/dist-packages/ZODB/ConflictResolution.py", line 303, in tryToResolveConflict
    data=newpickle)
ConflictError: database conflict error (oid 0x022c2b, class BTrees.IOBTree.IOBucket, serial this txn started with 0x03b62699589bbc00 2016-03-29 10:01:20.767579, serial currently committed 0x03b626a3034660cc 2016-03-29 10:11:00.767558)

Did I overlook something, or are the BTrees not able to deal with concurrent writes at all? As I'm using them as a kind of indices, every data insertion I do ends up in modifying several BTrees. Concurrent insertions are common there and it's just something that I expect them to be able to handle and resolve themselves. All in all, I figure in the end it's probably the best for me to migrate my database completely away from ZODB, instead of further trying to stretch the current approach. :(

from btrees.

jamadden commented on July 18, 2024

If you are adding/removing/mutating the same key concurrently (from different greenlets) conflicts are inevitable; you can't expect BTrees to magically pick a winner in that case.

If you are adding/removing/mutating keys that sort into the same bucket, the chance of conflicts depends on exactly what you modified and how. BTrees does its best to resolve conflicts, but they can still occur. It is for this reason that things like zope.intid use random keys; something like an incrementing counter, where the keys are basically guaranteed to sort into the same bucket is a very bad idea.

Note that small BTrees (with few keys) have very few buckets and so conflicts are much more likely, even with widely spaced keys. As the tree gets bigger and has more buckets, conflicts become less likely.

All of this comes down to the concurrency design of ZODB. ZODB uses optimistic concurrency, instead of pessimistic locking: allow as many transactions to proceed as possible and resolve conflicts at commit time. If transactions conflict unresolvably, the first one wins and the other one is expected to retry (where it may find that it has no work to actually do, because the work was handled in the first transaction, or it otherwise typically doesn't generate a conflict this time around).

Middleware like pyramid_tm and repoze.retry are meant to handle this automatically for WSGI apps. Likewise, the transaction package provides a context manager that can be used for the same purpose.

from btrees.

ml31415 commented on July 18, 2024

Writes on the same key should not happen. It's basically a constant appending of more or less sequential numbers, so probably all ending up in the same bucket as you described, even if the trees are rather large (2M+). Switching to random IDs now is not an option. I also really believe, that two concurrent insertions into the same bucket should be something automatically resolveable, that no user is ever bothered with. I don't know of any other database that has issues with that. But ok, thanks for the explanation!

from btrees.

tseaver commented on July 18, 2024

Two clients inserting different keys into the same bucket do not cause conflicts, right up to the point that the bucket splits: at that point, both the target bucket and its parent must be mutated, generating a conflict.

from btrees.

jamadden commented on July 18, 2024

Ahh, that's why I said it depends. I knew that most cases didn't but that there were some that did, I just couldn't remember for sure exactly what the conflict case was. Thanks for the details.

from btrees.

jamadden commented on July 18, 2024

So in the case of sequential-but-distinct keys, the rate of conflicts would be directly related to the rate at which buckets split, which in turn is defined by the class property max_leaf_size. This currently defaults to 120 but that can be changed (although only by subclassing, so not in existing trees, see #8). The defaults may be changing soon, see #28.

from btrees.

tseaver commented on July 18, 2024

Yup. Note that increasing the bucket size to reduce conflicts trades off the space (on disk, in RAM), but may make it more likely to hold the whole tree in the connection cache: the actual effect on a given application's performance is hard to predict, and should be benchmarked.

from btrees.

BTrees not finding contained keys on parallel access with gevent about btrees HOT 12 CLOSED

Comments (12)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent