Comments (8)
I used to have the issue while creating assets on the demo setup, I tried reproducing it locally but can't. I'll run my test case against the demo as soon as it's available again and fetch the traceback.
from substra-backend.
I've just written a document about this error.
It is available here
Please give me your reviews on it.
from substra-backend.
Thanks @jmorel do you have the associated traceback in the backend?
from substra-backend.
@jmorel Do you manage to reproduce it ? It would be nice to have the backend traceback in order to be able to find a fix for this issue.
I suspect that we will need to change the way we initialize the fabric client !
from substra-backend.
I got the error again while running a very big (about 1400 tuples) compute plan on the demo env, here is the stacktrace from org4-worker:
[2020-01-08 08:04:52,559: ERROR/ForkPoolWorker-1] <_Rendezvous of RPC that terminated with:
status = StatusCode.UNAVAILABLE
details = "failed to connect to all addresses"
debug_error_string = "{"created":"@1578470692.558659550","description":"Failed to pick subchannel","file":"src/core/ext/filters
/client_channel/client_channel.cc","file_line":3934,"referenced_errors":[{"created":"@1578470692.558654186","description":"failed t
o connect to all addresses","file":"src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":393,"grpc_s
tatus":14}]}"
>
Traceback (most recent call last):
File "/usr/src/app/substrapp/ledger_utils.py", line 167, in call_ledger
response = loop.run_until_complete(chaincode_calls[call_type](**params))
File "/usr/lib/python3.6/asyncio/base_events.py", line 484, in run_until_complete
return future.result()
File "/usr/local/lib/python3.6/dist-packages/hfc/fabric/client.py", line 1640, in chaincode_invoke
res = await asyncio.gather(*responses)
File "/usr/local/lib/python3.6/dist-packages/aiogrpc/channel.py", line 40, in __call__
return await fut
grpc._channel._Rendezvous: <_Rendezvous of RPC that terminated with:
status = StatusCode.UNAVAILABLE
details = "failed to connect to all addresses"
debug_error_string = "{"created":"@1578470692.558659550","description":"Failed to pick subchannel","file":"src/core/ext/filters
/client_channel/client_channel.cc","file_line":3934,"referenced_errors":[{"created":"@1578470692.558654186","description":"failed t
o connect to all addresses","file":"src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":393,"grpc_s
tatus":14}]}"
>
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/src/app/substrapp/ledger_utils.py", line 180, in call_ledger
response = [r for r in e.args[0] if r.response.status != 200][0].response.message
IndexError: tuple index out of range
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/src/app/substrapp/tasks/tasks.py", line 472, in on_success
log_success_tuple(tuple_type, subtuple['key'], retval['result'])
File "/usr/src/app/substrapp/ledger_utils.py", line 371, in log_success_tuple
_update_tuple_status(tuple_type, tuple_key, 'done', extra_kwargs=extra_kwargs)
File "/usr/src/app/substrapp/ledger_utils.py", line 324, in _update_tuple_status
update_ledger(fcn=invoke_fcn, args=invoke_args, sync=True)
File "/usr/src/app/substrapp/ledger_utils.py", line 107, in _wrapper
return fn(*args, **kwargs)
File "/usr/src/app/substrapp/ledger_utils.py", line 233, in update_ledger
return _invoke_ledger(*args, **kwargs)
File "/usr/src/app/substrapp/ledger_utils.py", line 212, in _invoke_ledger
response = call_ledger('invoke', fcn=fcn, args=args, kwargs=params)
File "/usr/src/app/substrapp/ledger_utils.py", line 182, in call_ledger
raise LedgerError(str(e))
substrapp.ledger_utils.LedgerError: <_Rendezvous of RPC that terminated with:
status = StatusCode.UNAVAILABLE
details = "failed to connect to all addresses"
debug_error_string = "{"created":"@1578470692.558659550","description":"Failed to pick subchannel","file":"src/core/ext/filters
/client_channel/client_channel.cc","file_line":3934,"referenced_errors":[{"created":"@1578470692.558654186","description":"failed t
o connect to all addresses","file":"src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":393,"grpc_s
tatus":14}]}"
>
from substra-backend.
Thank you @jmorel.
We can see that we do not retry on this kind of error https://github.com/SubstraFoundation/substra-backend/blob/master/backend/substrapp/ledger_utils.py#L89-L117
We can add new error to retry on this issue
from substra-backend.
Thanks @jmorel, this is clearer and it seems like a new error.
As seen in the traceback, the backend is also failing to parse correctly this error (should be fixed).
Before retrying, I think it would be worth to understand the cause of this error. Retrying may not be the only solution and not the best long term solution.
from substra-backend.
Where are we on this one ?
As we have a ledger retry strategy it should prevent from short connection interruption between two nodes. For longer interruption, it may be a bigger problem that should not be handled directly in the backend no ?
from substra-backend.
Related Issues (20)
- Algo fetch: HTTP errors aren't displayed in the logs HOT 4
- Local setup (skaffold): Incorrect DNS assumptions
- Cannot launch using docker-compose HOT 1
- Error 500 when linking dataset with datasamples
- Error 400 when adding a dataset HOT 3
- single-node compute plan: worker tries to delete missing image HOT 2
- Chart not compatible with K8S >=1.16 HOT 1
- Document password requirements on the Helm chart values
- 502 when under load HOT 5
- [Edge case] Crash when composite traintuple head and trunk models are identifcal HOT 1
- Evicted training tasks pods are never deleted
- Serializers for the CompositeTraintuple are not consistent (single vs compute plan)
- Can't get a stable deployment using line 0.7.1 of the compatibility table HOT 2
- authenticate_worker in Process has a strange behavior (not systematic)
- Ease the process to add extra volumes for local data registration HOT 2
- Add an auto-cleanup of old docker-registry images
- Is this project dead? HOT 1
- installation of backend with helm does not seem to work with microk8s HOT 1
- which backend url do I use to instantiate a Client? HOT 1
- BUG: container kaniko exit 1 Susbtra-backend crash HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from substra-backend.