Giter VIP home page Giter VIP logo

substra-backend's Introduction



Substra


Substra is an open source federated learning (FL) software. It enables the training and validation of machine learning models on distributed datasets. It provides a flexible Python interface and a web application to run federated learning training at scale. This specific repository is the low-level Python library used to interact with a Substra network.

Substra's main usage is in production environments. It has already been deployed and used by hospitals and biotech companies (see the MELLODDY project for instance). Substra can also be used on a single machine to perform FL simulations and debug code.

Substra was originally developed by Owkin and is now hosted by the Linux Foundation for AI and Data. Today Owkin is the main contributor to Substra.

Join the discussion on Slack and subscribe here to our newsletter.

To start using Substra

Have a look at our documentation.

Try out our MNIST example.

Support

If you need support, please either raise an issue on Github or ask on Slack.

Contributing

Substra warmly welcomes any contribution. Feel free to fork the repo and create a pull request.

Setup

To setup the project in development mode, run:

pip install -e ".[dev]"

To run all tests, use the following command:

make test

Some of the tests require Docker running on your machine before running them.

Code formatting

You can opt into auto-formatting of code on pre-commit using Black.

This relies on hooks managed by pre-commit, which you can set up as follows.

Install pre-commit, then run:

pre-commit install

Documentation generation

To generate the command line interface documentation, sdk and schemas documentation, the python version must be 3.8. Run the following command:

make doc

Documentation will be available in the references/ directory.

Changelog generation

The changelog is managed with towncrier. To add a new entry in the changelog, add a file in the changes folder. The file name should have the following structure: <unique_id>.<change_type>. The unique_id is a unique identifier, we currently use the PR number. The change_type can be of the following types: added, changed, removed, fixed.

To generate the changelog (for example during a release), use the following command (you must have the dev dependencies installed):

towncrier build --version=<x.y.z>

You can use the --draft option to see what would be generated without actually writing to the changelog (and without removing the fragments).

substra-backend's People

Contributors

acellard avatar alexandrepicosson avatar alexisdeh avatar aureliengasser avatar camillemarinisonos avatar cboillet-dev avatar clairephi avatar clementgautier avatar dependabot[bot] avatar esadruhn avatar guilhem-barthes avatar guillaumecisco avatar hamdyd avatar ic-1101asterisk avatar inalgnu avatar jmorel avatar kaanyagci avatar kelvin-m avatar louishulot avatar maeldebon avatar mblottiere avatar milouu avatar oleobal avatar samlesu avatar sdgjlbl avatar sergebouchut2 avatar thbcmlowk avatar thibaultfy avatar thibaultrobert avatar thibowk avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

substra-backend's Issues

Skaffold dev fails due to git clone timing out

When I run skaffold dev in the substra-backend repo, I have the following error:

Step 8/16 : RUN pip3 install -r requirements.txt
 ---> Running in e5731adf43f3
Collecting git+git://github.com/hyperledger/fabric-sdk-py.git@df19cf51ff4f21507869184901988c094658367a (from -r requirements.txt (line 31))
  Cloning git://github.com/hyperledger/fabric-sdk-py.git (to revision df19cf51ff4f21507869184901988c094658367a) to /tmp/pip-req-build-ydgheoes
  Running command git clone -q git://github.com/hyperledger/fabric-sdk-py.git /tmp/pip-req-build-ydgheoes
  fatal: unable to connect to github.com:
  github.com[0: 140.82.114.4]: errno=Connection timed out

ERROR: Command errored out with exit status 128: git clone -q git://github.com/hyperledger/fabric-sdk-py.git /tmp/pip-req-build-ydgheoes Check the logs for full command output.

I am on Ubuntu 18.04 (inside a VM).

When (outside of substra-backend repo) I run:
git clone git://github.com/hyperledger/fabric-sdk-py.git it also fails (Connection timed out).
But
git clone https://github.com/hyperledger/fabric-sdk-py.git works.

Statics files are generated in the init container

With the new version of server deployment, the static files are generated in the init container in production but are not present in the server container at the end. We should remove this init container or we should copy those static in the server container once generated

single-node compute plan: worker tries to delete missing image

Setup:

  • Setup with 2 nodes
  • We run a compute plan that only has training tasks on node 2

The worker in node 1 encounters an error: it tries to delete an image that doesn't exist:

ERROR 2020-05-27 19:38:56,487 substrapp.tasks.tasks 696 140190742099776 404 Client Error: Not Found ("reference does not exist")
 Traceback (most recent call last):
   File "/usr/local/lib/python3.6/dist-packages/docker/api/client.py", line 261, in _raise_for_status
     response.raise_for_status()
   File "/usr/local/lib/python3.6/dist-packages/requests/models.py", line 940, in raise_for_status
     raise HTTPError(http_error_msg, response=self)
 requests.exceptions.HTTPError: 404 Client Error: Not Found for url: http+docker://localhost/v1.35/images/substra/algo_dc32f8dd?force=True&noprune=False
 During handling of the above exception, another exception occurred:
 Traceback (most recent call last):
   File "/usr/src/app/substrapp/tasks/tasks.py", line 983, in remove_algo_images
     client.images.remove(algo_docker, force=True)
   File "/usr/local/lib/python3.6/dist-packages/docker/models/images.py", line 463, in remove
     self.client.api.remove_image(*args, **kwargs)
   File "/usr/local/lib/python3.6/dist-packages/docker/utils/decorators.py", line 19, in wrapped
     return f(self, resource_id, *args, **kwargs)
   File "/usr/local/lib/python3.6/dist-packages/docker/api/image.py", line 495, in remove_image
     return self._result(res, True)
   File "/usr/local/lib/python3.6/dist-packages/docker/api/client.py", line 267, in _result
     self._raise_for_status(response)
   File "/usr/local/lib/python3.6/dist-packages/docker/api/client.py", line 263, in _raise_for_status
     raise create_api_error_from_http_exception(e)
   File "/usr/local/lib/python3.6/dist-packages/docker/errors.py", line 31, in create_api_error_from_http_exception
     raise cls(e, response=response, explanation=explanation)
 docker.errors.NotFound: 404 Client Error: Not Found ("reference does not exist")

(After a quick investigation: it looks like we're not catch ing the correct exception type (NotFound instead of ImageNotFound)

LedgerTimeout and retry strategy

I ran into a complexe issue on a wait_for_event which timeout

It's not perfectly reproducible but I often see it.

When creating a traintuple with LEDGER_CALL_RETRY = True, default in prod (in case of False, default in dev settings, the user will have a Timeout error)

create second traintuple                                                                                               
Object with key(s) '363f70dcc3bf22fdce65e36c957e855b7cd3e2828e6909f34ccc97ee6218541a' already exists. 

If we look at the log in the backend

exception calling callback for <Future at 0x7f1217d4bb00 state=finished raised _Rendezvous>                                                                                                                                                   
Traceback (most recent call last):                                                                                                                                                                                                            
  File "/usr/local/lib/python3.6/concurrent/futures/thread.py", line 56, in run                                                                                                                                                               
    result = self.fn(*self.args, **self.kwargs)                                                                                                                                                                                               
  File "/usr/local/lib/python3.6/site-packages/aiogrpc/utils.py", line 126, in _next                                                                                                                                                          
    return next(self._iterator)                                                                                                                                                                                                               
  File "/usr/local/lib/python3.6/site-packages/grpc/_channel.py", line 388, in __next__                                                                                                                                                       
    return self._next()                                                                                                                                                                                                                       
  File "/usr/local/lib/python3.6/site-packages/grpc/_channel.py", line 382, in _next                                                                                                                                                          
    raise self                                                                                                                                                                                                                                
grpc._channel._Rendezvous: <_Rendezvous of RPC that terminated with:                                                                                                                                                                          
        status = StatusCode.CANCELLED                                                                                                                                                                                                         
        details = "Locally cancelled by application!"                                                                                                                                                                                         
        debug_error_string = "None"                                                                                                                                                                                                           
>                                                                                                                                                                                                                                             
                                                                                                                                                                                                                                                       
During handling of the above exception, another exception occurred:                                                                                                                                                                           
                                                                                                                                                                                                                                              
Traceback (most recent call last):                                                                                                                                                                                                            
  File "/usr/local/lib/python3.6/concurrent/futures/_base.py", line 324, in _invoke_callbacks                                                                                                                                                 
    callback(self)                                                                                                                                                                                                                            
  File "/usr/local/lib/python3.6/asyncio/futures.py", line 417, in _call_set_state                                                                                                                                                            
    dest_loop.call_soon_threadsafe(_set_state, destination, source)                                                                                                                                                                           
  File "/usr/local/lib/python3.6/asyncio/base_events.py", line 637, in call_soon_threadsafe                                                                                                                                                   
    self._check_closed()                                                                                                                                                                                                                      
  File "/usr/local/lib/python3.6/asyncio/base_events.py", line 377, in _check_closed                                                                                                                                                          
    raise RuntimeError('Event loop is closed')                                                                                                                                                                                                
RuntimeError: Event loop is closed                                                                                                                                                                                                            
exception calling callback for <Future at 0x7f1217dba940 state=finished raised _Rendezvous>                                                                                                                                                   
Traceback (most recent call last):            
  File "/usr/local/lib/python3.6/concurrent/futures/thread.py", line 56, in run                                                                                                                                                               
    result = self.fn(*self.args, **self.kwargs)                                      
  File "/usr/local/lib/python3.6/site-packages/aiogrpc/utils.py", line 126, in _next                                                                                                                                                          
    return next(self._iterator)
  File "/usr/local/lib/python3.6/site-packages/grpc/_channel.py", line 388, in __next__                                                                                                                                                       
    return self._next()                            
  File "/usr/local/lib/python3.6/site-packages/grpc/_channel.py", line 382, in _next                                   
    raise self                                                                                                                                                                                                                                
grpc._channel._Rendezvous: <_Rendezvous of RPC that terminated with:
        status = StatusCode.CANCELLED                      
        details = "Locally cancelled by application!"
        debug_error_string = "None"                                                                                   
>                 
                                                                                                                                                                                                                                              
During handling of the above exception, another exception occurred:                        
                                                           
Traceback (most recent call last):                                                                                                                                                                                                            
  File "/usr/local/lib/python3.6/concurrent/futures/_base.py", line 324, in _invoke_callbacks                          
    callback(self)                                                                                                                                                                                                                            
  File "/usr/local/lib/python3.6/asyncio/futures.py", line 417, in _call_set_state                                                                                                                                                            
    dest_loop.call_soon_threadsafe(_set_state, destination, source)                                                                                                                                                                           
  File "/usr/local/lib/python3.6/asyncio/base_events.py", line 637, in call_soon_threadsafe                            
    self._check_closed()                                                                                                                                                                                                                      
  File "/usr/local/lib/python3.6/asyncio/base_events.py", line 377, in _check_closed                                  
    raise RuntimeError('Event loop is closed')             
RuntimeError: Event loop is closed                                                                                                                                                                                                            
Function invoke_ledger failed (<class 'substrapp.ledger_utils.LedgerTimeout'>): waitForEvent timed out. retrying in 2s     

We can see that the wait_for_event timeouts and triggers the retry strategy. But the traintuple was
already commited in the ledger and create a todo event.
So when retrying, we just see a 409 which is return to the user.

INFO - 2019-10-24 09:23:47,455 - events.apps - Processing task a2171a1c09738c677748346d22d2b5eea47f874a3b4f4b75224674235892de72: type=traintuple status=todo with tx status: VALID                                                                                                             
[24/Oct/2019 09:23:47] "POST /traintuple/ HTTP/1.1" 409 115                                                                                                                                                                                   

It misleads the final user.

Morevover, it could turn in a big issue as for asset which are in localdb (a failure as 409) will trigger a instance.delete()

Chart not compatible with K8S >=1.16

Hello there,

One of the daemonset deployed when you use the feature flag "gpu" (daemonset-nvidia-plugin.yaml) is using the wrong api version and it result in this error: no matches for kind "DaemonSet" in version "extensions/v1beta1" I suggest removing the feature and pointing to the Nvidia device plugin documentation: https://github.com/NVIDIA/k8s-device-plugin

Worker: have a proper error if the medias volume is not mounted

If the medias volume is not mounted properly on the worker the tuple execution fails with the following message:

ERROR 2020-04-15 13:43:08,556 substrapp.tasks.tasks 708 139767323916096 [00-01-0126-1e30844]
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/celery/app/trace.py", line 385, in trace_task
    R = retval = fun(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/celery/app/trace.py", line 650, in __protected_call__
    return self.run(*args, **kwargs)
  File "/usr/src/app/substrapp/tasks/tasks.py", line 550, in compute_task
    max_retries=int(getattr(settings, 'CELERY_TASK_MAX_RETRIES')))
  File "/usr/local/lib/python3.6/dist-packages/celery/app/task.py", line 704, in retry
    raise_with_context(exc)
  File "/usr/src/app/substrapp/tasks/tasks.py", line 543, in compute_task
    prepare_materials(subtuple, tuple_type)
  File "/usr/src/app/substrapp/tasks/utils.py", line 26, in timed
    result = function(*args, **kw)
  File "/usr/src/app/substrapp/tasks/tasks.py", line 580, in prepare_materials
    prepare_opener(directory, subtuple)
  File "/usr/src/app/substrapp/tasks/utils.py", line 26, in timed
    result = function(*args, **kw)
  File "/usr/src/app/substrapp/tasks/tasks.py", line 333, in prepare_opener
    raise Exception('DataOpener Hash in Subtuple is not the same as in local db')
Exception: DataOpener Hash in Subtuple is not the same as in local db

It doesn't help to find the cause of the error. This is ambiguous and does not help for debugging.

The function get_hash must be improved.

Create an aggregatetuple without in model key

Original issue from the substra repository

It is possible to create an aggregatetuple without passing any --in-model-key in the command.

Issue in the backend

The in_models_keys field of an aggregatetuple isn't required and has a minimum length of 0.

As the sole purpose of the aggregatetuple is to aggregate at least 2 models together, the in_models_keys should be required and should have a minimum length of 2.

Node register app can fail

Sometimes, when we start substra-backend with docker-compose (cf docker/start.py) with dev settings, the node-register app can fail (see logs after) and block the container as runserver will not exit.

In production, we start substra-backend with uwsgi which has a need-app parameters that prevent from this issue

INFO - 2019-11-04 13:43:30,885 - events.apps - Start the event application.
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/hfc/fabric/client.py", line 1711, in chaincode_invoke
    timeout=wait_for_event_timeout)
  File "/usr/local/lib/python3.6/asyncio/tasks.py", line 362, in wait_for
    raise futures.TimeoutError()
concurrent.futures._base.TimeoutError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/src/app/substrapp/ledger_utils.py", line 159, in call_ledger
    response = loop.run_until_complete(chaincode_calls[call_type](**params))
  File "/usr/local/lib/python3.6/asyncio/base_events.py", line 484, in run_until_complete
    return future.result()
  File "/usr/local/lib/python3.6/site-packages/hfc/fabric/client.py", line 1721, in chaincode_invoke
    raise TimeoutError('waitForEvent timed out.')
TimeoutError: waitForEvent timed out.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "manage.py", line 15, in <module>
    execute_from_command_line(sys.argv)
  File "/usr/local/lib/python3.6/site-packages/django/core/management/__init__.py", line 381, in execute_from_command_line
    utility.execute()
  File "/usr/local/lib/python3.6/site-packages/django/core/management/__init__.py", line 357, in execute
    django.setup()
  File "/usr/local/lib/python3.6/site-packages/django/__init__.py", line 24, in setup
    apps.populate(settings.INSTALLED_APPS)
  File "/usr/local/lib/python3.6/site-packages/django/apps/registry.py", line 120, in populate
    app_config.ready()
  File "/usr/src/app/node-register/apps.py", line 10, in ready
    invoke_ledger(fcn='registerNode', args=[''], sync=True)
exception calling callback for <Future at 0x7f60aeeb2cc0 state=finished raised _Rendezvous>
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/concurrent/futures/thread.py", line 56, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/usr/local/lib/python3.6/site-packages/aiogrpc/utils.py", line 126, in _next
    return next(self._iterator)
  File "/usr/local/lib/python3.6/site-packages/grpc/_channel.py", line 388, in __next__
    return self._next()
  File "/usr/local/lib/python3.6/site-packages/grpc/_channel.py", line 382, in _next
    raise self
grpc._channel._Rendezvous: <_Rendezvous of RPC that terminated with:
        status = StatusCode.CANCELLED
        details = "Locally cancelled by application!"
        debug_error_string = "None"
>

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/concurrent/futures/_base.py", line 324, in _invoke_callbacks
    callback(self)
  File "/usr/local/lib/python3.6/asyncio/futures.py", line 417, in _call_set_state
    dest_loop.call_soon_threadsafe(_set_state, destination, source)
  File "/usr/local/lib/python3.6/asyncio/base_events.py", line 637, in call_soon_threadsafe
    self._check_closed()
  File "/usr/local/lib/python3.6/asyncio/base_events.py", line 377, in _check_closed
    raise RuntimeError('Event loop is closed')
RuntimeError: Event loop is closed  File "/usr/src/app/substrapp/ledger_utils.py", line 90, in _wrapper

    return fn(*args, **kwargs)
  File "/usr/src/app/substrapp/ledger_utils.py", line 208, in invoke_ledger
    response = call_ledger('invoke', fcn=fcn, args=args, kwargs=params)
  File "/usr/src/app/substrapp/ledger_utils.py", line 161, in call_ledger
    raise LedgerTimeout(str(e))
substrapp.ledger_utils.LedgerTimeout: waitForEvent timed out.
exception calling callback for <Future at 0x7f60aef01cf8 state=finished raised _Rendezvous>
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/concurrent/futures/thread.py", line 56, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/usr/local/lib/python3.6/site-packages/aiogrpc/utils.py", line 126, in _next
    return next(self._iterator)
  File "/usr/local/lib/python3.6/site-packages/grpc/_channel.py", line 388, in __next__
    return self._next()
  File "/usr/local/lib/python3.6/site-packages/grpc/_channel.py", line 382, in _next
    raise self
grpc._channel._Rendezvous: <_Rendezvous of RPC that terminated with:
        status = StatusCode.CANCELLED
        details = "Locally cancelled by application!"
        debug_error_string = "None"
>

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/concurrent/futures/_base.py", line 324, in _invoke_callbacks
    callback(self)
  File "/usr/local/lib/python3.6/asyncio/futures.py", line 417, in _call_set_state
    dest_loop.call_soon_threadsafe(_set_state, destination, source)
  File "/usr/local/lib/python3.6/asyncio/base_events.py", line 637, in call_soon_threadsafe
    self._check_closed()
  File "/usr/local/lib/python3.6/asyncio/base_events.py", line 377, in _check_closed
    raise RuntimeError('Event loop is closed')
RuntimeError: Event loop is closed

Backend - Ubuntu broken apt source ?

Sous MacOS Catalina
Substra-backend --> branch master --> skaffold dev --> casse sur la step 4
Il arrive pas à récupérer cette ressource et cela bloque l'install : http://archive.ubuntu.com/ubuntu/pool/main/p/publicsuffix/publicsuffix_20180223.1310-1_all.deb

Une suggestion please ? :'(

Step 4/13 : RUN apt-get install -y git curl netcat
---> Running in 7f0fd0bf027b
Reading package lists...
Building dependency tree...
Reading state information...
The following additional packages will be installed:
git-man krb5-locales less libbsd0 libcurl3-gnutls libcurl4 libedit2
liberror-perl libgssapi-krb5-2 libk5crypto3 libkeyutils1 libkrb5-3
libkrb5support0 libnghttp2-14 libpsl5 librtmp1 libssl1.0.0 libx11-6
libx11-data libxau6 libxcb1 libxdmcp6 libxext6 libxmuu1 multiarch-support
netcat-traditional openssh-client publicsuffix xauth
Suggested packages:
gettext-base git-daemon-run | git-daemon-sysvinit git-doc git-el git-email
git-gui gitk gitweb git-cvs git-mediawiki git-svn krb5-doc krb5-user
keychain libpam-ssh monkeysphere ssh-askpass
The following NEW packages will be installed:
curl git git-man krb5-locales less libbsd0 libcurl3-gnutls libcurl4 libedit2
liberror-perl libgssapi-krb5-2 libk5crypto3 libkeyutils1 libkrb5-3
libkrb5support0 libnghttp2-14 libpsl5 librtmp1 libssl1.0.0 libx11-6
libx11-data libxau6 libxcb1 libxdmcp6 libxext6 libxmuu1 multiarch-support
netcat netcat-traditional openssh-client publicsuffix xauth
0 upgraded, 32 newly installed, 0 to remove and 6 not upgraded.
Need to get 8949 kB of archives.
After this operation, 50.7 MB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu bionic/main amd64 multiarch-support amd64 2.27-3ubuntu1 [6916 B]
Get:2 http://archive.ubuntu.com/ubuntu bionic/main amd64 libxau6 amd64 1:1.0.8-1 [8376 B]
Get:3 http://archive.ubuntu.com/ubuntu bionic/main amd64 libbsd0 amd64 0.8.7-1 [41.5 kB]
Get:4 http://archive.ubuntu.com/ubuntu bionic/main amd64 libxdmcp6 amd64 1:1.1.2-3 [10.7 kB]
Get:5 http://archive.ubuntu.com/ubuntu bionic-updates/main amd64 libxcb1 amd64 1.13-2~ubuntu18.04 [45.5 kB]
Get:6 http://archive.ubuntu.com/ubuntu bionic-updates/main amd64 libx11-data all 2:1.6.4-3ubuntu0.2 [113 kB]
Get:7 http://archive.ubuntu.com/ubuntu bionic-updates/main amd64 libx11-6 amd64 2:1.6.4-3ubuntu0.2 [569 kB]
Get:8 http://archive.ubuntu.com/ubuntu bionic/main amd64 libxext6 amd64 2:1.3.3-1 [29.4 kB]
Get:9 http://archive.ubuntu.com/ubuntu bionic/main amd64 less amd64 487-0.1 [112 kB]
Get:10 http://archive.ubuntu.com/ubuntu bionic-updates/main amd64 krb5-locales all 1.16-2ubuntu0.1 [13.5 kB]
Get:11 http://archive.ubuntu.com/ubuntu bionic/main amd64 libedit2 amd64 3.1-20170329-1 [76.9 kB]
Get:12 http://archive.ubuntu.com/ubuntu bionic-updates/main amd64 libkrb5support0 amd64 1.16-2ubuntu0.1 [30.9 kB]
Get:13 http://archive.ubuntu.com/ubuntu bionic-updates/main amd64 libk5crypto3 amd64 1.16-2ubuntu0.1 [85.6 kB]
Get:14 http://archive.ubuntu.com/ubuntu bionic/main amd64 libkeyutils1 amd64 1.5.9-9.2ubuntu2 [8720 B]
Get:15 http://archive.ubuntu.com/ubuntu bionic-updates/main amd64 libkrb5-3 amd64 1.16-2ubuntu0.1 [279 kB]
Get:16 http://archive.ubuntu.com/ubuntu bionic-updates/main amd64 libgssapi-krb5-2 amd64 1.16-2ubuntu0.1 [122 kB]
Get:17 http://archive.ubuntu.com/ubuntu bionic/main amd64 libpsl5 amd64 0.19.1-5build1 [41.8 kB]
Get:18 http://archive.ubuntu.com/ubuntu bionic-updates/main amd64 libssl1.0.0 amd64 1.0.2n-1ubuntu5.3 [1088 kB]
Get:19 http://archive.ubuntu.com/ubuntu bionic/main amd64 libxmuu1 amd64 2:1.1.2-2 [9674 B]
Get:20 http://archive.ubuntu.com/ubuntu bionic-updates/main amd64 openssh-client amd64 1:7.6p1-4ubuntu0.3 [614 kB]
Get:21 http://archive.ubuntu.com/ubuntu bionic/main amd64 publicsuffix all 20180223.1310-1 [97.6 kB]
Err:21 http://archive.ubuntu.com/ubuntu bionic/main amd64 publicsuffix all 20180223.1310-1
Undetermined Error [IP: 91.189.88.31 80]
Get:22 http://archive.ubuntu.com/ubuntu bionic/main amd64 xauth amd64 1:1.0.10-1 [24.6 kB]
Get:23 http://archive.ubuntu.com/ubuntu bionic/main amd64 libnghttp2-14 amd64 1.30.0-1ubuntu1 [77.8 kB]
Get:24 http://archive.ubuntu.com/ubuntu bionic/main amd64 librtmp1 amd64 2.4+20151223.gitfa8646d.1-1 [54.2 kB]
Get:25 http://archive.ubuntu.com/ubuntu bionic-updates/main amd64 libcurl4 amd64 7.58.0-2ubuntu3.8 [214 kB]
Get:26 http://archive.ubuntu.com/ubuntu bionic-updates/main amd64 curl amd64 7.58.0-2ubuntu3.8 [159 kB]
Get:27 http://archive.ubuntu.com/ubuntu bionic-updates/main amd64 libcurl3-gnutls amd64 7.58.0-2ubuntu3.8 [213 kB]
Get:28 http://archive.ubuntu.com/ubuntu bionic/main amd64 liberror-perl all 0.17025-1 [22.8 kB]
Get:29 http://archive.ubuntu.com/ubuntu bionic-updates/main amd64 git-man all 1:2.17.1-1ubuntu0.5 [803 kB]
Get:30 http://archive.ubuntu.com/ubuntu bionic-updates/main amd64 git amd64 1:2.17.1-1ubuntu0.5 [3912 kB]
Get:31 http://archive.ubuntu.com/ubuntu bionic/universe amd64 netcat-traditional amd64 1.10-41.1 [61.7 kB]
Get:32 http://archive.ubuntu.com/ubuntu bionic/universe amd64 netcat all 1.10-41.1 [3436 B]
Fetched 8851 kB in 5s (1666 kB/s)
E: Failed to fetch http://archive.ubuntu.com/ubuntu/pool/main/p/publicsuffix/publicsuffix_20180223.1310-1_all.deb Undetermined Error [IP: 91.189.88.31 80]
E: Unable to fetch some archives, maybe run apt-get update or try with --fix-missing?
FATA[0174] exiting dev mode because first build failed: build failed: build failed: building [substrafoundation/celeryworker]: build artifact: unable to stream build output: The command '/bin/sh -c apt-get install -y git curl netcat' returned a non-zero code: 100

403 on substra-backend.node-1.com

Spoiler alert: This seems to be firefox-only issue, so you know...

I am getting a 403 - Forbidden: Request for an Unsupported Host Nome (webpage title) when I try to access http://substra-backend.node-1.com/, but everything is fine on http://substra-backend.node-2.com/. Frontend 1 & 2 are working as expected. And curl http://substra-backend.node-1.com/ is working perfectly. CLI login on this node is OK. I am not finding errors in the logs...

The webpage displayed is:

Unknown Host Request Forbidden

Your request to this server is for a Host Name that is unknown to this server or unsupported by this server.

Additional Information:

You are seeing this message because a request for a Web Site or Domain Name was directed to this server, but this server has not been configured to support requests for that Web Site or Domain Name.  Possible causes are (1) the domain name you were requesting has been incorrectly configured to point to the IP address of this server, (2) a Host File on your system/network has been incorrectly configured to direct requests for the domain name you specified to the IP address of this server, or (3) another web site on the internet has been incorrectly configured to redirect requests for the domain name you specified to the IP address of this server. It may also be possible that the incorrect configuration is not accidental but rather intentional designed with a malevolent intent. You should contact the holder or the administrator of the domain name you were requesting for investigation and resolution. 
  • GET Request headers:
Host: substra-backend.node-1.com
User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:72.0) Gecko/20100101 Firefox/72.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate
DNT: 1
Connection: keep-alive
Cookie: csrftoken=CRw0ntGtzVEeIhHWW9PMPQeGXqZoR1orD8996ZSwnodIohqLTTFPvxTRadENrwAw; sessionid=lgip14h9bko6vq3jlm6kxdpuxpbfymmj
Upgrade-Insecure-Requests: 1
If-Modified-Since: Mon, 21 Mar 2005 22:15:26 GMT
If-None-Match: "5a6-3f2da10a25b80"
Cache-Control: max-age=0
  • GET Response headers (attention: http error is 304 (Not Modified) and not 403 as displayed in the webpage title):
HTTP/1.1 304 Not Modified
Date: Mon, 10 Feb 2020 14:14:55 GMT
Connection: keep-alive
Keep-Alive: timeout=30
Server: Apache/2
ETag: "5a6-3f2da10a25b80"
Expires: Mon, 10 Feb 2020 15:14:55 GMT
Cache-Control: max-age=3600
Vary: Host
Accept-Ranges: bytes
Age: 0

Configuration:

  • Ubuntu Desktop: 19.10
  • Firefox: 72.0.2
  • Tried the same request deactivating all extensions => Same result
  • No weird hosts config
  • Works perfectly with Chromium...

Do you have any clue to make it work with Firefox? Any specific log to check?

Fix swagger doc

The current auto-generated swagger documentation (/doc) is broken, probably as a result of our last DRF update.

The library it relied on is also deprecated since mid-2019, so instead of just fixing it we'll need to switch to something new.

A potential replacement is https://github.com/axnsan12/drf-yasg

We need to make sure that the fix swagger doc specifies correctly how file uploads are handled. That was the purpose of the SchemaGenerator class in backend/views.py.

Local setup (skaffold): Incorrect DNS assumptions

The default "local" installation of the susbtra stack (skaffold run) assume that custom /etc/hosts entries present on the host are used during DNS resolution from inside running pods.

That appears to be the case in some setups:

  • Docker-for-mac
  • minikube with virtualbox driver

However, it appears not to be the case in other setups:

So, effectively, the default local installation of the substra stack only works in some environments.

Algo fetch: HTTP errors aren't displayed in the logs

HTTP errors occurring during the algo fetch aren't reported in the logs. This makes it hard to understand what is going on.

For instance, in this issue, the logs only error show hash doesn't match [...] which is unhelpful: the real source of the problem is the HTTP 403 error which occurred prior to the hash computation.

Error PHANTOM_READ_CONFLICT

I was adding lots of chained tuples on the demo env and the execution of the very first one failed with the following traceback in the worker:

[2020-01-09 10:29:11,067: ERROR/ForkPoolWorker-1] ['PHANTOM_READ_CONFLICT', 'PHANTOM_READ_CONFLICT', 'PHANTOM_READ_CONFLICT', 'PHANTOM_READ_CONFLICT']
Traceback (most recent call last):
  File "/usr/src/app/substrapp/ledger_utils.py", line 167, in call_ledger
    response = loop.run_until_complete(chaincode_calls[call_type](**params))
  File "/usr/lib/python3.6/asyncio/base_events.py", line 484, in run_until_complete
    return future.result()
  File "/usr/local/lib/python3.6/dist-packages/hfc/fabric/client.py", line 1729, in chaincode_invoke
    raise Exception(statuses)
Exception: ['PHANTOM_READ_CONFLICT', 'PHANTOM_READ_CONFLICT', 'PHANTOM_READ_CONFLICT', 'PHANTOM_READ_CONFLICT']

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/src/app/substrapp/ledger_utils.py", line 180, in call_ledger
    response = [r for r in e.args[0] if r.response.status != 200][0].response.message
  File "/usr/src/app/substrapp/ledger_utils.py", line 180, in <listcomp>
    response = [r for r in e.args[0] if r.response.status != 200][0].response.message
AttributeError: 'str' object has no attribute 'response'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/src/app/substrapp/tasks/tasks.py", line 472, in on_success
    log_success_tuple(tuple_type, subtuple['key'], retval['result'])
  File "/usr/src/app/substrapp/ledger_utils.py", line 371, in log_success_tuple
    _update_tuple_status(tuple_type, tuple_key, 'done', extra_kwargs=extra_kwargs)
  File "/usr/src/app/substrapp/ledger_utils.py", line 324, in _update_tuple_status
    update_ledger(fcn=invoke_fcn, args=invoke_args, sync=True)
  File "/usr/src/app/substrapp/ledger_utils.py", line 107, in _wrapper
    return fn(*args, **kwargs)
  File "/usr/src/app/substrapp/ledger_utils.py", line 233, in update_ledger
    return _invoke_ledger(*args, **kwargs)
  File "/usr/src/app/substrapp/ledger_utils.py", line 212, in _invoke_ledger
    response = call_ledger('invoke', fcn=fcn, args=args, kwargs=params)
  File "/usr/src/app/substrapp/ledger_utils.py", line 182, in call_ledger
    raise LedgerError(str(e))
substrapp.ledger_utils.LedgerError: ['PHANTOM_READ_CONFLICT', 'PHANTOM_READ_CONFLICT', 'PHANTOM_READ_CONFLICT', 'PHANTOM_READ_CONFLICT']
[2020-01-09 10:29:11,078: INFO/ForkPoolWorker-1] Task substrapp.tasks.tasks.compute_task[96d4f391-356b-41a8-a90f-419a43c8ce00] succeeded in 2.1781632349884603s: {'worker': 'Org1.worker', 'queue': 'Org1.worker', 'computePlanID': '0ac595c5cfcd1711446bcb8f68df720e09f0252
aac948994e7d54e236d198536', 'result': {'end_head_model_file_hash': '6485c28e994d905ef74b77026a48b18914cf1562049085a329f99981841185ed', 'end_head_model_file': 'https://substra-backend.org1.substra-demo.owkin.com/model/6485c28e994d905ef74b77026a48b18914cf1562049085a329f
99981841185ed/file/', 'end_trunk_model_file_hash': 'ed30311078bfc91cec7a2b6d027bfa4cd45de7f5056609fe29d92d4e5ffad984', 'end_trunk_model_file': 'https://substra-backend.org1.substra-demo.owkin.com/model/ed30311078bfc91cec7a2b6d027bfa4cd45de7f5056609fe29d92d4e5ffad984/f
ile/'}}

See #94 for a potential fix.

Error 500 when linking dataset with datasamples

It seems that there is first an error in the chaincode and that then this error is not properly catched in the backend, raising a new exception.

It can be reproduced using this end to end test: Substra/substra-tests#106

Traceback from server

[backend-org-1-substra-backend-server-545687975c-v5bg2 substra-backend] INFO - 2020-05-06 14:20:48,103 - substrapp.ledger_utils - smartcontract invoke:updateDataSample; elaps=93.40ms; error=LedgerBadRequest
[backend-org-1-substra-backend-server-545687975c-v5bg2 substra-backend] ERROR - 2020-05-06 14:20:48,165 - django.request - Internal Server Error: /data_sample/bulk_update/
[backend-org-1-substra-backend-server-545687975c-v5bg2 substra-backend] Traceback (most recent call last):
[backend-org-1-substra-backend-server-545687975c-v5bg2 substra-backend]   File "/usr/src/app/substrapp/views/datasample.py", line 250, in bulk_update
[backend-org-1-substra-backend-server-545687975c-v5bg2 substra-backend]     data = ledger.update_datasample(args)
[backend-org-1-substra-backend-server-545687975c-v5bg2 substra-backend]   File "/usr/src/app/substrapp/ledger.py", line 148, in update_datasample
[backend-org-1-substra-backend-server-545687975c-v5bg2 substra-backend]     return _create_asset('updateDataSample', args=args)
[backend-org-1-substra-backend-server-545687975c-v5bg2 substra-backend]   File "/usr/src/app/substrapp/ledger.py", line 93, in _create_asset
[backend-org-1-substra-backend-server-545687975c-v5bg2 substra-backend]     return __create_asset(fcn, args=args, sync=True, **extra_kwargs)
[backend-org-1-substra-backend-server-545687975c-v5bg2 substra-backend]   File "/usr/src/app/substrapp/ledger.py", line 88, in __create_asset
[backend-org-1-substra-backend-server-545687975c-v5bg2 substra-backend]     return invoke_ledger(fcn=fcn, args=args, sync=sync, **extra_kwargs)
[backend-org-1-substra-backend-server-545687975c-v5bg2 substra-backend]   File "/usr/src/app/substrapp/ledger_utils.py", line 147, in _wrapper
[backend-org-1-substra-backend-server-545687975c-v5bg2 substra-backend]     return fn(*args, **kwargs)
[backend-org-1-substra-backend-server-545687975c-v5bg2 substra-backend]   File "/usr/src/app/substrapp/ledger_utils.py", line 326, in invoke_ledger
[backend-org-1-substra-backend-server-545687975c-v5bg2 substra-backend]     return _invoke_ledger(*args, **kwargs)
[backend-org-1-substra-backend-server-545687975c-v5bg2 substra-backend]   File "/usr/src/app/substrapp/ledger_utils.py", line 310, in _invoke_ledger
[backend-org-1-substra-backend-server-545687975c-v5bg2 substra-backend]     response = call_ledger('invoke', fcn=fcn, args=args, kwargs=params)
[backend-org-1-substra-backend-server-545687975c-v5bg2 substra-backend]   File "/usr/src/app/substrapp/ledger_utils.py", line 285, in call_ledger
[backend-org-1-substra-backend-server-545687975c-v5bg2 substra-backend]     return _call_ledger(call_type, fcn, *args, **kwargs)
[backend-org-1-substra-backend-server-545687975c-v5bg2 substra-backend]   File "/usr/src/app/substrapp/ledger_utils.py", line 275, in _call_ledger
[backend-org-1-substra-backend-server-545687975c-v5bg2 substra-backend]     _raise_for_status(response)
[backend-org-1-substra-backend-server-545687975c-v5bg2 substra-backend]   File "/usr/src/app/substrapp/ledger_utils.py", line 116, in _raise_for_status
[backend-org-1-substra-backend-server-545687975c-v5bg2 substra-backend]     raise exception_class.from_response_dict(response)
[backend-org-1-substra-backend-server-545687975c-v5bg2 substra-backend] substrapp.ledger_utils.LedgerBadRequest: problem when reading json arg: {"hashes": "6ce482fae0cf23dc654a18667b2a194ce7e7c1191e8385777d591003f98cd7fd", "dataManagerKeys": "f5df98681ebb1d4737f4707eb2ba379a49e513be33b2ae30f19c48fbdfcb7df9"}, error is: json: cannot unmarshal string into Go struct field inputUpdateDataSample.hashes of type []string
[backend-org-1-substra-backend-server-545687975c-v5bg2 substra-backend]
[backend-org-1-substra-backend-server-545687975c-v5bg2 substra-backend] During handling of the above exception, another exception occurred:
[backend-org-1-substra-backend-server-545687975c-v5bg2 substra-backend]
[backend-org-1-substra-backend-server-545687975c-v5bg2 substra-backend] Traceback (most recent call last):
[backend-org-1-substra-backend-server-545687975c-v5bg2 substra-backend]   File "/usr/local/lib/python3.6/site-packages/django/core/handlers/exception.py", line 34, in inner
[backend-org-1-substra-backend-server-545687975c-v5bg2 substra-backend]     response = get_response(request)
[backend-org-1-substra-backend-server-545687975c-v5bg2 substra-backend]   File "/usr/local/lib/python3.6/site-packages/django/core/handlers/base.py", line 115, in _get_response
[backend-org-1-substra-backend-server-545687975c-v5bg2 substra-backend]     response = self.process_exception_by_middleware(e, request)
[backend-org-1-substra-backend-server-545687975c-v5bg2 substra-backend]   File "/usr/local/lib/python3.6/site-packages/django/core/handlers/base.py", line 113, in _get_response
[backend-org-1-substra-backend-server-545687975c-v5bg2 substra-backend]     response = wrapped_callback(request, *callback_args, **callback_kwargs)
[backend-org-1-substra-backend-server-545687975c-v5bg2 substra-backend]   File "/usr/local/lib/python3.6/site-packages/django/views/decorators/csrf.py", line 54, in wrapped_view
[backend-org-1-substra-backend-server-545687975c-v5bg2 substra-backend]     return view_func(*args, **kwargs)
[backend-org-1-substra-backend-server-545687975c-v5bg2 substra-backend]   File "/usr/local/lib/python3.6/site-packages/rest_framework/viewsets.py", line 114, in view
[backend-org-1-substra-backend-server-545687975c-v5bg2 substra-backend]     return self.dispatch(request, *args, **kwargs)
[backend-org-1-substra-backend-server-545687975c-v5bg2 substra-backend]   File "/usr/local/lib/python3.6/site-packages/rest_framework/views.py", line 505, in dispatch
[backend-org-1-substra-backend-server-545687975c-v5bg2 substra-backend]     response = self.handle_exception(exc)
[backend-org-1-substra-backend-server-545687975c-v5bg2 substra-backend]   File "/usr/local/lib/python3.6/site-packages/rest_framework/views.py", line 465, in handle_exception
[backend-org-1-substra-backend-server-545687975c-v5bg2 substra-backend]     self.raise_uncaught_exception(exc)
[backend-org-1-substra-backend-server-545687975c-v5bg2 substra-backend]   File "/usr/local/lib/python3.6/site-packages/rest_framework/views.py", line 476, in raise_uncaught_exception
[backend-org-1-substra-backend-server-545687975c-v5bg2 substra-backend]     raise exc
[backend-org-1-substra-backend-server-545687975c-v5bg2 substra-backend]   File "/usr/local/lib/python3.6/site-packages/rest_framework/views.py", line 502, in dispatch
[backend-org-1-substra-backend-server-545687975c-v5bg2 substra-backend]     response = handler(request, *args, **kwargs)
[backend-org-1-substra-backend-server-545687975c-v5bg2 substra-backend]   File "/usr/src/app/substrapp/views/datasample.py", line 252, in bulk_update
[backend-org-1-substra-backend-server-545687975c-v5bg2 substra-backend]     return Response({'message': str(e.msg)}, status=e.st)
[backend-org-1-substra-backend-server-545687975c-v5bg2 substra-backend] AttributeError: 'LedgerBadRequest' object has no attribute 'st'

Traceback from SDK:

  File "/Users/samlesu/.virtualenvs/sb/lib/python3.7/site-packages/substra/sdk/client.py", line 787, in link_dataset_with_data_samples
    data=data,
  File "/Users/samlesu/.virtualenvs/sb/lib/python3.7/site-packages/substra/sdk/utils.py", line 170, in wrapper
    return f(*args, **kwargs)
  File "/Users/samlesu/.virtualenvs/sb/lib/python3.7/site-packages/substra/sdk/rest_client.py", line 192, in request
    **request_kwargs,
  File "/Users/samlesu/.virtualenvs/sb/lib/python3.7/site-packages/substra/sdk/rest_client.py", line 170, in _request
    return self.__request(request_name, url, **request_kwargs)
  File "/Users/samlesu/.virtualenvs/sb/lib/python3.7/site-packages/substra/sdk/rest_client.py", line 156, in __request
    raise exceptions.InternalServerError.from_request_exception(e)
substra.sdk.exceptions.InternalServerError: 500 Server Error: Internal Server Error for url: http://substra-backend.node-1.com/data_sample/bulk_update/

Testtuple execution failed

[backend-org-1-substra-backend-server-54b7879994-j262b substra-backend] INFO - 2019-12-16 15:27:30,977 - events.apps - Processing task 3298e66712747719fd670cba24164479a4c778fa217856e441c98a6ca98ee770: type=testtuple status=todo with tx status: MVCC_READ_CONFLICT
[backend-org-1-substra-backend-worker-859bdb6d6-q9c6g worker] [2019-12-16 15:27:30,994: INFO/MainProcess] Received task: substrapp.tasks.tasks.prepare_tuple[3298e66712747719fd670cba24164479a4c778fa217856e441c98a6ca98ee770]
[backend-org-2-substra-backend-server-5bcfbc8888-7tdqh substra-backend] INFO - 2019-12-16 15:27:31,007 - events.apps - Processing task 3298e66712747719fd670cba24164479a4c778fa217856e441c98a6ca98ee770: type=testtuple status=todo with tx status: MVCC_READ_CONFLICT
[backend-org-2-substra-backend-server-5bcfbc8888-7tdqh substra-backend] DEBUG - 2019-12-16 15:27:31,007 - events.apps - Skipping task 3298e66712747719fd670cba24164479a4c778fa217856e441c98a6ca98ee770: owner does not match (MyOrg1MSP vs MyOrg2MSP)
[backend-org-2-substra-backend-worker-64d655d689-cf7kg worker] [2019-12-16 15:27:31,010: ERROR/ForkPoolWorker-1] MVCC read conflict for ('logSuccessCompositeTrain', ['{"key": "503e6b9b0700c9f857cdf750d641e5d758720b93e9d8bbb6fb6836fdc8a3c849", "log": "", "outHeadModel": {"hash": "5cfbe4ac6f29b0a34fa881c209239fc91f547f89327deaa5a6dea5026eab0e68", "storageAddress": "http://substra-backend.node-2.com/model/5cfbe4ac6f29b0a34fa881c209239fc91f547f89327deaa5a6dea5026eab0e68/file/"}, "outTrunkModel": {"hash": "4af8d42042e5a3ad19bd12e98e1ff731c56a856a5f9263bb96cfc497c45e00c4", "storageAddress": "http://substra-backend.node-2.com/model/4af8d42042e5a3ad19bd12e98e1ff731c56a856a5f9263bb96cfc497c45e00c4/file/"}}'])
[backend-org-1-substra-backend-worker-859bdb6d6-q9c6g worker] [2019-12-16 15:27:31,027: INFO/ForkPoolWorker-1] DISCOVERY: adding channel peers query
[backend-org-1-substra-backend-worker-859bdb6d6-q9c6g worker] [2019-12-16 15:27:31,027: INFO/ForkPoolWorker-1] DISCOVERY: adding config query
[backend-org-1-substra-backend-worker-859bdb6d6-q9c6g worker] [2019-12-16 15:27:31,027: INFO/ForkPoolWorker-1] DISCOVERY: adding chaincodes/collection query
[backend-org-1-substra-backend-worker-859bdb6d6-q9c6g worker] [2019-12-16 15:27:31,103: INFO/ForkPoolWorker-1] create peer delivery stream
[backend-org-1-substra-backend-worker-859bdb6d6-q9c6g worker] [2019-12-16 15:27:31,107: INFO/ForkPoolWorker-1] create peer delivery stream
[backend-org-1-substra-backend-worker-859bdb6d6-q9c6g worker] [2019-12-16 15:27:31,134: INFO/MainProcess] Received task: substrapp.tasks.tasks.compute_task[28fc383e-dfd7-4368-bc46-29f0d173afc7]
[backend-org-1-substra-backend-worker-859bdb6d6-q9c6g worker] [2019-12-16 15:27:31,145: INFO/ForkPoolWorker-1] Task substrapp.tasks.tasks.prepare_tuple[8dc933581e8e54c448c6a288f291d60625a0fd0ebcd7acf117d25ce4cce62d89] succeeded in 0.19399090000115393s: None
[backend-org-1-substra-backend-worker-859bdb6d6-q9c6g worker] [2019-12-16 15:27:31,174: INFO/ForkPoolWorker-1] DISCOVERY: adding channel peers query
[backend-org-1-substra-backend-worker-859bdb6d6-q9c6g worker] [2019-12-16 15:27:31,174: INFO/ForkPoolWorker-1] DISCOVERY: adding config query
[backend-org-1-substra-backend-worker-859bdb6d6-q9c6g worker] [2019-12-16 15:27:31,174: INFO/ForkPoolWorker-1] DISCOVERY: adding chaincodes/collection query
[backend-org-1-substra-backend-worker-859bdb6d6-q9c6g worker] [2019-12-16 15:27:31,206: ERROR/ForkPoolWorker-1] Task substrapp.tasks.tasks.prepare_tuple[3298e66712747719fd670cba24164479a4c778fa217856e441c98a6ca98ee770] raised unexpected: update testtuple 3298e66712747719fd670cba24164479a4c778fa217856e441c98a6ca98ee770 failed: cannot change status from waiting to doing
[backend-org-1-substra-backend-worker-859bdb6d6-q9c6g worker] Traceback (most recent call last):
[backend-org-1-substra-backend-worker-859bdb6d6-q9c6g worker]   File "/usr/local/lib/python3.6/dist-packages/celery/app/trace.py", line 382, in trace_task
[backend-org-1-substra-backend-worker-859bdb6d6-q9c6g worker]     R = retval = fun(*args, **kwargs)
[backend-org-1-substra-backend-worker-859bdb6d6-q9c6g worker]   File "/usr/local/lib/python3.6/dist-packages/celery/app/trace.py", line 641, in __protected_call__
[backend-org-1-substra-backend-worker-859bdb6d6-q9c6g worker]     return self.run(*args, **kwargs)
[backend-org-1-substra-backend-worker-859bdb6d6-q9c6g worker]   File "/usr/src/app/substrapp/tasks/tasks.py", line 449, in prepare_tuple
[backend-org-1-substra-backend-worker-859bdb6d6-q9c6g worker]     log_start_tuple(tuple_type, subtuple['key'])
[backend-org-1-substra-backend-worker-859bdb6d6-q9c6g worker]   File "/usr/src/app/substrapp/ledger_utils.py", line 330, in log_start_tuple
[backend-org-1-substra-backend-worker-859bdb6d6-q9c6g worker]     _update_tuple_status(tuple_type, tuple_key, 'doing')
[backend-org-1-substra-backend-worker-859bdb6d6-q9c6g worker]   File "/usr/src/app/substrapp/ledger_utils.py", line 324, in _update_tuple_status
[backend-org-1-substra-backend-worker-859bdb6d6-q9c6g worker]     update_ledger(fcn=invoke_fcn, args=invoke_args, sync=True)
[backend-org-1-substra-backend-worker-859bdb6d6-q9c6g worker]   File "/usr/src/app/substrapp/ledger_utils.py", line 107, in _wrapper
[backend-org-1-substra-backend-worker-859bdb6d6-q9c6g worker]     return fn(*args, **kwargs)
[backend-org-1-substra-backend-worker-859bdb6d6-q9c6g worker]   File "/usr/src/app/substrapp/ledger_utils.py", line 233, in update_ledger
[backend-org-1-substra-backend-worker-859bdb6d6-q9c6g worker]     return _invoke_ledger(*args, **kwargs)
[backend-org-1-substra-backend-worker-859bdb6d6-q9c6g worker]   File "/usr/src/app/substrapp/ledger_utils.py", line 212, in _invoke_ledger
[backend-org-1-substra-backend-worker-859bdb6d6-q9c6g worker]     response = call_ledger('invoke', fcn=fcn, args=args, kwargs=params)
[backend-org-1-substra-backend-worker-859bdb6d6-q9c6g worker]   File "/usr/src/app/substrapp/ledger_utils.py", line 196, in call_ledger
[backend-org-1-substra-backend-worker-859bdb6d6-q9c6g worker]     raise exception_class.from_response(response)
[backend-org-1-substra-backend-worker-859bdb6d6-q9c6g worker] substrapp.ledger_utils.LedgerResponseError: update testtuple 3298e66712747719fd670cba24164479a4c778fa217856e441c98a6ca98ee770 failed: cannot change status from waiting to doing

Error 400 when adding a dataset

Requests error status 400: {"message":[{"name":["This field may not be null."]}],"pkhash":"3a768e71c323e3cf62fb43c5fca6b4e8f7f2975bc13972163a8946e7c5ee6b8c"}
substra.sdk.exceptions.InvalidRequest: 400 Client Error: Bad Request for url: http://substra-backend.node-1.com/data_manager/: [{'name': ['This field may not be null.']}]
> /Users/mypath/register_my_dataset.py(119)main()
-> dataset_key = client.add_dataset(DATASET, exist_ok=True)['pkhash']
(Pdb) pp DATASET
{'data_opener': './dataset/opener.py',
 'description': './dataset/description.md',
 'name': 'My Dataset',
 'permissions': {'authorized_ids': [], 'public': True},
 'type': 'csv'}

This error has been reported twice today.
Downgrading to earlier version of the backend and hlf-k8s seems to fix the issue.

[Edge case] Crash when composite traintuple head and trunk models are identifcal

For a composite traintuple, if the head and the trunk out models are identical, the saving fails.

That's because not only do the two models have the same composite traintuple key (expected), they ALSO have the same value.

The hash is computed from the traintuplekey and the value. So the hashes are the same for the head and the trunk model, which leads to a pkhash conflict.

There was an aborted attempt at fixing this issue.

Error 500 when trying to use /api-token-auth with a node-to-node login

I tried login through the substra client using a node-to-node login with both valid and invalid password. This means using the /api-token-auth endpoint.

In both cases, I get this 500 response:

Requests error status 500: ValueError at /api-token-auth/
save() prohibited to prevent data loss due to unsaved related object 'user'.

Request Method: POST
Request URL: http://substra-backend.node-1.com/api-token-auth/
Django Version: 2.1.11
Python Executable: /usr/local/bin/python
Python Version: 3.6.10
Python Path: ['/usr/src/app', '/usr/local/lib/python36.zip', '/usr/local/lib/python3.6', '/usr/local/lib/python3.6/lib-dynload', '/usr/local/lib/python3.6/site-packages', '/usr/src/app', '/usr/src/app', '/usr/src/app/libs']
Server time: Tue, 7 Jan 2020 14:05:39 +0000
Installed Applications:
['django.contrib.admin',
 'django.contrib.auth',
 'django.contrib.contenttypes',
 'django.contrib.sessions',
 'django.contrib.messages',
 'django.contrib.staticfiles',
 'django.contrib.sites',
 'django_celery_results',
 'rest_framework_swagger',
 'rest_framework',
 'rest_framework.authtoken',
 'rest_framework_simplejwt.token_blacklist',
 'substrapp',
 'node',
 'users',
 'corsheaders',
 'events',
 'node-register']
Installed Middleware:
['corsheaders.middleware.CorsMiddleware',
 'django.middleware.security.SecurityMiddleware',
 'django.contrib.sessions.middleware.SessionMiddleware',
 'django.middleware.common.CommonMiddleware',
 'django.middleware.csrf.CsrfViewMiddleware',
 'django.contrib.auth.middleware.AuthenticationMiddleware',
 'django.contrib.auth.middleware.RemoteUserMiddleware',
 'django.contrib.messages.middleware.MessageMiddleware',
 'django.middleware.clickjacking.XFrameOptionsMiddleware',
 'libs.SQLPrintingMiddleware.SQLPrintingMiddleware',
 'libs.HealthCheckMiddleware.HealthCheckMiddleware']


Traceback:

File "/usr/local/lib/python3.6/site-packages/django/db/models/query.py" in get_or_create
  486.             return self.get(**lookup), False

File "/usr/local/lib/python3.6/site-packages/django/db/models/query.py" in get
  399.                 self.model._meta.object_name

During handling of the above exception (Token matching query does not exist.), another exception occurred:

File "/usr/local/lib/python3.6/site-packages/django/core/handlers/exception.py" in inner
  34.             response = get_response(request)

File "/usr/local/lib/python3.6/site-packages/django/core/handlers/base.py" in _get_response
  126.                 response = self.process_exception_by_middleware(e, request)

File "/usr/local/lib/python3.6/site-packages/django/core/handlers/base.py" in _get_response
  124.                 response = wrapped_callback(request, *callback_args, **callback_kwargs)

File "/usr/local/lib/python3.6/site-packages/django/views/decorators/csrf.py" in wrapped_view
  54.         return view_func(*args, **kwargs)

File "/usr/local/lib/python3.6/site-packages/django/views/generic/base.py" in view
  68.             return self.dispatch(request, *args, **kwargs)

File "/usr/local/lib/python3.6/site-packages/rest_framework/views.py" in dispatch
  483.             response = self.handle_exception(exc)

File "/usr/local/lib/python3.6/site-packages/rest_framework/views.py" in handle_exception
  443.             self.raise_uncaught_exception(exc)

File "/usr/local/lib/python3.6/site-packages/rest_framework/views.py" in dispatch
  480.             response = handler(request, *args, **kwargs)

File "/usr/src/app/backend/views.py" in post
  124.         token, created = Token.objects.get_or_create(user=user)

File "/usr/local/lib/python3.6/site-packages/django/db/models/manager.py" in manager_method
  82.                 return getattr(self.get_queryset(), name)(*args, **kwargs)

File "/usr/local/lib/python3.6/site-packages/django/db/models/query.py" in get_or_create
  488.             return self._create_object_from_params(lookup, params)

File "/usr/local/lib/python3.6/site-packages/django/db/models/query.py" in _create_object_from_params
  522.                 obj = self.create(**params)

File "/usr/local/lib/python3.6/site-packages/django/db/models/query.py" in create
  413.         obj.save(force_insert=True, using=self.db)

File "/usr/local/lib/python3.6/site-packages/rest_framework/authtoken/models.py" in save
  35.         return super(Token, self).save(*args, **kwargs)

File "/usr/local/lib/python3.6/site-packages/django/db/models/base.py" in save
  670.                         "unsaved related object '%s'." % field.name

Exception Type: ValueError at /api-token-auth/
Exception Value: save() prohibited to prevent data loss due to unsaved related object 'user'.
Request information:
USER: AnonymousUser

GET: No GET data

POST:
username = 'MyOrg1MSP'
password = 'selfSecret1'

FILES: No FILES data

COOKIES: No cookie data

META:
BACKEND_DB_NAME = 'substra'
BACKEND_DB_PWD = 'postgres'
BACKEND_DB_USER = 'postgres'
BACKEND_DEFAULT_PORT = '8000'
BACKEND_ORG = 'MyOrg1'
BACKEND_ORG_1_POSTGRESQL_PORT = 'tcp://10.99.253.201:5432'
BACKEND_ORG_1_POSTGRESQL_PORT_5432_TCP = 'tcp://10.99.253.201:5432'
BACKEND_ORG_1_POSTGRESQL_PORT_5432_TCP_ADDR = '10.99.253.201'
BACKEND_ORG_1_POSTGRESQL_PORT_5432_TCP_PORT = '5432'
BACKEND_ORG_1_POSTGRESQL_PORT_5432_TCP_PROTO = 'tcp'
BACKEND_ORG_1_POSTGRESQL_SERVICE_HOST = '10.99.253.201'
BACKEND_ORG_1_POSTGRESQL_SERVICE_PORT = '5432'
BACKEND_ORG_1_POSTGRESQL_SERVICE_PORT_POSTGRESQL = '5432'
BACKEND_ORG_1_RABBITMQ_PORT = 'tcp://10.108.182.80:4369'
BACKEND_ORG_1_RABBITMQ_PORT_15672_TCP = 'tcp://10.108.182.80:15672'
BACKEND_ORG_1_RABBITMQ_PORT_15672_TCP_ADDR = '10.108.182.80'
BACKEND_ORG_1_RABBITMQ_PORT_15672_TCP_PORT = '15672'
BACKEND_ORG_1_RABBITMQ_PORT_15672_TCP_PROTO = 'tcp'
BACKEND_ORG_1_RABBITMQ_PORT_25672_TCP = 'tcp://10.108.182.80:25672'
BACKEND_ORG_1_RABBITMQ_PORT_25672_TCP_ADDR = '10.108.182.80'
BACKEND_ORG_1_RABBITMQ_PORT_25672_TCP_PORT = '25672'
BACKEND_ORG_1_RABBITMQ_PORT_25672_TCP_PROTO = 'tcp'
BACKEND_ORG_1_RABBITMQ_PORT_4369_TCP = 'tcp://10.108.182.80:4369'
BACKEND_ORG_1_RABBITMQ_PORT_4369_TCP_ADDR = '10.108.182.80'
BACKEND_ORG_1_RABBITMQ_PORT_4369_TCP_PORT = '4369'
BACKEND_ORG_1_RABBITMQ_PORT_4369_TCP_PROTO = 'tcp'
BACKEND_ORG_1_RABBITMQ_PORT_5672_TCP = 'tcp://10.108.182.80:5672'
BACKEND_ORG_1_RABBITMQ_PORT_5672_TCP_ADDR = '10.108.182.80'
BACKEND_ORG_1_RABBITMQ_PORT_5672_TCP_PORT = '5672'
BACKEND_ORG_1_RABBITMQ_PORT_5672_TCP_PROTO = 'tcp'
BACKEND_ORG_1_RABBITMQ_SERVICE_HOST = '10.108.182.80'
BACKEND_ORG_1_RABBITMQ_SERVICE_PORT = '4369'
BACKEND_ORG_1_RABBITMQ_SERVICE_PORT_AMQP = '5672'
BACKEND_ORG_1_RABBITMQ_SERVICE_PORT_DIST = '25672'
BACKEND_ORG_1_RABBITMQ_SERVICE_PORT_EPMD = '4369'
BACKEND_ORG_1_RABBITMQ_SERVICE_PORT_STATS = '15672'
BACKEND_ORG_1_SUBSTRA_BACKEND_FLOWER_PORT = 'tcp://10.97.215.88:5555'
BACKEND_ORG_1_SUBSTRA_BACKEND_FLOWER_PORT_5555_TCP = 'tcp://10.97.215.88:5555'
BACKEND_ORG_1_SUBSTRA_BACKEND_FLOWER_PORT_5555_TCP_ADDR = '10.97.215.88'
BACKEND_ORG_1_SUBSTRA_BACKEND_FLOWER_PORT_5555_TCP_PORT = '5555'
BACKEND_ORG_1_SUBSTRA_BACKEND_FLOWER_PORT_5555_TCP_PROTO = 'tcp'
BACKEND_ORG_1_SUBSTRA_BACKEND_FLOWER_SERVICE_HOST = '10.97.215.88'
BACKEND_ORG_1_SUBSTRA_BACKEND_FLOWER_SERVICE_PORT = '5555'
BACKEND_ORG_1_SUBSTRA_BACKEND_FLOWER_SERVICE_PORT_HTTP = '5555'
BACKEND_ORG_1_SUBSTRA_BACKEND_SERVER_PORT = 'tcp://10.99.149.30:8000'
BACKEND_ORG_1_SUBSTRA_BACKEND_SERVER_PORT_8000_TCP = 'tcp://10.99.149.30:8000'
BACKEND_ORG_1_SUBSTRA_BACKEND_SERVER_PORT_8000_TCP_ADDR = '10.99.149.30'
BACKEND_ORG_1_SUBSTRA_BACKEND_SERVER_PORT_8000_TCP_PORT = '8000'
BACKEND_ORG_1_SUBSTRA_BACKEND_SERVER_PORT_8000_TCP_PROTO = 'tcp'
BACKEND_ORG_1_SUBSTRA_BACKEND_SERVER_SERVICE_HOST = '10.99.149.30'
BACKEND_ORG_1_SUBSTRA_BACKEND_SERVER_SERVICE_PORT = '8000'
BACKEND_ORG_1_SUBSTRA_BACKEND_SERVER_SERVICE_PORT_HTTP = '8000'
BACKEND_PEER_PORT = 'internal'
CELERY_BROKER_URL = 'amqp://rabbitmq:rabbitmq@backend-org-1-rabbitmq:5672//'
CONTENT_LENGTH = '39'
CONTENT_TYPE = 'application/x-www-form-urlencoded'
DATABASE_HOST = 'backend-org-1-postgresql'
DEFAULT_DOMAIN = 'http://substra-backend.node-1.com'
DJANGO_SETTINGS_MODULE = 'backend.settings.server.dev'
GATEWAY_INTERFACE = 'CGI/1.1'
GPG_KEY = '0D96DF4D4110E5C43FBFB17F2D347EA6AA65421D'
GRPC_MAX_RECEIVE_MESSAGE_LENGTH = '0'
GRPC_MAX_SEND_MESSAGE_LENGTH = '0'
GRPC_SSL_CIPHER_SUITES = 'ECDHE-RSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-SHA256:ECDHE-RSA-AES256-SHA384:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES128-SHA256:ECDHE-ECDSA-AES256-SHA384:ECDHE-ECDSA-AES256-GCM-SHA384'
HOME = '/root'
HOSTNAME = 'backend-org-1-substra-backend-server-5f4bc85f8-2pz82'
HTTP_ACCEPT = 'application/json;version=0.0'
HTTP_ACCEPT_ENCODING = 'gzip, deflate'
HTTP_HOST = 'substra-backend.node-1.com'
HTTP_USER_AGENT = 'python-requests/2.22.0'
HTTP_X_FORWARDED_FOR = '192.168.65.3'
HTTP_X_FORWARDED_HOST = 'substra-backend.node-1.com'
HTTP_X_FORWARDED_PORT = '80'
HTTP_X_FORWARDED_PROTO = 'http'
HTTP_X_ORIGINAL_URI = '/api-token-auth/'
HTTP_X_REAL_IP = '192.168.65.3'
HTTP_X_REQUEST_ID = '29e880b1643d1386411caecab8207997'
HTTP_X_SCHEME = 'http'
KUBERNETES_PORT = 'tcp://10.96.0.1:443'
KUBERNETES_PORT_443_TCP = 'tcp://10.96.0.1:443'
KUBERNETES_PORT_443_TCP_ADDR = '10.96.0.1'
KUBERNETES_PORT_443_TCP_PORT = '443'
KUBERNETES_PORT_443_TCP_PROTO = 'tcp'
KUBERNETES_SERVICE_HOST = '10.96.0.1'
KUBERNETES_SERVICE_PORT = '443'
KUBERNETES_SERVICE_PORT_HTTPS = '443'
LANG = 'C.UTF-8'
LEDGER_CONFIG_FILE = '/conf/MyOrg1/substra-backend/conf.json'
MEDIA_ROOT = '/tmp/org-1/medias/'
NETWORK_ORG_1_PEER_1_CA_PORT = 'tcp://10.98.200.198:7054'
NETWORK_ORG_1_PEER_1_CA_PORT_7054_TCP = 'tcp://10.98.200.198:7054'
NETWORK_ORG_1_PEER_1_CA_PORT_7054_TCP_ADDR = '10.98.200.198'
NETWORK_ORG_1_PEER_1_CA_PORT_7054_TCP_PORT = '7054'
NETWORK_ORG_1_PEER_1_CA_PORT_7054_TCP_PROTO = 'tcp'
NETWORK_ORG_1_PEER_1_CA_SERVICE_HOST = '10.98.200.198'
NETWORK_ORG_1_PEER_1_CA_SERVICE_PORT = '7054'
NETWORK_ORG_1_PEER_1_CA_SERVICE_PORT_HTTP = '7054'
NETWORK_ORG_1_PEER_1_NGINX_INGRESS_CONTROLLER_PORT = 'tcp://10.109.205.170:80'
NETWORK_ORG_1_PEER_1_NGINX_INGRESS_CONTROLLER_PORT_443_TCP = 'tcp://10.109.205.170:443'
NETWORK_ORG_1_PEER_1_NGINX_INGRESS_CONTROLLER_PORT_443_TCP_ADDR = '10.109.205.170'
NETWORK_ORG_1_PEER_1_NGINX_INGRESS_CONTROLLER_PORT_443_TCP_PORT = '443'
NETWORK_ORG_1_PEER_1_NGINX_INGRESS_CONTROLLER_PORT_443_TCP_PROTO = 'tcp'
NETWORK_ORG_1_PEER_1_NGINX_INGRESS_CONTROLLER_PORT_80_TCP = 'tcp://10.109.205.170:80'
NETWORK_ORG_1_PEER_1_NGINX_INGRESS_CONTROLLER_PORT_80_TCP_ADDR = '10.109.205.170'
NETWORK_ORG_1_PEER_1_NGINX_INGRESS_CONTROLLER_PORT_80_TCP_PORT = '80'
NETWORK_ORG_1_PEER_1_NGINX_INGRESS_CONTROLLER_PORT_80_TCP_PROTO = 'tcp'
NETWORK_ORG_1_PEER_1_NGINX_INGRESS_CONTROLLER_SERVICE_HOST = '10.109.205.170'
NETWORK_ORG_1_PEER_1_NGINX_INGRESS_CONTROLLER_SERVICE_PORT = '80'
NETWORK_ORG_1_PEER_1_NGINX_INGRESS_CONTROLLER_SERVICE_PORT_HTTP = '80'
NETWORK_ORG_1_PEER_1_NGINX_INGRESS_CONTROLLER_SERVICE_PORT_HTTPS = '443'
NETWORK_ORG_1_PEER_1_NGINX_INGRESS_DEFAULT_BACKEND_PORT = 'tcp://10.111.238.187:80'
NETWORK_ORG_1_PEER_1_NGINX_INGRESS_DEFAULT_BACKEND_PORT_80_TCP = 'tcp://10.111.238.187:80'
NETWORK_ORG_1_PEER_1_NGINX_INGRESS_DEFAULT_BACKEND_PORT_80_TCP_ADDR = '10.111.238.187'
NETWORK_ORG_1_PEER_1_NGINX_INGRESS_DEFAULT_BACKEND_PORT_80_TCP_PORT = '80'
NETWORK_ORG_1_PEER_1_NGINX_INGRESS_DEFAULT_BACKEND_PORT_80_TCP_PROTO = 'tcp'
NETWORK_ORG_1_PEER_1_NGINX_INGRESS_DEFAULT_BACKEND_SERVICE_HOST = '10.111.238.187'
NETWORK_ORG_1_PEER_1_NGINX_INGRESS_DEFAULT_BACKEND_SERVICE_PORT = '80'
NETWORK_ORG_1_PEER_1_NGINX_INGRESS_DEFAULT_BACKEND_SERVICE_PORT_HTTP = '80'
NETWORK_ORG_1_PEER_1_PORT = 'tcp://10.98.102.223:7051'
NETWORK_ORG_1_PEER_1_PORT_7051_TCP = 'tcp://10.98.102.223:7051'
NETWORK_ORG_1_PEER_1_PORT_7051_TCP_ADDR = '10.98.102.223'
NETWORK_ORG_1_PEER_1_PORT_7051_TCP_PORT = '7051'
NETWORK_ORG_1_PEER_1_PORT_7051_TCP_PROTO = 'tcp'
NETWORK_ORG_1_PEER_1_PORT_7053_TCP = 'tcp://10.98.102.223:7053'
NETWORK_ORG_1_PEER_1_PORT_7053_TCP_ADDR = '10.98.102.223'
NETWORK_ORG_1_PEER_1_PORT_7053_TCP_PORT = '7053'
NETWORK_ORG_1_PEER_1_PORT_7053_TCP_PROTO = 'tcp'
NETWORK_ORG_1_PEER_1_SERVICE_HOST = '10.98.102.223'
NETWORK_ORG_1_PEER_1_SERVICE_PORT = '7051'
NETWORK_ORG_1_PEER_1_SERVICE_PORT_EVENT = '7053'
NETWORK_ORG_1_PEER_1_SERVICE_PORT_REQUEST = '7051'
PATH = '/usr/local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin'
PATH_INFO = '/api-token-auth/'
PWD = '/usr/src/app'
PYTHONUNBUFFERED = '1'
PYTHON_GET_PIP_SHA256 = 'b86f36cc4345ae87bfd4f10ef6b2dbfa7a872fbff70608a1e43944d283fd0eee'
PYTHON_GET_PIP_URL = 'https://github.com/pypa/get-pip/raw/ffe826207a010164265d9cc807978e3604d18ca0/get-pip.py'
PYTHON_PIP_VERSION = '19.3.1'
PYTHON_VERSION = '3.6.10'
QUERY_STRING = ''
REMOTE_ADDR = '10.1.1.123'
REMOTE_HOST = ''
REQUEST_METHOD = 'POST'
SCRIPT_NAME = ''
SERVER_NAME = 'backend-org-1-substra-backend-server-5f4bc85f8-2pz82'
SERVER_PORT = '8000'
SERVER_PROTOCOL = 'HTTP/1.1'
SERVER_SOFTWARE = 'WSGIServer/0.2'
SHLVL = '0'
TZ = 'UTC'
_ = '/usr/local/bin/python'
wsgi.errors = <_io.TextIOWrapper name='<stderr>' mode='w' encoding='UTF-8'>
wsgi.file_wrapper = ''
wsgi.input = <django.core.handlers.wsgi.LimitedStream object at 0x7f3257f0e160>
wsgi.multiprocess = False
wsgi.multithread = True
wsgi.run_once = False
wsgi.url_scheme = 'http'
wsgi.version = '(1, 0)'

Settings:
Using settings module backend.settings.server.dev
ABSOLUTE_URL_OVERRIDES = {}
ADMINS = []
ALLOWED_HOSTS = ['*']
APPEND_SLASH = True
AUTHENTICATION_BACKENDS = ['django.contrib.auth.backends.ModelBackend', 'node.authentication.NodeBackend']
AUTH_PASSWORD_VALIDATORS = '********************'
AUTH_USER_MODEL = 'auth.User'
BASE_DIR = '/usr/src/app/backend'
CACHES = {'default': {'BACKEND': 'django.core.cache.backends.locmem.LocMemCache'}}
CACHE_MIDDLEWARE_ALIAS = 'default'
CACHE_MIDDLEWARE_KEY_PREFIX = '********************'
CACHE_MIDDLEWARE_SECONDS = 600
CELERY_ACCEPT_CONTENT = ['application/json']
CELERY_BROKER_URL = "('amqp://rabbitmq:rabbitmq@backend-org-1-rabbitmq:5672//',)"
CELERY_RESULT_BACKEND = 'django-db'
CELERY_RESULT_SERIALIZER = 'json'
CELERY_TASK_MAX_RETRIES = 1
CELERY_TASK_RETRY_DELAY_SECONDS = 0
CELERY_TASK_SERIALIZER = 'json'
CELERY_TASK_TRACK_STARTED = True
CELERY_WORKER_CONCURRENCY = 1
CORS_ALLOW_CREDENTIALS = True
CORS_ALLOW_HEADERS = "('accept', 'accept-encoding', 'authorization', 'content-type', 'dnt', 'origin', 'user-agent', 'x-csrftoken', 'x-requested-with', 'token')"
CORS_ORIGIN_ALLOW_ALL = True
CSRF_COOKIE_AGE = 31449600
CSRF_COOKIE_DOMAIN = None
CSRF_COOKIE_HTTPONLY = False
CSRF_COOKIE_NAME = 'csrftoken'
CSRF_COOKIE_PATH = '/'
CSRF_COOKIE_SAMESITE = 'Lax'
CSRF_COOKIE_SECURE = False
CSRF_FAILURE_VIEW = 'django.views.csrf.csrf_failure'
CSRF_HEADER_NAME = 'HTTP_X_CSRFTOKEN'
CSRF_TRUSTED_ORIGINS = []
CSRF_USE_SESSIONS = False
DATABASES = {'default': {'ENGINE': 'django.db.backends.postgresql_psycopg2', 'NAME': 'substra', 'USER': 'postgres', 'PASSWORD': '********************', 'HOST': 'backend-org-1-postgresql', 'PORT': 5432, 'ATOMIC_REQUESTS': False, 'AUTOCOMMIT': True, 'CONN_MAX_AGE': 0, 'OPTIONS': {}, 'TIME_ZONE': None, 'TEST': {'CHARSET': None, 'COLLATION': None, 'NAME': None, 'MIRROR': None}}}
DATABASE_ROUTERS = []
DATA_UPLOAD_MAX_MEMORY_SIZE = 2621440
DATA_UPLOAD_MAX_NUMBER_FIELDS = 10000
DATETIME_FORMAT = 'N j, Y, P'
DATETIME_INPUT_FORMATS = ['%Y-%m-%d %H:%M:%S', '%Y-%m-%d %H:%M:%S.%f', '%Y-%m-%d %H:%M', '%Y-%m-%d', '%m/%d/%Y %H:%M:%S', '%m/%d/%Y %H:%M:%S.%f', '%m/%d/%Y %H:%M', '%m/%d/%Y', '%m/%d/%y %H:%M:%S', '%m/%d/%y %H:%M:%S.%f', '%m/%d/%y %H:%M', '%m/%d/%y']
DATE_FORMAT = 'N j, Y'
DATE_INPUT_FORMATS = ['%Y-%m-%d', '%m/%d/%Y', '%m/%d/%y', '%b %d %Y', '%b %d, %Y', '%d %b %Y', '%d %b, %Y', '%B %d %Y', '%B %d, %Y', '%d %B %Y', '%d %B, %Y']
DEBUG = True
DEBUG_PROPAGATE_EXCEPTIONS = False
DECIMAL_SEPARATOR = '.'
DEFAULT_CHARSET = 'utf-8'
DEFAULT_CONTENT_TYPE = 'text/html'
DEFAULT_DOMAIN = 'http://substra-backend.node-1.com'
DEFAULT_EXCEPTION_REPORTER_FILTER = 'django.views.debug.SafeExceptionReporterFilter'
DEFAULT_FILE_STORAGE = 'django.core.files.storage.FileSystemStorage'
DEFAULT_FROM_EMAIL = 'webmaster@localhost'
DEFAULT_INDEX_TABLESPACE = ''
DEFAULT_PORT = '8000'
DEFAULT_TABLESPACE = ''
DISALLOWED_USER_AGENTS = []
EMAIL_BACKEND = 'django.core.mail.backends.smtp.EmailBackend'
EMAIL_HOST = 'localhost'
EMAIL_HOST_PASSWORD = '********************'
EMAIL_HOST_USER = ''
EMAIL_PORT = 25
EMAIL_SSL_CERTFILE = None
EMAIL_SSL_KEYFILE = '********************'
EMAIL_SUBJECT_PREFIX = '[Django] '
EMAIL_TIMEOUT = None
EMAIL_USE_LOCALTIME = False
EMAIL_USE_SSL = False
EMAIL_USE_TLS = False
EXPIRY_TOKEN_LIFETIME = '********************'
FILE_CHARSET = 'utf-8'
FILE_UPLOAD_DIRECTORY_PERMISSIONS = None
FILE_UPLOAD_HANDLERS = ['django.core.files.uploadhandler.MemoryFileUploadHandler', 'django.core.files.uploadhandler.TemporaryFileUploadHandler']
FILE_UPLOAD_MAX_MEMORY_SIZE = 2621440
FILE_UPLOAD_PERMISSIONS = None
FILE_UPLOAD_TEMP_DIR = None
FIRST_DAY_OF_WEEK = 0
FIXTURE_DIRS = []
FORCE_SCRIPT_NAME = None
FORMAT_MODULE_PATH = None
FORM_RENDERER = 'django.forms.renderers.DjangoTemplates'
IGNORABLE_404_URLS = []
INSTALLED_APPS = ['django.contrib.admin', 'django.contrib.auth', 'django.contrib.contenttypes', 'django.contrib.sessions', 'django.contrib.messages', 'django.contrib.staticfiles', 'django.contrib.sites', 'django_celery_results', 'rest_framework_swagger', 'rest_framework', 'rest_framework.authtoken', 'rest_framework_simplejwt.token_blacklist', 'substrapp', 'node', 'users', 'corsheaders', 'events', 'node-register']
INTERNAL_IPS = []
LANGUAGES = [('af', 'Afrikaans'), ('ar', 'Arabic'), ('ast', 'Asturian'), ('az', 'Azerbaijani'), ('bg', 'Bulgarian'), ('be', 'Belarusian'), ('bn', 'Bengali'), ('br', 'Breton'), ('bs', 'Bosnian'), ('ca', 'Catalan'), ('cs', 'Czech'), ('cy', 'Welsh'), ('da', 'Danish'), ('de', 'German'), ('dsb', 'Lower Sorbian'), ('el', 'Greek'), ('en', 'English'), ('en-au', 'Australian English'), ('en-gb', 'British English'), ('eo', 'Esperanto'), ('es', 'Spanish'), ('es-ar', 'Argentinian Spanish'), ('es-co', 'Colombian Spanish'), ('es-mx', 'Mexican Spanish'), ('es-ni', 'Nicaraguan Spanish'), ('es-ve', 'Venezuelan Spanish'), ('et', 'Estonian'), ('eu', 'Basque'), ('fa', 'Persian'), ('fi', 'Finnish'), ('fr', 'French'), ('fy', 'Frisian'), ('ga', 'Irish'), ('gd', 'Scottish Gaelic'), ('gl', 'Galician'), ('he', 'Hebrew'), ('hi', 'Hindi'), ('hr', 'Croatian'), ('hsb', 'Upper Sorbian'), ('hu', 'Hungarian'), ('ia', 'Interlingua'), ('id', 'Indonesian'), ('io', 'Ido'), ('is', 'Icelandic'), ('it', 'Italian'), ('ja', 'Japanese'), ('ka', 'Georgian'), ('kab', 'Kabyle'), ('kk', 'Kazakh'), ('km', 'Khmer'), ('kn', 'Kannada'), ('ko', 'Korean'), ('lb', 'Luxembourgish'), ('lt', 'Lithuanian'), ('lv', 'Latvian'), ('mk', 'Macedonian'), ('ml', 'Malayalam'), ('mn', 'Mongolian'), ('mr', 'Marathi'), ('my', 'Burmese'), ('nb', 'Norwegian Bokmål'), ('ne', 'Nepali'), ('nl', 'Dutch'), ('nn', 'Norwegian Nynorsk'), ('os', 'Ossetic'), ('pa', 'Punjabi'), ('pl', 'Polish'), ('pt', 'Portuguese'), ('pt-br', 'Brazilian Portuguese'), ('ro', 'Romanian'), ('ru', 'Russian'), ('sk', 'Slovak'), ('sl', 'Slovenian'), ('sq', 'Albanian'), ('sr', 'Serbian'), ('sr-latn', 'Serbian Latin'), ('sv', 'Swedish'), ('sw', 'Swahili'), ('ta', 'Tamil'), ('te', 'Telugu'), ('th', 'Thai'), ('tr', 'Turkish'), ('tt', 'Tatar'), ('udm', 'Udmurt'), ('uk', 'Ukrainian'), ('ur', 'Urdu'), ('vi', 'Vietnamese'), ('zh-hans', 'Simplified Chinese'), ('zh-hant', 'Traditional Chinese')]
LANGUAGES_BIDI = ['he', 'ar', 'fa', 'ur']
LANGUAGE_CODE = 'en-us'
LANGUAGE_COOKIE_AGE = None
LANGUAGE_COOKIE_DOMAIN = None
LANGUAGE_COOKIE_NAME = 'django_language'
LANGUAGE_COOKIE_PATH = '/'
LEDGER = {'name': 'MyOrg1', 'core_peer_mspconfigpath': '/var/hyperledger/msp', 'channel_name': 'mychannel', 'chaincode_name': 'mycc', 'chaincode_version': '1.0', 'client': {'name': 'user', 'org': 'MyOrg1', 'state_store': '/tmp/hfc-cvs', 'key_path': '********************', 'cert_path': '/var/hyperledger/msp/signcerts/cert.pem', 'msp_id': 'MyOrg1MSP'}, 'peer': {'name': 'peer', 'host': 'network-org-1-peer-1.org-1', 'port': {'internal': 7051, 'external': 7051}, 'docker_core_dir': '/var/hyperledger/fabric_cfg', 'tlsCACerts': '/var/hyperledger/ca/cacert.pem', 'clientKey': '********************', 'clientCert': '/var/hyperledger/tls/client/pair/tls.crt', 'grpcOptions': {'grpc-max-send-message-length': 15, 'grpc.ssl_target_name_override': 'network-org-1-peer-1.org-1'}}, 'requestor': <hfc.fabric.user.User object at 0x7f327278e550>, 'hfc': <function get_hfc_client at 0x7f3276035598>}
LEDGER_CALL_RETRY = True
LEDGER_CONFIG_FILE = '/conf/MyOrg1/substra-backend/conf.json'
LEDGER_MAX_RETRY_TIMEOUT = 5
LEDGER_SYNC_ENABLED = True
LOCALE_PATHS = []
LOGGING = {'version': 1, 'disable_existing_loggers': False, 'formatters': {'verbose': {'format': '%(levelname)s %(asctime)s %(module)s %(process)d %(thread)d %(message)s'}, 'simple': {'format': '%(levelname)s - %(asctime)s - %(name)s - %(message)s'}}, 'filters': {'require_debug_false': {'()': 'django.utils.log.RequireDebugFalse'}}, 'handlers': {'mail_admins': {'level': 'ERROR', 'filters': ['require_debug_false'], 'class': 'django.utils.log.AdminEmailHandler'}, 'console': {'level': 'DEBUG', 'class': 'logging.StreamHandler', 'formatter': 'simple'}, 'error_file': {'level': 'INFO', 'filename': '/usr/src/app/backend.log', 'class': 'logging.handlers.RotatingFileHandler', 'maxBytes': 1048576, 'backupCount': 2, 'formatter': 'verbose'}}, 'loggers': {'django.request': {'handlers': ['mail_admins', 'error_file'], 'level': 'INFO', 'propagate': False}, 'events': {'handlers': ['console'], 'level': 'DEBUG', 'propagate': True}}}
LOGGING_CONFIG = 'logging.config.dictConfig'
LOGIN_REDIRECT_URL = '/accounts/profile/'
LOGIN_URL = '/accounts/login/'
LOGOUT_REDIRECT_URL = None
MANAGERS = []
MEDIA_ROOT = '/tmp/org-1/medias/'
MEDIA_URL = '/media/'
MESSAGE_STORAGE = 'django.contrib.messages.storage.fallback.FallbackStorage'
MIDDLEWARE = ['corsheaders.middleware.CorsMiddleware', 'django.middleware.security.SecurityMiddleware', 'django.contrib.sessions.middleware.SessionMiddleware', 'django.middleware.common.CommonMiddleware', 'django.middleware.csrf.CsrfViewMiddleware', 'django.contrib.auth.middleware.AuthenticationMiddleware', 'django.contrib.auth.middleware.RemoteUserMiddleware', 'django.contrib.messages.middleware.MessageMiddleware', 'django.middleware.clickjacking.XFrameOptionsMiddleware', 'libs.SQLPrintingMiddleware.SQLPrintingMiddleware', 'libs.HealthCheckMiddleware.HealthCheckMiddleware']
MIGRATION_MODULES = {}
MONTH_DAY_FORMAT = 'F j'
NUMBER_GROUPING = 0
ORG = 'MyOrg1'
ORG_NAME = 'MyOrg1'
PASSWORD_HASHERS = '********************'
PASSWORD_RESET_TIMEOUT_DAYS = '********************'
PEER_PORT = 7051
PREPEND_WWW = False
PROJECT_ROOT = '/usr/src/app'
REST_FRAMEWORK = {'TEST_REQUEST_DEFAULT_FORMAT': 'json', 'DEFAULT_RENDERER_CLASSES': ('rest_framework.renderers.JSONRenderer', 'rest_framework.renderers.BrowsableAPIRenderer'), 'DEFAULT_AUTHENTICATION_CLASSES': ['users.authentication.SecureJWTAuthentication', 'libs.expiryTokenAuthentication.ExpiryTokenAuthentication', 'libs.sessionAuthentication.CustomSessionAuthentication'], 'DEFAULT_PERMISSION_CLASSES': ['rest_framework.permissions.IsAuthenticated'], 'UNICODE_JSON': False, 'DEFAULT_VERSIONING_CLASS': 'libs.versioning.AcceptHeaderVersioningRequired', 'ALLOWED_VERSIONS': ('0.0',), 'DEFAULT_VERSION': '0.0'}
ROOT_URLCONF = 'backend.urls'
SECRET_FILE = '********************'
SECRET_KEY = '********************'
SECURE_BROWSER_XSS_FILTER = False
SECURE_CONTENT_TYPE_NOSNIFF = False
SECURE_HSTS_INCLUDE_SUBDOMAINS = False
SECURE_HSTS_PRELOAD = False
SECURE_HSTS_SECONDS = 0
SECURE_PROXY_SSL_HEADER = None
SECURE_REDIRECT_EXEMPT = []
SECURE_SSL_HOST = None
SECURE_SSL_REDIRECT = False
SERVER_EMAIL = 'root@localhost'
SESSION_CACHE_ALIAS = 'default'
SESSION_COOKIE_AGE = 1209600
SESSION_COOKIE_DOMAIN = None
SESSION_COOKIE_HTTPONLY = True
SESSION_COOKIE_NAME = 'sessionid'
SESSION_COOKIE_PATH = '/'
SESSION_COOKIE_SAMESITE = 'Lax'
SESSION_COOKIE_SECURE = False
SESSION_ENGINE = 'django.contrib.sessions.backends.db'
SESSION_EXPIRE_AT_BROWSER_CLOSE = False
SESSION_FILE_PATH = None
SESSION_SAVE_EVERY_REQUEST = False
SESSION_SERIALIZER = 'django.contrib.sessions.serializers.JSONSerializer'
SETTINGS_MODULE = 'backend.settings.server.dev'
SHORT_DATETIME_FORMAT = 'm/d/Y P'
SHORT_DATE_FORMAT = 'm/d/Y'
SIGNING_BACKEND = 'django.core.signing.TimestampSigner'
SILENCED_SYSTEM_CHECKS = []
SIMPLE_JWT = {'ACCESS_TOKEN_LIFETIME': '********************', 'AUTH_HEADER_TYPES': ('JWT',)}
SITE_HOST = 'substra-backend.MyOrg1.xyz'
SITE_ID = 1
SITE_PORT = '8000'
STATICFILES_DIRS = []
STATICFILES_FINDERS = ['django.contrib.staticfiles.finders.FileSystemFinder', 'django.contrib.staticfiles.finders.AppDirectoriesFinder']
STATICFILES_STORAGE = 'django.contrib.staticfiles.storage.StaticFilesStorage'
STATIC_ROOT = None
STATIC_URL = '/static/'
SUBSTRA_FOLDER = '/substra'
TASK = {'CAPTURE_LOGS': True, 'CLEAN_EXECUTION_ENVIRONMENT': True, 'CACHE_DOCKER_IMAGES': False}
TEMPLATES = [{'BACKEND': 'django.template.backends.django.DjangoTemplates', 'DIRS': [], 'APP_DIRS': True, 'OPTIONS': {'context_processors': ['django.template.context_processors.debug', 'django.template.context_processors.request', 'django.contrib.auth.context_processors.auth', 'django.contrib.messages.context_processors.messages']}}]
TEST_NON_SERIALIZED_APPS = []
TEST_RUNNER = 'django.test.runner.DiscoverRunner'
THOUSAND_SEPARATOR = ','
TIME_FORMAT = 'P'
TIME_INPUT_FORMATS = ['%H:%M:%S', '%H:%M:%S.%f', '%H:%M']
TIME_ZONE = 'UTC'
TRUE_VALUES = {'TRUE', 1, '1', 'on', 'yes', 'true', 'ON', 'YES', 'True', 'Y', 'On', 'y', 't', 'T'}
USE_I18N = True
USE_L10N = True
USE_THOUSAND_SEPARATOR = False
USE_TZ = True
USE_X_FORWARDED_HOST = False
USE_X_FORWARDED_PORT = False
WSGI_APPLICATION = 'backend.wsgi.application'
X_FRAME_OPTIONS = 'SAMEORIGIN'
YEAR_MONTH_FORMAT = 'F Y'


You're seeing this error because you have DEBUG = True in your
Django settings file. Change that to False, and Django will
display a standard page generated by the handler for this status code.

Add asset creation date

It would be really helpful when debugging.
Could be set either by the backend(s) or the chaincode.

Worker: missing information in logs

Follow-up to #124

Currently, some important information isn't displayed in the logs. Not having access this information makes it harder to troubleshoot errors/bugs. In particular, some of the algo exceptions on prod are being swallowed (every exception until the final retry), so we potentially lose some critical troubleshooting data.

Not available from the logs:

  • Task id
  • Retry attempts (can be inferred, but not explicitly mentioned)
  • Exception that caused each retry (only the exception that caused the final failure is displayed)

Solutions:

Option 1

Restore celery logs: #213

Option 2

Maybe celery logging gives out too much information. In that case, we could explore writing log messages ourselves (e.g. logger.info(f'Starting task {task_id}') etc.)

Option 3

(your idea here?)

CLI: Invalid key in the leaderboard command

When I try to use the leaderboard command on a key which doesn't exists, I get the following error:
Error: Request failed: InternalServerError: 500 Server Error: Internal Server Error for url: http://substra-backend.node-1.com/objective/foo/leaderboard/?sort=desc

Maybe we could have a more specific error explaining that the key doesn't exists?

Add requirements.txt for dev uses

We should probably add another requirements file (for example requirements-dev.txt) that would include dependencies of scripts used for local development (start.py, populate.py etc.)

502 when under load

When under load, nginx sporadically returns 502 responses

Repro

(Linux, docker driver, minikube w/ ingress addon)

substra login

for i in `seq 200`; do 
    substra get traintuple $i & 
done

In the logs

2020/07/13 17:44:56 [error] 2217#2217: *672834 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 172.17.0.1, server: substra-backend.node-1.com, request: "GET /traintuple/184/ HTTP/1.1", upstream: "http://172.18.0.47:8000/traintuple/184/", host: "substra-backend.node-1.com"
[...]
172.17.0.1 - - [13/Jul/2020:17:44:56 +0000] "GET /traintuple/184/ HTTP/1.1" 400 26 "-" "python-requests/2.24.0" 260 0.027 [org-1-backend-org-1-substra-backend-server-http] [] 172.18.0.47:8000, 172.18.0.47:8000 0, 26 0.000, 0.024 502, 400 be8dae30c7f82749a8b130ddf459875f

For 200 consecutive requests, I consistently get 1-4 "502" responses.

Interestingly, the first time I run the test, I only get one 502. When I re-run the test, I get 3-4 502s. I then keep on getting 3-4 502s in subsequent tests. This might be related to the fact that we currently use the cheaper algorithm.

Event App exit on socket closed

I run into a particular issue.

The event app crashes because of a socket closed. I didn't manage to have a local setup to reproduce it easily.

Here some logs:

On peer side

 2020-01-22 09:20:21.257 UTC [comm.grpc.server] 1 -> INFO 1083 streaming call completed grpc.service=protos.Deliver grpc.method=Deliver grpc.peer_address=10.1.1.1:43510 grpc.peer_subject="CN=user,OU=peer,O=Hyperledger,ST=North Carolina, C=US" error="context finished before block retrieved: context canceled" grpc.code=Unknown grpc.call_duration=30.002261971s  

On backend side

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/usr/local/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "./events/apps.py", line 134, in wait
    loop.run_until_complete(stream)
  File "/usr/local/lib/python3.6/asyncio/base_events.py", line 484, in run_until_complete
    return future.result()
  File "/usr/local/lib/python3.6/site-packages/hfc/fabric/channel/channel_eventhub.py", line 545, in handle_stream
    async for event in stream:
  File "/usr/local/lib/python3.6/site-packages/aiogrpc/utils.py", line 138, in __anext__
    return await asyncio.shield(self._next_future, loop=self._loop)
  File "/usr/local/lib/python3.6/concurrent/futures/thread.py", line 56, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/usr/local/lib/python3.6/site-packages/aiogrpc/utils.py", line 126, in _next
    return next(self._iterator)
  File "/usr/local/lib/python3.6/site-packages/grpc/_channel.py", line 392, in __next__
    return self._next()
  File "/usr/local/lib/python3.6/site-packages/grpc/_channel.py", line 561, in _next
    raise self
grpc._channel._Rendezvous: <_Rendezvous of RPC that terminated with:
        status = StatusCode.UNAVAILABLE
        details = "Socket closed"
        debug_error_string = "{"created":"@1579684585.868278838","description":"Error received from peer ipv4:194.167.143.126:443","file":"src/core/lib/surface/call.cc","file_line":1055,"grpc_message":"Socket closed","grpc_status":14}"

I found some information that we may need to change some grpc parameters

Docker container name conflict

While executing tuples that output pretty large models (1GB), I ran into the following issue:

ERROR 2020-02-05 15:30:50,114 substrapp.tasks.tasks 15 140710911403840 [00-01-0004-969dae2]
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/docker/api/client.py", line 261, in _raise_for_status
    response.raise_for_status()
  File "/usr/local/lib/python3.6/dist-packages/requests/models.py", line 940, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 409 Client Error: Conflict for url: http+docker://localhost/v1.35/containers/create?name=compositeTraintuple_12031be8_train

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/celery/app/trace.py", line 385, in trace_task
    R = retval = fun(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/celery/app/trace.py", line 650, in __protected_call__
    return self.run(*args, **kwargs)
  File "/usr/src/app/substrapp/tasks/tasks.py", line 530, in compute_task
    max_retries=int(getattr(settings, 'CELERY_TASK_MAX_RETRIES')))
  File "/usr/local/lib/python3.6/dist-packages/celery/app/task.py", line 704, in retry
    raise_with_context(exc)
  File "/usr/src/app/substrapp/tasks/tasks.py", line 524, in compute_task
    res = do_task(subtuple, tuple_type)
  File "/usr/src/app/substrapp/tasks/tasks.py", line 590, in do_task
    org_name
  File "/usr/src/app/substrapp/tasks/tasks.py", line 743, in _do_task
    environment=environment
  File "/usr/src/app/substrapp/tasks/utils.py", line 252, in compute_docker
    client.containers.run(**task_args)
  File "/usr/local/lib/python3.6/dist-packages/docker/models/containers.py", line 803, in run
    detach=detach, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/docker/models/containers.py", line 861, in create
    resp = self.client.api.create_container(**create_kwargs)
  File "/usr/local/lib/python3.6/dist-packages/docker/api/container.py", line 430, in create_container
    return self.create_container_from_config(config, name)
  File "/usr/local/lib/python3.6/dist-packages/docker/api/container.py", line 441, in create_container_from_config
    return self._result(res, True)
  File "/usr/local/lib/python3.6/dist-packages/docker/api/client.py", line 267, in _result
    self._raise_for_status(response)
  File "/usr/local/lib/python3.6/dist-packages/docker/api/client.py", line 263, in _raise_for_status
    raise create_api_error_from_http_exception(e)
  File "/usr/local/lib/python3.6/dist-packages/docker/errors.py", line 31, in create_api_error_from_http_exception
    raise cls(e, response=response, explanation=explanation)
docker.errors.APIError: 409 Client Error: Conflict ("Conflict. The container name "/compositeTraintuple_12031be8_train" is already in use by container "3d5cb300703d3dcec4321ee83612d7c0cee2a83743faf070dfa54e969461f8e2". You have to remove (or rename) that container to be a
ble to reuse that name.")

Do we need to keep URL as a asset attribute and/or even expose it to the user ?

The mission download_assets (PR) change the way that URL is used and expose to the user.
Indeed the solution to hide the ledger URL and remplace it by the node url make url useless.

For instance, if we want to download an asset owned by the node 2 on this url:

http://substrabac.node2/dataset/key/download from the node 1, the feature implemented proxified the url to http://substrabac.node1/dataset/node2key/download and hide the fact that the asset comes from node 2 to the user.

We can still know who owned the asset by looking to the owner attribute if there is one.

Moreover, a solution like tranforming http://substrabac.node2/dataset/key/download to http://substrabac.node1/proxy?url=http://substrabac.node2/dataset/node2key/download wasn't chosen so when we look to the list of assets on a node, all urls are proxified implicitly:

On node 1

dataset

[
    [
        {
            "objectiveKey": "3d70ab46d710dacb0f48cb42db4874fac14e048a0d415e266aad38c09591ee71",
            "description": {
                "hash": "15863c2af1fcfee9ca6f61f04be8a0eaaf6a45e4d50c421788d450d198e580f1",
                "storageAddress": "http://owkin.substrabac:8000/data_manager/8dd01465003a9b1e01c99c904d86aa518b3a5dd9dc8d40fe7d075c726ac073ca/description/"
            },
            "key": "8dd01465003a9b1e01c99c904d86aa518b3a5dd9dc8d40fe7d075c726ac073ca",
            "name": "ISIC 2018",
            "opener": {
                "hash": "8dd01465003a9b1e01c99c904d86aa518b3a5dd9dc8d40fe7d075c726ac073ca",
                "storageAddress": "http://owkin.substrabac:8000/data_manager/8dd01465003a9b1e01c99c904d86aa518b3a5dd9dc8d40fe7d075c726ac073ca/opener/"
            },
            "owner": "chu-nantesMSP",
            "permissions": {
                "process": {
                    "public": true,
                    "authorizedIDs": []
                }
            },
            "type": "Images"
        },
        {
            "objectiveKey": "3d70ab46d710dacb0f48cb42db4874fac14e048a0d415e266aad38c09591ee71",
            "description": {
                "hash": "258bef187a166b3fef5cb86e68c8f7e154c283a148cd5bc344fec7e698821ad3",
                "storageAddress": "http://owkin.substrabac:8000/data_manager/ce9f292c72e9b82697445117f9c2d1d18ce0f8ed07ff91dadb17d668bddf8932/description/"
            },
            "key": "ce9f292c72e9b82697445117f9c2d1d18ce0f8ed07ff91dadb17d668bddf8932",
            "name": "Simplified ISIC 2018",
            "opener": {
                "hash": "ce9f292c72e9b82697445117f9c2d1d18ce0f8ed07ff91dadb17d668bddf8932",
                "storageAddress": "http://owkin.substrabac:8000/data_manager/ce9f292c72e9b82697445117f9c2d1d18ce0f8ed07ff91dadb17d668bddf8932/opener/"
            },
            "owner": "owkinMSP",
            "permissions": {
                "process": {
                    "public": true,
                    "authorizedIDs": []
                }
            },
            "type": "Images"
        }
    ]
]

algo

[
    [
        {
            "key": "0acc5180e09b6a6ac250f4e3c172e2893f617aa1c22ef1f379019d20fe44142f",
            "name": "Neural Network",
            "content": {
                "hash": "0acc5180e09b6a6ac250f4e3c172e2893f617aa1c22ef1f379019d20fe44142f",
                "storageAddress": "http://owkin.substrabac:8000/algo/0acc5180e09b6a6ac250f4e3c172e2893f617aa1c22ef1f379019d20fe44142f/file/"
            },
            "description": {
                "hash": "b9463411a01ea00869bdffce6e59a5c100a4e635c0a9386266cad3c77eb28e9e",
                "storageAddress": "http://owkin.substrabac:8000/algo/0acc5180e09b6a6ac250f4e3c172e2893f617aa1c22ef1f379019d20fe44142f/description/"
            },
            "owner": "chu-nantesMSP",
            "permissions": {
                "process": {
                    "public": true,
                    "authorizedIDs": []
                }
            }
        },
        {
            "key": "9c3d8777e11fd72cbc0fd672bec3a0848f8518b4d56706008cc05f8a1cee44f9",
            "name": "Random Forest",
            "content": {
                "hash": "9c3d8777e11fd72cbc0fd672bec3a0848f8518b4d56706008cc05f8a1cee44f9",
                "storageAddress": "http://owkin.substrabac:8000/algo/9c3d8777e11fd72cbc0fd672bec3a0848f8518b4d56706008cc05f8a1cee44f9/file/"
            },
            "description": {
                "hash": "4acea40c4b51996c88ef279c5c9aa41ab77b97d38c5ca167e978a98b2e402675",
                "storageAddress": "http://owkin.substrabac:8000/algo/9c3d8777e11fd72cbc0fd672bec3a0848f8518b4d56706008cc05f8a1cee44f9/description/"
            },
            "owner": "chu-nantesMSP",
            "permissions": {
                "process": {
                    "public": true,
                    "authorizedIDs": []
                }
            }
        },
        {
            "key": "7c9f9799bf64c10002381583a9ffc535bc3f4bf14d6f0c614d3f6f868f72a9d5",
            "name": "Logistic regression",
            "content": {
                "hash": "7c9f9799bf64c10002381583a9ffc535bc3f4bf14d6f0c614d3f6f868f72a9d5",
                "storageAddress": "http://owkin.substrabac:8000/algo/7c9f9799bf64c10002381583a9ffc535bc3f4bf14d6f0c614d3f6f868f72a9d5/file/"
            },
            "description": {
                "hash": "124a0425b746d7072282d167b53cb6aab3a31bf1946dae89135c15b0126ebec3",
                "storageAddress": "http://owkin.substrabac:8000/algo/7c9f9799bf64c10002381583a9ffc535bc3f4bf14d6f0c614d3f6f868f72a9d5/description/"
            },
            "owner": "owkinMSP",
            "permissions": {
                "process": {
                    "public": true,
                    "authorizedIDs": []
                }
            }
        }
    ]
]

On node 2

dataset

[
    [
        {
            "objectiveKey": "3d70ab46d710dacb0f48cb42db4874fac14e048a0d415e266aad38c09591ee71",
            "description": {
                "hash": "15863c2af1fcfee9ca6f61f04be8a0eaaf6a45e4d50c421788d450d198e580f1",
                "storageAddress": "http://chunantes.substrabac:8001/data_manager/8dd01465003a9b1e01c99c904d86aa518b3a5dd9dc8d40fe7d075c726ac073ca/description/"
            },
            "key": "8dd01465003a9b1e01c99c904d86aa518b3a5dd9dc8d40fe7d075c726ac073ca",
            "name": "ISIC 2018",
            "opener": {
                "hash": "8dd01465003a9b1e01c99c904d86aa518b3a5dd9dc8d40fe7d075c726ac073ca",
                "storageAddress": "http://chunantes.substrabac:8001/data_manager/8dd01465003a9b1e01c99c904d86aa518b3a5dd9dc8d40fe7d075c726ac073ca/opener/"
            },
            "owner": "chu-nantesMSP",
            "permissions": {
                "process": {
                    "public": true,
                    "authorizedIDs": []
                }
            },
            "type": "Images"
        },
        {
            "objectiveKey": "3d70ab46d710dacb0f48cb42db4874fac14e048a0d415e266aad38c09591ee71",
            "description": {
                "hash": "258bef187a166b3fef5cb86e68c8f7e154c283a148cd5bc344fec7e698821ad3",
                "storageAddress": "http://chunantes.substrabac:8001/data_manager/ce9f292c72e9b82697445117f9c2d1d18ce0f8ed07ff91dadb17d668bddf8932/description/"
            },
            "key": "ce9f292c72e9b82697445117f9c2d1d18ce0f8ed07ff91dadb17d668bddf8932",
            "name": "Simplified ISIC 2018",
            "opener": {
                "hash": "ce9f292c72e9b82697445117f9c2d1d18ce0f8ed07ff91dadb17d668bddf8932",
                "storageAddress": "http://chunantes.substrabac:8001/data_manager/ce9f292c72e9b82697445117f9c2d1d18ce0f8ed07ff91dadb17d668bddf8932/opener/"
            },
            "owner": "owkinMSP",
            "permissions": {
                "process": {
                    "public": true,
                    "authorizedIDs": []
                }
            },
            "type": "Images"
        }
    ]
]

algo

[
    [
        {
            "key": "0acc5180e09b6a6ac250f4e3c172e2893f617aa1c22ef1f379019d20fe44142f",
            "name": "Neural Network",
            "content": {
                "hash": "0acc5180e09b6a6ac250f4e3c172e2893f617aa1c22ef1f379019d20fe44142f",
                "storageAddress": "http://chunantes.substrabac:8001/algo/0acc5180e09b6a6ac250f4e3c172e2893f617aa1c22ef1f379019d20fe44142f/file/"
            },
            "description": {
                "hash": "b9463411a01ea00869bdffce6e59a5c100a4e635c0a9386266cad3c77eb28e9e",
                "storageAddress": "http://chunantes.substrabac:8001/algo/0acc5180e09b6a6ac250f4e3c172e2893f617aa1c22ef1f379019d20fe44142f/description/"
            },
            "owner": "chu-nantesMSP",
            "permissions": {
                "process": {
                    "public": true,
                    "authorizedIDs": []
                }
            }
        },
        {
            "key": "9c3d8777e11fd72cbc0fd672bec3a0848f8518b4d56706008cc05f8a1cee44f9",
            "name": "Random Forest",
            "content": {
                "hash": "9c3d8777e11fd72cbc0fd672bec3a0848f8518b4d56706008cc05f8a1cee44f9",
                "storageAddress": "http://chunantes.substrabac:8001/algo/9c3d8777e11fd72cbc0fd672bec3a0848f8518b4d56706008cc05f8a1cee44f9/file/"
            },
            "description": {
                "hash": "4acea40c4b51996c88ef279c5c9aa41ab77b97d38c5ca167e978a98b2e402675",
                "storageAddress": "http://chunantes.substrabac:8001/algo/9c3d8777e11fd72cbc0fd672bec3a0848f8518b4d56706008cc05f8a1cee44f9/description/"
            },
            "owner": "chu-nantesMSP",
            "permissions": {
                "process": {
                    "public": true,
                    "authorizedIDs": []
                }
            }
        },
        {
            "key": "7c9f9799bf64c10002381583a9ffc535bc3f4bf14d6f0c614d3f6f868f72a9d5",
            "name": "Logistic regression",
            "content": {
                "hash": "7c9f9799bf64c10002381583a9ffc535bc3f4bf14d6f0c614d3f6f868f72a9d5",
                "storageAddress": "http://chunantes.substrabac:8001/algo/7c9f9799bf64c10002381583a9ffc535bc3f4bf14d6f0c614d3f6f868f72a9d5/file/"
            },
            "description": {
                "hash": "124a0425b746d7072282d167b53cb6aab3a31bf1946dae89135c15b0126ebec3",
                "storageAddress": "http://chunantes.substrabac:8001/algo/7c9f9799bf64c10002381583a9ffc535bc3f4bf14d6f0c614d3f6f868f72a9d5/description/"
            },
            "owner": "owkinMSP",
            "permissions": {
                "process": {
                    "public": true,
                    "authorizedIDs": []
                }
            }
        }
    ]
]

As we see, the url become nearly useless as we can always infer it from the key and the node url where we want to download the asset (user/front api) or from the key and the node url of the owner of the asset (server/back api)

It raises question on the mapping between node ID and its node URL, that could be done in the ledger for instance.

Feel free to comment :) !

Error "Rendezvous of RPC that terminated - failed to connect to all addresse"

While trying to add a bunch of assets, I regularly end up with the following message (as returned by the SDK):

ERROR    substra.sdk.rest_client:rest_client.py:116 Requests error status 400: {"message":"<_Rendezvous of RPC that terminated with:\n\tstatus = StatusCode.UNAVAILABLE\n\tdetails = \"failed to connect to all addresses\"\n\tdebug_error_string = \"{\"created\":\"@1576051644.524371597\",\"description\":\"Failed to pick subchannel\",\"file\":\"src/core/ext/filters/client_channel/client_channel.cc\",\"file_line\":3934,\"referenced_errors\":[{\"created\":\"@1576051644.524366407\",\"description\":\"failed to connect to all addresses\",\"file\":\"src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc\",\"file_line\":393,\"grpc_status\":14}]}\"\n>"}

~10 celeryworker docker images get pulled when I add a traintuple

With backend/celeryworker/celerybeat on version 0.0.12-alpha.4:

# After starting network/peers
$ docker images | grep worker
substrafoundation/celeryworker                                   0.0.12-alpha.4       2351fa3aa980        4 days ago          794MB

# Add traintuple...
# It takes ~5 min. With version 0.0.11 it's much faster.

# After the traintuple has been processed
$ docker images | grep worker
substrafoundation/celeryworker                                                                latest               c2ae5a9c7489        3 days ago          794MB
substrafoundation/celeryworker                                                                0.0.12-alpha.4       2351fa3aa980        4 days ago          794MB
substrafoundation/celeryworker                                                                0.0.12-alpha.3       14611cb4d69f        11 days ago         794MB
substrafoundation/celeryworker                                                                0.0.12-alpha.2       1df63f2ea2e5        2 weeks ago         766MB
substrafoundation/celeryworker                                                                0.0.12-alpha.1       29afa2954ce2        2 weeks ago         766MB
substrafoundation/celeryworker                                                                0.0.11               098db9729666        6 weeks ago         766MB
substrafoundation/celeryworker                                                                0.0.11-alpha.3       d010a2653162        6 weeks ago         766MB
substrafoundation/celeryworker                                                                0.0.11-alpha.2       dc825690761c        6 weeks ago         766MB
substrafoundation/celeryworker                                                                0.0.11-alpha.1       33370e58b413        6 weeks ago         766MB
substrafoundation/celeryworker                                                                0.0.10               f3a397018bd4        7 weeks ago         766MB
substrafoundation/celeryworker                                                                dev                  4a4002ef6992        8 weeks ago         766MB
substrafoundation/celeryworker                                                                0.0.9                2d5a2efe39e5        2 months ago        765MB

Worker: retry in case of timeout when calling log_start_tuple seems buggy

Traceback:

[backend-org-2-substra-backend-worker-56c79d57bd-tdv9b worker] [2019-12-09 21:33:15,334: ERROR/ForkPoolWorker-1] exception calling callback for <Future at 0x7f50303a4908 state=finished raised _Rendezvous>
[backend-org-2-substra-backend-worker-56c79d57bd-tdv9b worker] Traceback (most recent call last):
[backend-org-2-substra-backend-worker-56c79d57bd-tdv9b worker]   File "/usr/lib/python3.6/concurrent/futures/thread.py", line 56, in run
[backend-org-2-substra-backend-worker-56c79d57bd-tdv9b worker]     result = self.fn(*self.args, **self.kwargs)
[backend-org-2-substra-backend-worker-56c79d57bd-tdv9b worker]   File "/usr/local/lib/python3.6/dist-packages/aiogrpc/utils.py", line 126, in _next
[backend-org-2-substra-backend-worker-56c79d57bd-tdv9b worker]     return next(self._iterator)
[backend-org-2-substra-backend-worker-56c79d57bd-tdv9b worker]   File "/usr/local/lib/python3.6/dist-packages/grpc/_channel.py", line 392, in __next__
[backend-org-2-substra-backend-worker-56c79d57bd-tdv9b worker]     return self._next()
[backend-org-2-substra-backend-worker-56c79d57bd-tdv9b worker]   File "/usr/local/lib/python3.6/dist-packages/grpc/_channel.py", line 561, in _next
[backend-org-2-substra-backend-worker-56c79d57bd-tdv9b worker]     raise self
[backend-org-2-substra-backend-worker-56c79d57bd-tdv9b worker] grpc._channel._Rendezvous: <_Rendezvous of RPC that terminated with:
[backend-org-2-substra-backend-worker-56c79d57bd-tdv9b worker] 	status = StatusCode.CANCELLED
[backend-org-2-substra-backend-worker-56c79d57bd-tdv9b worker] 	details = "Locally cancelled by application!"
[backend-org-2-substra-backend-worker-56c79d57bd-tdv9b worker] 	debug_error_string = "None"
[backend-org-2-substra-backend-worker-56c79d57bd-tdv9b worker] >
[backend-org-2-substra-backend-worker-56c79d57bd-tdv9b worker]
[backend-org-2-substra-backend-worker-56c79d57bd-tdv9b worker] During handling of the above exception, another exception occurred:
[backend-org-2-substra-backend-worker-56c79d57bd-tdv9b worker]
[backend-org-2-substra-backend-worker-56c79d57bd-tdv9b worker] Traceback (most recent call last):
[backend-org-2-substra-backend-worker-56c79d57bd-tdv9b worker]   File "/usr/lib/python3.6/concurrent/futures/_base.py", line 324, in _invoke_callbacks
[backend-org-2-substra-backend-worker-56c79d57bd-tdv9b worker]     callback(self)
[backend-org-2-substra-backend-worker-56c79d57bd-tdv9b worker]   File "/usr/lib/python3.6/asyncio/futures.py", line 417, in _call_set_state
[backend-org-2-substra-backend-worker-56c79d57bd-tdv9b worker]     dest_loop.call_soon_threadsafe(_set_state, destination, source)
[backend-org-2-substra-backend-worker-56c79d57bd-tdv9b worker]   File "/usr/lib/python3.6/asyncio/base_events.py", line 637, in call_soon_threadsafe
[backend-org-2-substra-backend-worker-56c79d57bd-tdv9b worker]     self._check_closed()
[backend-org-2-substra-backend-worker-56c79d57bd-tdv9b worker]   File "/usr/lib/python3.6/asyncio/base_events.py", line 377, in _check_closed
[backend-org-2-substra-backend-worker-56c79d57bd-tdv9b worker]     raise RuntimeError('Event loop is closed')
[backend-org-2-substra-backend-worker-56c79d57bd-tdv9b worker] RuntimeError: Event loop is closed
[backend-org-1-substra-backend-server-6b867c94c-p7t5q substra-backend]
[backend-org-1-substra-backend-server-6b867c94c-p7t5q substra-backend]
[backend-org-1-substra-backend-server-6b867c94c-p7t5q substra-backend]   [SQL Queries for] /composite_traintuple/7d3419fa32def5678c555ce985305b12faa1e5c2cbc9cff449f4d5a17e3d0b84/
[backend-org-1-substra-backend-server-6b867c94c-p7t5q substra-backend]
[backend-org-1-substra-backend-server-6b867c94c-p7t5q substra-backend]   [0.003] SELECT authtoken_token.key,  authtoken_token.user_id,  authtoken_token.created,  auth_user.id,  auth_user.password,  auth_user.last_login,  auth_user.is_superuser,  auth_user.username,  auth_user.first_name,  auth_user.last_name,  auth_user.email,  auth_user.is_staff,  auth_user.is_active,  auth_user.date_joined FROM authtoken_token INNER JOIN auth_user ON (authtoken_token.user_id = auth_user.id) WHERE authtoken_token.key = '41c18852d55e6dfdc751f75c7f84bdb850a7f0cc'
[backend-org-1-substra-backend-server-6b867c94c-p7t5q substra-backend]
[backend-org-1-substra-backend-server-6b867c94c-p7t5q substra-backend]   [TOTAL TIME: 0.003 seconds (1 queries)]
[backend-org-1-substra-backend-server-6b867c94c-p7t5q substra-backend] [09/Dec/2019 21:33:15] "GET /composite_traintuple/7d3419fa32def5678c555ce985305b12faa1e5c2cbc9cff449f4d5a17e3d0b84/ HTTP/1.1" 200 1806
[backend-org-1-substra-backend-server-6b867c94c-p7t5q substra-backend] [09/Dec/2019 21:33:16] "GET /readiness HTTP/1.1" 200 2
[backend-org-2-substra-backend-worker-56c79d57bd-tdv9b worker] [2019-12-09 21:33:17,333: WARNING/ForkPoolWorker-1] Function invoke_ledger failed (<class 'substrapp.ledger_utils.LedgerTimeout'>): waitForEvent timed out. retrying in 2s
[backend-org-2-substra-backend-worker-56c79d57bd-tdv9b worker] [2019-12-09 21:33:17,386: INFO/ForkPoolWorker-1] DISCOVERY: adding channel peers query
[backend-org-2-substra-backend-worker-56c79d57bd-tdv9b worker] [2019-12-09 21:33:17,386: INFO/ForkPoolWorker-1] DISCOVERY: adding config query
[backend-org-2-substra-backend-worker-56c79d57bd-tdv9b worker] [2019-12-09 21:33:17,387: INFO/ForkPoolWorker-1] DISCOVERY: adding chaincodes/collection query
[backend-org-2-substra-backend-worker-56c79d57bd-tdv9b worker] [2019-12-09 21:33:17,468: ERROR/ForkPoolWorker-1] Task substrapp.tasks.tasks.prepare_tuple[7d3419fa32def5678c555ce985305b12faa1e5c2cbc9cff449f4d5a17e3d0b84] raised unexpected: cannot update traintuple 7d3419fa32def5678c555ce985305b12faa1e5c2cbc9cff449f4d5a17e3d0b84 - status already doing
[backend-org-2-substra-backend-worker-56c79d57bd-tdv9b worker] Traceback (most recent call last):
[backend-org-2-substra-backend-worker-56c79d57bd-tdv9b worker]   File "/usr/local/lib/python3.6/dist-packages/celery/app/trace.py", line 382, in trace_task
[backend-org-2-substra-backend-worker-56c79d57bd-tdv9b worker]     R = retval = fun(*args, **kwargs)
[backend-org-2-substra-backend-worker-56c79d57bd-tdv9b worker]   File "/usr/local/lib/python3.6/dist-packages/celery/app/trace.py", line 641, in __protected_call__
[backend-org-2-substra-backend-worker-56c79d57bd-tdv9b worker]     return self.run(*args, **kwargs)
[backend-org-2-substra-backend-worker-56c79d57bd-tdv9b worker]   File "/usr/src/app/substrapp/tasks/tasks.py", line 428, in prepare_tuple
[backend-org-2-substra-backend-worker-56c79d57bd-tdv9b worker]     log_start_tuple(tuple_type, subtuple['key'])
[backend-org-2-substra-backend-worker-56c79d57bd-tdv9b worker]   File "/usr/src/app/substrapp/ledger_utils.py", line 98, in _wrapper
[backend-org-2-substra-backend-worker-56c79d57bd-tdv9b worker]     return fn(*args, **kwargs)
[backend-org-2-substra-backend-worker-56c79d57bd-tdv9b worker]   File "/usr/src/app/substrapp/ledger_utils.py", line 320, in log_start_tuple
[backend-org-2-substra-backend-worker-56c79d57bd-tdv9b worker]     sync=True)
[backend-org-2-substra-backend-worker-56c79d57bd-tdv9b worker]   File "/usr/src/app/substrapp/ledger_utils.py", line 98, in _wrapper
[backend-org-2-substra-backend-worker-56c79d57bd-tdv9b worker]     return fn(*args, **kwargs)
[backend-org-2-substra-backend-worker-56c79d57bd-tdv9b worker]   File "/usr/src/app/substrapp/ledger_utils.py", line 210, in invoke_ledger
[backend-org-2-substra-backend-worker-56c79d57bd-tdv9b worker]     response = call_ledger('invoke', fcn=fcn, args=args, kwargs=params)
[backend-org-2-substra-backend-worker-56c79d57bd-tdv9b worker]   File "/usr/src/app/substrapp/ledger_utils.py", line 187, in call_ledger
[backend-org-2-substra-backend-worker-56c79d57bd-tdv9b worker]     raise exception_class.from_response(response)
[backend-org-2-substra-backend-worker-56c79d57bd-tdv9b worker] substrapp.ledger_utils.LedgerResponseError: cannot update traintuple 7d3419fa32def5678c555ce985305b12faa1e5c2cbc9cff449f4d5a17e3d0b84 - status already doing

Seen while executing the test (random failure): tests/test_execution_compute_plan.py::test_compute_plan_aggregate_composite_traintuples

_GatheringFuture exception was never retrieved

While inspecting the logs of the worker I stumbled upon:

[2020-02-05 16:01:41,804: ERROR/ForkPoolWorker-1] _GatheringFuture exception was never retrieved
future: <_GatheringFuture finished exception=CancelledError()>
concurrent.futures._base.CancelledError
INFO 2020-02-05 16:01:57,298 substrapp.ledger_utils 15 140710911403840 smartcontract invoke:logSuccessTrain; elaps=16270.34ms; error=None
[2020-02-05 16:01:57,524: ERROR/ForkPoolWorker-1] _GatheringFuture exception was never retrieved
future: <_GatheringFuture finished exception=CancelledError()>
concurrent.futures._base.CancelledError

Objective unicity check

The unicity of an objective is currently only based on the description content. This allows objective with the same metrics to be uploaded multiple time so that they are associated with different datasets.

This however prevents the creation of an objective with the same description but a different metrics archive.

We could have the unicity check include both description and metrics, which would allow the same flexibility when it comes to re-uploading a metrics with another description but would also allow to upload multiple metrics with the same description.

Use public image of substra-tools in the fixtures

As a consequence of the move from gcr.io to docker hub we can now make use of the public images of substrafoundation/substra-tools instead of eu.gcr.io/substra-208412/substra-tools

There are references of this image in the Dockerfiles and charts here:

charts/substra-backend/values.yaml
12:    # - eu.gcr.io/substra-208412/substra-tools:0.0.1

fixtures/chunantes/objectives/objective0/Dockerfile
1:FROM eu.gcr.io/substra-208412/substra-tools:0.0.1

fixtures/owkin/objectives/objective0/Dockerfile
1:FROM eu.gcr.io/substra-208412/substra-tools:0.0.1

but also inside the zip and tar.gz files here:

./fixtures/chunantes/algos/algo0/algo.zip
./fixtures/chunantes/algos/algo4/algo.zip
./fixtures/chunantes/algos/algo2/algo.zip
./fixtures/chunantes/algos/algo1/algo.tar.gz
./fixtures/chunantes/algos/algo3/algo.tar.gz
./fixtures/chunantes/algos/algo0/algo.tar.gz
./fixtures/chunantes/algos/algo4/algo.tar.gz

I may have forgot a location but it should cover all the cases we use right now.

Error GRPC "StatusCode.CANCELLED"

While adding a bunch of tuples individually while a whole other bunch of tuples is being executed, I got the following error in the worker's log. The traceback is incomplete (keyboard mishaps and the logs were gone), but I hope this is enough to investigate

[2020-01-16 10:09:16,738: ERROR/ForkPoolWorker-1] exception calling callback for <Future at 0x7f79d456f630 state=finished raised _MultiThreadedRendezvous>
Traceback (most recent call last):
  File "/usr/lib/python3.6/concurrent/futures/thread.py", line 56, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/usr/local/lib/python3.6/dist-packages/aiogrpc/utils.py", line 126, in _next
    return next(self._iterator)
  File "/usr/local/lib/python3.6/dist-packages/grpc/_channel.py", line 416, in __next__
    return self._next()
  File "/usr/local/lib/python3.6/dist-packages/grpc/_channel.py", line 703, in _next
    raise self
grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
    status = StatusCode.CANCELLED
    details = "Locally cancelled by application!"
    debug_error_string = "None"
>

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.6/concurrent/futures/_base.py", line 324, in _invoke_callbacks
    callback(self)
  File "/usr/lib/python3.6/asyncio/futures.py", line 417, in _call_set_state
    dest_loop.call_soon_threadsafe(_set_state, destination, source)
  File "/usr/lib/python3.6/asyncio/base_events.py", line 637, in call_soon_threadsafe
    self._check_closed()
  File "/usr/lib/python3.6/asyncio/base_events.py", line 377, in _check_closed
    raise RuntimeError('Event loop is closed')
RuntimeError: Event loop is closed
[2020-01-16 10:09:16,743: ERROR/ForkPoolWorker-1] exception calling callback for <Future at 0x7f79d458abe0 state=finished raised _MultiThreadedRendezvous>
Traceback (most recent call last):
  File "/usr/lib/python3.6/concurrent/futures/thread.py", line 56, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/usr/local/lib/python3.6/dist-packages/aiogrpc/utils.py", line 126, in _next
    return next(self._iterator)
  File "/usr/local/lib/python3.6/dist-packages/grpc/_channel.py", line 416, in __next__
    return self._next()
  File "/usr/local/lib/python3.6/dist-packages/grpc/_channel.py", line 703, in _next
    raise self
grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
    status = StatusCode.CANCELLED
    details = "Locally cancelled by application!"
    debug_error_string = "None"
>

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.6/concurrent/futures/_base.py", line 324, in _invoke_callbacks
    callback(self)
  File "/usr/lib/python3.6/asyncio/futures.py", line 417, in _call_set_state
    dest_loop.call_soon_threadsafe(_set_state, destination, source)
  File "/usr/lib/python3.6/asyncio/base_events.py", line 637, in call_soon_threadsafe

Streamline model view

The model view has a lot of peculiarities:

  • /model doesn't return models per se, but instead returns the list of all traintuples/composite traintuples/aggregatetuples each with the associated certified testtuple (even though this concept doesn't really exists anymore)
  • /model/<traintuple_key> only works for traintuples and returns the traintuple and all the linked testtuples. It also creates a local cache of each outModel. It fails with a 500 if the key doesn't match a traintuple but rather a composite traintuple for example.
  • /model/<tuple_key>/details (where tuple_key is either a composite_traintuple_key, a traintuple_key or an aggregate_key) returns the tuple and all the linked testtuples
  • /model/<model_hash>/file streams the content of an outModel

A streamlined schema could be:

  • /model returns the list of all traintuples/composite traintuples/aggregatuples
  • /model/<tuple_key> returns the matching traintuple / composite traintuple / aggregatetuple with all linked testuples
  • /model/<tuple_key>/file streams the content of the outModel (for traintuple / aggregatetuple) or the content of the outTrunkModel (for composite traintuple)

Evicted training tasks pods are never deleted

When a training task pod is evicted by kubernetes, the backend waits forever for the pod to either complete or fail.

However, since "Evicted" is not a failure condition that is currently supported in the backend:

  • the pod stays in Evicted state forever
  • the backend waits (for the pod to compelte) forever
  • the training tasks (and possibly compute plan) stay in the doing state forever.

List index out of range

While trying to run the tests on either the demo env, lots of calls fail with the following traceback:

│ substra-backend Traceback (most recent call last):                                                                                                                                                                                                                           │
│ substra-backend   File "/usr/local/lib/python3.6/site-packages/django/core/handlers/exception.py", line 34, in inner                                                                                                                                                         │
│ substra-backend     response = get_response(request)                                                                                                                                                                                                                         │
│ substra-backend   File "/usr/local/lib/python3.6/site-packages/django/core/handlers/base.py", line 126, in _get_response                                                                                                                                                     │
│ substra-backend     response = self.process_exception_by_middleware(e, request)                                                                                                                                                                                              │
│ substra-backend   File "/usr/local/lib/python3.6/site-packages/django/core/handlers/base.py", line 124, in _get_response                                                                                                                                                     │
│ substra-backend     response = wrapped_callback(request, *callback_args, **callback_kwargs)                                                                                                                                                                                  │
│ substra-backend   File "/usr/local/lib/python3.6/site-packages/django/views/decorators/csrf.py", line 54, in wrapped_view                                                                                                                                                    │
│ substra-backend     return view_func(*args, **kwargs)                                                                                                                                                                                                                        │
│ substra-backend   File "/usr/local/lib/python3.6/site-packages/rest_framework/viewsets.py", line 103, in view                                                                                                                                                                │
│ substra-backend     return self.dispatch(request, *args, **kwargs)                                                                                                                                                                                                           │
│ substra-backend   File "/usr/local/lib/python3.6/site-packages/rest_framework/views.py", line 483, in dispatch                                                                                                                                                               │
│ substra-backend     response = self.handle_exception(exc)                                                                                                                                                                                                                    │
│ substra-backend   File "/usr/local/lib/python3.6/site-packages/rest_framework/views.py", line 443, in handle_exception                                                                                                                                                       │
│ substra-backend     self.raise_uncaught_exception(exc)                                                                                                                                                                                                                       │
│ substra-backend   File "/usr/local/lib/python3.6/site-packages/rest_framework/views.py", line 480, in dispatch                                                                                                                                                               │
│ substra-backend     response = handler(request, *args, **kwargs)                                                                                                                                                                                                             │
│ substra-backend   File "./node/views/node.py", line 18, in list                                                                                                                                                                                                              │
│ substra-backend     nodes = query_ledger(fcn=self.ledger_query_call)                                                                                                                                                                                                         │
│ substra-backend   File "./substrapp/ledger_utils.py", line 98, in _wrapper                                                                                                                                                                                                   │
│ substra-backend     return fn(*args, **kwargs)                                                                                                                                                                                                                               │
│ substra-backend   File "./substrapp/ledger_utils.py", line 192, in query_ledger                                                                                                                                                                                              │
│ substra-backend     return call_ledger('query', fcn=fcn, args=args)                                                                                                                                                                                                          │
│ substra-backend   File "./substrapp/ledger_utils.py", line 123, in call_ledger                                                                                                                                                                                               │
│ substra-backend     with get_hfc() as (loop, client):                                                                                                                                                                                                                        │
│ substra-backend   File "/usr/local/lib/python3.6/contextlib.py", line 81, in __enter__                                                                                                                                                                                       │
│ substra-backend     return next(self.gen)                                                                                                                                                                                                                                    │
│ substra-backend   File "./substrapp/ledger_utils.py", line 113, in get_hfc                                                                                                                                                                                                   │
│ substra-backend     loop, client = LEDGER['hfc']()                                                                                                                                                                                                                           │
│ substra-backend   File "./backend/settings/deps/ledger.py", line 103, in get_hfc_client                                                                                                                                                                                      │
│ substra-backend     update_client_with_discovery(client, results)                                                                                                                                                                                                            │
│ substra-backend   File "./backend/settings/deps/ledger.py", line 123, in update_client_with_discovery                                                                                                                                                                        │
│ substra-backend     peer_info = msp[0]                                                                                                                                                                                                                                       │
│ substra-backend IndexError: list index out of range

Cannot launch using docker-compose

hlf-k8s creates configuration files at <substra folder>/conf/config/conf-<org>.json. However the backend's start.py script looks for files at <substra folder>/conf/<org>/substra-backend/conf.json. As a result, the backend's start.py script doesn't launch any backend.

Can't launch all celery workers/schedulers

I'm trying to start all the workers/schedulers for the 2 orgs setup.
If I run the command:

DJANGO_SETTINGS_MODULE=backend.settings.dev BACKEND_ORG=owkin BACKEND_DEFAULT_PORT=8000 BACKEND_PEER_PORT_EXTERNAL=9051 celery -E -A backend worker -l info -B -n owkin -Q owkin,scheduler,celery --hostname owkin.scheduler

the logs get attached to my terminal and I can't run the commands to start the following instances. Thus I decided to pas the --detach argument, which works, but when trying to lauch a new celery worker I get an error saying:

ERROR: Pidfile (celeryd.pid) already exists. Seems we're already running? (pid: 4815) Error in atexit._run_exitfuncs: Traceback (most recent call last): File "/usr/lib/python3.6/multiprocessing/util.py", line 319, in _exit_function p.join() File "/usr/lib/python3.6/multiprocessing/process.py", line 122, in join assert self._parent_pid == os.getpid(), 'can only join a child process' AssertionError: can only join a child process

I could run the command with the --pidfile= (with no path) flag, but is that the right way to go?
Thanks!

MVCC read conflict when reporting the success of a task

Hello,

It looks like a MVCC READ CONFLICT occurred when registering the success of a training task. Attached are the logs.
logs.txt
For the context, I had submitted two composite_traintuples and an aggregatetuple with inmodels the two composite_traintuples. The error occured for one of the two composite_traintuple.

I am not sure to understand why we get it here. Is it because when logging the success of this task, when updating the status of its child (the aggregatetuple), it reads an older version of the other parents (the other composite_traintuple) which status has been updated at the same time ?

Couldn't we add a retry mechanism in order to fix this ?

Local volume not handled properly when executing traintuples of a same compute plan on different workers

Traceback:

[backend-org-2-substra-backend-worker-655dc75d9b-h8fxq worker] Traceback (most recent call last):
[backend-org-2-substra-backend-worker-655dc75d9b-h8fxq worker]   File "/usr/src/app/substrapp/tasks/tasks.py", line 460, in compute_task
[backend-org-2-substra-backend-worker-655dc75d9b-h8fxq worker]     res = do_task(subtuple, tuple_type)
[backend-org-2-substra-backend-worker-655dc75d9b-h8fxq worker]   File "/usr/src/app/substrapp/tasks/tasks.py", line 540, in do_task
[backend-org-2-substra-backend-worker-655dc75d9b-h8fxq worker]     local_volume = client.volumes.get(volume_id=volume_id)
[backend-org-2-substra-backend-worker-655dc75d9b-h8fxq worker]   File "/usr/local/lib/python3.6/dist-packages/docker/models/volumes.py", line 76, in get
[backend-org-2-substra-backend-worker-655dc75d9b-h8fxq worker]     return self.prepare_model(self.client.api.inspect_volume(volume_id))
[backend-org-2-substra-backend-worker-655dc75d9b-h8fxq worker]   File "/usr/local/lib/python3.6/dist-packages/docker/api/volume.py", line 114, in inspect_volume
[backend-org-2-substra-backend-worker-655dc75d9b-h8fxq worker]     return self._result(self._get(url), True)
[backend-org-2-substra-backend-worker-655dc75d9b-h8fxq worker]   File "/usr/local/lib/python3.6/dist-packages/docker/api/client.py", line 235, in _result
[backend-org-2-substra-backend-worker-655dc75d9b-h8fxq worker]     self._raise_for_status(response)
[backend-org-2-substra-backend-worker-655dc75d9b-h8fxq worker]   File "/usr/local/lib/python3.6/dist-packages/docker/api/client.py", line 231, in _raise_for_status
[backend-org-2-substra-backend-worker-655dc75d9b-h8fxq worker]     raise create_api_error_from_http_exception(e)
[backend-org-2-substra-backend-worker-655dc75d9b-h8fxq worker]   File "/usr/local/lib/python3.6/dist-packages/docker/errors.py", line 31, in create_api_error_from_http_exception
[backend-org-2-substra-backend-worker-655dc75d9b-h8fxq worker]     raise cls(e, response=response, explanation=explanation)
[backend-org-2-substra-backend-worker-655dc75d9b-h8fxq worker] docker.errors.NotFound: 404 Client Error: Not Found ("get local-acbe3fabadb14a63d29e2ae9e23388d4b7e05037f0528ef12562281ef1f034c8-MyOrg2: no such volume")

To reproduce, use substra-tests:

pytest tests/test_execution_compute_plan.py::test_compute_plan

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.