Integrate contributivity works and methods into the framework.

Work is in progress on contributivity workgroup (see dedicated <a href="https://github

Ok, so here we go for some thoughts on that one: What we call

Update from last MAP committee (10/09/2020): Work is in progre

Contributivity about substra-documentation HOT 4 CLOSED

substra commented on June 25, 2024

Contributivity

from substra-documentation.

Comments (4)

ClementMayer commented on June 25, 2024

Work is in progress on contributivity workgroup (see dedicated repository). A library is being set up, already offering 8 to 9 different calculation methods, but no findings have been shared at this stage.
The working group on contributivity will start thinking about a potential integration in the framework and about the right layer to integrate the different works done. In particular, one possibility to explore would be on the generation of compute plans.

from substra-documentation.

bowni commented on June 25, 2024

Ok, so here we go for some thoughts on that one:

What we call "contributivity" is a measurement of how much different partners and their respective datasets have contributed to the performance of a model that has been distributedly trained on these datasets
The baseline measurement approach is the computation of the Shapley value for each dataset, assuming the characteristic function attributes to a given set of partners the performance of the model trained on the datasets of these partners. It has all the good properties, however it is not viable computation-wise beyond 3 or 4 partners (exponential). As such, most approaches are in fact approximations of Shapley. They all rely on being able to compute the performance of the model trained on subsets of the partners and their datasets (see dedicated MPLC project for more information)
- Note 1: the Shapley approximations studied this far are all based on sampling and optimizing the sampling of certain subsets. They are relevant in use cases with significant numbers of partners (e.g. x100+). In our typical cases with a few to up to a few tens of partners, they aren't really applicable
- Note 2: on the other hand, Shapley is OK up to 3 or 4 partners. So, we kind of have a gap for 4 to 100 partners
Other types of approaches, not based on Shapley, are being contemplated and tested, including approaches based on:
- model ensembling (also Shapley-based, but with a different characteristic function which doesn't require additional intensive model training, but many model aggregations instead)
- measuring performance increments after each local minibatch
- reinforcement learning (DVRL / PVRL)
If we now explore what it would mean to integrate some of this with Substra:
1. For Shapley:
  - We need to be able to train models on subsets of the available partners, test them to get performance figures, and in the end do some computations on all those performance figures. This could accomplished "manually", meaning registering dedicated compute plans for each model training and testing, fetching the performance figures, and doing our final computations of the Shapley value manually
  - To go one step further we could actually design a large complete compute plan, encompassing all models training and testing that are needed. Since we know from the start all those tasks, it should be possible
  - Ultimately to really automate it, a final task in the large compute plan could be a traintuple that executes the computation of the Shapley values, using as inputs the performance figures from the past testtuples of the compute plan. This would not require training on a dataset, so this would be an "edge use" of a traintuple, just executing a script.
2. For the federated step-by-step increments:
  - In this approach, after each local train on a mini-batch, we do a performance test and save the performance figure. After the whole training is completed, we take all those performance figures and compile them
  - So the high-level analysis of how to spin this on a multi-partner Substra setup is similar to Shapley above: we would need to carefully design a large compute plan encompassing all train and test tasks, and in the end have a final task to launch the script that compiles all performance figures in a synthetic contributivity values
3. For model ensembling:
  - Here again it should be similar. The only specific aspect is that we need to be able to aggregate individually trained models
4. For DVRL / PVRL: requires more work, these approaches are complex and I don't properly understand yet how they work

from substra-documentation.

ClementMayer commented on June 25, 2024

Update from last MAP committee (10/09/2020):

Work is in progress on contributivity workgroup. The package already offers 8 to 9 different calculation methods. No large benchmark done and no findings shared yet
It seems possible to design a large compute plan with all the train and test tasks needed to run certain contributivity measurement methods
Interesting next steps:
- quick review of first thoughts by Camille to see if Eric’s first thoughts are coherent
- Include in Arthur’s internship roadmap to work on a “PoC” of a simple contributivity measurement approach on Substra
- Identify another dataset in order to have other data / thoughts
You wanna join the discussion? You can join the public channel Slack #workgroup-mpl-contributivity and participate in the discussions and the dedicated repos.

from substra-documentation.

RomainGoussault commented on June 25, 2024

Closing stale issue.

from substra-documentation.

Contributivity about substra-documentation HOT 4 CLOSED

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent