rmnldwg / lymph Goto Github PK

View Code? Open in Web Editor NEW

5.0 1.0 4.0 113.35 MB

Python package for statistical modelling of lymphatic metastatic spread in head & neck cancer.

Home Page: https://lymph-model.readthedocs.io

License: MIT License

Python 100.00%

hidden-markov-model modelling metastasis radiation-oncology lymph-nodes probabilistic-models cancer head-and-neck

lymph's Introduction

Hey, I'm Roman 👋

🔭 Working on probabilistic models to predict how cancer spreads
👯 Interested in collaborating on datasets of lymphatic progression patterns in head & neck cancer
💬 Always happy to hear feedback on our interactive Lymphatic Progression eXplorer (LyProX)

📚🔍 Research fields

I am a PostDoc in the medical physics research group of Prof. Jan Unkelbach at the University Zurich and the University Hospital Zurich.

In our main project, we try to model the risk for metastases in the lymph system of patients with squamous cell carcinomas in the head & neck region. You can read more on that in an excellent paper by a PostDoc in our group: Pouymayou et al. You can also check out our code for the lymph model, which is a python package containing the code to learn and compute this risk of lymphatic metastases using Bayesian networks (mentioned paper) and also - this is new - hidden Markov models (Ludwig et al).

Another project deals with optimal fractionation schemes. Fractionation is the splitting of a prescribed dose of radiation designed to kill cancer cells in a tumor into multiple sessions to allow the healthy parts of the body to recover better. Innovative technologies like the MR-LinAc at our institution enable us to tackle this problem with reinforcement learning

🔭 Topics I'm interested in

probabilistic models
interpretable machine learning methods
statistical learning theory

and also (though not necessarily research-related)

🌌 (theoretical) astrophysics (I did my master in this group)
web development
open source

🛠️ Tech Stack


Writing
Coding
Dev
Software
Learning

Thanks a lot for reading 😃

📫 In case you want to reach me: [email protected]

lymph's People

Contributors

Stargazers

Watchers

Forkers

lfranceschetti larstwi julianbro yoelph

lymph's Issues

trinary model

Implement the trinary model, where every LNL can be in one of the three states healthy, microscopic involvement, and macroscopic involvement.

Setting for shared trinary params not accessible via bilateral model

From the Bilateral model, it is somewhat cumbersome to set the is_micro_mod_shared and is_growth_shared attributes of the two Unilateral instances. While they can be set via the unilateral_kwargs in the Bilateral model's constructor, changing that afterwards is annoying. Also, the default should probably be True for these attributes, but it is False.

parameter assignment "not elegant"

I just started exploring the newest version of the code. I noticed the following thing:

After setting up a model and running the get_params function, I get the following output:

model.get_params(as_dict=True)
----------------------------------------------------------------
{'primarytoII_spread': 0.0,
 'primarytoIII_spread': 0.0,
 'primarytoIV_spread': 0.0,
 'IItoIII_spread': 0.0,
 'IIItoIV_spread': 0.0,
 'late_p': 0.5}

There are no underscores between primary and to for example, which does not look very nice in my opinion. I think we should put that back in, or is there a reason why we do not want that?

Edit: I just went through the code and noticed, that you did it on purpose. There is generally no issue, I was just used to the old naming convention :)

diagnose time prior updating is not checked

There is an issue when sampling since NaN values are produced in the likelihood. I pinned down the problem to the diagnose time file.

in line 188 we have:

         self._kwargs.update({p: kwargs[p] for p in params_to_set})

which means that we do not check whether the parameter is in an allowed range. I exchanged it with:

           for p in params_to_set:
                if not 0. <= kwargs[p] <= 1.:
                    raise ValueError("diagnose probability must be between 0 and 1!")
                self._kwargs.update({p: kwargs[p]})

implement str method

instead of the print_graph() method, a __str__() method would be better

multiple modalities

While the formalism with the observation matrix B and the data matrix C is quite elegant and general, it comes with significant performance issues as soon as one wants to use more than two diagnostic modalities. This is mainly because they grow really quickly with the number of modalities.

There might be a simple solution though: Instead of computing the probability for every possible combination of diagnoses given any possible hidden state, only compute the likelihood of the actually observed diagnoses given all possible hidden states. This would be a process carried out once, when the data is loaded and result in a matrix of size 2^N x P where N is the number of LNLs and P the number of patients.

vectorize likelihood

For performance reasons it would be great if the likelihood function (as well as the risk & set_theta functions) could take a list of parameter vectors and consequently return a list of likelihoods (or risks respectively).

Graph gets messed up

Apparently, the graph sometimes gets disordered, which is likely due to the use of set(value) in the __init__ call of the Unilateral class. Sets in python are unordered and it seems this causes a random order of the elements when iterating over it.

type hinting

improve documentation by using Python's type hinting according to PEP484. There's also a sphinx extension capable of handling this correctly.

Plotting

Similar to the issue with all the code I copy and paste for every round of sampling, I often find myself doing similarly pointless activities just to get the inferred sampled plotted nicely.

An idea to solve this could be a wrapper of the corner() function from the corner.py package that styles and displays the corner plots the way I usually do it.

assign spread parameters by keyword

Right now, it is sometimes confusing to assign the correct parameters, because they are simply passed in one - sometimes rather long - list to the model. It would be great to have a system of assigning spread parameters by keyword.

For example, if we had a method assign_parameters in the Unilateral class, we could allow setting parameters like this:

model.assign_parameters(spread_T_to_III=0.45, growth=0.82)

The names of the keywords could be auto-computed from the names of the tumor and the LNLs. In this example, the tumor would be called T and one of the LNLs III. Hence, the keyword to assign to the Edge instance between the two respective Node instances would be called spread_T_to_III.

This has several advantages:

It avoids confusion and is more readable.
One could easily set single parameters without the need to pass an entire array.
The emcee package has a couple of quality-of-life features that we can use if parameters are assigned by keyword.

make library more modular and reusable

Currently, most of the methods do their job well, but can't be reused. There is also the issue with overlap in the functionality. For example, both inside the likelihood and the risk methods, it computes the prior over the hidden states, given the spread parameters.

To resolve this, large parts of the code base should be broken apart and refactored into smaller and more general methods. A consequence of this would also be that large methods like the risk and the likelihood are easier to understand on an abstract level and the smaller methods they are made up of can be tested more effectively.

package anatomy

I have noticed that the file structure as well as the importing isn't quite standard

risk crashes when involvement is incomplete

When a dictionary is passed to the Unilateral risk method that does not have all the LNLs as keys, a KeyError is thrown.

The Bilateral risk has no problem with this.

Bug in risk comp_posterior_state_dist() in unilateral

There is an issue in the comp_diagnose_encoding function in [unilateral.py](https://github.com/rmnldwg/lymph/blob/main/lymph/models/unilateral.py.

in line 876 we have:

    diagnose_given_state = diagnose_encoding @ self.observation_matrix

This does not work as the two matrices do not have matching dimensions. After going through all the underlying code I assume that the dimensions of the self.observation_matrix are simply transposed. The following fix makes the function functional:

        diagnose_given_state = diagnose_encoding @ self.observation_matrix.T

improve risk method

The risk method handles the involvement of interest and the observed diagnoses inconsistently. I would like to clean that up an e.g. standardize the argument names and how these arguments are structured.

Logging

Implement some logging for more feedback while using the package.

marginalization over contralateral states wrong

In the current implication, it seems I forgot the marginalization over the exact time-point of midline crossing in the contralateral state evolution again. I simply weigh the contralateral state dist evolution where midext was never present and the one where it was always present with the evolution over midline extension. This is WRONG!

Synchronization is unreadable and error-prone

I think my attempt of synchronizing the attributes of the ipsi and contra instances in the Bilateral model class are bad: It's hard to understand what's going on, the state of some objects changes "magically", one cannot even look up what is synchronized, and I am loosing my mind over it.

However, I think there may be simpler and more elegant solution: A SynchronizerMixing, in a similar fashion to the DelegatorMixin: Such a mixin class could dynamically create methods and properties that keep e.g. some ipsi and contra attributes in sync. The advantage would be that the entire synchronization is now the responsibility of the Bilateral (or Midline) class. The ipsi and contra objects would - on their own - still work as expected and not leak anything to each other via obscure synchronization callbacks.

This could be much easier to test, as well.

use NetworkX library

Just found out that there's a library for generating and analysing graph structures out there. It's called NetworkX and it might make parts of my code more readable and/or performant, so I should look into this... But on the other hand it might mean rewriting large parts of the code base....

Sampler

I find myself copying and pasting lots of settings and code snippets from one Jupyter notebook to the other whenever I want to do a sampling/inference round.

Maybe I could come up with a utility function/class that takes care of most of the things and possibly make the whole setup more reproducible by storing the settings and results in e.g. an HDF5 file neatly.

One idea would be something like a LymphSampler class that inherits from emcee's awesome EnsembleSampler but adds some functionality like storing the settings and the results automatically in an HDF5 file.

Synthetic data generation should output LyProX-style format

The models' methods to generate random, synthetic datasets still output the old data format. But the new version is supposed to in- and output data that is of the same format as what is fed into LyProX.

That should be updated.

display graph nicely

Use the plotting library daft to optionally display the structure of the graph in a visually appealing way

Bug in comp_diagnose_encoding function of unilateral

There is an issue in the comp_diagnose_encoding function in [unilateral.py](https://github.com/rmnldwg/lymph/blob/main/lymph/models/unilateral.py.

in line 822 we have:

                    lnls=[lnl.name for lnl in self.graph.lnls],

But this does not work as self.graph.lnls is not an object but a dictionary. Hence i propose the following fix:

                    lnls=[lnl for lnl in self.lnls.keys()],

matrices not updated in bilateral model

While debugging my code I stumbled over an interesting problem which only arises when changing the modalities of a bilateral model.

Updating the modalities in any model should update the observation_matrix. This also works for the ipsilateral side, but not for the contralateral side. It seems as if callback for the contralateral side is not successful. I am not 100% sure whether this is due to the synchronization or if there is another problem.

I am working on a fix, but I wanted to notify you since you probably can detect the problem faster and produce a better fix.

Edit: my fast fix was quite simple. Since the problem is that the contralateral side does not trigger callbacks anymore, I am applying a different syncing function for the modalities:

def init_dict_sync2(
    this: AbstractLookupDict,
    other: AbstractLookupDict,
) -> None:
    """Add callback to ``this`` to sync with ``other``."""
    def sync():
        other.clear()
        other.update(this)

    this.trigger_callbacks.append(sync)

Thus we also need to change the way modalities are synced in the init_synchronization function:

       # Sync modalities
        if self.is_symmetric["modalities"]:
            init_dict_sync2(
                this=self.ipsi.modalities,
                other=self.contra.modalities,
            )

I am sure that you can come up with a more elegant solution, as I am struggling a bit to get a good overview of all the callbacks and syncing functions.

notebook

add notebook that makes creation of plots and computation of results reproducible

loading an empty `DataFrame` raises error

When loading an empty DataFrame (i.e., one with no rows), the Unilateral model raises the following exception:

ValueError: Cannot set a DataFrame with multiple columns to the single column ('_model', '#', 't_stage')

semantic likelihood

I found out that emcee's EnsembleSampler has a parameter_names argument. Providing it, the sampler will then pass a dictionary of values to the likelihood function. With this I could make the likelihood function semantic, instead of describing in a lot of detail where to put which sets of parameters.

package does not follow separation of concerns

Currently, the Unilateral class handles passing spread probabilities to the Node instances to help it compute its trans_prob. But the Node class should be able to do that itself based on its incoming Edge instances.

diagnose matrices not aligned with data

Because the diagnose matrices are computed separately for each T-stage, they are not aligned with the patient data stored in the model anymore.

A solution could be to store the diagnose matrices in the patient data DataFrame and filter that by T-stage when model.diagnose_matrices[t_stage] is called.

This has benefits for both the Bayesian network implementation and the mixture model. And if I didn't overlook anything, this should be possible without breaking changes.

Mid-sagittal extension

Implement the information of whether or not a tumour extends over the mid-sagittal plane as a risk factor for contralateral involvement into the model.

The core idea so far is to introduce (ideally only) one new parameter that modulates the contralateral base probabilities depending on the extension risk factor.

Regarding this parameter, the best idea I had so far was to write the base probabilities contralateral for patients with midline extension as a linear combination between the ipsilateral base parameters and the contralateral base probabilities for patients without midline extension.
But if I really use this is yet to be decided.

symmetric tumor spread in `bilateral` does not trigger tensor deletion of `_transition_tensor`

I found another issue which I can't quite figure out yet.

In the bilateral.py there is the option to set up tumor edge symmetry (e.g. for a "tumor on the midline" model). However, assigning new base spread parameters (after setting them initially) does not trigger a deletion of the _transition_tensor in the edges connecting tumor to LNL. Interestingly, this issue does not show up for edges which connect LNLs (when setting LNL transition symmetry).
I tried to figure out whether there is a difference how the edges are treated/whether there is a different callback between tumor edges and LNL edges, but found none.
The parameters are correctly assigned to the contralateral edges, but the callback seems not to function properly. I will try to find the origin of this bug and report it here.

Allow `Tumor` nodes to have states (aka T-categories) as well

Given our new implementation of the graph representation and particularly the Tumor node implementation, it should be straightforward to enable tracking a tumor's T-category as a random variable:

We basically need to give the Tumor node the allowed_states = [0, 1, 2, 3, 4] and add an Edge instance that has this Tumor node both as start and as end (similar to the growth edges). This edge would then be parametrized with a probability that during one time-step the T-category increases by one.

If then functions like the generate_transition() don't just iterate over LNLs, but over all nodes and consider their possible evolutions, this Tumor node's state would be automatically incorporated and tracked as a random variable, just like all other LymphNodeLevel node's states.

When tracking T-category as a random variable, the distributions over diagnose times make no sense anymore and we would need to define a "diagnosis probability" that depends on all the node's states. We discussed this idea already.

Wrong product in `comp_posterior_joint_state_dist` function.

I found a second bug in the bilateral code in the comp_posterior_joint_state_dist function. In line 605 we multiply by the ipsilateral diagnose vector. Which is correct, but the transposition of the matrix does not have the desired effect. To reach the desired effect we need to build a different vector array.

Current state:

        joint_state_dist = self.comp_joint_state_dist(t_stage=t_stage, mode=mode)
        # matrix with P(Zi=zi,Zc=zc|Xi,Xc) * P(Xi,Xc) for all states Xi,Xc.
        joint_diagnose_and_state = (
            diagnose_given_state["ipsi"].T
            * joint_state_dist
            * diagnose_given_state["contra"]
        )

Potential fix:

joint_state_dist = self.comp_joint_state_dist(t_stage='early', mode='HMM')
# matrix with P(Zi=zi,Zc=zc|Xi,Xc) * P(Xi,Xc) for all states Xi,Xc.
joint_diagnose_and_state = (
    diagnose_given_state["ipsi"][:, np.newaxis] 
    * joint_state_dist
    * diagnose_given_state["contra"]
)

Originally posted by @YoelPH in #60 (comment)

double bilateral model

Add the option to not use the mixing parameter for the MidlineBilateral class to effectively turn the model into a "double Bilateral" one.

multiple tumours

I forgot to add the possibility of multiple tumours. I think it's rather simple to implement, as I can simply treat any tumour as a normal node that is involved from the beginning. Of course, a primary tumour's connection to the LNLs might be different than the connections among the LNLs, but that's already captured in the implementation.

fuzzy observations & sublevels

In some data we received, the involvement is not always reported per level, but sometimes it is reported for a group of levels. For example, a pathology report might state that there have been metastases found in a tissue sample taken from the levels III and IV. From a probabilistic standpoint, this is easy to deal with as it means that LNL III or IV are involved and we can simply marginalize over the respective diagnoses.

The problem now lies with the implementation. I think it would be important to add this possibility of reporting "fuzzy" involvement also for the cases where we don't have detailed sub-level reports (e.g. level IIa & IIb). But the implementation should be consistent and comprehensible.

`nodes` in `graph` not correctly accessed

I found another small issue originating from the fact that graph.nodes is not an object anymore but a dictionary.
In graph.py two functions: to_dict and get_mermaid both try to use self.nodes as an object:

    def to_dict(self) -> dict[tuple[str, str], set[str]]:
        """Returns graph representing this instance's nodes and egdes as dictionary."""
        res = {}
        for node in self.nodes:
            node_type = "tumor" if isinstance(node, Tumor) else "lnl"
            res[(node_type, node.name)] = {o.child.name for o in node.out}
        return res


    def get_mermaid(self) -> str:
        """Prints the graph in mermaid format.

        Example:

        >>> graph_dict = {
        ...    ("tumor", "T"): ["II", "III"],
        ...    ("lnl", "II"): ["III"],
        ...    ("lnl", "III"): [],
        ... }
        >>> graph = Representation(graph_dict)
        >>> graph.edge_params["spread_T_to_II"].set_param(0.1)
        >>> graph.edge_params["spread_T_to_III"].set_param(0.2)
        >>> graph.edge_params["spread_II_to_III"].set_param(0.3)
        >>> print(graph.get_mermaid())  # doctest: +NORMALIZE_WHITESPACE
        flowchart TD
            T-->|10%| II
            T-->|20%| III
            II-->|30%| III
        <BLANKLINE>
        """
        mermaid_graph = "flowchart TD\n"

        for idx, node in enumerate(self.nodes):
            for edge in self.nodes[idx].out:
                mermaid_graph += f"\t{node.name}-->|{edge.spread_prob:.0%}| {edge.child.name}\n"

        return mermaid_graph

This does not work anmymore. For get_mermaid for example we need to fix it like this:

        for idx, node in enumerate(self.nodes):
            for edge in self.nodes[node].out:
                mermaid_graph += f"\t{node}-->|{edge.spread_prob:.0%}| {edge.child.name}\n"

        return mermaid_graph

(idx would not be needed anymore)

Issues with `DistributionsUserDict`

There's a number of issues with the current implementation of the AbstractUserDict subclass that need to be addressed:

The method set_distribution_params() is outdated and expects every distribution in the dict to have only one parameter
Also, it is not consistent with all the set_params() methods in the way it handles incoming parameters
The syncing function in the bilateral.py module initiates only a one-way sync between linked instances of the dict.

integration tests missing

there are no routines that effectively test whether the package does its job correctly

switch to `pyproject.toml`

Since PEP 621 metadata for a package should be stored in a pyproject.toml file.

use pdoc for docs

I actually like pdoc a lot more than any other doc generator, so I would like to change the documentation to use it.

Parameter assignment in bilateral model

There is a small issue in bilateral.py: https://github.com/rmnldwg/lymph/blob/dev/lymph/models/bilateral.py

in Line 343 we have:

        remaining_args, remainings_kwargs = self.contra.assign_params(
            *remaining_args, **contra_kwargs, **remainings_kwargs
        )

This does not work properly with the trinary model, where we need to assign "global" parameters like growth. A fix would be to replace **remaining_kwargs with general_kwargs.
As follows:

        remaining_args, remainings_kwargs = self.contra.assign_params(
            *remaining_args, **contra_kwargs, **general_kwargs
        )

by doing so, the general kwargs like universal growth, microscopic spread and I also guess the time distribution prior is passed to both ipsi and contralateral model.

encoding of diagnose/involvement patterns does not work for trinary

The function matrix.compute_encoding() can currently only compute the correct data encoding for the binary model case.

if things are expensive, they should look expensive

Right now, the package makes heavy use of Python's @property decorator. For example, loading data is as "easy" as writing

model.patient_data = some_pandas_table

This looks like I am only storing the table inside the model object. But actually, it is computing the diagnose matrix which is rather expensive. So, making this a function instead of a property would give it the look and feel of something potentially resource-intensive:

model.load_patient_data(some_pandas_table)

incorporate HPV status

HPV (human papilloma virus) status is widely considered an important prognostic factor in oropharyngeal SCC. Thus, our model should be able to factor it into its predictions. However, we do not yet have a good idea how to do that.

One idea is to allow HPV+ and HPV- patients to have slightly different "time priors", indicating that not the patterns of lymphatic progression, but their speed is different for HPV+ patients.

marginalization over diagnose times

Currently, the way I provide and use the distributions that marginalize over the possible diagnose times, is quite messy. This is mostly because here are often T-stages (mostly the earliest one) for which I would like to provide a fixed time marginalization, while for all others, I'd like to have a parametrized one that gets updated when new parameters are provided (during sampling, for example).

risk from precomputed state distribution

It could be use- and helpful to implement a method that can compute risks and prevalences from a set of precomputed distributions over hidden states. This method should basically take in the same arguments as the models' risk() method, but instead of samples, it should expect arrays the like the ones produced by the state_dist() method.

generate data

Write a function that generates data based on some set spread probabilities and other settings. This could be useful for testing and validating.

unit tests incomplete

not every part of the code is thoroughly tested, In fact, most of it isn't