Giter VIP home page Giter VIP logo

rvandewater / yaib Goto Github PK

View Code? Open in Web Editor NEW
40.0 40.0 6.0 10.49 MB

๐ŸงชYet Another ICU Benchmark: a holistic framework for the standardization of clinical prediction model experiments. Provide custom datasets, cohorts, prediction tasks, endpoints, preprocessing, and models. Paper: https://arxiv.org/abs/2306.05109

Home Page: https://github.com/rvandewater/YAIB/wiki

License: MIT License

Makefile 0.53% Python 99.26% Shell 0.21%
amsterdamumcdb benchmark clinical-data clinical-ml deep-learning ehr eicu-crd framework hirid-dataset icu machine-learning mimic-iii mimic-iv patient-monitoring time-series

yaib's Introduction

LinkedIn GoogleScholar ResearchGate Medium

Hi ๐Ÿ‘‹, I am a PhD candidate in AI in healthcare at the Hasso Plattner Institute. I am a Computer Scientist and Data Scientist by education. I specialize in using predictive modelling methods to derive insights from healthcare data, thereby prolonging the life of patients. I am interested in applying the ml methods to improve healthcare.

TLDR:

  • ๐Ÿง  AI in Health Researcher
  • ๐Ÿง‘โ€๐ŸŽ“ Ph.D. Candidate
  • ๐Ÿ›๏ธ Hasso Plattner Institute
  • ๐Ÿ  Berlin, Germany

yaib's People

Contributors

alisher-turubayev avatar anna-shopova avatar dependabot[bot] avatar fabianlange18 avatar hendrikschmidt avatar hugoych avatar mhueser avatar mlondschien avatar prockenschaub avatar rvandewater avatar snagnar avatar vayvy avatar xinruilyu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

yaib's Issues

Fix example configs

Because example configs include the other configs and includes aren't parsed for random searches these files don't work correctly at the moment and need fixing.

[REC] Group should be optional

Groups in a Recipe/Step should be optional. Steps can either not care about groups ever .group == True or use group if specified (ignore otherwise and treat as if no group). The current implementation of _apply_group() however forces a group for Steps with .group == True and leads to an error if there is not group.

TCN raises error

The current encodeder expects num_channels to be a list whose length determines the number of layers.

Write tests for recipes

Write unit and integration tests for the preprocessing with recipes.

Create a GitHub action to run these tests as a pre-commit hook.

Preprocessing via external file

In order for us to claim the benchmark is "an easily customisable framework", we might want to include a way to do preprocessing via an external file. This might be a python file, which you would supply via the CLI or link with a gin-config.

Implement missing Selector methods

The Selector contains a few missing methods that need to be implemented. Once starts_with is implemented, it can be used to make the tests for the sklearn step more robust by checking that the right number of columns is created.

Generalise splits generation

The function that generates the splits is very much tied to our dataset. It includes hardcoded variables for the proportion of splits and the grouping variable ('stay_id'). This needs to be more general.

It also gets paths as parameters and writes to disk, something that needs to be reconsidered.

Redesign how `Ingredients` remember roles

Pandas allows to add persistent metadata to a DataFrame by subclassing the DataFrame and specifying _metadata = ["metadata_varname"]. This is what we currently use in Ingredients to remember roles. Unfortunately, metadata is propagated by reference whenever the DataFrame is subset, meaning any roles set on a slice will also be set on the original.

I got around this behaviour and enforce strict propagation by copy through overriding the __finalize__ function, which is responsible for remembering the metadata. However, this function is anotated with @final, hinting that it should not be overridden as this might be error-prone. Due to this, we should find some other way to enforce copying of roles.

Include demo data

Preprocess demo data for mimic and eicu and use for examples and tests.

Update LGBM

The current architecture uses a deprecated feature:

UserWarning: 'early_stopping_rounds' argument is deprecated and will be removed in a future release of LightGBM. Pass 'early_stopping()' callback via 'callbacks' argument instead.

Update to current library method

Get training to work in parallel on different architectures

Right now, when trying to run the training of the DL models, the multiprocessing throws an error on my machine (MacBook Pro with M1 Pro).

python -m icu_benchmarks.run train \
                             -c configs/hirid/Classification/LSTM.gin \
                             -l logs/random_search/24h_multiclass/LSTM/run \
                             -t Phenotyping_APACHEGroup \
                             --num-class 15 \
                             --maxlen 288 \
                             -rs True\
                             -lr  3e-4 1e-4 3e-5 1e-5\
                             -sd 1111 2222 3333 \
                             --hidden 32 64 128 256 \
                             --do 0.0 0.1 0.2 0.3 0.4 \
                             --depth 1 2 3
OMP: Info #271: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
2022-08-30 11:16:08,290 - INFO: Model will be trained using CPU Hardware. This should be considerably slower
Traceback (most recent call last):
  File "/Users/hendrikschmidt/opt/anaconda3/envs/icu-benchmark/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/Users/hendrikschmidt/opt/anaconda3/envs/icu-benchmark/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/Users/hendrikschmidt/projects/thesis/YAIB/icu_benchmarks/run.py", line 426, in <module>
    main()
  File "/Users/hendrikschmidt/projects/thesis/YAIB/icu_benchmarks/run.py", line 406, in main
    train_with_gin(model_dir=log_dir_seed,
  File "/Users/hendrikschmidt/projects/thesis/YAIB/icu_benchmarks/models/train.py", line 46, in train_with_gin
    train_common(model_dir, overwrite, load_weights)
  File "/Users/hendrikschmidt/opt/anaconda3/envs/icu-benchmark/lib/python3.8/site-packages/gin/config.py", line 1531, in gin_wrapper
    utils.augment_exception_message_and_reraise(e, err_str)
  File "/Users/hendrikschmidt/opt/anaconda3/envs/icu-benchmark/lib/python3.8/site-packages/gin/utils.py", line 41, in augment_exception_message_and_reraise
    raise proxy.with_traceback(exception.__traceback__) from None
  File "/Users/hendrikschmidt/opt/anaconda3/envs/icu-benchmark/lib/python3.8/site-packages/gin/config.py", line 1508, in gin_wrapper
    return fn(*new_args, **new_kwargs)
  File "/Users/hendrikschmidt/projects/thesis/YAIB/icu_benchmarks/models/train.py", line 88, in train_common
    model.train(dataset, val_dataset, weight)
  File "/Users/hendrikschmidt/opt/anaconda3/envs/icu-benchmark/lib/python3.8/site-packages/gin/config.py", line 1531, in gin_wrapper
    utils.augment_exception_message_and_reraise(e, err_str)
  File "/Users/hendrikschmidt/opt/anaconda3/envs/icu-benchmark/lib/python3.8/site-packages/gin/utils.py", line 41, in augment_exception_message_and_reraise
    raise proxy.with_traceback(exception.__traceback__) from None
  File "/Users/hendrikschmidt/opt/anaconda3/envs/icu-benchmark/lib/python3.8/site-packages/gin/config.py", line 1508, in gin_wrapper
    return fn(*new_args, **new_kwargs)
  File "/Users/hendrikschmidt/projects/thesis/YAIB/icu_benchmarks/models/wrappers.py", line 179, in train
    train_loss, train_metric_results = self._do_training(train_loader, weight, metrics)
  File "/Users/hendrikschmidt/projects/thesis/YAIB/icu_benchmarks/models/wrappers.py", line 134, in _do_training
    for t, elem in tqdm(enumerate(train_loader)):
  File "/Users/hendrikschmidt/opt/anaconda3/envs/icu-benchmark/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 355, in __iter__
    return self._get_iterator()
  File "/Users/hendrikschmidt/opt/anaconda3/envs/icu-benchmark/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 301, in _get_iterator
    return _MultiProcessingDataLoaderIter(self)
  File "/Users/hendrikschmidt/opt/anaconda3/envs/icu-benchmark/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 914, in __init__
    w.start()
  File "/Users/hendrikschmidt/opt/anaconda3/envs/icu-benchmark/lib/python3.8/multiprocessing/process.py", line 121, in start
    self._popen = self._Popen(self)
  File "/Users/hendrikschmidt/opt/anaconda3/envs/icu-benchmark/lib/python3.8/multiprocessing/context.py", line 224, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
  File "/Users/hendrikschmidt/opt/anaconda3/envs/icu-benchmark/lib/python3.8/multiprocessing/context.py", line 284, in _Popen
    return Popen(process_obj)
  File "/Users/hendrikschmidt/opt/anaconda3/envs/icu-benchmark/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 32, in __init__
    super().__init__(process_obj)
  File "/Users/hendrikschmidt/opt/anaconda3/envs/icu-benchmark/lib/python3.8/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)
  File "/Users/hendrikschmidt/opt/anaconda3/envs/icu-benchmark/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 47, in _launch
    reduction.dump(process_obj, fp)
  File "/Users/hendrikschmidt/opt/anaconda3/envs/icu-benchmark/lib/python3.8/multiprocessing/reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
AttributeError: Can't pickle local object 'WeakValueDictionary.__init__.<locals>.remove'
  In call to configurable 'train' (<function DLWrapper.train at 0x7fcc5bec0790>)
  In call to configurable 'train_common' (<function train_common at 0x7fcc3b5000d0>)
Closing remaining open files:/Users/hendrikschmidt/projects/thesis/data/hirid_preprocessed/ml_stage/ml_stage_12h.h5...done/Users/hendrikschmidt/projects/thesis/data/hirid_preprocessed/ml_stage/ml_stage_12h.h5...done

The issue might be the dataloader / way the H5 file is opened, a possible solution is described here: pytorch/pytorch#11929 (comment).
Ideally, the training would work on all different architectures and not only Linux to facilitate development speed.

Investigate issue with split index

Atm StepHistorical leads to issues, because it sometimes drops the complete index. However, the splits are implemented as an index on the DF and so get lost as well/can't be lined up with the result.
This could also lead to problems when trying to do the following:

rec.prep(data=train_df)
rec.bake(data=val_df)
rec.bake(data=test_df)

Investigate if a different setup for the splits might make sense (e.g. dict with three keys, implies changing the dataloader too) or the HistoricalSteps (and maybe more preprocessing) need to be changed.

Add Sampling Options

We want users to be able to specify a type of balanced sampling. So for specifying test and train set, we would have the same ratio of positive and negative cases (for binary classification).

Potential candidates for oversampling:

  • - Random oversampling
  • - Synthetic Minority Oversampling (SMOTE)
  • - Adaptive Synthetic (ADASYN)

Change behaviour of prep and bake

At the moment, the behaviour of prep and bake is a bit confusing, as prep already transforms the data as well as fitting it. Maybe consolidate the two functions into one or distinguish between them better. Make sure that fitting on one split and transforming on another works.

Common tasks in `Step.transform()`

@HendrikSchmidt in #41 raised the question whether common tasks in Step.transform() -- such as calling _check_ingredients() -- should be moved into the parent class. This is currently the case for Step.fit() with a dedicated do_fit() function that must be overridden by child classes.

Pass preprocessed DFs as parameter instead of writing to disk

Atm, all preprocessed DFs (splits, features, imputation) get written to disk before being read again by the loader. To make the process more adaptable, let the main method pass the DFs directly to the loader/Dataset and have the train method use that. Caching the results on disk could still be an option.

Remove superfluous `train_with_gin`

train_with_gin does not seem to be necessary anymore, as gin files are now parsed elsewhere. Remove in favour of a simple train function.

Numpy version conflict

Installing the conda environment with the YML file results in some Numpy problem: TypeError: <class 'numpy.typing._dtype_like._SupportsDType'> is not a generic class

For now, I have updated Numpy to 21.6 in my codebase, which seems to solve that particular issue. Further investigation is needed to find out what impact this has.

Redo Result Saving

The results now get stored in a pickled object. It makes more sense to store them in a readable CSV file to easily copy them into a results sheet.

Fix GRU architecture

The output of the GRU encoder, is too short which leads to failures in the training. Investigate the architecture and parameters and see whether this bug existed in the original benchmarks as well.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.