mld3 / fiddle Goto Github PK

FlexIble Data-Driven pipeLinE – a preprocessing pipeline that transforms structured EHR data into feature vectors to be used with ML algorithms. https://doi.org/10.1093/jamia/ocaa139

Home Page: http://tiny.cc/get_FIDDLE

License: MIT License

Python 5.57% Jupyter Notebook 94.30% R 0.12% Dockerfile 0.01%

electronic-health-records preprocessing machine-learning jamia data-science

fiddle's Introduction

FIDDLE

FIDDLE – FlexIble Data-Driven pipeLinE – is a preprocessing pipeline that transforms structured EHR data into feature vectors that can be used with ML algorithms, relying on only a small number of user-defined arguments.

Try a quick demo here: tiny.cc/FIDDLE-demo

Contributions and feedback are welcome; please submit issues on the GitHub site: https://github.com/MLD3/FIDDLE/issues.

To enhance reproducibility, we have released preprocessed features for MIMIC-III and eICU and accompanying code for the experiments in the paper. Refer to more details on these linked websites.

If you use FIDDLE in your research, please cite the following publication:

@article{FIDDLE,
    author = {Tang, Shengpu and Davarmanesh, Parmida and Song, Yanmeng and Koutra, Danai and Sjoding, Michael W and Wiens, Jenna},
    title = "{Democratizing EHR analyses with FIDDLE: a flexible data-driven preprocessing pipeline for structured clinical data}",
    journal = {Journal of the American Medical Informatics Association},
    year = {2020},
    month = {10},
    doi = {10.1093/jamia/ocaa139},
}

System Requirements

Pip

Requires python 3.7 or above (older versions may still work but have not been tested). Required packages and versions are listed in requirements.txt. Run the following command to install the required packages.

pip install -r requirements.txt

Docker

To build the docker image, run the following command:

docker build -t fiddle-v020 .

Refer to the notebook tests/small_test/Run-docker.ipynb for an example to run FIDDLE in docker.

Usage Notes

FIDDLE generates feature vectors based on data within the observation period $t\in[0,T]$. This feature representation can be used to make predictions of adverse outcomes at t=T. More specifically, FIDDLE outputs a set of binary feature vectors for each example $i$, ${(s_i,x_i)\ \text{for}\ i=1 \dots N}$ where $s_i \in \mathbb{R}^d$ contains time-invariant features and $x_i \in \mathbb{R}^{L \times D}$ contains time-dependent features.

Input:

formatted EHR data: .csv or .p/.pickle file, a table with 4 columns [ID, t, variable_name, variable_value]
population file: a list of unique IDs you want processed
- the output feature matrix will correspond to IDs in lexicographically sorted order
config file:
- specifies additional settings by providing a custom config.yaml file
- a default config file is located at FIDDLE/config-default.yaml
arguments:
- T: The time of prediction; time-dependent features will be generated using data in $t\in[0,T]$.
- dt: the temporal granularity at which to "window" time-dependent data.
- theta_1: The threshold for Pre-filter.
- theta_2: The threshold for Post-filter.
- theta_freq: The threshold at which we deem a variable “frequent” (for which summary statistics will be calculated).
- stats_functions: A set of 𝐾 statistics functions (e.g., min, max, mean). Each function is used to calculate a summary value using all recordings within a single time bin. These functions are only applicable to “frequent” variables as determined by theta_freq.

Output: The generated features and associated metadata are located in {data_path}/:

s.npz: a sparse array of shape (N, d)
X.npz: a sparse tensor of shape (N, L, D)
s.feature_names.json: names of d time-invariant features
X.feature_names.json: names of D time-series features
x.feature_aliases.json: aliases of duplicated time-invariant features
X.feature_aliases.json: aliases of duplicated time-series features

To load the generated features:

X = sparse.load_npz('{data_path}/X.npz'.format(data_path=...)).todense()
s = sparse.load_npz('{data_path}/s.npz'.format(data_path=...)).todense()

Example usage:

python -m FIDDLE.run \
    --data_path='./tests/small_test/' \
    --population='./tests/small_test/pop.csv' \
    --T=24 --dt=5 \
    --theta_1=0.001 --theta_2=0.001 --theta_freq=1 \
    --stats_functions 'min' 'max' 'mean'

Guidelines on argument settings

The user-defined arguments of FIDDLE include: T, dt, theta_1, theta_2, theta_freq, and K statistics functions. The settings of these arguments could affect the features and how they can be used. We provided reasonable default values in the implementation, and here list some practical considerations: (i) prediction time and frequency, (ii) temporal density of data, and (iii) class balance.

(i) The prediction time and frequency determine the appropriate settings for T and dt. The risk stratification tasks we considered all involve a single prediction at the end of a fixed prediction window. It is thus most reasonable to set T to be the length of prediction window. Another possible formulation is to make multiple predictions where each prediction depends on only data from the past (not the future), using models like LSTM or fully convolutional networks. In that case, for example, if a prediction needs to be made every 4 hours over a 48-hour period, then T should be 48 hours, whereas dt should be at most 4 hours.

(ii) The temporal density of data, that is, how often the variables are usually measured, also affects the setting of dt. This can be achieved by plotting a histogram of recording frequency. In our case, we observed that the maximum hourly frequency is ~1.2 times, which suggests dt should not be smaller than 1 hour. While most variables are recorded on average <0.1 time per hour (most of the time not recorded), the 6 vital signs are recorded slightly >1 time per hour. Thus, given that in the ICU, vital signs are usually collected once per hour, we set dt=1. This also implies the setting of θ_freq to be 1. Besides determining the value for dt from context (how granular we want to encode the data), we can also sweep the range (if there are sufficient computational resources and time) given the prediction frequency and the temporal density of data.

(iii) We recommend setting θ₁=θ₂=θ and be conservative to avoid removing information that could be potentially useful. For binary classification, the rule-of-the-thumb we suggest is to set θ to be about 1/100 of the minority class. For example, our cohorts consist of ~10% positive cases, so setting θ=0.001 is appropriate, whereas for a cohort with only 1% positive cases, then θ=0.0001 is more appropriate. Given sufficient computational resources and time, the value of θ can also be swept and optimized.

Finally, for the summary statistics functions, we included by default the most basic statistics functions are minimum, maximum, and mean. If on average, we expect more than one value per time bin, then we can also include higher order statistics such as standard deviation and linear slope.

Experiments

In this repository, we release FIDDLE as a standalone software. In order to show the flexibility and utility of FIDDLE, we conducted several experiments using data from MIMIC-III and eICU. The code to reproduce the results are located at https://github.com/MLD3/FIDDLE-experiments. The experiments were performed using FIDDLE v0.1.0 and reported in the JAMIA paper; bug fixes and new functionalities have since been implemented and may affect the numerical results.

Publications & Resources

Title: Democratizing EHR analyses with FIDDLE: a flexible data-driven preprocessing pipeline for structured clinical data.
Authors: Shengpu Tang, Parmida Davarmanesh, Yanmeng Song, Danai Koutra, Michael W. Sjoding, and Jenna Wiens.
Published in JAMIA (Journal of the American Medical Informatics Association), October 2020: article link
Previously presented at MLHC 2019 (Machine Learning for Healthcare) as a clinical abstract
News coverage on HealthcareITNews: link
Poster | Slides

fiddle's People

Contributors

Stargazers

Watchers

Forkers

johnwalz97 xpf100 em3ndez qiliang28-dev empyriumz jplasser little-shiba guzmanben16 sbbauer matthodgman josephcscarpa realdev12 rajeevyadav zwimpee peterjweir jwu19 harel-coffee data-designer

fiddle's Issues

Question <not an issue> Do you know of any work that calculated SOFA scores, and Sepsis onsets using the Fiddle pipeline?

Team,

Fiddle pipeline is amazing!!! I have been able to run it without any difficulties. I am trying to predict Sepsis, and wondering if anyone used Fiddle pipeline for calculating SOFA scores and Sepsis Onset.

Please let me know if you do.

Thank you,
Sam

Having trouble processing ICD codes

I am not using MIMIC-III or eicu data, and since this pipeline should e applicable to other EHR data sets, I am using it for in-house EHR data. No matter how I preprocess ICD codes e.g. ICD9:V50.2 vs V50.2 vs V502. I always encounter the error below:

--------------------------------------------------------------------------------
2-B) Transform time-dependent data
--------------------------------------------------------------------------------
Total variables    : 31734
Traceback (most recent call last):
  File "D:\bo\envs\bd\lib\site-packages\pandas\core\indexes\base.py", line 3361, in get_loc
    return self._engine.get_loc(casted_key)
  File "pandas\_libs\index.pyx", line 76, in pandas._libs.index.IndexEngine.get_loc
  File "pandas\_libs\index.pyx", line 108, in pandas._libs.index.IndexEngine.get_loc
  File "pandas\_libs\hashtable_class_helper.pxi", line 5198, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas\_libs\hashtable_class_helper.pxi", line 5206, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'icd_code:0'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "D:\bo\envs\bd\lib\runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "D:\bo\envs\bd\lib\runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "D:\bo\EOBD_prediction\FIDDLE\run.py", line 141, in <module>
    main()
  File "D:\bo\EOBD_prediction\FIDDLE\run.py", line 138, in main
    X, X_feature_names, X_feature_aliases = FIDDLE_steps.process_time_dependent(df_time_series, args)
  File "D:\bo\EOBD_prediction\FIDDLE\steps.py", line 235, in process_time_dependent
    df_time_series, dtypes_time_series = transform_time_series_table(df_data_time_series, args)
  File "D:\bo\EOBD_prediction\FIDDLE\steps.py", line 430, in transform_time_series_table
    variables_num_freq = get_frequent_numeric_variables(df_in, variables, theta_freq, args)
  File "D:\bo\EOBD_prediction\FIDDLE\helpers.py", line 93, in get_frequent_numeric_variables
    numeric_vars = [col for col in variables if df_types[col] == 'Numeric']
  File "D:\bo\EOBD_prediction\FIDDLE\helpers.py", line 93, in <listcomp>
    numeric_vars = [col for col in variables if df_types[col] == 'Numeric']
  File "D:\bo\envs\bd\lib\site-packages\pandas\core\series.py", line 942, in __getitem__
    return self._get_value(key)
  File "D:\bo\envs\bd\lib\site-packages\pandas\core\series.py", line 1051, in _get_value
    loc = self.index.get_loc(label)
  File "D:\bo\envs\bd\lib\site-packages\pandas\core\indexes\base.py", line 3363, in get_loc
    raise KeyError(key) from err
KeyError: 'icd_code:0'

So my df_types only one icd related variable name icd_code which is correct. However the parse_variable_data_type process has made a whole new list of variable names with icd at the beginning. Thus why variables has a long list of "icd_code:*" elements. The whole process is very confusing and vague in details. Would you please enlighten me on the source of the error? Many thanks.

hierarchical variables have a value of 1.0 for every time bin

the df_time_series.joblib dataframe has the correct timing of when hierarchical variables (user-defined and ICD 10) are present, however, the output X.npz just has a 1.0 value for every hierarchical variable for every patient.

Questions about the functions make_float and is_numeric in helpers.py

Hi, the functions make_float and is_numeric can't catch NoneType error. You can reproduce this error by running the Run.ipynb. Besides, the Example Usage in README.md is not runable, clearly the test in path should be tests.

TypeError: float() argument must be a string or a number, not 'NoneType' error during step 2-B) Transform time-dependent data

I was trying out the demo of FIDDLE using Google Colab (http://tiny.cc/get_FIDDLE) as well as locally. While running the preprocessing step using the provided data I encountered following error. I apologise for such small information as this is being my first public issue.

Generating features for train and test data

To have a reasonable experimental setting, I need to generate features for the training data and keep the feature names. Then the features for the testing data should be generated using these feature names from the training step. Is there any way to do this with FIDDLE? Thanks!

Mapping data between datasets

I'm having an issue mapping data between institutions when one dataset has a subset of variables in the second dataset. I run the first dataset through FIDDLE I get a set of discretization bins that I want to apply tot the second dataset. However, since the second dataset has variables that are not available in the first dataset, FIDDLE doesn't have any discretization bins to apply to those variables. Is it possible to have FIDDLE drop any data that it doesn't have any discretization bins for?

Pre-Filter ID Issue

Hello,

I am trying to replicate the results from the paper and I am stuck at getting the features for the time-variant variables. I run the following and I get a KeyError: ID in the Pre-Filter step. Could you please help in figuring out what I am doing wrong?
I have run all the steps from the mimic3_experiements to generate the required files.
Thank you!

Error when not discretizing MIMIC-III time-series data - TypeError: bad operand type for unary ~: 'float'

I am running FIDDLE on data extracted from MIMIC-III using the pipeline outlined in FIDDLE-experiments. I have my population of ICU stays and am running FIDDLE with these parameters:

--T=240.0
--dt=1.0
--theta_1=0.003
--theta_2=0.003
--theta_freq=1
--stats_functions 'mean'

and other default ones found in run_make_all.sh.

I get the following error:

Traceback (most recent call last):  
  File "/home/hodgman/miniconda3/envs/FIDDLE-env/lib/python3.7/runpy.py", line 193, in _run_module_as_main  
    "__main__", mod_spec)  
  File "/home/hodgman/miniconda3/envs/FIDDLE-env/lib/python3.7/runpy.py", line 85, in _run_code  
    exec(code, run_globals)  
  File "/home/hodgman/FIDDLE-experiments/FIDDLE/FIDDLE/run.py", line 141, in <module>  
    main()  
  File "/home/hodgman/FIDDLE-experiments/FIDDLE/FIDDLE/run.py", line 138, in main  
    X, X_feature_names, X_feature_aliases = FIDDLE_steps.process_time_dependent(df_time_series, args)  
  File "/home/hodgman/FIDDLE-experiments/FIDDLE/FIDDLE/steps.py", line 244, in process_time_dependent  
    X_all, X_all_feature_names, X_discretization_bins = map_time_series_features(df_time_series, dtypes_time_series, args)  
  File "/home/hodgman/FIDDLE-experiments/FIDDLE/FIDDLE/steps.py", line 604, in map_time_series_features  
    df.loc[~numeric_mask, col] = np.nan  
  File "/home/hodgman/miniconda3/envs/FIDDLE-env/lib/python3.7/site-packages/pandas/core/generic.py", line 1532, in __invert__  
    new_data = self._mgr.apply(operator.invert)  
  File "/home/hodgman/miniconda3/envs/FIDDLE-env/lib/python3.7/site-packages/pandas/core/internals/managers.py", line 325, in apply  
    applied = b.apply(f, **kwargs)  
  File "/home/hodgman/miniconda3/envs/FIDDLE-env/lib/python3.7/site-packages/pandas/core/internals/blocks.py", line 381, in apply  
    result = func(self.values, **kwargs)  
TypeError: bad operand type for unary ~: 'float'

Do you know what could be causing this error? I was able to determine that it first occurs in the column 225958 and numeric_mask contains at least one NaN value which must mean column 225958 contains None values however in in my input_data.p file there are no None or NaN variable_values for variable_name == '225958'.