emmarocheteau / tpc-los-prediction Goto Github PK

This repository contains the code used for Temporal Pointwise Convolutional Networks for Length of Stay Prediction in the Intensive Care Unit (https://dl.acm.org/doi/10.1145/3450439.3451860).

Home Page: https://dl.acm.org/doi/10.1145/3450439.3451860

License: MIT License

PLpgSQL 1.81% Python 98.19%

patient-outcomes mortality-prediction length-of-stay convolutional-neural-networks

tpc-los-prediction's People

Contributors

Stargazers

Watchers

tpc-los-prediction's Issues

kappa for los

hi, when I calculate kappa for los using sklearn
from sklearn.metrics import cohen_kappa_score
kappa = cohen_kappa_score(y_true, y_pred)

it occurred ValueError: continuous is not supported,it seems kappa is not proper for regression, only classification?

Computation Power needed

Hi @EmmaRocheteau . I wanted to know how much computation power would be needed to run the inference for the models. Does a standard work laptop suffice?

Thanks

some conceptual questions about temp_pointwise

Hi Emma,

I got some conceptual questions regarding temp_pointwise implementation. I marked 3 steps in the following source code for questions. The comments are my understanding and there are 4 lines below extracted from your source code.

def temp_pointwise(...):
  ...
  # temp_skip(batch_size, ts_feature_value_dim, ts_feature_conv_dim+1, n_measure_of_patient)
  # temp_skip is combination of temporal convolution and skip connection. Each ts_feature_conv_dim(12 values in 1 layer) values are 
  # concatenated with a feature value from skip connection.
  # step 1
  temp_skip = cat((point_skip.unsqueeze(2),  # B * (F + Zt) * 1 * T
                         X_temp.view(B, point_skip.shape[1], temp_kernels, T)),  # B * (F + Zt) * temp_kernels * T
                        dim=2)  # B * (F + Zt) * (1 + temp_kernels) * T

  # point_output(batch_size * n_measure_of_patient, point_size)
  #   -> view(batch_size, n_measure_of_patient, point_size, 1)
  #   -> permute(batch_size, point_size, 1, n_measure_of_patient)
  #   -> X_point_rep(batch_size, point_size, ts_feature_pattern_dim+1, n_measure_of_patient)
  # X_point_rep contains representation of each measure in low-dimensional space
  # step 2
  X_point_rep = point_output.view(B, T, point_size, 1).permute(0, 2, 3, 1).repeat(1, 1, (1 + temp_kernels), 1)  # B * point_size * (1 + temp_kernels) * T
  
  # X_combined(batch_size, ts_feature_value_dim + point_size, ts_feature_conv_dim+1, n_measure_of_patient)
  # temp_skip and X_point_rep are concatenated along ts_feature_value_dim axis.
  # step 3
  X_combined = self.relu(cat((temp_skip, X_point_rep), dim=1))  # B * (F + Zt) * (1 + temp_kernels) * T
  next_X = X_combined.contiguous().view(B, (point_skip.shape[1] + point_size) * (1 + temp_kernels), T)  # B * ((F + Zt + point_size) * (1 + temp_kernels)) * T
  ...

At step 3 X_combined, my understanding for the reason of concatenating temp_skip and X_point_rep along ts_feature_value_dim is that X_point_rep contains representation at ts_feature_value_dim level. If so, why don't do the following:

X_combined = self.relu(cat(
      (temp_skip.view(B, point_skip.shape[1] * (temp_kernels+1), T),
      point_output.view(B, T, point_size).permute(0, 2, 1)  # B * point_size * T
      ),
  dim=1
 )

So flatten temp_skip so that it can be concatenated with point_output at ts_feature_value_dim level.

I actually have difficulty in understanding the reasoning to repeat each point_size value (1+temp_kernals) times at step 2 X_point_rep. The only reason I can think of is to match the dimension with temp_skip. But with the repeation, will next_X contain (1+temp_kernals) repeated value at dim=1, which will not add information for network?

Asking source code in text is a bit difficult. I am not sure if I state my question clearly.

Thanks in advance for your time and help,
Cheng

File "pandas\_libs\parsers.pyx", line 545, in pandas._libs.parsers.TextReader.cinit pandas.errors.EmptyDataError: No columns to parse from file

There was an empty data error when running -m MIMIC_preprocessing.run_all_preprocessing

Is it ok that my flat_features.csv file is empty...?

n_epochs does not work as an argument.

I gave a model n_epochs as 5, but it is calculating 14 epochs as usual.

Issues preprocessing MIMIC-IV using BigQuery

Hi I'm having trouble preprocessing the MIMIC-IV with Big Query. I'm using the their query translation tool however I'm getting errors. I'm trying to translate the following query to BQ

create table ld_commonlabs as
  -- extracting the itemids for all the labevents that occur within the time bounds for our cohort
  with labsstay as (
    select l.itemid, la.stay_id
    from labevents as l
    inner join ld_labels as la
      on la.hadm_id = l.hadm_id
    where l.valuenum is not null  -- stick to the numerical data
      -- epoch extracts the number of seconds since 1970-01-01 00:00:00-00, we want to extract measurements between
      -- admission and the end of the patients' stay
      and (date_part('epoch', l.charttime) - date_part('epoch', la.intime))/(60*60*24) between -1 and la.los),
  -- getting the average number of times each itemid appears in an icustay (filtering only those that are more than 2)
  avg_obs_per_stay as (
    select itemid, avg(count) as avg_obs
    from (select itemid, count(*) from labsstay group by itemid, stay_id) as obs_per_stay
    group by itemid
    having avg(count) > 3)  -- we want the features to have at least 3 values entered for the average patient
  select d.label, count(distinct labsstay.stay_id) as count, a.avg_obs
    from labsstay
    inner join d_labitems as d
      on d.itemid = labsstay.itemid
    inner join avg_obs_per_stay as a
      on a.itemid = labsstay.itemid
    group by d.label, a.avg_obs
    -- only keep data that is present at some point for at least 25% of the patients, this gives us 45 lab features
    having count(distinct labsstay.stay_id) > (select count(distinct stay_id) from ld_labels)*0.25
    order by count desc;

My resulting big query sql is :

CREATE TABLE mimic_iv.ld_commonlabs
  AS
    WITH labsstay AS (
      SELECT
          --  extracting the itemids for all the labevents that occur within the time bounds for our cohort
          l.itemid,
          la.stay_id
        FROM
          physionet-data.mimiciv_hosp.labevents AS l
          INNER JOIN mimic_iv.ld_labels AS la ON la.hadm_id = l.hadm_id
        WHERE l.valuenum IS NOT NULL
         AND (UNIX_SECONDS(CAST(CAST(l.charttime as DATE) AS TIMESTAMP)) - CAST(UNIX_SECONDS(CAST(CAST(la.intime as DATE) AS TIMESTAMP)) as FLOAT64)) / (60 * 60 * 24) BETWEEN -1 AND la.los
    ), avg_obs_per_stay AS (
      SELECT
          --  stick to the numerical data
          --  epoch extracts the number of seconds since 1970-01-01 00:00:00-00, we want to extract measurements between
          --  admission and the end of the patients' stay
          --  getting the average number of times each itemid appears in an icustay (filtering only those that are more than 2)
          obs_per_stay.itemid,
          avg(CAST(obs_per_stay.count as BIGNUMERIC)) AS avg_obs
        FROM
          (
            SELECT
                labsstay.itemid,
                count(*) AS count
              FROM
                labsstay
              GROUP BY 1, labsstay.stay_id
          ) AS obs_per_stay
        GROUP BY 1
        HAVING avg(CAST(obs_per_stay.count as BIGNUMERIC)) > 3
    )
    SELECT
        --  we want the features to have at least 3 values entered for the average patient
        d.label,
        count(DISTINCT labsstay.stay_id) AS count,
        a.avg_obs
      FROM
        labsstay
        INNER JOIN physionet-data.mimiciv_hosp.d_labitems AS d ON d.itemid = labsstay.itemid
        INNER JOIN avg_obs_per_stay AS a ON a.itemid = labsstay.itemid
      GROUP BY 1, 3
      HAVING count(DISTINCT labsstay.stay_id) > (
        SELECT
            --  only keep data that is present at some point for at least 25% of the patients, this gives us 45 lab features
            count(DISTINCT labsstay.stay_id) AS count
          FROM
            mimic_iv.ld_labels
      ) * NUMERIC '0.25'

However, this is producing the following error

An expression references labsstay.stay_id which is neither grouped nor aggregated at [46:28]

I'm not very good with SQL and I had other issues setting up the PostgresSQL database locally. Maybe you could help explain what this query is doing and how to better translate it to the Google Big Query style as I would like to generate the CSV files

Thanks.

AttributeError: 'Config' object has no attribute 'model_type'

There was an error in \models\run_tpc.py, line 14.

But I cannot see any 'model_type' attribute in run_tpc.py or initialise_arguments.py.

Should I set the argument 'model_type' in config=c?

Some questions about performance

Hi Emma,

For the figures in performance tables(take table 2 for example), are the scores calculated from test data set(I assume from test data set) or validation data set?

Another question about transfomer model in transfomer_model.py.

class TransformerEncoder(nn.Module):
  ...
  def forward(self, X, T):
    ...
    # question about this line.
    X = self.transformer_encoder(src=X.permute(2, 0, 1), mask=self._causal_mask(size=T))  # T * B * d_model
    ....

Is _causal_mask telling transformer to mask padded data?

Thanks,
Cheng

configuration of GPU machine for training?

Hi Emma,

Thanks for sharing the detailed code implementation. I am doing some study of your paper, which looked very interesting. May I ask what kind of the GPU machine configuration you used for training and how long roughly did it take you to train the best tpc model with eICU data set? I am trying to train the model on AWS ml.p3.2xlarge NVIDIA V100 with 16GB GPU with eICU data set. I noticed the GPU utilitization is pretty low when I inspect with 'nvidia-smi'(I set batch_size to 64 to occupy about 11GB of GPU memory). It looked the percentage of GPU usage fluctuated a lot back to 0% and most of time was not using above 80%.

Thanks, Cheng

Question regarding masked datafields in timeseries.csv processed file

Hello Emma,

I ran the preprocessing scripts on the original eiCU dataset and noticed the data fields in the timeseries.csv file have "_mask" suffix. For e.g "temperature_mask", "total protein_mask".
Can you please help me understand the reason behind creating masked data fields in the processed timeseries.csv file.

Best,
Kinara Pandya

raise DataError('No numeric types to aggregate') pandas.core.base.DataError: No numeric types to aggregate

I am having error:
raise DataError('No numeric types to aggregate') pandas.core.base.DataError: No numeric types to aggregate

I am using pandas version as requirements.txt (0.24.2), and python 3.6.8.

Preprocessing MIMIC-IV issue

When running the command
\copy D_HCPCS FROM 'd_hcpcs.csv' DELIMITER ',' CSV HEADER NULL ''

in postgresql, it says

ERROR: 0xe2 0x80 byte combined character (encoding: "UHC") has no corresponding character code in "UTF8" encoding Syntax: COPY d_hcpcs, line 88856

Channel mismatch in model_type 'tpc'

Command
python -m models.run_tpc --model_type tpc --mode test --n_epochs 5

Error
Traceback (most recent call last): File "C:\Users\sewoo\AppData\Local\Programs\Python\Python36\lib\runpy.py", line 193, in _run_module_as_main "__main__", mod_spec) File "C:\Users\sewoo\AppData\Local\Programs\Python\Python36\lib\runpy.py", line 85, in _run_code exec(code, run_globals) File "D:\git\cs598_replication_tpc_los_prediction\models\run_tpc.py", line 39, in <module> run_tpc() File "D:\git\cs598_replication_tpc_los_prediction\models\run_tpc.py", line 34, in run_tpc tpc.run() File "C:\Users\sewoo\AppData\Local\Programs\Python\Python36\lib\site-packages\trixi\experiment\experiment.py", line 108, in run self.process_err(e) File "C:\Users\sewoo\AppData\Local\Programs\Python\Python36\lib\site-packages\trixi\experiment\pytorchexperiment.py", line 391, in process_err raise e File "C:\Users\sewoo\AppData\Local\Programs\Python\Python36\lib\site-packages\trixi\experiment\experiment.py", line 90, in run self.validate(epoch=self._epoch_idx) File "D:\git\cs598_replication_tpc_los_prediction\models\experiment_template.py", line 221, in validate self.test() File "D:\git\cs598_replication_tpc_los_prediction\models\experiment_template.py", line 254, in test y_hat_los, y_hat_mort = self.model(padded, diagnoses, flat) File "C:\Users\sewoo\AppData\Local\Programs\Python\Python36\lib\site-packages\torch\nn\modules\module.py", line 1051, in _call_impl return forward_call(*input, **kwargs) File "D:\git\cs598_replication_tpc_los_prediction\models\tpc_model.py", line 618, in forward diagnoses_enc = self.relu(self.main_dropout(self.bn_diagnosis_encoder(self.diagnosis_encoder(diagnoses)))) # B * diagnosis_size File "C:\Users\sewoo\AppData\Local\Programs\Python\Python36\lib\site-packages\torch\nn\modules\module.py", line 1051, in _call_impl return forward_call(*input, **kwargs) File "D:\git\cs598_replication_tpc_los_prediction\models\tpc_model.py", line 78, in forward training=True, momentum=exponential_average_factor, eps=self.eps) # set training to True so it calculates the norm of the batch File "C:\Users\sewoo\AppData\Local\Programs\Python\Python36\lib\site-packages\torch\nn\functional.py", line 2279, in batch_norm _verify_batch_size(input.size()) File "C:\Users\sewoo\AppData\Local\Programs\Python\Python36\lib\site-packages\torch\nn\functional.py", line 2247, in _verify_batch_size raise ValueError("Expected more than 1 value per channel when training, got input size {}".format(size)) ValueError: Expected more than 1 value per channel when training, got input size torch.Size([1, 64])

I will share here when I solve this issue, and please let me know if anybody know the solution of this error.

Preproceesing eICU issue

I got the following error while running the pre-processing scripts by python3 -m eICU_preprocessing.run_all_preprocessing

File "/anaconda3/envs/TPC_Proj/lib/python3.8/runpy.py", line 194, in_run_module_as_main
return _run_code(code, main_globals, None,
File "/anaconda3/envs/TPC_Proj/lib/python3.8/runpy.py", line 87, in_run_code
exec(code, run_globals)
File "/eICU/TPC-LoS-prediction/eICU_preprocessing/run_all_preprocessing.py", line 18, in
timeseries_main(eICU_path, test=False)
File "/eICU/TPC-LoS-prediction/eICU_preprocessing/timeseries.py", line 228, in timeseries_main
gen_timeseries_file(eICU_path, test)
File "/eICU/TPC-LoS-prediction/eICU_preprocessing/timeseries.py", line 166, in gen_timeseries_file
merged = timeseries_lab.loc[patient_chunk].append(timeseries_resp.loc[patien t_chunk], sort=False)
File "/anaconda3/envs/TPC_Proj/lib/python3.8/site-packages/pandas/co re/indexing.py", line 967, in getitem
return self._getitem_axis(maybe_callable, axis=axis)
File "/anaconda3/envs/TPC_Proj/lib/python3.8/site-packages/pandas/co re/indexing.py", line 1194, in _getitem_axis
return self._getitem_iterable(key, axis=axis)
File "/anaconda3/envs/TPC_Proj/lib/python3.8/site-packages/pandas/co re/indexing.py", line 1132, in _getitem_iterable
keyarr, indexer = self._get_listlike_indexer(key, axis)
File "/anaconda3/envs/TPC_Proj/lib/python3.8/site-packages/pandas/co re/indexing.py", line 1330, in _get_listlike_indexer
keyarr, indexer = ax._get_indexer_strict(key, axis_name)
File "/anaconda3/envs/TPC_Proj/lib/python3.8/site-packages/pandas/co re/indexes/multi.py", line 2587, in _get_indexer_strict
self._raise_if_missing(key, indexer, axis_name)
File "/anaconda3/envs/TPC_Proj/lib/python3.8/site-packages/pandas/co re/indexes/multi.py", line 2605, in _raise_if_missing
raise KeyError(f"{keyarr[cmask]} not in index")
KeyError: '[141939 142056 142476 142521 142560 146391 147447 149039 149606 15300 6\n 160529 162431 166572 166709 167391 167417 171174 175528 177651 178069\n 1788 58 179142 179554] not in index'

performance

Hi Emma, great work!

I am wondering why your R2 performance is much lower than those reported in this paper: https://github.com/mostafaalishahi/eICU_Benchmark

BTW, what is the current (reliable) SOTA performance?

Python Version

May I ask the python version of your environment? Thanks!

KeyError: '[141939 142056 142476 142521 142560 146391 147447 149039 149606 153006\n 160529 162431 166572 166709 167391 167417 171174 175528 177651 178069\n 178858 179142 179554] not in index'

Hi,
I am getting following error while running the command:
python -m eICU_preprocessing.run_all_preprocessing

/opt/conda/lib/python3.6/runpy.py:85: DtypeWarning: Columns (3) have mixed types.Specify dtype option on import or set low_memory=False.
exec(code, run_globals)
Traceback (most recent call last):
File "/opt/conda/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/opt/conda/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/Code/TPC-LoS-prediction-master/eICU_preprocessing/run_all_preprocessing.py", line 19, in
timeseries_main(eICU_path, test=False)
File "/Code/TPC-LoS-prediction-master/eICU_preprocessing/timeseries.py", line 228, in timeseries_main
gen_timeseries_file(eICU_path, test)
File "/Code/TPC-LoS-prediction-master/eICU_preprocessing/timeseries.py", line 166, in gen_timeseries_file
merged = timeseries_lab.loc[patient_chunk].append(timeseries_resp.loc[patient_chunk], sort=False)
File "/opt/conda/lib/python3.6/site-packages/pandas/core/indexing.py", line 879, in getitem
return self._getitem_axis(maybe_callable, axis=axis)
File "/opt/conda/lib/python3.6/site-packages/pandas/core/indexing.py", line 1099, in _getitem_axis
return self._getitem_iterable(key, axis=axis)
File "/opt/conda/lib/python3.6/site-packages/pandas/core/indexing.py", line 1037, in _getitem_iterable
keyarr, indexer = self._get_listlike_indexer(key, axis, raise_missing=False)
File "/opt/conda/lib/python3.6/site-packages/pandas/core/indexing.py", line 1240, in _get_listlike_indexer
indexer, keyarr = ax._convert_listlike_indexer(key)
File "/opt/conda/lib/python3.6/site-packages/pandas/core/indexes/multi.py", line 2397, in _convert_listlike_indexer
raise KeyError(f"{keyarr[mask]} not in index")
KeyError: '[141939 142056 142476 142521 142560 146391 147447 149039 149606 153006\n 160529 162431 166572 166709 167391 167417 171174 175528 177651 178069\n 178858 179142 179554] not in index'
==> Removing the stays.txt file if it exists...

==> Removing the preprocessed_timeseries.csv file if it exists...
==> Loading data from timeseries files...
==> Reconfiguring lab timeseries...
==> Reconfiguring respiratory timeseries...
==> Reconfiguring nurse timeseries...
==> Reconfiguring aperiodic timeseries...
==> Reconfiguring periodic timeseries...
==> Starting main processing loop...

Any idea how to fix this?

emmarocheteau / tpc-los-prediction Goto Github PK

tpc-los-prediction's People

Contributors

Stargazers

Watchers

Forkers

tpc-los-prediction's Issues

Recommend Projects

Recommend Topics

Recommend Org