Giter VIP home page Giter VIP logo

spatial_lda's Introduction

Spatial-LDA

spatial-lda

Spatial-LDA is a probabilistic topic model for identifying characteristic cellular microenvironments from in-situ multiplexed imaging data such as MIBI-ToF or CODEX.

This repository contains an implementation of the Spatial-LDA model as described in the paper Modeling Multiplexed Images with Spatial-LDA Reveals Novel Tissue Microenvironments.

Please cite our work if you find this tool useful.

Modeling Multiplexed Images with Spatial-LDA Reveals Novel Tissue Microenvironments

Zhenghao Chen, Ilya Soifer, Hugo Hilton, Leeat Keren, and Vladimir Jojic

Journal of Computational Biology 2020.04.03; doi: http://doi.org/10.1089/cmb.2019.0340

BibTeX

@article{chen2020modeling,
  title={Modeling Multiplexed Images with Spatial-LDA Reveals Novel Tissue Microenvironments},
  author={Chen, Zhenghao and Soifer, Ilya and Hilton, Hugo and Keren, Leeat and Jojic, Vladimir},
  journal={Journal of Computational Biology},
  year={2020},
  publisher={Mary Ann Liebert, Inc., publishers 140 Huguenot Street, 3rd Floor New~…}
}

The repository also contains notebooks that generate the results and figures presented in the paper as examples of how to use Spatial-LDA.

Installation

The easiest and preferred way to install the Spatial-LDA package is via pip:

pip install spatial_lda

Alternatively, you can clone this repository and run setup.py directly (assuming you have setuptools installed).

python setup.py install

Examples

Please refer to the included notebooks below for examples of how to train a Spatial-LDA model. We include two notebooks:

(1) Applying Spatial-LDA to a CODEX dataset of mouse spleen tissues

We apply Spatial-LDA to a dataset of mouse spleens from Deep Profiling of Mouse Splenic Architecture with CODEX Multiplexed Imaging to validate that it recovers known spatial relationships between immune cells in the mouse spleen.

Mouse Spleen Analysis

(2) Applying Spatial-LDA to a MIBI-ToF dataset of Triple Negative Breast Cancer (TNBC) tumors

We apply Spatial-LDA to a dataset of TNBC tumors from A Structured Tumor-Immune Microenvironment in Triple Negative Breast Cancer Revealed by Multiplexed Ion Beam Imaging to identify prototypical tumor-immune microenvironments in TNBC.

TNBC Analysis

For convenience, we have included pre-processed versions of the data from the two datasets above under 'data/' and pretrained models (the output of these notebooks) under 'models/'.

Please note that in order to download the data and model files you will need to install and enable Git Large File Storage (LFS) before cloning this repository.

Usage

Featurization

The Spatial-LDA model requires a dataset of index cells and neighborhood features along with an undirected graph where nodes are index cells and edges between nodes encode index cells that should be regularized to have similar topic priors.

We provide utilities in the featurization module to generate required neighborhood features (featurization.featurize_samples) and adjacency matrices (featurization.make_merged_difference_matrices) from dataframes containing the location and features of index and background cells.

Training and inference

To fit a Spatial-LDA model, call spatial_lda.model.train on the feature matrix and difference matrix generated in the featurization step. E.g.,

spatial_lda_model = spatial_lda.model.train(train_tumor_marker_features, 
                                            train_difference_matrices, 
                                            n_topics=N_TOPICS, 
                                            difference_penalty=DIFFERENCE_PENALTY, 
                                            verbosity=1,
                                            n_parallel_processes=3,
                                            n_iters=3,
                                            admm_rho=0.1,
                                            primal_dual_mu=2)

To run inference - computing regularized topic weights on a pre-trained set of topics:

complete_lda = spatial_lda.model.infer(
      spatial_lda_model.components_, tumor_marker_features, 
      complete_difference_matrices, difference_penalty=DIFFERENCE_PENALTY,
      n_parallel_processes=N_PARALLEL_PROCESSES)

For reference, we also include an earlier primal-dual based implementation of the model that was described in an earlier version of our paper. However, the ADMM based solution should be preferred as it should be significantly faster.

spatial_lda's People

Contributors

bcollica avatar magdalenamat avatar vjojic avatar z-chen avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

spatial_lda's Issues

Repeat call in line_search()

I think there is an extra call to compute_r() in the line_search() function on line 137 of admm.py. It seems like build_linear_system() returns a value for r on line 128, and then compute_r() is called again on line 137 using the same unchanged arguments.

This is the call on line 128:

M, r = build_linear_system(gamma, u, C, e, rho, s, t)

This is what build_linear_system() calls:

def build_linear_system(gamma, u, C, e, rho, s, t):
    """Build the linear system for ADMM.primal_dual (see appendix section 5.2.7)."""
    n, k = e.shape
    H = hessian_f0(gamma, e, rho, s)
    uC = scipy.sparse.diags(np.squeeze(u), 0).dot(C)
    Cg = scipy.sparse.diags(np.squeeze(C.dot(gamma)))
    M = scipy.sparse.vstack((scipy.sparse.hstack((H, C.T)),
                                               scipy.sparse.hstack((-uC, -Cg)))).tocsr()
    r = compute_r(gamma, u, C, e, rho, s, t)
    return M, r

And then compute_r() called again with the same variables on line 137:

r = compute_r(gamma, u, C, e, rho, s, t)

I'm not sure if this was intentional, nor do I know how computationally taxing it is either since I haven't had a chance to explicitly profile compute_r(). But if it's a redundant call, then it is potentially being called up to 1,500 times more than it should per topic (15 iterations of update_xis inside admm() and then up to 100 iterations of line_search() inside primal_dual()).

how to run the spatial-LDA on all cells

I'd like to run the spatial-LDA on all the cells in my dataset. I tried to set all cells as index cells by setting is_anchor_col argument in the featurize_samples function to a column where all values are True, but that produces an error.

if os.path.exists(PATH_TO_TUMOR_MARKER_FEATURES_PKL):
    with open(PATH_TO_TUMOR_MARKER_FEATURES_PKL, 'rb') as f:
        tumor_marker_features = pickle.load(f)
else:
    neighborhood_feature_fn = functools.partial(neighborhood_to_marker, 
                                              markers=markers)
    
    tumor_marker_features = featurize_samples(patient_dfs, neighborhood_feature_fn, 100, 'is_index',
                             'x', 'y', n_processes=N_PARALLEL_PROCESSES)
    with open(PATH_TO_TUMOR_MARKER_FEATURES_PKL, 'wb') as f:
        pickle.dump(tumor_marker_features, f)

ERROR:

RemoteTraceback Traceback (most recent call last)
RemoteTraceback:
"""
Traceback (most recent call last):
File "/home/kidzik/miniconda3/envs/spatial-lda/lib/python3.7/multiprocessing/pool.py", line 121, in worker
result = (True, func(*args, **kwds))
File "/home/kidzik/miniconda3/envs/spatial-lda/lib/python3.7/site-packages/spatial_lda-0.0.3-py3.7.egg/spatial_lda/featurization.py", line 51, in _featurize_sample
is_anchor_col, x_col, y_col, z_col=z_col)
File "/home/kidzik/miniconda3/envs/spatial-lda/lib/python3.7/site-packages/spatial_lda-0.0.3-py3.7.egg/spatial_lda/featurization.py", line 38, in _featurize_cells
neighborhood_kdTree = KDTree(neighborhood_cells[coord_cols].values)
File "/home/kidzik/miniconda3/envs/spatial-lda/lib/python3.7/site-packages/scipy/spatial/kdtree.py", line 239, in init
self.maxes = np.amax(self.data,axis=0)
File "/home/kidzik/miniconda3/envs/spatial-lda/lib/python3.7/site-packages/numpy/core/fromnumeric.py", line 2505, in amax
initial=initial)
File "/home/kidzik/miniconda3/envs/spatial-lda/lib/python3.7/site-packages/numpy/core/fromnumeric.py", line 86, in _wrapreduction
return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
ValueError: zero-size array to reduction operation maximum which has no identity
"""

The above exception was the direct cause of the following exception:

ValueError Traceback (most recent call last)
in
7
8 tumor_marker_features = featurize_samples(patient_dfs, neighborhood_feature_fn, 100, 'is_index',
----> 9 'x', 'y', n_processes=N_PARALLEL_PROCESSES)
10 with open(PATH_TO_TUMOR_MARKER_FEATURES_PKL, 'wb') as f:
11 pickle.dump(tumor_marker_features, f)

~/miniconda3/envs/spatial-lda/lib/python3.7/site-packages/spatial_lda-0.0.3-py3.7.egg/spatial_lda/featurization.py in featurize_samples(sample_dfs, neighborhood_feature_fn, radius, is_anchor_col, x_col, y_col, z_col, n_processes)
87 all_sample_features = list(tqdm(pool.imap(featurize_sample_fn,
88 sample_dfs.items()),
---> 89 total=total))
90 else:
91 for i, sample_df in sample_dfs.items():

~/miniconda3/envs/spatial-lda/lib/python3.7/site-packages/tqdm/_tqdm_notebook.py in iter(self, *args, **kwargs)
221 def iter(self, *args, **kwargs):
222 try:
--> 223 for obj in super(tqdm_notebook, self).iter(*args, **kwargs):
224 # return super(tqdm...) will not catch exception
225 yield obj

~/miniconda3/envs/spatial-lda/lib/python3.7/site-packages/tqdm/_tqdm.py in iter(self)
1003 """), fp_write=getattr(self.fp, 'write', sys.stderr.write))
1004
-> 1005 for obj in iterable:
1006 yield obj
1007 # Update and possibly print the progressbar.

~/miniconda3/envs/spatial-lda/lib/python3.7/multiprocessing/pool.py in next(self, timeout)
746 if success:
747 return value
--> 748 raise value
749
750 next = next # XXX

ValueError: zero-size array to reduction operation maximum which has no identity

I think it's because the neighborhood_cells becomes an empty arrey if I set all cells as index cells.

Can you kindly advise on how to run the pipeline on all the cells, please?

Thanks a lot!

Magda

neural spatial LDA

Just wonder will you share the code for your new paper neural spatial LDA?

Thank you.

add max_ls_iter and max_dirichlet_iter as options in model.train

When increasing the number of topics to be discovered I get an Exception:

Exception: Stopping in admm.newton_regularized_dirichlet.

As you suggested It could be fixed by increasing the number of line search iterations (ls_iter in the admm.newton_regularized_dirichlet) and the number of Dirichlet iterations (max_diricihlet_iter in the admm.admm).

Could you add these options to the model.train, please?

Option to decrease the max_iter argument in admm().

The max_iter argument used in the admm() function currently has a default value of 15 without the option to decrease it, nor is there a criteria for breaking the loop over these iterations. Given that a substantial amount of computational time is spent in this part of the code during model training, having the option to decrease these iterations manually might speed things up, especially if there aren't too many gains to be had from all 15 iterations.

Incorporating a break in the loop when the marginal gain from the last iteration is less than some tolerance could also be helpful. When I ran the example code using the TNBC data, the objective was showing changes of less than 1% as early as the fifth iteration (see screenshot of logging output for sample 6). I should also note that I had set TRAIN_SIZE_FRACTION = 0.1 instead of 0.9 for the purposes of the demo and running it locally on my machine.

N_PARALLEL_PROCESSES =  1#@param
TRAIN_SIZE_FRACTION = 0.1 #@param
N_TOPICS = 3 #@param
DIFFERENCE_PENALTY = 250 #@param
RETRAIN_MODEL = True#@param

logger = logging.getLogger()
logger.setLevel(logging.INFO)

N_IMMUNE_CLUSTERS = 12

with open(PATH_TO_PATIENT_DFS_PKL, 'rb') as f:
   patient_dfs = pickle.load(f)
   

for patient_id in patient_dfs.keys():
  df = patient_dfs[patient_id]
  df['combined_cluster_id'] = (df['immune_cluster'].fillna(0) + 
                               (df.cluster_id + N_IMMUNE_CLUSTERS).fillna(0))
  df.loc[df['combined_cluster_id'] == 0, 'combined_cluster_id'] = None
  df.loc[:, 'is_tumor'] = ~df['isimmune']
  patient_dfs[patient_id] = df
  
compartmentalized_tumors = [3, 4, 5, 6, 9, 10, 16, 28, 32, 35, 36, 37, 40, 41]
noncompartmentalized_tumors = [x for x in patient_dfs.keys() 
                               if x not in compartmentalized_tumors]
immune_cluster_names={8:'Treg',6:'MF/Glia',1:'B',4:'CD11c-high',7:'NK',0:'CD4T',
                      2:'DC',3:'CD8T',10:'MF',11:'Neutrophil',9:'Other', 5:'MF'}
other_cluster_names={0:'Epithelial',2:'Tumor/Keratin',3:'Tumor/EGFR',
                     4:'Endothelial/Vim',1:'Mesenchymal/SMA'}
markers = patient_dfs[1].columns[2:44]
num_patients = 41

if os.path.exists(PATH_TO_TUMOR_MARKER_FEATURES_PKL):
  with open(PATH_TO_TUMOR_MARKER_FEATURES_PKL, 'rb') as f:
    tumor_marker_features = pickle.load(f)
else:
  neighborhood_feature_fn = functools.partial(neighborhood_to_marker, 
                                              markers=markers)
  tumor_marker_features = featurize_tumors(patient_dfs, 
                                           neighborhood_feature_fn, 
                                           radius=75,
                                           n_processes=N_PARALLEL_PROCESSES)
  
  with open(PATH_TO_TUMOR_MARKER_FEATURES_PKL, 'wb') as f:
    pickle.dump(tumor_marker_features, f)


# Subsample tumor cells
all_tumor_idxs = tumor_marker_features.index.map(lambda x: x[0])
_sets = train_test_split(tumor_marker_features, 
                         test_size=1. - TRAIN_SIZE_FRACTION,
                         stratify=all_tumor_idxs)
train_tumor_marker_features, test_tumor_marker_features = _sets
train_difference_matrices = make_merged_difference_matrices(
    train_tumor_marker_features, patient_dfs, 'x', 'y')
tumor_idxs = train_tumor_marker_features.index.map(lambda x: x[0])

path_to_train_model = '_'.join((f'{PATH_TO_MODELS}/tnbc_training',
                                f'penalty={DIFFERENCE_PENALTY}',
                                f'topics={N_TOPICS}',
                                f'trainfrac={TRAIN_SIZE_FRACTION}')) + '.pkl'

if os.path.exists(path_to_train_model) and not RETRAIN_MODEL:
  print('Loading existing model and inferred topics.')
  with open(path_to_train_model, 'rb') as f:
    spatial_lda_model = pickle.load(f)
else:
  spatial_lda_model = spatial_lda.spatial_lda.model.train(train_tumor_marker_features, 
                                              train_difference_matrices, 
                                              n_topics=N_TOPICS, 
                                              difference_penalty=DIFFERENCE_PENALTY, 
                                              verbosity=1,
                                              n_parallel_processes=N_PARALLEL_PROCESSES,
                                              n_iters=1,
                                              admm_rho=0.1,
                                              primal_dual_mu=2)
  with open(path_to_train_model, 'wb') as f:
    pickle.dump(spatial_lda_model, f)

Screen Shot 2021-06-08 at 10 06 29 AM

Error encountered when training the model

Hi,

I wanted to fit the spatial LDA model on ~1.6 million cells (using all cells except ~50000 tumor cells as anchor cells), but I ran into some issues when training the model. Here is a snippet of the code:


spatial_lda_models = {}
difference_penalty = 0.25

N_TOPICS_LIST = [3,4,5,6,7,8,9,10,11]

for n in N_TOPICS_LIST:
    path_to_train_model = '_'.join((f'spatialLDA_model/',f'penalty = {difference_penalty}', f'topics={n}',f'trainfrac=0.9')) + '.pkl'
    print(f'Running n_topics = {n}, d = {difference_penalty}\n')
    spatial_lda_model = spatial_lda.model.train(sample_features = train_hodgkin_features,
                                               difference_matrices = train_difference_matrices,
                                               difference_penalty = difference_penalty,
                                               n_topics = n,
                                               n_parallel_processes = 20,
                                               verbosity = 1,
                                               admm_rho = 0.1,
                                               primal_dual_mu = 10)
    spatial_lda_models[n] = spatial_lda_model
    with open(path_to_train_model, 'wb') as f:
        pickle.dump(spatial_lda_model, f)

order_topics_consistently(spatial_lda_models.values())

When training the model, I got exception: stopping in admm.newton_regularized_dirichlet. From the traceback message, it seems like this is caused by the hyperparameter primal_dual_mu. I tried to train the model on a subset of the data with the same hyperparameters and the code did run successfully. Can you provide some insight in to the issue I am facing? Could it be caused by the large sample size and improper hyperparameter choice making the optimization algorithm to diverge?

Screen Shot 2023-01-25 at 9 57 28 AM

hard-coded number of parallel processes in infer function

The infer() function accepts an argument for the number of parallel processes to use, but then it hard codes the number of processes in the update xis step to be 2. See below:

xis = _update_xis(sample_features,
                      difference_matrices,
                      difference_penalty,
                      gamma=gamma,
                      n_parallel_processes=2,
                      verbosity=0)

This forces inference to be very slow.

Recommendations for Updates

  • Provide description of dataframe/input that enters the model
  • Provide a list of options of function for data type of input from which user can choose:
    -- Example:

func_type = {"marker": neighborhood_to_marker, "cluster": neighborhood_to_cluster, "avg_marker": neighborhood_to_avg_marker, "count": neighborhood_to_count}

if TYPE == "marker" or TYPE == "avg_marker":
neighborhood_feature_fn = functools.partial(func_type[TYPE], markers = markers)
else:
neighborhood_feature_fn = functools.partial(func_type[TYPE])

  • Make "anchor cell" more generalizable/define more clearly (since most data sets will probably have all cells as anchors
  • Enable user to plot 1 FOV at a time (plot_samples_in_a_row does not work when only one fov number is input)
    -- Maybe rename visualization functions as "sample" instead of "tumor" since not all MIBI images will be of tumors
  • Enable user to save resulting plots (currently plots are produced within the function, so to plot, it would require user to modify function)

logsumexp import error

The online_lda.py module tries to import logsumexp from sklearn, but I think this is outdated and throws an ImportError. Importing from scipy.special seems to work.

from sklearn.utils.fixes import logsumexp

ImportError: cannot import name 'logsumexp'

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.