theislab / geome Goto Github PK

Home Page: https://geome.readthedocs.io/en/latest/

License: MIT License

Python 100.00%

graph-learning microscopy pytorch-geometric anndata

geome's Introduction

geome

The repo provides a set of tools for creating PyTorch Geometric (PyG) data objects from AnnData objects, which are commonly used for storing and manipulating single-cell genomics data. In addition, the repo includes functionality for creating PyTorch Lightning (PyTorch-Lightning) DataModule objects from the PyG data objects, which can be used to create graph neural network (GNN) data pipelines. The PyG data objects represent graphs, where the nodes represent cells and the edges represent relationships between the cells, and can be used to perform GNN tasks such as node classification, graph classification, and link prediction. The repo is written in Python and utilizes the PyTorch, PyTorch Geometric, and PyTorch-Lightning libraries.

Getting started

Please refer to the documentation. In particular, the

API documentation.

Installation

You need to have Python 3.9 or newer installed on your system. If you don't have Python installed, we recommend installing Mambaforge.

There are several alternative options to install geome:

Install the latest development version:

Without GPU support:

mamba create -n geome  python=3.10
mamba activate geome
pip install torch==2.1.0
pip install torch-scatter torch-sparse torch-cluster
pip install git+https://github.com/theislab/geome.git@main

With GPU:

mamba create -n geome  python=3.10
mamba activate geome
mamba install pytorch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 pytorch-cuda=12.1 -c pytorch -c nvidia
pip install torch-scatter torch-sparse torch-cluster -f https://data.pyg.org/whl/torch-2.1.0+cu121.html
pip install git+https://github.com/theislab/geome.git@main

Release notes

See the changelog.

Contact

For questions and help requests, you can reach out in the scverse discourse. If you found a bug, please use the issue tracker.

Credits

Some of the data for DatasetHartmann is distributed in this package. It was originally retrieved from: https://zenodo.org/record/3951613#.Y9flQS-B1qv

This project was generated from @cjolowicz's Hypermodern Python Cookiecutter template.

Citation

t.b.a

geome's People

Contributors

Stargazers

Watchers

Forkers

grst scverse-bot onerai francescadr

geome's Issues

unpin matplotlib

geome/pyproject.toml

Line 30 in 22ce1fd

"matplotlib==3.6",

is there any reason for pinning matpltolib?

Proper pyproject.toml versioning and better run instructions

Description of feature

specify which library version ranges not only module names
how to install pytorch before added to readme more clearly

Refactoring: Formatting in .py files and removing unused imports in .ipynp files

Hi,

We should decide on a formatter and use that. Also since formatter doesn't apply to notebook files we should manually refactor them.

Edge weight integration

I would like to select nodes that are closest to a specific target node. Therefore, I require an additional parameter edge_weight to be saved for the graphs.

The edge weight should represent the euclidian distance between two nodes given their coordinates, so: $d=\sqrt{(x_1-x_2)^2 +(y_1-y_2)^2}$.

The goal should be to use the edge weights to create subgraphs for nodes given their euclidian distance.

LinearNCEM fully reproduce an experiment

To see if the training loop works and if we have an interface that complies to the requirements, I suggest that we reproduce a LinearNCEM experiment (preferably spatial type). Here are the steps I suggest. These can be converted into their own issue.

We need to specify the experiment configurations from the manuscript:
- Optimization configurations (e.g., lr, optimizer).
- Monitored stats (e.g., r2, SALS, ...)
- Preprocessing of data: #11
- Sampling/Loading schemes (e.g., masking stuff, neighbor loading etc.) #10
- Train, test, val split schemes: #8

I suggest we create an examples/linear_ncem_exp1.ipynb file in the linear-exp-1 branch.

Nodewise splitting of data

Provide data splitting for training, test and validation for small number of graphs where one data object would be one node.
Possible solutions:

masking the loss function
randomly selecting subgraphs
selecting subgraphs and appending them to include all neighbors of selected nodes (through directed edges)
creating a subgraph for all nodes and splitting like in the case of several graphs

Should there be one DataModule?

We can decide to use graphs or nodes as batches on one data module. Should we do one class that does both, or should each class handle them separately?
This is when there is one class:

if len(self.data) == 1:
            self.data = self.data[0]
            self.dataloader = RandomNodeSampler(self.data, num_parts=self.batch_size,
                                                num_workers=self.num_workers
                                                )
else:
            self.dataloader = DataLoader(self.data, batch_size=self.batch_size, shuffle=True,
                                         num_workers=self.num_workers
                                         )

LinearNCEM Train, Test, Val scheme for an experiment

To implement this, we will have to modify datamodule.py.

But first we need to explain how the splitting works and decide on an experiment in #7.

We should apply the mask attrs in the training loop to the loss function somehow.

We have random node split but on the training loop we don't apply it to the loss function.

@tothmarcella do you think we do something similar to below instead of altering the loss function?

x_v = x[val]
x_v.requires_grad = False
loss = lossfn(x,y)

Maybe creating a builder class instead of the adata2data_fn callable?

I realized a weakness of the current interface I proposed. We maybe should create a "Builder" class (from design patterns) to create the dataset we wish (e.g., list of data, or Batch obj.). So that it has an adata2data_fn method for each seperate case (train, test, val). Similar to the Datamodule object having train_dataloader method for only creating a dataloader.

class AnnData2DataTransform:

def __init__(adata):
    ...

def train_dataset():
   ...
def test_dataset()
   ...

But for this we would need more use cases and wait until the anndata2data function in utils module is as general as it gets.

ValueError when subgraph creation with library_key for PyG datas

Report

Description: Error message when category to iterate on (e.i. library_key) is different than the category to be predicted.

Expected behaviour

Given an AnnData object with four cell types and three image_ids, I want to create subgraphs based on image_id but predict the cell types (y). To do so, I need to pass the image_id are passed as library key to the sq.gr.spatial_neighbours() function. This splits the graph up into three subgraphs according to the image_id, like shown below:

np.random.seed(42)
cluster_centers = np.array([
    [1, 1],  # Center for the first 20 points
    [5, 5],  # Center for the next 20 points
    [1, 5]  # Center for the last 10 points
])

points_cluster_1 = cluster_centers[0] + np.random.randn(20, 2)
points_cluster_2 = cluster_centers[1] + np.random.randn(20, 2)
points_cluster_3 = cluster_centers[2] + np.random.randn(10, 2)

coordinates = np.vstack((points_cluster_1, points_cluster_2, points_cluster_3))
adata_gt = ad.AnnData(
    np.random.rand(50, 2),
    obs={"cell_type": list("ab" * 20) + ["c"] * 5 + ["d"] * 5, "image_id": ["x"] * 20 + ["y"] * 20 + ["z"] * 10},
    obsm={"spatial_init": coordinates},
)

adata_gt.obs['image_id'] = pd.Categorical(adata_gt.obs['image_id'])

sq.gr.spatial_neighbors(adata_gt, spatial_key='spatial_init', library_key="image_id")
print(adata_gt)
sq.pl.spatial_scatter(adata_gt, spatial_key='spatial_init', shape=None, color="image_id", size=30, connectivity_key="spatial_connectivities",)

If I pass different categories to the iterables.ToCategoryIterator("image_id", axis="obs", preserve_categories = ["cell_type"]) it fails.

Reproducible Example

Inspired from test_ann2data_by_category.py

coordinates = np.random.rand(50, 2)
func_args = {"radius": 4.0, "coord_type": "generic", "library_key": "image_id"}
coordinates[:25, 0] += 100
adata_gt = ad.AnnData(
    np.random.rand(50, 2),
    obs={"cell_type": ["a"] * 20 + ["b"] * 20 + ["c"] * 5 + ["d"] * 5, "image_id": list("xy" * 20) + ["z"] * 10},
    obsm={"spatial_init": coordinates},
)
a2d = ann2data.Ann2DataByCategory(
    fields={"x": ["X"], "edge_index": ["uns/edge_index"], "edge_weight": ["uns/edge_weight"], "y": ["obs/cell_type"]},
    category="image_id",
    preprocess=transforms.Categorize(keys=["cell_type", "image_id"]),
    transform=transforms.AddEdgeIndex(
        spatial_key="spatial_init",
        key_added="graph",
        edge_index_key="edge_index",
        edge_weight_key="edge_weight",
        func_args=func_args,
    ),
)
datas = list(a2d(adata_gt.copy()))
assert len(datas) == 3

Error:

ValueError: Found array with 0 sample(s) (shape=(0, 2)) while a minimum of 1 is required by NearestNeighbors.

Version information

No response

Adding non-spatial split support

Since we now have LinearNCEM, we should support the splitting spesific to its case. The edge_index attribute is still given in the data object, so the splitting and loading is still done spatially. For example, we split by RandomNodeSplit and load with NeighborLoader.

Besides Graph or Node splitting and loading, I think we should also have a Spatial option. This option will expect edge_index attributes on the data object created if set. Would you agree?

Fix the doc formatting and deploy the page

Description of feature

here are the steps

fix docstring formatting in the classes
fix notebooks so they don't run
if they are run by the doc build flow add CI for notebook tests (lots of dependency issue if so)

Creating a data loader for Hartmann as AnnData objects

Convert the code in https://github.com/theislab/ncem/blob/1ed3ccead38f70fcdbcc6640971ddfa877ff62cd/ncem/data.py#L2278 to create a class that creates AnnData object(s).

This is the first step of the preprocessing step.

Update: Copied the dataloader class from ncem. I don't know if we should change the name of the class since it can cause ambiguity with the torch dataloader.

Missing dependencies

Report

I installed the geome package in a new conda environment with pip install geome. The install does not seem to install all required dependencies such as squidpy. For example, when trying to download from the gnome package (e.g. from geome import transforms) I get the No module named 'squidpy'.

Version information

No response

Overall Train, Val, Test handling

Currently, the stage handling is currently in a dummy state in datamodule.py

I suggest we first do an example to list the requirements here. See #8 . Then we can start listing based on how this goes.

closest kNN subgraph creation

Description of feature

Similar to PyTorch geometric it would be great to have functions that can create subgraphs from the PyTorch geometric object. One possibility for a subgraph selection that is not implemented with PyTorch geometric is selection of subgraph given the nodes euclidian distance (edge_weight).

For example, PyTorch geometric has a function multiple functions that create subgraphs in torch_geometric.utils.subgraph.

The function should select the n nearest neighbours (given their euclidian distance) of a target node node\_idx and return the subgraph as a PyTorch geometric object.

subgraph = kNN_subgraph(node_idx: int, n: int)

Organize the repo with cookiecutter template for dependencies etc.

To make this project a python module project and to work on the development mode of pip etc. We should re-ogranize the repo files with the https://github.com/cjolowicz/cookiecutter-hypermodern-python template, which was also the template of the previous ncem project

anndata2data callable builder class

I know I am not good with names, so all these names can change. I wrote an example code how this might look like.

The main goal is to make the models agnostic to the attributes we deal with. For this we can use a class to create the transformations that the model will use.

Usage will be like this

xs = {"X":"numeric", "obs/domain":"categorical" }
ys = {"obs/cell_type":"categorical",}
adata2data_fn = AnnData2DataCallable(is_sq=True,has_edge_index=False, xs=xs, ys=ys)
dm = GraphAnnDataModule(adata, adata2data_fn,learning_type='graph')

So the most recent version looks like this, and it is almost complete:

class AnnData2DataCallable:
    def __init__(
        self,
        x_names: Dict[str, str],
        y_names: Dict[str, str],
        is_sq: bool = False,
        has_edge_index: bool = True,
    ):
        self.is_sq = is_sq
        self.has_edge_index = has_edge_index
        if is_sq:
            self._adata_iter = AnnData2DataCallable._get_sq_adata_iter
        else:
            self._adata_iter = lambda x: x  # identity

        self.xs = x_names
        self.ys = y_names

    @staticmethod
    def _get_sq_adata_iter(adata):
        cats = adata.obs.library_id.dtypes.categories
        for cat in cats:
            yield adata[adata.obs.library_id == cat]

    @staticmethod
    def _get_adj_matrix(adata):
        """helper function to create adj matrix depending on the adata"""

        # Get adjacency matrices
        if "adjacency_matrix_connectivities" in adata.obsp.keys():
            spatial_connectivities = adata.obsp["adjacency_matrix_connectivities"]

        else:
            spatial_connectivities, _ = sq.gr.spatial_neighbors(
                adata,
                coord_type="generic",
                key_added="spatial",
                copy=True,
            )
        return spatial_connectivities

    def _extract_features(self, obj, adata):
        raise NotImplementedError()

    def _concat_features(adata, obj, features_dict):
        raise NotImplementedError()

    def _create_data_obj(self, adata, spatial_connectivities):
        obj = dict()
        if self.has_edge_index:
            nodes1, nodes2 = spatial_connectivities.nonzero()
            obj["edge_index"] = torch.vstack(
                [
                    torch.from_numpy(nodes1).to(torch.long),
                    torch.from_numpy(nodes2).to(torch.long),
                ]
            )

        # extract general features e.g., gene_expression, cell_type
        features_dict = self._extract_features(adata, obj)

        # assign x's and y's here as one torch tensor
        obj = self._concat_features(adata, obj, features_dict)

        return Data(**obj)

    def __call__(self, adatas):
        dataset = []

        for adata in self._adata_iter(adatas):
            spatial_connectivities = self._get_adj_matrix(adata)
            data = self._create_data_obj(adata, spatial_connectivities)
            dataset.append(data)

        return dataset

Determining the behavior for all cases for splitting and loading

Case 1

Split By: Node
Yield Edges : True

Examples:
-NonLinearNCEM with all datasets.

Strategy

We enforce adata2data_fn to give a list of Data object with:

edge_index attribute (2 x edge_count)

We recommend adata2data_fn to give a list of Data object with:

x attribute as input (node_count x in_features)
(depending on the model) y attribute as target (node_count x out_features)
(depending on the model) edge_weights attribute (2 x edge_count)

Then do the following:

Merge the list of data to one big graph.
Add train, test, val masks to batch with RandomNodeSplit.
Load with NeighborLoader with the masks as input.

Case 2

Split By: Node
Yield Edges: False

Strategy

This case should give an NotImplementedError because splitting by nodes then loading non-spatially is equivalent to the case
where we consider each node as a small graph and do Case 4. I don't think this is an urgent thing to do.

Case 3

Split By: Graph
Yield Edges : True

Strategy

This case should give a NotImplementedError because splitting by graphs then loading spatially is equivalent to the case
where we add train_mask full of ones to the graphs chosen to be for training, then and zeros to others (similarly for val & test). Then merging the graphs as one big graph, then do Case 1.

Case 4

Split By: Graph
Yield Edges : False

Examples:
-LinearNCEM with all datasets, both spatial and non-spatial variant.

Strategy

We recommend adata2data_fn to give a list of Data object with:

x attribute as input (node_count x in_features)
(depending on the model) y attribute as target (node_count x out_features)

Then do the following:

Split the python list by indices
Load with DataListLoader with the masks as input.

notebooks

Report

the notebooks dont work

Version information

No response

Add more dataset examples and apply to linear NCEM #1

LinearSpatial implementation doesn't use Message Passing. Also not GPU compatible.

We currently have this implementation on LinearSpatial forward method

  def forward(self, x, edge_index):
      """
      Inputs:
          x - Input features per node
          edge_index - input edge indices. Shape 2 x 2*(No. edges)
      """

      X_cell_type = x[:, 0 : self.num_cell_types + 1]

      if self.mult_features:
          X_domain = x[:, self.num_cell_types + 1 :]

      num_obs = x.shape[0]

      # Define adjacency matrix
      adj_matrix = torch.eye(num_obs) + torch.squeeze(to_dense_adj(edge_index))  # NxN

      # Compute discrete target cell interactions
      Xs = torch.matmul(adj_matrix, X_cell_type)
      Xt = Xs > 0

      # Compute interaction matrix
      Xts = torch.empty(num_obs, self.num_cell_types**2)
      i = 0
      for col1 in range(self.num_cell_types):
          for col2 in range(self.num_cell_types):
              Xts[:, i] = x[:, col1] * Xt[:, col2]
              i += 1

      # Define design matrix
      if self.mult_features:
          Xd = torch.cat([X_cell_type, Xts, X_domain], dim=1)  # Nx(L+L^2+C)
      else:
          Xd = torch.cat([X_cell_type, Xts], dim=1)  # Nx(L+L^2)

      return self.linear(Xd)

The things we should focus on:

Using vectorized code (i.e., no loops)
Making use of the edge_index instead of creating an adj matrix. Otherwise, this model wouldn't even need pytorch geometric.

Also, one thing I currently don't understand is could we do these operations outside the forward method? It seems like except self.linear isn't being optimized. So isn't the part before self.linear just a preprocessing step? @chelseabright96

theislab / geome Goto Github PK

geome's Introduction

geome

Getting started

Installation

Release notes

Contact

Credits

Citation

geome's People

Contributors

Stargazers

Watchers

Forkers

geome's Issues

Description of feature

Report

Expected behaviour

Reproducible Example

Version information

Description of feature

Report

Version information

Description of feature

Case 1

Strategy

Case 2

Strategy

Case 3

Strategy

Case 4

Strategy

Report

Version information

Recommend Projects

Recommend Topics

Recommend Org