Giter VIP home page Giter VIP logo

geome's Introduction

geome

Tests Documentation

The repo provides a set of tools for creating PyTorch Geometric (PyG) data objects from AnnData objects, which are commonly used for storing and manipulating single-cell genomics data. In addition, the repo includes functionality for creating PyTorch Lightning (PyTorch-Lightning) DataModule objects from the PyG data objects, which can be used to create graph neural network (GNN) data pipelines. The PyG data objects represent graphs, where the nodes represent cells and the edges represent relationships between the cells, and can be used to perform GNN tasks such as node classification, graph classification, and link prediction. The repo is written in Python and utilizes the PyTorch, PyTorch Geometric, and PyTorch-Lightning libraries.

Getting started

Please refer to the documentation. In particular, the

Installation

You need to have Python 3.9 or newer installed on your system. If you don't have Python installed, we recommend installing Mambaforge.

There are several alternative options to install geome:

  1. Install the latest development version:

Without GPU support:

mamba create -n geome  python=3.10
mamba activate geome
pip install torch==2.1.0
pip install torch-scatter torch-sparse torch-cluster
pip install git+https://github.com/theislab/geome.git@main

With GPU:

mamba create -n geome  python=3.10
mamba activate geome
mamba install pytorch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 pytorch-cuda=12.1 -c pytorch -c nvidia
pip install torch-scatter torch-sparse torch-cluster -f https://data.pyg.org/whl/torch-2.1.0+cu121.html
pip install git+https://github.com/theislab/geome.git@main

Release notes

See the changelog.

Contact

For questions and help requests, you can reach out in the scverse discourse. If you found a bug, please use the issue tracker.

Credits

Some of the data for DatasetHartmann is distributed in this package. It was originally retrieved from: https://zenodo.org/record/3951613#.Y9flQS-B1qv

This project was generated from @cjolowicz's Hypermodern Python Cookiecutter template.

Citation

t.b.a

geome's People

Contributors

chelseabright96 avatar grst avatar ivirshup avatar selmanozleyen avatar tothmarcella avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

geome's Issues

Edge weight integration

I would like to select nodes that are closest to a specific target node. Therefore, I require an additional parameter edge_weight to be saved for the graphs.

The edge weight should represent the euclidian distance between two nodes given their coordinates, so: $d=\sqrt{(x_1-x_2)^2 +(y_1-y_2)^2}$.

The goal should be to use the edge weights to create subgraphs for nodes given their euclidian distance.

LinearNCEM fully reproduce an experiment

To see if the training loop works and if we have an interface that complies to the requirements, I suggest that we reproduce a LinearNCEM experiment (preferably spatial type). Here are the steps I suggest. These can be converted into their own issue.

  • We need to specify the experiment configurations from the manuscript:
    • Optimization configurations (e.g., lr, optimizer).
    • Monitored stats (e.g., r2, SALS, ...)
    • Preprocessing of data: #11
    • Sampling/Loading schemes (e.g., masking stuff, neighbor loading etc.) #10
    • Train, test, val split schemes: #8

I suggest we create an examples/linear_ncem_exp1.ipynb file in the linear-exp-1 branch.

Nodewise splitting of data

Provide data splitting for training, test and validation for small number of graphs where one data object would be one node.
Possible solutions:

  • masking the loss function
  • randomly selecting subgraphs
  • selecting subgraphs and appending them to include all neighbors of selected nodes (through directed edges)
  • creating a subgraph for all nodes and splitting like in the case of several graphs

Should there be one DataModule?

We can decide to use graphs or nodes as batches on one data module. Should we do one class that does both, or should each class handle them separately?
This is when there is one class:

if len(self.data) == 1:
            self.data = self.data[0]
            self.dataloader = RandomNodeSampler(self.data, num_parts=self.batch_size,
                                                num_workers=self.num_workers
                                                )
else:
            self.dataloader = DataLoader(self.data, batch_size=self.batch_size, shuffle=True,
                                         num_workers=self.num_workers
                                         )

Maybe creating a builder class instead of the adata2data_fn callable?

I realized a weakness of the current interface I proposed. We maybe should create a "Builder" class (from design patterns) to create the dataset we wish (e.g., list of data, or Batch obj.). So that it has an adata2data_fn method for each seperate case (train, test, val). Similar to the Datamodule object having train_dataloader method for only creating a dataloader.

class AnnData2DataTransform:

def __init__(adata):
    ...

def train_dataset():
   ...
def test_dataset()
   ...

But for this we would need more use cases and wait until the anndata2data function in utils module is as general as it gets.

ValueError when subgraph creation with library_key for PyG datas

Report

Description: Error message when category to iterate on (e.i. library_key) is different than the category to be predicted.

Expected behaviour

Given an AnnData object with four cell types and three image_ids, I want to create subgraphs based on image_id but predict the cell types (y). To do so, I need to pass the image_id are passed as library key to the sq.gr.spatial_neighbours() function. This splits the graph up into three subgraphs according to the image_id, like shown below:

np.random.seed(42)
cluster_centers = np.array([
    [1, 1],  # Center for the first 20 points
    [5, 5],  # Center for the next 20 points
    [1, 5]  # Center for the last 10 points
])

points_cluster_1 = cluster_centers[0] + np.random.randn(20, 2)
points_cluster_2 = cluster_centers[1] + np.random.randn(20, 2)
points_cluster_3 = cluster_centers[2] + np.random.randn(10, 2)

coordinates = np.vstack((points_cluster_1, points_cluster_2, points_cluster_3))
adata_gt = ad.AnnData(
    np.random.rand(50, 2),
    obs={"cell_type": list("ab" * 20) + ["c"] * 5 + ["d"] * 5, "image_id": ["x"] * 20 + ["y"] * 20 + ["z"] * 10},
    obsm={"spatial_init": coordinates},
)

adata_gt.obs['image_id'] = pd.Categorical(adata_gt.obs['image_id'])

sq.gr.spatial_neighbors(adata_gt, spatial_key='spatial_init', library_key="image_id")
print(adata_gt)
sq.pl.spatial_scatter(adata_gt, spatial_key='spatial_init', shape=None, color="image_id", size=30, connectivity_key="spatial_connectivities",)

image

If I pass different categories to the iterables.ToCategoryIterator("image_id", axis="obs", preserve_categories = ["cell_type"]) it fails.

Reproducible Example

Inspired from test_ann2data_by_category.py

coordinates = np.random.rand(50, 2)
func_args = {"radius": 4.0, "coord_type": "generic", "library_key": "image_id"}
coordinates[:25, 0] += 100
adata_gt = ad.AnnData(
    np.random.rand(50, 2),
    obs={"cell_type": ["a"] * 20 + ["b"] * 20 + ["c"] * 5 + ["d"] * 5, "image_id": list("xy" * 20) + ["z"] * 10},
    obsm={"spatial_init": coordinates},
)
a2d = ann2data.Ann2DataByCategory(
    fields={"x": ["X"], "edge_index": ["uns/edge_index"], "edge_weight": ["uns/edge_weight"], "y": ["obs/cell_type"]},
    category="image_id",
    preprocess=transforms.Categorize(keys=["cell_type", "image_id"]),
    transform=transforms.AddEdgeIndex(
        spatial_key="spatial_init",
        key_added="graph",
        edge_index_key="edge_index",
        edge_weight_key="edge_weight",
        func_args=func_args,
    ),
)
datas = list(a2d(adata_gt.copy()))
assert len(datas) == 3

Error:

ValueError: Found array with 0 sample(s) (shape=(0, 2)) while a minimum of 1 is required by NearestNeighbors.

Version information

No response

Adding non-spatial split support

Since we now have LinearNCEM, we should support the splitting spesific to its case. The edge_index attribute is still given in the data object, so the splitting and loading is still done spatially. For example, we split by RandomNodeSplit and load with NeighborLoader.

Besides Graph or Node splitting and loading, I think we should also have a Spatial option. This option will expect edge_index attributes on the data object created if set. Would you agree?

Fix the doc formatting and deploy the page

Description of feature

here are the steps

  • fix docstring formatting in the classes
  • fix notebooks so they don't run
  • if they are run by the doc build flow add CI for notebook tests (lots of dependency issue if so)

Missing dependencies

Report

I installed the geome package in a new conda environment with pip install geome. The install does not seem to install all required dependencies such as squidpy. For example, when trying to download from the gnome package (e.g. from geome import transforms) I get the No module named 'squidpy'.

Version information

No response

closest kNN subgraph creation

Description of feature

Similar to PyTorch geometric it would be great to have functions that can create subgraphs from the PyTorch geometric object. One possibility for a subgraph selection that is not implemented with PyTorch geometric is selection of subgraph given the nodes euclidian distance (edge_weight).

For example, PyTorch geometric has a function multiple functions that create subgraphs in torch_geometric.utils.subgraph.

The function should select the n nearest neighbours (given their euclidian distance) of a target node node\_idx and return the subgraph as a PyTorch geometric object.

subgraph = kNN_subgraph(node_idx: int, n: int)

anndata2data callable builder class

I know I am not good with names, so all these names can change. I wrote an example code how this might look like.

The main goal is to make the models agnostic to the attributes we deal with. For this we can use a class to create the transformations that the model will use.

Usage will be like this

xs = {"X":"numeric", "obs/domain":"categorical" }
ys = {"obs/cell_type":"categorical",}
adata2data_fn = AnnData2DataCallable(is_sq=True,has_edge_index=False, xs=xs, ys=ys)
dm = GraphAnnDataModule(adata, adata2data_fn,learning_type='graph')

So the most recent version looks like this, and it is almost complete:

class AnnData2DataCallable:
    def __init__(
        self,
        x_names: Dict[str, str],
        y_names: Dict[str, str],
        is_sq: bool = False,
        has_edge_index: bool = True,
    ):
        self.is_sq = is_sq
        self.has_edge_index = has_edge_index
        if is_sq:
            self._adata_iter = AnnData2DataCallable._get_sq_adata_iter
        else:
            self._adata_iter = lambda x: x  # identity

        self.xs = x_names
        self.ys = y_names

    @staticmethod
    def _get_sq_adata_iter(adata):
        cats = adata.obs.library_id.dtypes.categories
        for cat in cats:
            yield adata[adata.obs.library_id == cat]

    @staticmethod
    def _get_adj_matrix(adata):
        """helper function to create adj matrix depending on the adata"""

        # Get adjacency matrices
        if "adjacency_matrix_connectivities" in adata.obsp.keys():
            spatial_connectivities = adata.obsp["adjacency_matrix_connectivities"]

        else:
            spatial_connectivities, _ = sq.gr.spatial_neighbors(
                adata,
                coord_type="generic",
                key_added="spatial",
                copy=True,
            )
        return spatial_connectivities

    def _extract_features(self, obj, adata):
        raise NotImplementedError()

    def _concat_features(adata, obj, features_dict):
        raise NotImplementedError()

    def _create_data_obj(self, adata, spatial_connectivities):
        obj = dict()
        if self.has_edge_index:
            nodes1, nodes2 = spatial_connectivities.nonzero()
            obj["edge_index"] = torch.vstack(
                [
                    torch.from_numpy(nodes1).to(torch.long),
                    torch.from_numpy(nodes2).to(torch.long),
                ]
            )

        # extract general features e.g., gene_expression, cell_type
        features_dict = self._extract_features(adata, obj)

        # assign x's and y's here as one torch tensor
        obj = self._concat_features(adata, obj, features_dict)

        return Data(**obj)

    def __call__(self, adatas):
        dataset = []

        for adata in self._adata_iter(adatas):
            spatial_connectivities = self._get_adj_matrix(adata)
            data = self._create_data_obj(adata, spatial_connectivities)
            dataset.append(data)

        return dataset

Determining the behavior for all cases for splitting and loading

Case 1

Split By: Node
Yield Edges : True

Examples:
-NonLinearNCEM with all datasets.

Strategy

We enforce adata2data_fn to give a list of Data object with:

  • edge_index attribute (2 x edge_count)

We recommend adata2data_fn to give a list of Data object with:

  • x attribute as input (node_count x in_features)
  • (depending on the model) y attribute as target (node_count x out_features)
  • (depending on the model) edge_weights attribute (2 x edge_count)

Then do the following:

  1. Merge the list of data to one big graph.
  2. Add train, test, val masks to batch with RandomNodeSplit.
  3. Load with NeighborLoader with the masks as input.

Case 2

Split By: Node
Yield Edges: False

Strategy

This case should give an NotImplementedError because splitting by nodes then loading non-spatially is equivalent to the case
where we consider each node as a small graph and do Case 4. I don't think this is an urgent thing to do.

Case 3

Split By: Graph
Yield Edges : True

Strategy

This case should give a NotImplementedError because splitting by graphs then loading spatially is equivalent to the case
where we add train_mask full of ones to the graphs chosen to be for training, then and zeros to others (similarly for val & test). Then merging the graphs as one big graph, then do Case 1.

Case 4

Split By: Graph
Yield Edges : False

Examples:
-LinearNCEM with all datasets, both spatial and non-spatial variant.

Strategy

We recommend adata2data_fn to give a list of Data object with:

  • x attribute as input (node_count x in_features)
  • (depending on the model) y attribute as target (node_count x out_features)

Then do the following:

  1. Split the python list by indices
  2. Load with DataListLoader with the masks as input.

notebooks

Report

the notebooks dont work

Version information

No response

LinearSpatial implementation doesn't use Message Passing. Also not GPU compatible.

We currently have this implementation on LinearSpatial forward method

  def forward(self, x, edge_index):
      """
      Inputs:
          x - Input features per node
          edge_index - input edge indices. Shape 2 x 2*(No. edges)
      """

      X_cell_type = x[:, 0 : self.num_cell_types + 1]

      if self.mult_features:
          X_domain = x[:, self.num_cell_types + 1 :]

      num_obs = x.shape[0]

      # Define adjacency matrix
      adj_matrix = torch.eye(num_obs) + torch.squeeze(to_dense_adj(edge_index))  # NxN

      # Compute discrete target cell interactions
      Xs = torch.matmul(adj_matrix, X_cell_type)
      Xt = Xs > 0

      # Compute interaction matrix
      Xts = torch.empty(num_obs, self.num_cell_types**2)
      i = 0
      for col1 in range(self.num_cell_types):
          for col2 in range(self.num_cell_types):
              Xts[:, i] = x[:, col1] * Xt[:, col2]
              i += 1

      # Define design matrix
      if self.mult_features:
          Xd = torch.cat([X_cell_type, Xts, X_domain], dim=1)  # Nx(L+L^2+C)
      else:
          Xd = torch.cat([X_cell_type, Xts], dim=1)  # Nx(L+L^2)

      return self.linear(Xd)

The things we should focus on:

  • Using vectorized code (i.e., no loops)
  • Making use of the edge_index instead of creating an adj matrix. Otherwise, this model wouldn't even need pytorch geometric.

Also, one thing I currently don't understand is could we do these operations outside the forward method? It seems like except self.linear isn't being optimized. So isn't the part before self.linear just a preprocessing step? @chelseabright96

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.