giotto-ai / giotto-deep Goto Github PK
View Code? Open in Web Editor NEWDeep learning made topological.
License: Other
Deep learning made topological.
License: Other
To accumulate class probabilities as in
giotto-deep/gdeep/pipeline/pipeline.py
Line 110 in 2965f49
CUDA out of memory
problems.with torch.no_grad()
scope and hence the gradients are not tracked.model.eval()
and in this case, Cuda runs out of memory.
In short: we should not accumulate tensors that require grad.
See also My model reports “cuda runtime error(2): out of memory” in
https://pytorch.org/docs/stable/notes/faq.html
Is your feature request related to a problem? Please describe.
There is a need to build a method that extracts the gradients with respect to the activations, hence:
\frac{d(Loss)}{d (activation_i)}
Describe the solution you'd like
A possible solution would be to deepcopy
the model, add biases, compute the gradient with respect to biases.
Describe alternatives you've considered
Not sure, open to suggestions.
Additional context
To compute the model accuracy correctly for dropout use model.eval(), see
https://stackoverflow.com/questions/53879727/pytorch-how-to-deactivate-dropout-in-evaluation-mode
in the code
giotto-deep/gdeep/pipeline/pipeline.py
Line 94 in 2965f49
Description: The gdeep library is difficult to use for developers who are not familiar with it. This can lead to developers not using the library, or using it incorrectly and causing bugs.
Resolution: Write tutorial notebooks that show how to use the different features of the gdeep library and add illlustrative examples.
I have different kinds of models that require different input sizes.
I have written a small workaround for that but it would be great to have a proper implementation.
giotto-deep/gdeep/pipeline/pipeline.py
Line 166 in e47feea
When training a model for a regression task, the accuracy is still displayed.
Suggestions: allow for custom metrics, like F1 score, accuracy, R Square/Adjusted R Square, Mean Square Error/Root Mean Square Error
Description: When dealing with persistence diagrams as input to machine learning models one wants a generic way to deal with data processing.
One such way is creating a Pipeline. This can be done by creating a generic Pipeline implemented with a chain of transformations. With that design, each transformation takes data as input and returns an output.
Additional transformations can be easily added by registering them with a register method. Users can easily add new transformations by simply registering them with the pipeline such that the implementation of does not have to change. Furthermore, the pipeline has to be easily storable as a JSON file such that it can be loaded later for inference. This is crucial since the trained model heavily depends on the transformations performed on the input data.
import json
import jsonpickle
from typing import Callable, Generic, Iterator, List, TypeVar
T = TypeVar('T')
class Pipeline(Generic[T]):
_transformations: List[Callable[[T], T]]
def __init__(self, transformations: List[Callable[[T], T]] = None) -> None:
if transformations is None:
self._transformations = []
else:
self._transformations = transformations
def register(self, transformation: Callable[[T], T]) -> None:
self._transformations.append(transformation)
def __call__(self, data: T) -> T:
for transformation in self._transformations:
data = transformation(data)
return data
def __len__(self) -> int:
return len(self._transformations)
def __getitem__(self, index: int) -> Callable[[T], T]:
return self._transformations[index]
def __iter__(self) -> Iterator[Callable[[T], T]]:
return iter(self._transformations)
def __repr__(self) -> str:
return f'Pipeline({self._transformations})'
def __add__(self, other: 'Pipeline[T]') -> 'Pipeline[T]':
return Pipeline(self._transformations + other._transformations)
def save_to_json(self, path: str) -> None:
with open(path, 'w') as f:
json_transformation = jsonpickle.encode(self)
json.dump(json_transformation, f)
@classmethod
def load_from_json(cls, path: str) -> 'Pipeline[T]':
with open(path, 'r') as f:
transformations = json.load(f)
transform: Pipeline[T] = jsonpickle.decode(transformations)
return transform
def pipeline(*transformations: Callable[[T], T]) -> Pipeline[T]:
"""
Creates a pipeline that applies the given transformations in order.
"""
pipeline: Pipeline[T] = Pipeline()
for transformation in transformations:
pipeline.register(transformation)
return pipeline
import torch
Tensor = torch.Tensor
def add_one(x: Tensor) -> Tensor:
return x + 1
def multiply_by_two(x: Tensor) -> Tensor:
return x * 2
pipe: Pipeline[Tensor] = pipeline(add_one, multiply_by_two)
pipe.save_to_json('pipeline.json')
del pipe
pipe = Pipeline.load_from_json('pipeline.json')
x = torch.tensor([[1, 2], [3, 4]])
y = pipe(x)
Is your feature request related to a problem? Please describe.
Build a notebook that implements the analysis we did in https://arxiv.org/abs/2112.15210
Describe the solution you'd like
Prepare a user friendly notebook that in a few lines of code reproduces the nav<ysis of the paper. Also with nice pictures and descriptions.
Describe alternatives you've considered
N/A
Additional context
Is your feature request related to a problem? Please describe.
Some code in the Trainer class's optimization_step method performs a weird, always-true condition which checks if the batch index is "% 1"
Describe the solution you'd like
Add a print_every parameter to the class in order to re-establish the feature that resulted in this weird leftover code
Describe alternatives you've considered
Remove condition altogether if no print delay is expected/desired
I want to have a torch dataset that loads numpy arrays from a folder. So the straight forward place to put it would be in a module datasets . But it would do something totally different than all the other classes in torch_datasets , since they are all dataloader builder. Hence my suggestion is to rename the torch_datasets module to torch_dataloader_builder and put the datasets in the torch_datasets module.
Is your feature request related to a problem? Please describe.
modularise the Persformer
Describe the solution you'd like
modularise the Persformer: have a factory design that takes the config and builds the model
Describe alternatives you've considered
Additional context
Is your feature request related to a problem? Please describe.
The building of teh lastests docker images for k8 should be automatic.
Describe the solution you'd like
Make a CI job for this
Describe alternatives you've considered
N/A
Additional context
When installing as recomended python -m pip install -U giotto-deep
it installs the latest versions of torch and other packages, it does not install torch==1.12.1
And this causes basic_tutorial_image.ipynb to be broken
yields:
ValueError: keyword grid_b is not recognized; valid keywords are ['size', ... 'grid_ms']
If one installs the versions in requirements.txt pip install -r requirements.txt
yields
ValueError: No builder registered for key GuidedGradCam
Description: The codebase does not have a comprehensive suite of tests. This makes
it difficult to make changes without introducing bugs, and it makes it difficult
to know if the code is working correctly.
Resolution: Write tests for all new code, and aim to increase test coverage over time.
Hey maintainers 👋🏽 ☕ , I noticed that most of (if not all) the steps of the Python Coding Style can be automated via pre-commit such as flake8
and pytest
or even the ones in the GH actions such as mypy
. Having pre-commit would make that process much easier and enforce that the developers follow coding conventions. Maybe this could even help with #67 if we add black
as a hook
Additional context
Although this would add pre-commit
as a dependency for the project.
Describe the bug
When using giotto's DatasetBuilder to build a torchvision dataset (CIFAR, caltechXXX, ...), providing a custom "root" argument for the Pytorch dataset causes an error: "TypeError: type object got multiple values for keyword argument 'root'". This is due to the fact that the DatasetBuilder already tries to provide a default root folder
To reproduce
Try to use the DatasetBuilder as per the following code to create any dataset from the torchvision collection
caltech_bd = DatasetBuilder("Caltech256")
caltech_ds, _, _ = caltech_bd.build(root="~/customFolder")
Expected behavior
The documentation specifies that the kwargs in the "build" method can be used to transfer arguments to the torchvision dataset but nowhere is it mentioned that the "root" argument should not be used. As such, trying to use it should not result in an error
Actual behaviour
The aforementioned error is raised
Is your feature request related to a problem? Please describe.
The pain is that, most often, plain datasets are not in the right input format or do not have the designed statistical caracterisrtics. Furthermore, standard techniques like data augmentation, need to be implemented
Describe the solution you'd like
We build an API class (AbstractClass
) for the preprocessing -- a generic one.
It should look similar to this one:
from abc import ABC, abstractmethod
class AbstractPreprocessing(ABC):
"""The abstract class to define the interface of preprocessing
"""
@abstractmethod
def __call__(self, *args, **kwargs):
"""This method deals with datum-wise transformations"""
pass
@abstractmethod
def fit_to_data(self, *args, **kwargs):
"""This method deals with getting dataset-level information"""
pass
Each of the methods shall be implemented, as it will be called automatically inside the Dataset
classes:
__getitem__
will be transformed by item_transform
. the data inside item_transform
that are needed to perform the transformation, will be stored in self. The methods dataset_level_data
and batch_level_data
will be called only once, before the first time that __getitem__
is called.Describe alternatives you've considered
Only doing point 3 above (without 1 and 2), however I find it is always possible to only use that approach and it is much easier to implement and is less bind to the generic pipeline
Additional context
After running an HPO I get RuntimeWarnings.
c:\Users\Raphael\Documents\GitHub\giotto-deep-new\venv\lib\site-packages\numpy\lib\function_base.py:2821: RuntimeWarning:
Degrees of freedom <= 0 for slice
c:\Users\Raphael\Documents\GitHub\giotto-deep-new\venv\lib\site-packages\numpy\lib\function_base.py:2674: RuntimeWarning:
invalid value encountered in subtract
c:\Users\Raphael\Documents\GitHub\giotto-deep-new\venv\lib\site-packages\numpy\lib\function_base.py:2680: RuntimeWarning:
divide by zero encountered in true_divide
c:\Users\Raphael\Documents\GitHub\giotto-deep-new\venv\lib\site-packages\numpy\lib\function_base.py:2680: RuntimeWarning:
invalid value encountered in multiply
giotto-deep/gdeep/models/simple_nn.py
Line 43 in bbb382a
I think we should not have any activation here since the cross-entropy loss uses the logits.
https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html
The input is expected to contain raw, unnormalized scores for each class.
It is also more efficient to compute cross-entropy directly from the logits.
Writing in this issue the list of suggestions coming from the review process: once the review process is completed, I will work on implementing these items.
When I use n_accumulated_grads with a value bigger than 1 with batch norm layers, the batch normalization is computed only over the micro-batches and not the whole batch. This could cause problems with training stability and validation results.
I think that batch norm layers should be treated specially to compute the mean and variance over the whole batch finally. I think this is a very hard problem and I don't have a solution. Maybe it would be good to look at how they're doing it in the Pytorch-lightning library.
Is your feature request related to a problem? Please describe.
We shall think about submitting a short paper about the usefulness of giotto-deep
Describe the solution you'd like
I would like to propose JOSS. Basically, we need create a paper
branch in which we put paper.md
and paper.bib
.
Describe alternatives you've considered
Other standard journals like JMLR
Additional context
Describe the bug
The code needs to run smoothly on GPUs as well as on CPUs. For GPUs, there are inconsistency in loading the data to the device.
To reproduce
Just run the notebooks.
Expected behavior
Actual behaviour
Versions
Additional context
Is your feature request related to a problem? Please describe.
Create notebook recreating the experiments of this paper
In the current version, the model at the end of training is used to get the test accuracy.
giotto-deep/gdeep/pipeline/pipeline.py
Line 150 in 2965f49
Using the model with the highest validation accuracy (or the low validation loss) during training would be better.
The weights of that model have to be stored temporarily.
Is your feature request related to a problem? Please describe.
When distributing computations over K8, only CPU are used.
Describe the solution you'd like
We shall probably add a NodeSelector description in the yaml files and test if the current workers dockers support GPUs: https://cloud.google.com/kubernetes-engine/docs/how-to/gpus#ubuntu
Describe alternatives you've considered
None
Additional context
It would be nice to have a tutorial on how to use pre-trained Huggingface transformer models with giotto-deep.
This would show how to fine-tune within giotto-deep and how to use some interpretability methods on them.
This would replace the existing notebooks on QA and translation which use a T5-style model with random initialization.
Issue:
When using persistence diagrams as pytorch tensors one always has to check if they
the input has the right format (2 dimensional, one-hot-encoded homology typ ...) which
is a huge overhead.
Suggestion:
We could use the typing module to define a subtype of tensor which represents
one-hot-encoded persistence diagrams. This would allow to stafely pass around
persistence diagrams without having the overhead of additional assertions everywhere.
Sample code:
from typing import NewType
# Create a subtype of tensor representing one-hot persistence diagrams with`
# almost zero runtime overhead.`
OneHotPersistenceDiagram = NewType('OneHotPersistenceDiagram', torch.Tensor)`
def one_hot_persistence_diagram(persistence_diagram: torch.Tensor) \
-> OneHotPersistenceDiagram:
"""Convert a persistence diagram to one-hot encoding."""
assert persistence_diagram.ndim == 2
# ..... additional assertions here ...
return OneHotPersistenceDiagram(persistence_diagram)
x = torch.tensor([[0.0, 1.0, 0.0, 1.0],
[0.2, 0.4, 1.0, 0.0]])
pd = one_hot_persistence_diagram(x)
def get_number_of_homology_type(persistence_diagram: OneHotPersistenceDiagram) \
-> int:
"""Get the number of homology types in a persistence diagram."""
return persistence_diagram.shape[1] - 2
get_number_of_homology_type(pd) # = 2
get_number_of_homology_type(x) # Will cause a TypeError
# Otherwise we can use pd as a usual pytorch tensor
This should be a torch tensor
If self.dataloaders
has length 3 the batch_size parameter is ignored.
See
giotto-deep/gdeep/pipeline/pipeline.py
Line 179 in 2965f49
There should at least be a warning for the user.
In the TorchDataLoader class:
giotto-deep/gdeep/data/torch_datasets.py
Line 84 in f62b2ef
add the clean_up_files decorator in gdeep/search/tests/test_persformer_hyperparamter_search.py to the
test_pipe_1 in gdeep/pipeline/tests/test_pipeline.py
In _train_loop
, _val_loop
, _test_loop
the loss is averaged by the dataset size:
giotto-deep/gdeep/pipeline/pipeline.py
Line 122 in 2965f49
CrossEntropyLoss
is already averaged over the minibatchHence the losses are averaged twice.
The documentation should contain example use cases and
Is your feature request related to a problem? Please describe.
Need better documentation to allow users to understand the code
Describe the solution you'd like
We need to:
Describe alternatives you've considered
Additional context
Description: Some parts of the codebase have type annotations while others do not. This can make code difficult to understand for developers who are not familiar with the codebase, and it can lead to bugs if the types of variables are not consistent.
Resolution: Add type annotations to all new code and gradually add them to existing code as time permits. Use a type checker like mypy (https://github.com/python/mypy) to ensure consistent types.
Tools: Mypy can be run automatically on commit.
Is your feature request related to a problem? Please describe.
The repository contains a lot of branches that are outdated. My suggestion is to remove them, to get an overview of all useful branch
For pull-requests use forks, because they can be easily removed or back merged to another branch
Describe the solution you'd like
Just delete longer outdated branches
Additional context
The aim is to remove the outdated branch
Why should it be removed?
– Because it’s outdated
– Because the codebase can confuse people who contribute code
– Because removing it will make maintaining the codebase easier
According to PEP8:
To better support introspection, modules should explicitly declare the names in their public API using the all attribute. Setting all to an empty list indicates that the module has no public API.
https://peps.python.org/pep-0008/#public-and-internal-interfaces
E.g.
giotto-deep/gdeep/utility/__init__.py
Line 11 in f62b2ef
Is your feature request related to a problem? Please describe.
The current basic way of distributing jobs in K8 is not effective. we need automation
Describe the solution you'd like
The solution would be to use python-rq , deploy many workers on K8, and every time one wants to crush heavy HPOs, simply enqueue the jobs and let the workers work.
This requires also a mysql DB so that optima can keep track of the progress and properly distribute the computations.
Describe alternatives you've considered
Using kubctl
and volumes, but it creates race conditions easily.
Additional context
ValueError Traceback (most recent call last)
/tmp/ipykernel_3173/723289580.py in
1 # persistent homology after the optimisation
----> 2 vr.fit_transform_plot([dist.detach().numpy()])
/opt/hostedtoolcache/Python/3.7.12/x64/lib/python3.7/site-packages/gtda/base.py in fit_transform_plot(self, X, y, sample, **plot_params)
116 """
117 self.fit(X, y)
--> 118 Xt = self.transform_plot(X, sample=sample, **plot_params)
119 return Xt
120
/opt/hostedtoolcache/Python/3.7.12/x64/lib/python3.7/site-packages/gtda/base.py in transform_plot(self, X, sample, **plot_params)
141 """
142 Xt = self.transform(X[sample:sample+1])
--> 143 self.plot({sample: Xt[0]}, sample=sample, **plot_params).show()
144 return Xt
/opt/hostedtoolcache/Python/3.7.12/x64/lib/python3.7/site-packages/gtda/homology/simplicial.py in plot(Xt, sample, homology_dimensions, plotly_params)
345 return plot_diagram(
346 Xt[sample], homology_dimensions=homology_dimensions,
--> 347 plotly_params=plotly_params
348 )
349
/opt/hostedtoolcache/Python/3.7.12/x64/lib/python3.7/site-packages/gtda/plotting/persistence_diagrams.py in plot_diagram(diagram, homology_dimensions, plotly_params)
41 posinfinite_mask = np.isposinf(diagram_no_dims)
42 neginfinite_mask = np.isneginf(diagram_no_dims)
---> 43 max_val = np.max(np.where(posinfinite_mask, -np.inf, diagram_no_dims))
44 min_val = np.min(np.where(neginfinite_mask, np.inf, diagram_no_dims))
45 parameter_range = max_val - min_val
<array_function internals> in amax(*args, **kwargs)
/opt/hostedtoolcache/Python/3.7.12/x64/lib/python3.7/site-packages/numpy/core/fromnumeric.py in amax(a, axis, out, keepdims, initial, where)
2753 """
2754 return _wrapreduction(a, np.maximum, 'max', axis, None, out,
-> 2755 keepdims=keepdims, initial=initial, where=where)
2756
2757
/opt/hostedtoolcache/Python/3.7.12/x64/lib/python3.7/site-packages/numpy/core/fromnumeric.py in _wrapreduction(obj, ufunc, method, axis, dtype, out, **kwargs)
84 return reduction(axis=axis, out=out, **passkwargs)
85
---> 86 return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
87
88
ValueError: zero-size array to reduction operation maximum which has no identity
ValueError: zero-size array to reduction operation maximum which has no identity
Error: Process completed with exit code 1.`
I got this error while importing DataLoaderBuilder.
To reproduce
from gdeep.data.datasets.base_dataloaders import DataLoaderBuilder
Versions
Linux-5.4.188+-x86_64-with-glibc2.31
Python 3.9.13 | packaged by conda-forge | (main, May 27 2022, 16:58:50)
[GCC 10.3.0]
NumPy 1.23.1
SciPy 1.8.1
Joblib 1.1.0
Scikit-learn 1.1.1
Giotto-tda 0.5.1
Additional context
I tried downgrading to pillow 6.1 but got the following:
UnsatisfiableError: The following specifications were found
to be incompatible with the existing python installation in your environment:
Specifications:
- pillow=6.1 -> python[version='2.7.*|3.5.*|3.6.*|3.4.*']
- pillow=6.1 -> python[version='>=2.7,<2.8.0a0|>=3.6,<3.7.0a0|>=3.7,<3.8.0a0']
Your python: python=3.9
To achieve high performance, the data loader should have
pin_memory=True since, in this way, one can avoid the transfer between pageable and pinned host arrays, see:
https://developer.nvidia.com/blog/how-optimize-data-transfers-cuda-cc/
This can result in a 2x increase in speed.
Here is a reference to the code:
giotto-deep/gdeep/pipeline/pipeline.py
Line 204 in 2965f49
Describe the bug
All in the title
To reproduce
Check the example folder
Expected behavior
The best practices suggest to remove the notebook output before pushing it to the repo
Actual behaviour
Output is there....
Versions
Additional context
self.train_epoch is increased by 1 for every batch.
giotto-deep/gdeep/pipeline/pipeline.py
Line 69 in 2965f49
Is your feature request related to a problem? Please describe.
When using the Gridsearch
class for testing different model architectures one has to pass a dictionary
of models to the init function. The problem with that is that
Describe the solution you'd like
Instead of passing a dictionary of models allow passing a model architecture that depends on hyperparameters + search space of possible model architecture hyperparameters.
Additional context
giotto-deep/gdeep/search/gridsearch.py
Line 16 in 2965f49
Please add a reference to the paper your code is based on. This makes it easier to look up how the algorithm works.
Description: Different parts of the code will use different conventions for
variable and function names, comments, and general organization. This
leads to inconsistency which can make code difficult to read.
Resolution: Follow a style guide so that all code has a consistent appearance.
Agree on a style guide with all developers in our team and define it in a config file.
Tools: Use Black (https://github.com/psf/black) for code formatting. It can be
set up to run automatically on commit.
For the sake of better readability, better use
ModuleList (https://pytorch.org/docs/stable/generated/torch.nn.ModuleList.html)
instead of
´exec`
in
giotto-deep/gdeep/models/simple_nn.py
Line 28 in 2965f49
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.