Giter VIP home page Giter VIP logo

Comments (10)

joein avatar joein commented on May 18, 2024 2

We discussed and here are to possible solutions we figured out:

  • torch.Tensors, np.ndarray and everything built upon them doomed to be broken (e.g. pandas objects), the first ones will raise RuntimeError, the second - ValueError. We can suppress this exceptions in method like obj_in_list and wait for feedback if they are not sufficient.
  • replace fetch_unique_objects with emit_objects like discussed earlier in this thread

@generall

from quaterion.

joein avatar joein commented on May 18, 2024

Unfortunately, we can't just calculate hash here, because objects can be not hashable, so we need to pass key_extractor_fn somehow.
Also we can have several key_extractor_fn because we can have several encoders, thus we need to apply each of them or apply one of them, which could be arguable. And we do the same thing in CacheDataLoader.

My proposal is to replace fetch_unique_objects with just emit_objects with approximately such implementations:

For PairSimilarityDataLoader:

@classmethod
def emit_object(cls, batch: List[SimilarityPairSample]) -> Any:
    for sample in batch:
        yield sample.obj_a
        yield sample.obj_b

For GroupSimilarityDataLoader:

@classmethod
def emit_object(cls, batch: List[SimilarityPairSample]) -> Any:
    for sample in batch:
        yield sample.obj

from quaterion.

monatis avatar monatis commented on May 18, 2024

replace fetch_unique_objects with just emit_objects

In this case, we will lose the whole functionality to prevent multiple calculation, won't we?

we can't just calculate hash here,

Ok, so what about this one:

def obj_in_list(obj, obj_list):
    try:
        return obj in obj_list
    except:  # we caught the reported exception here, so assume obj is not in list to add it anyway
        return False

now we can use it like if not obj_in_list(sample.obj, unique_objects):.
This will work as usual unless we hit the reported bug, but it will still safely process that object if we do. WDYT?

from quaterion.

joein avatar joein commented on May 18, 2024

In this case, we will lose the whole functionality to prevent multiple calculation, won't we?

Actually only part of it. Now the flow is like:

  • fetch unique objects in batch
  • for every unique object calculate its key via key extractors
  • if calculated key has not been in current dataloader, calculate its embeddings, otherwise do nothing

So actually fetch_unique_objects prevents us from repeated key calculation, which I guess not that crucial

from quaterion.

joein avatar joein commented on May 18, 2024

Ok, so what about this one:

It can be a solution, need to look more thoroughly into this

from quaterion.

monatis avatar monatis commented on May 18, 2024

And here's the minimal code to reproduce this bug:

import numpy as np
import torch

l = []
t1 = torch.from_numpy(np.array([1, 2, 3]))  # remove `torch.from_numpy()` for the numpy version
t2 = torch.from_numpy(np.array([1, 2, 2]))
ts = [t1, t2]

for t in ts:
    if t not in l:
        l.append(t)

print("everything fine")

from quaterion.

monatis avatar monatis commented on May 18, 2024

Also another note on strange behaviors of tensors we figured out: there is no hash collision even for two tensors with the same values because Tensor.__hash__ hashes by id(tensor).

import numpy as np
import torch

# create two tensors with the same values
t1 = torch.from_numpy(np.array([1, 2, 3]))
t2 = torch.from_numpy(np.array([1, 2, 3]))

d = {hash(t1): "some value"}
print(hash(t2) in d)  # this is False to our surprise

d = {t1: "some value"}
print(t2 in d)  # this is also False

# only this one is True
print(t1 in d)

from quaterion.

generall avatar generall commented on May 18, 2024

what could help in this discussion for sure - tests with examples for reproduction

from quaterion.

joein avatar joein commented on May 18, 2024

The reason of exception is in the way in operator works. It compares new object with those already in collection. It checks If another couple of objects are the same object (obj_a is obj_b) or check if they are equal via == (that's the place where exception occurs, tensors and similar objects don't support this way of comparison).

If instead of raw tensor we will pass dict like

d = {
            "value": torch.Tensor(...)
            "path_to_image": "source/path/to/image.png",
        }

Then if value being compared first - it results in the same exception again.

So we can't fetch unique objects from batch only with wrapping it in dict, maybe we need some special class like the following to handle such cases.

class ComparableClass:
    def __init__(self, comparison_feature, value):
        self.comparison_feature = ...
        self.value = torch.tensor(...)

    def __eq__(self, other):
        return self.comparison_feature == other.comparison_feature

    # we can provide default hash implementation here as well

The alternative for this could be rejecting the idea of fetching unique objects from batch.
In this case we can handle complicated objects via dict and custom key_extractor_fn and successfully use cache.
The drawback is that we need to extract key from each object from batch for each encoder, but I don't think that it is that crucial.

from quaterion.

monatis avatar monatis commented on May 18, 2024

Fixed in #34

from quaterion.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.