Giter VIP home page Giter VIP logo

Comments (6)

mberr avatar mberr commented on June 16, 2024 1

Does the latter only specify the initial entity vectors but does not jointly train the encoder with the KGE model, while the former does indeed perform an end-to-end training?

Yes, exactly.

from pykeen.

lorenzobalzani avatar lorenzobalzani commented on June 16, 2024 1

Thanks 🙏🏻

from pykeen.

cthoyt avatar cthoyt commented on June 16, 2024

Can you do the following:

  1. Please copy paste the text of the stack trace, not a screenshot
  2. Can you create a minimum viable example? The complexity of the pipeline doesn't make this easy to debug. Similarly, using a small built in dataset would be preferable to asking people to manually download and open datasets

from pykeen.

lorenzobalzani avatar lorenzobalzani commented on June 16, 2024

Sure!

Stack trace

INFO:pykeen.triples.splitting:done splitting triples to groups of sizes [672253, 121116, 121117]
WARNING:pykeen.utils:No cuda devices were available. The model runs on CPU

AttributeError                            Traceback (most recent call last)
Cell In[47], line 47
     43     return result.model
     46 tf, train_set, val_set, test_set = load_umls_kg("UMLS_KG/triplets.txt", val_ratio = .1, test_ratio = .1)
---> 47 model = train_coder_on_umls(tf, train_set, val_set, test_set, save_model = True)

Cell In[47], line 15, in train_coder_on_umls(tf, train_set, val_set, test_set, CODER_checkpoint, save_model)
     14 def train_coder_on_umls(tf, train_set, val_set, test_set, CODER_checkpoint = "./coder_eng_pp", save_model: bool = True):
---> 15     entity_representations = TextRepresentation.from_triples_factory(
     16         triples_factory=tf, 
     17         encoder="transformer",
     18         encoder_kwargs=dict(pretrained_model_name_or_path=CODER_checkpoint, max_length=512)
     19     )
     21     result = pipeline(
     22         training=train_set,
     23         validation=val_set,
   (...)
     39         ),
     40     )
     41     if (save_model):

File ~/anaconda3/envs/prometeia/lib/python3.9/site-packages/pykeen/nn/representation.py:1067, in TextRepresentation.from_triples_factory(cls, triples_factory, for_entities, **kwargs)
   1053 """
   1054 Prepare a text representations with labels from a triples factory.
   1055 
   (...)
   1064     a text representation from the triples factory
   1065 """
   1066 labeling: Labeling = triples_factory.entity_labeling if for_entities else triples_factory.relation_labeling
-> 1067 return cls(labels=labeling.all_labels(), **kwargs)

File ~/anaconda3/envs/prometeia/lib/python3.9/site-packages/pykeen/nn/representation.py:1040, in TextRepresentation.__init__(self, labels, max_id, shape, encoder, encoder_kwargs, missing_action, **kwargs)
   1038 labels = _clean_labels(labels, missing_action)
   1039 # infer shape
-> 1040 shape = ShapeError.verify(shape=encoder.encode_all(labels[0:1]).shape[1:], reference=shape)
   1041 super().__init__(max_id=max_id, shape=shape, **kwargs)
   1042 self.labels = list(labels)

File ~/anaconda3/envs/prometeia/lib/python3.9/site-packages/torch/utils/_contextlib.py:115, in context_decorator.<locals>.decorate_context(*args, **kwargs)
    112 @functools.wraps(func)
    113 def decorate_context(*args, **kwargs):
    114     with ctx_factory():
--> 115         return func(*args, **kwargs)

File ~/anaconda3/envs/prometeia/lib/python3.9/site-packages/pykeen/nn/text.py:113, in TextEncoder.encode_all(self, labels, batch_size)
     92 @torch.inference_mode()
     93 def encode_all(
     94     self,
     95     labels: Sequence[str],
     96     batch_size: Optional[int] = None,
     97 ) -> torch.FloatTensor:
     98     """Encode all labels (inference mode & batched).
     99 
    100     :param labels:
   (...)
    111         a tensor representing the encodings for all labels
    112     """
--> 113     return _encode_all_memory_utilization_optimized(
    114         encoder=self, labels=labels, batch_size=batch_size or len(labels)
    115     ).detach()

File ~/anaconda3/envs/prometeia/lib/python3.9/site-packages/torch_max_mem/api.py:511, in MemoryUtilizationMaximizer.__call__.<locals>.inner(*args, **kwargs)
    509     values = tuple(bound.arguments[name] for name in self.parameter_names)
    510 kwargs.update(zip(self.parameter_names, values))
--> 511 result, self.parameter_value[h] = wrapped(*args, **kwargs)
    512 return result

File ~/anaconda3/envs/prometeia/lib/python3.9/site-packages/torch_max_mem/api.py:368, in maximize_memory_utilization_decorator.<locals>.decorator_maximize_memory_utilization.<locals>.wrapper_maximize_memory_utilization(*args, **kwargs)
    366 bound_arguments.arguments.update(p_kwargs)
    367 try:
--> 368     return func(*bound_arguments.args, **bound_arguments.kwargs), tuple(
    369         max_values
    370     )
    371 except (torch.cuda.OutOfMemoryError, RuntimeError) as error:
    372     # raise errors unrelated to out-of-memory
    373     if not is_oom_error(error):

File ~/anaconda3/envs/prometeia/lib/python3.9/site-packages/pykeen/nn/text.py:57, in _encode_all_memory_utilization_optimized(encoder, labels, batch_size)
     35 @memory_utilization_maximizer
     36 def _encode_all_memory_utilization_optimized(
     37     encoder: "TextEncoder",
     38     labels: Sequence[str],
     39     batch_size: int,
     40 ) -> torch.Tensor:
     41     """
     42     Encode all labels with the given batch-size.
     43 
   (...)
     54         the encoded labels
     55     """
     56     return torch.cat(
---> 57         [encoder(batch) for batch in chunked(tqdm(map(str, labels), leave=False), batch_size)],
     58         dim=0,
     59     )

File ~/anaconda3/envs/prometeia/lib/python3.9/site-packages/pykeen/nn/text.py:57, in <listcomp>(.0)
     35 @memory_utilization_maximizer
     36 def _encode_all_memory_utilization_optimized(
     37     encoder: "TextEncoder",
     38     labels: Sequence[str],
     39     batch_size: int,
     40 ) -> torch.Tensor:
     41     """
     42     Encode all labels with the given batch-size.
     43 
   (...)
     54         the encoded labels
     55     """
     56     return torch.cat(
---> 57         [encoder(batch) for batch in chunked(tqdm(map(str, labels), leave=False), batch_size)],
     58         dim=0,
     59     )

File ~/anaconda3/envs/prometeia/lib/python3.9/site-packages/torch/nn/modules/module.py:1501, in Module._call_impl(self, *args, **kwargs)
   1496 # If we don't have any hooks, we want to skip the rest of the logic in
   1497 # this function, and just call forward.
   1498 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1499         or _global_backward_pre_hooks or _global_backward_hooks
   1500         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501     return forward_call(*args, **kwargs)
   1502 # Do not call functions when jit is used
   1503 full_backward_hooks, non_full_backward_hooks = [], []

File ~/anaconda3/envs/prometeia/lib/python3.9/site-packages/pykeen/nn/text.py:77, in TextEncoder.forward(self, labels)
     75 labels = upgrade_to_sequence(labels)
     76 labels = list(map(str, labels))
---> 77 return self.forward_normalized(texts=labels)

File ~/anaconda3/envs/prometeia/lib/python3.9/site-packages/pykeen/nn/text.py:222, in TransformerTextEncoder.forward_normalized(self, texts)
    221 def forward_normalized(self, texts: Sequence[str]) -> torch.FloatTensor:  # noqa: D102
--> 222     return self.model(
    223         **self.tokenizer(
    224             texts,
    225             return_tensors="pt",
    226             padding=True,
    227             truncation=True,
    228             max_length=self.max_length,
    229         ).to(get_preferred_device(self.model))
    230     ).pooler_output

AttributeError: 'tuple' object has no attribute 'pooler_output'

Code example

from pykeen.datasets import get_dataset
from pykeen.nn import TextRepresentation

dataset = get_dataset(dataset="umls")
entity_representations = TextRepresentation.from_dataset(
    dataset=dataset,
    encoder="transformer",
    encoder_kwargs=dict(pretrained_model_name_or_path="./coder_eng_pp", max_length=512)
)

from pykeen.

mberr avatar mberr commented on June 16, 2024

Hi @lorenzobalzani ,

I think the issue is that the TransformerTextEncoder assumes the model to be of BERT-style, but not all models from huggingface share the same API; you can add your own adjusted encoder by creating a subclass of TransformerTextEncoder with a custom forward_normalized suited for your choice of model:

from pykeen.datasets import get_dataset
from pykeen.nn import TextRepresentation
from pykeen.nn.text import TransformerTextEncoder

class MyEncoder(TransformerTextEncoder):
  def forward_normalized(self, texts: Sequence[str]) -> torch.FloatTensor:
     # todo: do something to encode the texts into vectors


dataset = get_dataset(dataset="umls")
entity_representations = TextRepresentation.from_dataset(
    dataset=dataset,
    encoder=MyEncoder,
    encoder_kwargs=dict(pretrained_model_name_or_path="./coder_eng_pp", max_length=512)
)

I took a quick look at https://huggingface.co/GanjinZero/coder_eng_pp and the referenced repo at https://github.com/GanjinZero/CODER, but neither of them contains example code for obtaining text encodings, so you may need to look into that.

from pykeen.

lorenzobalzani avatar lorenzobalzani commented on June 16, 2024

It's been quite a while since my last response, @mberr. In the meantime, I rearranged some things, and my final result looks like the following:

from typing import List
import torch
from pykeen.nn.text import TransformerTextEncoder
from pykeen.datasets import get_dataset
from sentence_transformers import SentenceTransformer

class SentenceTransformerEncoder(TransformerTextEncoder):
    def __init__(
        self, encoder_model_name_or_path: str, device: str) -> None:
        super().__init__()
        self._model = SentenceTransformer(encoder_model_name_or_path, device=device)

    def forward_normalized(self, texts: List[str]) -> torch.FloatTensor:
        encoder_output = self._model.encode(texts, convert_to_tensor=True)
        return (
            encoder_output
            / torch.linalg.matrix_norm(encoder_output, ord=2, keepdim=True)
            .clamp(min=1e-12)
            .contiguous()
        )

dataset=get_dataset(dataset="umls")
random_seed:int = 42
entity_representations = TextRepresentation.from_dataset(
        dataset=dataset,
        encoder=SentenceTransformerEncoder,
        encoder_kwargs=dict(
            encoder_model_name_or_path="GanjinZero/coder_eng_pp",
            device="cuda:0",
        ),
 )
pykeen_model = ERModel(
    dataset=dataset,
    interaction="ermlpe",
    interaction_kwargs=dict(
        embedding_dim=self._entity_representations.shape[0],
    ),
    entity_representations=self._entity_representations,
    relation_representations_kwargs=dict(
        shape=self._entity_representations.shape,
    ),
    random_seed=random_seed,
)

Now, is there really a tangible difference between the following bullet points:

  • Using TextRepresentation, as mentioned above.
  • Specifying only the embedding_dim and the entity_initializer parameters in the model_kwargs dictionary.
entity_labels: List[str] = [...]
model = SentenceTransformer("GanjinZero/coder_eng_pp")
embeddings = model.encode(entity_labels, convert_to_tensor=True, device="cuda:0")

result_lm = pipeline(
    dataset=dataset,
    model='DistMult',
    stopper='early',
    epochs=10,
    model_kwargs=dict(
        embedding_dim=embeddings.shape[-1],
        entity_initializer=PretrainedInitializer(tensor=embeddings),
    ),
    device="cuda:0",
)

Does the latter only specify the initial entity vectors but does not jointly train the encoder with the KGE model, while the former does indeed perform an end-to-end training?

from pykeen.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.