Describe the bug I am trying to produce embeddings for UMLS - and

Can you do the following: Please copy paste the text of the st

Sure! Stack trace <div class="highlight highlight-text-python-

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

It's been quite a while since my last response, <a class="user-mention notranslate" da

Usage of a custom encoder about pykeen HOT 6 CLOSED

lorenzobalzani commented on June 16, 2024

Usage of a custom encoder

from pykeen.

Comments (6)

mberr commented on June 16, 2024 1

Does the latter only specify the initial entity vectors but does not jointly train the encoder with the KGE model, while the former does indeed perform an end-to-end training?

Yes, exactly.

from pykeen.

lorenzobalzani commented on June 16, 2024 1

Thanks 🙏🏻

from pykeen.

cthoyt commented on June 16, 2024

Can you do the following:

Please copy paste the text of the stack trace, not a screenshot
Can you create a minimum viable example? The complexity of the pipeline doesn't make this easy to debug. Similarly, using a small built in dataset would be preferable to asking people to manually download and open datasets

from pykeen.

lorenzobalzani commented on June 16, 2024

Sure!

Stack trace

INFO:pykeen.triples.splitting:done splitting triples to groups of sizes [672253, 121116, 121117]
WARNING:pykeen.utils:No cuda devices were available. The model runs on CPU

AttributeError                            Traceback (most recent call last)
Cell In[47], line 47
     43     return result.model
     46 tf, train_set, val_set, test_set = load_umls_kg("UMLS_KG/triplets.txt", val_ratio = .1, test_ratio = .1)
---> 47 model = train_coder_on_umls(tf, train_set, val_set, test_set, save_model = True)

Cell In[47], line 15, in train_coder_on_umls(tf, train_set, val_set, test_set, CODER_checkpoint, save_model)
     14 def train_coder_on_umls(tf, train_set, val_set, test_set, CODER_checkpoint = "./coder_eng_pp", save_model: bool = True):
---> 15     entity_representations = TextRepresentation.from_triples_factory(
     16         triples_factory=tf, 
     17         encoder="transformer",
     18         encoder_kwargs=dict(pretrained_model_name_or_path=CODER_checkpoint, max_length=512)
     19     )
     21     result = pipeline(
     22         training=train_set,
     23         validation=val_set,
   (...)
     39         ),
     40     )
     41     if (save_model):

File ~/anaconda3/envs/prometeia/lib/python3.9/site-packages/pykeen/nn/representation.py:1067, in TextRepresentation.from_triples_factory(cls, triples_factory, for_entities, **kwargs)
   1053 """
   1054 Prepare a text representations with labels from a triples factory.
   1055 
   (...)
   1064     a text representation from the triples factory
   1065 """
   1066 labeling: Labeling = triples_factory.entity_labeling if for_entities else triples_factory.relation_labeling
-> 1067 return cls(labels=labeling.all_labels(), **kwargs)

File ~/anaconda3/envs/prometeia/lib/python3.9/site-packages/pykeen/nn/representation.py:1040, in TextRepresentation.__init__(self, labels, max_id, shape, encoder, encoder_kwargs, missing_action, **kwargs)
   1038 labels = _clean_labels(labels, missing_action)
   1039 # infer shape
-> 1040 shape = ShapeError.verify(shape=encoder.encode_all(labels[0:1]).shape[1:], reference=shape)
   1041 super().__init__(max_id=max_id, shape=shape, **kwargs)
   1042 self.labels = list(labels)

File ~/anaconda3/envs/prometeia/lib/python3.9/site-packages/torch/utils/_contextlib.py:115, in context_decorator.<locals>.decorate_context(*args, **kwargs)
    112 @functools.wraps(func)
    113 def decorate_context(*args, **kwargs):
    114     with ctx_factory():
--> 115         return func(*args, **kwargs)

File ~/anaconda3/envs/prometeia/lib/python3.9/site-packages/pykeen/nn/text.py:113, in TextEncoder.encode_all(self, labels, batch_size)
     92 @torch.inference_mode()
     93 def encode_all(
     94     self,
     95     labels: Sequence[str],
     96     batch_size: Optional[int] = None,
     97 ) -> torch.FloatTensor:
     98     """Encode all labels (inference mode & batched).
     99 
    100     :param labels:
   (...)
    111         a tensor representing the encodings for all labels
    112     """
--> 113     return _encode_all_memory_utilization_optimized(
    114         encoder=self, labels=labels, batch_size=batch_size or len(labels)
    115     ).detach()

File ~/anaconda3/envs/prometeia/lib/python3.9/site-packages/torch_max_mem/api.py:511, in MemoryUtilizationMaximizer.__call__.<locals>.inner(*args, **kwargs)
    509     values = tuple(bound.arguments[name] for name in self.parameter_names)
    510 kwargs.update(zip(self.parameter_names, values))
--> 511 result, self.parameter_value[h] = wrapped(*args, **kwargs)
    512 return result

File ~/anaconda3/envs/prometeia/lib/python3.9/site-packages/torch_max_mem/api.py:368, in maximize_memory_utilization_decorator.<locals>.decorator_maximize_memory_utilization.<locals>.wrapper_maximize_memory_utilization(*args, **kwargs)
    366 bound_arguments.arguments.update(p_kwargs)
    367 try:
--> 368     return func(*bound_arguments.args, **bound_arguments.kwargs), tuple(
    369         max_values
    370     )
    371 except (torch.cuda.OutOfMemoryError, RuntimeError) as error:
    372     # raise errors unrelated to out-of-memory
    373     if not is_oom_error(error):

File ~/anaconda3/envs/prometeia/lib/python3.9/site-packages/pykeen/nn/text.py:57, in _encode_all_memory_utilization_optimized(encoder, labels, batch_size)
     35 @memory_utilization_maximizer
     36 def _encode_all_memory_utilization_optimized(
     37     encoder: "TextEncoder",
     38     labels: Sequence[str],
     39     batch_size: int,
     40 ) -> torch.Tensor:
     41     """
     42     Encode all labels with the given batch-size.
     43 
   (...)
     54         the encoded labels
     55     """
     56     return torch.cat(
---> 57         [encoder(batch) for batch in chunked(tqdm(map(str, labels), leave=False), batch_size)],
     58         dim=0,
     59     )

File ~/anaconda3/envs/prometeia/lib/python3.9/site-packages/pykeen/nn/text.py:57, in <listcomp>(.0)
     35 @memory_utilization_maximizer
     36 def _encode_all_memory_utilization_optimized(
     37     encoder: "TextEncoder",
     38     labels: Sequence[str],
     39     batch_size: int,
     40 ) -> torch.Tensor:
     41     """
     42     Encode all labels with the given batch-size.
     43 
   (...)
     54         the encoded labels
     55     """
     56     return torch.cat(
---> 57         [encoder(batch) for batch in chunked(tqdm(map(str, labels), leave=False), batch_size)],
     58         dim=0,
     59     )

File ~/anaconda3/envs/prometeia/lib/python3.9/site-packages/torch/nn/modules/module.py:1501, in Module._call_impl(self, *args, **kwargs)
   1496 # If we don't have any hooks, we want to skip the rest of the logic in
   1497 # this function, and just call forward.
   1498 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1499         or _global_backward_pre_hooks or _global_backward_hooks
   1500         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501     return forward_call(*args, **kwargs)
   1502 # Do not call functions when jit is used
   1503 full_backward_hooks, non_full_backward_hooks = [], []

File ~/anaconda3/envs/prometeia/lib/python3.9/site-packages/pykeen/nn/text.py:77, in TextEncoder.forward(self, labels)
     75 labels = upgrade_to_sequence(labels)
     76 labels = list(map(str, labels))
---> 77 return self.forward_normalized(texts=labels)

File ~/anaconda3/envs/prometeia/lib/python3.9/site-packages/pykeen/nn/text.py:222, in TransformerTextEncoder.forward_normalized(self, texts)
    221 def forward_normalized(self, texts: Sequence[str]) -> torch.FloatTensor:  # noqa: D102
--> 222     return self.model(
    223         **self.tokenizer(
    224             texts,
    225             return_tensors="pt",
    226             padding=True,
    227             truncation=True,
    228             max_length=self.max_length,
    229         ).to(get_preferred_device(self.model))
    230     ).pooler_output

AttributeError: 'tuple' object has no attribute 'pooler_output'

Code example

from pykeen.datasets import get_dataset
from pykeen.nn import TextRepresentation

dataset = get_dataset(dataset="umls")
entity_representations = TextRepresentation.from_dataset(
    dataset=dataset,
    encoder="transformer",
    encoder_kwargs=dict(pretrained_model_name_or_path="./coder_eng_pp", max_length=512)
)

from pykeen.

mberr commented on June 16, 2024

Hi @lorenzobalzani ,

I think the issue is that the TransformerTextEncoder assumes the model to be of BERT-style, but not all models from huggingface share the same API; you can add your own adjusted encoder by creating a subclass of TransformerTextEncoder with a custom forward_normalized suited for your choice of model:

from pykeen.datasets import get_dataset
from pykeen.nn import TextRepresentation
from pykeen.nn.text import TransformerTextEncoder

class MyEncoder(TransformerTextEncoder):
  def forward_normalized(self, texts: Sequence[str]) -> torch.FloatTensor:
     # todo: do something to encode the texts into vectors


dataset = get_dataset(dataset="umls")
entity_representations = TextRepresentation.from_dataset(
    dataset=dataset,
    encoder=MyEncoder,
    encoder_kwargs=dict(pretrained_model_name_or_path="./coder_eng_pp", max_length=512)
)

I took a quick look at https://huggingface.co/GanjinZero/coder_eng_pp and the referenced repo at https://github.com/GanjinZero/CODER, but neither of them contains example code for obtaining text encodings, so you may need to look into that.

from pykeen.

lorenzobalzani commented on June 16, 2024

It's been quite a while since my last response, @mberr. In the meantime, I rearranged some things, and my final result looks like the following:

from typing import List
import torch
from pykeen.nn.text import TransformerTextEncoder
from pykeen.datasets import get_dataset
from sentence_transformers import SentenceTransformer

class SentenceTransformerEncoder(TransformerTextEncoder):
    def __init__(
        self, encoder_model_name_or_path: str, device: str) -> None:
        super().__init__()
        self._model = SentenceTransformer(encoder_model_name_or_path, device=device)

    def forward_normalized(self, texts: List[str]) -> torch.FloatTensor:
        encoder_output = self._model.encode(texts, convert_to_tensor=True)
        return (
            encoder_output
            / torch.linalg.matrix_norm(encoder_output, ord=2, keepdim=True)
            .clamp(min=1e-12)
            .contiguous()
        )

dataset=get_dataset(dataset="umls")
random_seed:int = 42
entity_representations = TextRepresentation.from_dataset(
        dataset=dataset,
        encoder=SentenceTransformerEncoder,
        encoder_kwargs=dict(
            encoder_model_name_or_path="GanjinZero/coder_eng_pp",
            device="cuda:0",
        ),
 )
pykeen_model = ERModel(
    dataset=dataset,
    interaction="ermlpe",
    interaction_kwargs=dict(
        embedding_dim=self._entity_representations.shape[0],
    ),
    entity_representations=self._entity_representations,
    relation_representations_kwargs=dict(
        shape=self._entity_representations.shape,
    ),
    random_seed=random_seed,
)

Now, is there really a tangible difference between the following bullet points:

Using TextRepresentation, as mentioned above.
Specifying only the embedding_dim and the entity_initializer parameters in the model_kwargs dictionary.

entity_labels: List[str] = [...]
model = SentenceTransformer("GanjinZero/coder_eng_pp")
embeddings = model.encode(entity_labels, convert_to_tensor=True, device="cuda:0")

result_lm = pipeline(
    dataset=dataset,
    model='DistMult',
    stopper='early',
    epochs=10,
    model_kwargs=dict(
        embedding_dim=embeddings.shape[-1],
        entity_initializer=PretrainedInitializer(tensor=embeddings),
    ),
    device="cuda:0",
)

Does the latter only specify the initial entity vectors but does not jointly train the encoder with the KGE model, while the former does indeed perform an end-to-end training?

from pykeen.

Usage of a custom encoder about pykeen HOT 6 CLOSED

Comments (6)

Stack trace

Code example

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent