Comments (6)
Does the latter only specify the initial entity vectors but does not jointly train the encoder with the KGE model, while the former does indeed perform an end-to-end training?
Yes, exactly.
from pykeen.
Thanks 🙏🏻
from pykeen.
Can you do the following:
- Please copy paste the text of the stack trace, not a screenshot
- Can you create a minimum viable example? The complexity of the pipeline doesn't make this easy to debug. Similarly, using a small built in dataset would be preferable to asking people to manually download and open datasets
from pykeen.
Sure!
Stack trace
INFO:pykeen.triples.splitting:done splitting triples to groups of sizes [672253, 121116, 121117]
WARNING:pykeen.utils:No cuda devices were available. The model runs on CPU
AttributeError Traceback (most recent call last)
Cell In[47], line 47
43 return result.model
46 tf, train_set, val_set, test_set = load_umls_kg("UMLS_KG/triplets.txt", val_ratio = .1, test_ratio = .1)
---> 47 model = train_coder_on_umls(tf, train_set, val_set, test_set, save_model = True)
Cell In[47], line 15, in train_coder_on_umls(tf, train_set, val_set, test_set, CODER_checkpoint, save_model)
14 def train_coder_on_umls(tf, train_set, val_set, test_set, CODER_checkpoint = "./coder_eng_pp", save_model: bool = True):
---> 15 entity_representations = TextRepresentation.from_triples_factory(
16 triples_factory=tf,
17 encoder="transformer",
18 encoder_kwargs=dict(pretrained_model_name_or_path=CODER_checkpoint, max_length=512)
19 )
21 result = pipeline(
22 training=train_set,
23 validation=val_set,
(...)
39 ),
40 )
41 if (save_model):
File ~/anaconda3/envs/prometeia/lib/python3.9/site-packages/pykeen/nn/representation.py:1067, in TextRepresentation.from_triples_factory(cls, triples_factory, for_entities, **kwargs)
1053 """
1054 Prepare a text representations with labels from a triples factory.
1055
(...)
1064 a text representation from the triples factory
1065 """
1066 labeling: Labeling = triples_factory.entity_labeling if for_entities else triples_factory.relation_labeling
-> 1067 return cls(labels=labeling.all_labels(), **kwargs)
File ~/anaconda3/envs/prometeia/lib/python3.9/site-packages/pykeen/nn/representation.py:1040, in TextRepresentation.__init__(self, labels, max_id, shape, encoder, encoder_kwargs, missing_action, **kwargs)
1038 labels = _clean_labels(labels, missing_action)
1039 # infer shape
-> 1040 shape = ShapeError.verify(shape=encoder.encode_all(labels[0:1]).shape[1:], reference=shape)
1041 super().__init__(max_id=max_id, shape=shape, **kwargs)
1042 self.labels = list(labels)
File ~/anaconda3/envs/prometeia/lib/python3.9/site-packages/torch/utils/_contextlib.py:115, in context_decorator.<locals>.decorate_context(*args, **kwargs)
112 @functools.wraps(func)
113 def decorate_context(*args, **kwargs):
114 with ctx_factory():
--> 115 return func(*args, **kwargs)
File ~/anaconda3/envs/prometeia/lib/python3.9/site-packages/pykeen/nn/text.py:113, in TextEncoder.encode_all(self, labels, batch_size)
92 @torch.inference_mode()
93 def encode_all(
94 self,
95 labels: Sequence[str],
96 batch_size: Optional[int] = None,
97 ) -> torch.FloatTensor:
98 """Encode all labels (inference mode & batched).
99
100 :param labels:
(...)
111 a tensor representing the encodings for all labels
112 """
--> 113 return _encode_all_memory_utilization_optimized(
114 encoder=self, labels=labels, batch_size=batch_size or len(labels)
115 ).detach()
File ~/anaconda3/envs/prometeia/lib/python3.9/site-packages/torch_max_mem/api.py:511, in MemoryUtilizationMaximizer.__call__.<locals>.inner(*args, **kwargs)
509 values = tuple(bound.arguments[name] for name in self.parameter_names)
510 kwargs.update(zip(self.parameter_names, values))
--> 511 result, self.parameter_value[h] = wrapped(*args, **kwargs)
512 return result
File ~/anaconda3/envs/prometeia/lib/python3.9/site-packages/torch_max_mem/api.py:368, in maximize_memory_utilization_decorator.<locals>.decorator_maximize_memory_utilization.<locals>.wrapper_maximize_memory_utilization(*args, **kwargs)
366 bound_arguments.arguments.update(p_kwargs)
367 try:
--> 368 return func(*bound_arguments.args, **bound_arguments.kwargs), tuple(
369 max_values
370 )
371 except (torch.cuda.OutOfMemoryError, RuntimeError) as error:
372 # raise errors unrelated to out-of-memory
373 if not is_oom_error(error):
File ~/anaconda3/envs/prometeia/lib/python3.9/site-packages/pykeen/nn/text.py:57, in _encode_all_memory_utilization_optimized(encoder, labels, batch_size)
35 @memory_utilization_maximizer
36 def _encode_all_memory_utilization_optimized(
37 encoder: "TextEncoder",
38 labels: Sequence[str],
39 batch_size: int,
40 ) -> torch.Tensor:
41 """
42 Encode all labels with the given batch-size.
43
(...)
54 the encoded labels
55 """
56 return torch.cat(
---> 57 [encoder(batch) for batch in chunked(tqdm(map(str, labels), leave=False), batch_size)],
58 dim=0,
59 )
File ~/anaconda3/envs/prometeia/lib/python3.9/site-packages/pykeen/nn/text.py:57, in <listcomp>(.0)
35 @memory_utilization_maximizer
36 def _encode_all_memory_utilization_optimized(
37 encoder: "TextEncoder",
38 labels: Sequence[str],
39 batch_size: int,
40 ) -> torch.Tensor:
41 """
42 Encode all labels with the given batch-size.
43
(...)
54 the encoded labels
55 """
56 return torch.cat(
---> 57 [encoder(batch) for batch in chunked(tqdm(map(str, labels), leave=False), batch_size)],
58 dim=0,
59 )
File ~/anaconda3/envs/prometeia/lib/python3.9/site-packages/torch/nn/modules/module.py:1501, in Module._call_impl(self, *args, **kwargs)
1496 # If we don't have any hooks, we want to skip the rest of the logic in
1497 # this function, and just call forward.
1498 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
1499 or _global_backward_pre_hooks or _global_backward_hooks
1500 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501 return forward_call(*args, **kwargs)
1502 # Do not call functions when jit is used
1503 full_backward_hooks, non_full_backward_hooks = [], []
File ~/anaconda3/envs/prometeia/lib/python3.9/site-packages/pykeen/nn/text.py:77, in TextEncoder.forward(self, labels)
75 labels = upgrade_to_sequence(labels)
76 labels = list(map(str, labels))
---> 77 return self.forward_normalized(texts=labels)
File ~/anaconda3/envs/prometeia/lib/python3.9/site-packages/pykeen/nn/text.py:222, in TransformerTextEncoder.forward_normalized(self, texts)
221 def forward_normalized(self, texts: Sequence[str]) -> torch.FloatTensor: # noqa: D102
--> 222 return self.model(
223 **self.tokenizer(
224 texts,
225 return_tensors="pt",
226 padding=True,
227 truncation=True,
228 max_length=self.max_length,
229 ).to(get_preferred_device(self.model))
230 ).pooler_output
AttributeError: 'tuple' object has no attribute 'pooler_output'
Code example
from pykeen.datasets import get_dataset
from pykeen.nn import TextRepresentation
dataset = get_dataset(dataset="umls")
entity_representations = TextRepresentation.from_dataset(
dataset=dataset,
encoder="transformer",
encoder_kwargs=dict(pretrained_model_name_or_path="./coder_eng_pp", max_length=512)
)
from pykeen.
Hi @lorenzobalzani ,
I think the issue is that the TransformerTextEncoder
assumes the model to be of BERT-style, but not all models from huggingface share the same API; you can add your own adjusted encoder by creating a subclass of TransformerTextEncoder
with a custom forward_normalized
suited for your choice of model:
from pykeen.datasets import get_dataset
from pykeen.nn import TextRepresentation
from pykeen.nn.text import TransformerTextEncoder
class MyEncoder(TransformerTextEncoder):
def forward_normalized(self, texts: Sequence[str]) -> torch.FloatTensor:
# todo: do something to encode the texts into vectors
dataset = get_dataset(dataset="umls")
entity_representations = TextRepresentation.from_dataset(
dataset=dataset,
encoder=MyEncoder,
encoder_kwargs=dict(pretrained_model_name_or_path="./coder_eng_pp", max_length=512)
)
I took a quick look at https://huggingface.co/GanjinZero/coder_eng_pp and the referenced repo at https://github.com/GanjinZero/CODER, but neither of them contains example code for obtaining text encodings, so you may need to look into that.
from pykeen.
It's been quite a while since my last response, @mberr. In the meantime, I rearranged some things, and my final result looks like the following:
from typing import List
import torch
from pykeen.nn.text import TransformerTextEncoder
from pykeen.datasets import get_dataset
from sentence_transformers import SentenceTransformer
class SentenceTransformerEncoder(TransformerTextEncoder):
def __init__(
self, encoder_model_name_or_path: str, device: str) -> None:
super().__init__()
self._model = SentenceTransformer(encoder_model_name_or_path, device=device)
def forward_normalized(self, texts: List[str]) -> torch.FloatTensor:
encoder_output = self._model.encode(texts, convert_to_tensor=True)
return (
encoder_output
/ torch.linalg.matrix_norm(encoder_output, ord=2, keepdim=True)
.clamp(min=1e-12)
.contiguous()
)
dataset=get_dataset(dataset="umls")
random_seed:int = 42
entity_representations = TextRepresentation.from_dataset(
dataset=dataset,
encoder=SentenceTransformerEncoder,
encoder_kwargs=dict(
encoder_model_name_or_path="GanjinZero/coder_eng_pp",
device="cuda:0",
),
)
pykeen_model = ERModel(
dataset=dataset,
interaction="ermlpe",
interaction_kwargs=dict(
embedding_dim=self._entity_representations.shape[0],
),
entity_representations=self._entity_representations,
relation_representations_kwargs=dict(
shape=self._entity_representations.shape,
),
random_seed=random_seed,
)
Now, is there really a tangible difference between the following bullet points:
- Using
TextRepresentation
, as mentioned above. - Specifying only the
embedding_dim
and theentity_initializer
parameters in themodel_kwargs
dictionary.
entity_labels: List[str] = [...]
model = SentenceTransformer("GanjinZero/coder_eng_pp")
embeddings = model.encode(entity_labels, convert_to_tensor=True, device="cuda:0")
result_lm = pipeline(
dataset=dataset,
model='DistMult',
stopper='early',
epochs=10,
model_kwargs=dict(
embedding_dim=embeddings.shape[-1],
entity_initializer=PretrainedInitializer(tensor=embeddings),
),
device="cuda:0",
)
Does the latter only specify the initial entity vectors but does not jointly train the encoder with the KGE model, while the former does indeed perform an end-to-end training?
from pykeen.
Related Issues (20)
- AttributeError: 'Module' object has no attribute 'get' HOT 2
- Question about the use of `create_inverse_triples` HOT 2
- Want to train a model without any evaluate or test dataset HOT 1
- Bug in wandb result tracker HOT 1
- Possible issue with model evaluation when using datasets with inverse triples HOT 1
- RGCN RuntimeError: trying to backward through graph a second time. (has parameters but no reset_parameters) HOT 2
- QuatE: GPU memory is not released per epoch HOT 3
- Training loop does not update relation representations when continuing training HOT 2
- from pykeen.pipeline import pipeline, pipeline issue HOT 3
- Evaluating metrics on many subsets with multiple models HOT 2
- Shape Mismatch upon initializing pretrained ComplEx embeddings HOT 2
- TransE - CUDA out of memory HOT 3
- Importing model_resolver HOT 2
- Getting Embeddings of the Entity and Relations HOT 13
- RGCN Hyper parameter optimization error HOT 1
- MatKG HOT 1
- HPO_Pipeline fails on AutoSF models HOT 1
- Unable to reproduce TransE experiment
- EarlyStopper: show progress bar
- Cosine Annealing with Warm Restart LR Scheduler recieving an unexpected kwarg `T_i` HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pykeen.