I downloaded the 36 GB gzipped file (wikidata_translation_v1.tsv.gz) containing the pr

Using pre-trained WikiData embeddings for nearest neighbor search about pytorch-biggraph HOT 7 CLOSED

facebookresearch commented on August 18, 2024

Using pre-trained WikiData embeddings for nearest neighbor search

from pytorch-biggraph.

Comments (7)

lw commented on August 18, 2024

Most of what you are asking is explained in the documentation so, rather than copy-pasting it here, I suggest you to read up there and let me know if anything is unclear.

The one thing that I believe we did not explain in the doc is what we mean by "nearest neighbor" search. You are right in saying that to properly compute how "close" (similar) two entities are one should apply the proper operators and do the dot product. However, it turns out that once the embeddings are fully trained, their distance in L2 space already captures some semantic similarity and can thus be used to get a rough sense of the neighbors. This is an approximation we did in that example that we should have explained better. If you want a more exact search, there's a few options. I believe you can tell FAISS to use the dot product over the L2 norm, although not all indices support it. You cannot tell FAISS to apply the operator for you, but you can apply yourself the operator to your query before searching in the un-transformed embeddings (if you use "standard" relations, this only allows you to query the nearest left-hand side neighbors of a right-hand side entity; if you use dynamic relations you can do it on either side). If for some reason this doesn't work for you, you can drop FAISS entirely and do a slower but more correct evaluation of the scores between an entity and all other ones, similarly to what is done when ranking.

As mentioned in the page about the Wikidata embeddings, the TSV is almost the same format as the one of the export_to_tsv command, which is explained in the readme. The parameters of the operator of each relation type are at the end of the TSV file. They are not pre-applied to the embeddings (this would be impossible if one had more than one relation types, with different parameters).

You will also find in the doc that, in addition to TSV, there's a machine-readable format for these embeddings (i.e., .npy). There we also explain how to load it.

Dynamic relations are explained here. The parameters for the left-hand side operators also appear in the TSV file, with a _reverse_relation suffix.

Then, the FAISS example should work just the same for the Wikidata embeddings. Due to their size you may want to use a different index type for better performance, but that depends on your application and you should turn to the FAISS developers for help with tuning.

from pytorch-biggraph.

kadimaolivier commented on August 18, 2024

Hello, i would like to work with pyTorch-biggraph, my aim is from a graph data set ,i want to be able to find some entities simularities, and dertermine some simularity between entitites that has numerical attributes (my data are in RDF format) and at the end how can i apply TransEA model with numerical attributes after detemining the simularity between entities

from pytorch-biggraph.

lw commented on August 18, 2024

Your questions are very broad and they are basically about how to design a full ML pipeline, which is something that is up to you, rather than how to employ PBG as one block of it, which is what we're here to help with. I advise you to check out the README and documentation and get back to us if you have specific issues.

from pytorch-biggraph.

kadimaolivier commented on August 18, 2024

Thank you very much for your advise and your feedback, may you please guide me since i am a newby in this field, is it possible to get vectors from an RDF dataset using PBG tool? if yes which step should follow? secondly after getting vectors i would like to compare these vectors and find which entities vectors are simular so is it also possible to do that with PBG, once again thanks in advance for your guidance and orientation

from pytorch-biggraph.

lw commented on August 18, 2024

PBG, by itself, doesn't read RDF. You need to convert it to either the native format (explained here) or to TSV (tab-separated values), for which there already is an importer (i.e., torchbiggraph_from_tsv). The N-Tripes format of RDF is somewhat similar to TSV, so that may be easiest. Once you have your data in the right format, you can find in the doc explanations on how to train embeddings for its entities.

from pytorch-biggraph.

kadimaolivier commented on August 18, 2024

Thank you very much for the details...

from pytorch-biggraph.

lw commented on August 18, 2024

Closing this as I think everything had been answered and there were no follow-ups. If I missed something or new questions arise, please reopen or create a new issue.

from pytorch-biggraph.

Using pre-trained WikiData embeddings for nearest neighbor search about pytorch-biggraph HOT 7 CLOSED

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent