This repository contains the code for our Graph Split
algorithm, the concept embeddings we generated using concept names and concept definitions from Wikipedia, and the data we used for the experiments presented in the paper: Contextual Embeddings and Graph Convolutional Networks for Concept Prerequisite Learning.
.
├───AL_CPL_Originial_Data #Contains the original AL-CPL data that can be found here: https://github.com/harrylclc/AL-CPL-dataset
├───Generated_Images
├───Graph_Split #This is the training and testing data generated from our graph split
│ ├───data_mining # Each subfolder contains the data for a domain + the statistics of each split in `x_split_statistics.csv`
│ ├───geometry
│ ├───physics
│ └───precalculus
├───Random_Split
│ ├───data_mining
│ ├───geometry
│ ├───physics
│ └───precalculus
└───Scripts #Contains the scripts useful for both types of splits
By examining the concept prerequisite graph generated from the AL-CPL dataset, we noticed that some of the prerequisite relations are easier to deduce than others due to the transitive property. Therefore, we devise a novel graph-based stochastic split algorithm that avoids cases where the prerequisite inference in the test set is trivial due to the transitivity property (like in Subfigure 5.a). Our Graph Split algorithm is in the script Graph_Split.py
.
The following section explains how we split our data for our experiments and how we generate our embeddings.
For reproducibility purposes, below is a schema that explains how we generate the appropriate splits to train our models.
What is called a 'Random Split' is simply the random split of the data you would get after running a function similar to sklearn.model_selection.train_test_split
, whereas the 'Graph Split' is the random split obtained by our Graph Split
algorithm (see Graph_split.py
). The in-domain split, whether from a 'Graph Split' or a 'Random Split', generates 5 different train/val/test splits for each of the domains. The transductive split is the method of splitting the training data that is recommended by Lescovec for transductive link prediction. His slides are available here (see slide 68.), as well as the code we used to reproduce this split (we used option 2 with 'torch_geometric').
We use FastText, Transformers, and Sentence Transformers to generate concept embeddings to solve Concept Prerequisite Learning on the AL-CPL dataset. Below, you will find the methodology we followed to generate our embeddings and the metrics we used to evaluate them.
The Notebook Metrics_for_Emebddings.ipynb
shows how to load the embeddings in Concepts_with_Description_and_Embeddings.csv
, provides a visualization, and computes the metrics for each embedding method. If needed, Metrics_for_Emebddings.ipynb
can be opened and executed in Google Colaboratory, you would just need to upload the Concepts_with_Description_and_Embeddings.csv
file.
We use the following methodology to represent concepts, with the name of the corresponding column in the file Concepts_with_Description_and_Embeddings.csv
in parentheses:
-
FastText Embeddings (
ConceptEmbeddings_FastText
): We use FastText'scc.en.300.bin
model. Itsget_sentence_vector
method is used on the name of concepts. -
OpenAI Embeddings (
Phrase_Embedding_OpenAI
): We use Open AI's API to generate concept embeddings using the definitions of concepts (Query_Phrase
) and their modeltext-embedding-ada-002
. -
Sentence Transformer Embeddings (
Phrase_Embedding_all-*
): We use the English models of the library Sentence Transformers and the definitions of concepts (Query_Phrase
) to generate concept embeddings. The name of the model we used is included in the column's name. I.e. for the embeddings in columnPhrase_Embedding_all-mpnet-base-v2
, the modelall-mpnet-base-v2
was used.
Additionally, we propose a method to evaluate embeddings for our CPL task, drawing inspiration from recommender systems. Our evaluation process involves the following steps: First, we compute the cosine similarity matrix for all embeddings. Next, we rank the results for each concept. Finally, we compute the evaluation metrics based on the definitions provided below:
Where for any given concept C, the term "related concepts" refers to either the prerequisites of the given concept C (ancestors) or the concepts for which C is a prerequisite (descendants). The query set, denoted as Q, contains all the related concepts to concept C.
All of these metrics return values in the interval [0,1]. The higher the value of the metric, the higher the probability of concepts being 'related' when their embeddings have a high cosine similarity (see the definition of "related concepts" given above).
This work would not have been possible without:
Please, do cite our work if this repository was helpful to your research!
@article{X2023PrereqGCN,
title={Contextual Embeddings and Graph Convolutional Networks for Concept Prerequisite Learning},
author={X},
year={2023},
booktitle = {X},
publisher = {X},
}