Contextual Embeddings and Graph Convolutional Networks for Concept Prerequisite Learning

This repository contains the code for our Graph Split algorithm, the concept embeddings we generated using concept names and concept definitions from Wikipedia, and the data we used for the experiments presented in the paper: Contextual Embeddings and Graph Convolutional Networks for Concept Prerequisite Learning.

Contextual Embeddings and Graph Convolutional Networks for Concept Prerequisite Learning

Structure of the Repository

.
├───AL_CPL_Originial_Data #Contains the original AL-CPL data that can be found here: https://github.com/harrylclc/AL-CPL-dataset
├───Generated_Images
├───Graph_Split #This is the training and testing data generated from our graph split
│   ├───data_mining # Each subfolder contains the data for a domain + the statistics of each split in `x_split_statistics.csv`
│   ├───geometry
│   ├───physics
│   └───precalculus
├───Random_Split
│   ├───data_mining
│   ├───geometry
│   ├───physics
│   └───precalculus
└───Scripts #Contains the scripts useful for both types of splits

Overview

By examining the concept prerequisite graph generated from the AL-CPL dataset, we noticed that some of the prerequisite relations are easier to deduce than others due to the transitive property. Therefore, we devise a novel graph-based stochastic split algorithm that avoids cases where the prerequisite inference in the test set is trivial due to the transitivity property (like in Subfigure 5.a). Our Graph Split algorithm is in the script Graph_Split.py.

The following section explains how we split our data for our experiments and how we generate our embeddings.

Split Architecture

For reproducibility purposes, below is a schema that explains how we generate the appropriate splits to train our models.

What is called a 'Random Split' is simply the random split of the data you would get after running a function similar to sklearn.model_selection.train_test_split, whereas the 'Graph Split' is the random split obtained by our Graph Split algorithm (see Graph_split.py). The in-domain split, whether from a 'Graph Split' or a 'Random Split', generates 5 different train/val/test splits for each of the domains. The transductive split is the method of splitting the training data that is recommended by Lescovec for transductive link prediction. His slides are available here (see slide 68.), as well as the code we used to reproduce this split (we used option 2 with 'torch_geometric').

Embeddings

We use FastText, Transformers, and Sentence Transformers to generate concept embeddings to solve Concept Prerequisite Learning on the AL-CPL dataset. Below, you will find the methodology we followed to generate our embeddings and the metrics we used to evaluate them.

The Notebook Metrics_for_Emebddings.ipynb shows how to load the embeddings in Concepts_with_Description_and_Embeddings.csv, provides a visualization, and computes the metrics for each embedding method. If needed, Metrics_for_Emebddings.ipynb can be opened and executed in Google Colaboratory, you would just need to upload the Concepts_with_Description_and_Embeddings.csv file.

How our Embeddings Were Generated

We use the following methodology to represent concepts, with the name of the corresponding column in the file Concepts_with_Description_and_Embeddings.csv in parentheses:

FastText Embeddings (ConceptEmbeddings_FastText): We use FastText's cc.en.300.bin model. Its get_sentence_vector method is used on the name of concepts.
OpenAI Embeddings (Phrase_Embedding_OpenAI): We use Open AI's API to generate concept embeddings using the definitions of concepts (Query_Phrase) and their model text-embedding-ada-002.
Sentence Transformer Embeddings (Phrase_Embedding_all-*): We use the English models of the library Sentence Transformers and the definitions of concepts (Query_Phrase) to generate concept embeddings. The name of the model we used is included in the column's name. I.e. for the embeddings in column Phrase_Embedding_all-mpnet-base-v2, the model all-mpnet-base-v2 was used.

Metrics

Additionally, we propose a method to evaluate embeddings for our CPL task, drawing inspiration from recommender systems. Our evaluation process involves the following steps: First, we compute the cosine similarity matrix for all embeddings. Next, we rank the results for each concept. Finally, we compute the evaluation metrics based on the definitions provided below:

Where for any given concept C, the term "related concepts" refers to either the prerequisites of the given concept C (ancestors) or the concepts for which C is a prerequisite (descendants). The query set, denoted as Q, contains all the related concepts to concept C.

All of these metrics return values in the interval [0,1]. The higher the value of the metric, the higher the probability of concepts being 'related' when their embeddings have a high cosine similarity (see the definition of "related concepts" given above).

Acknowledgements

This work would not have been possible without:

Citation

Please, do cite our work if this repository was helpful to your research!

@article{X2023PrereqGCN,
  title={Contextual Embeddings and Graph Convolutional Networks for Concept Prerequisite Learning},
  author={X},
  year={2023},
  booktitle = {X},
  publisher = {X},
}

learningchipmunk / al-cpl-graphsplit-embeddings Goto Github PK

al-cpl-graphsplit-embeddings's Introduction

Contextual Embeddings and Graph Convolutional Networks for Concept Prerequisite Learning

Table of Contents

Structure of the Repository

Overview

Split Architecture

Embeddings

How our Embeddings Were Generated

Metrics

Acknowledgements

Citation

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent