HDT: Hierarchical Document Transformer

This repository contains code for our COLM'24 paper "HDT: Hierarchical Document Transformer"

📖 Overview

We present HDT, a novel sparse Transformer architecture tailored for structured hierarchical documents. HDT exploits document structure by introducing auxiliary anchor tokens and redesigning the attention mechanism into a sparse multi-level hierarchy. By developing a novel sparse attention kernel that considers the hierarchical structure of documents, HDT achieves computational efficiency as well as higher sample efficiency for pre-training and better performance on downstream tasks.

🌟 Requirements

The required Python packages for running this repo are listed in requirements.txt. To install these pacakages at one time, plaese run

pip install -r requirements.txt

🧩 ListOps

To verify our hierarchical attention, we start experiments on ListOPs before training on language tasks using the scripts in ListOPs. The entry point is run_experiment.py. You can provide model names and hyperparameters as command line arguments. For example, to run the HDT vs BERT vs HAT vs Longformer experiment we used in the paper:

cd ListOPs
python run_experiment.py 0.25 5 20 90000 12 128 1 512 300 120 0.0003 fixed blue 512 HDT hdt_testrun
python run_experiment.py 0.25 5 20 90000 12 128 1 512 300 120 0.0003 fixed blue 512 BERT bert_testrun
python run_experiment.py 0.25 5 20 90000 12 128 1 512 300 120 0.0003 fixed blue 512 Longformer Longformer_testrun
python run_experiment.py 0.25 5 20 90000 12 128 1 512 300 120 0.0003 fixed blue 512 HAT HAT_testrun

Note

Currently our customized attention kernel only supports three-level hierarchy, we don't use it for the ListOps tasks where the depths could be much larger, e.g., 20. We create hierarchical attention mask and directly apply the mask on the attention score matrix. A more flexible kernel will be released soon which supports arbitrary levels of hierarchy. We use cython to speed up the computation of our sparse attention mask. The cython code needs to be compiled on your system before your run the code. python setup.py build_ext --inplace

📊 Datasets

Pre-training Data

For pre-training, we collect structured documents from HUPD, unarXive, and Wikipedia. In our implementation, all documents are further preprocessed as a list of sections, where each section is a list of sentences. This hierarchical format allows the HDT model to efficiently process and exploit the structural information present in the document. Here is an example representation of the document data structure:

document = [
        [
            "Title",
            "Abstract",
            "This is the first sentence of abstract.",
            "This is the second sentence of abstract.",
            ...
        ],
        [
            "Introduction",
            "This is the first sentence of the introduction.",
            ...
        ],
    ...
]

To download the preprocessed data, run

from datasets import load_dataset

unarxive = load_dataset('howey/unarXive')
hupd = load_dataset('howey/hupd')
wikipedia = load_dataset('howey/wiki_en')

FullTextSciRepEval

In our experiments, we extend the SciRepEval with public accessible arxiv full-text data, leading to a subset called FullTextSciRepEval including full-text scientific papers and labels from SciRepEval. FullTextSciRepEval is used to benchmark long document representation in our paper.

🚀 Training

HDT-E

The pre-training for HDT uses pretrain.py. Note that both training of encoder-only model and encoder-decoder model shares the same training script, with different arguments setting. For instance, training a HDT encoder-only model (HDT-E) on Masked Language Modeling (MLM) task, please run

python pretrain.py --encoder_only --tok_name google-bert/bert-base-uncased

Here we directly use the BERT tokenizer for simplicity.

Following CRAMMING, we pre-train our models in an academic budget to evaluate the efficiency of our method. As default, the model is trained on 1 GPU for 24 hours.

HDT-ED

In addition, to pre-train an encoder-decoder model for generation tasks with multiple gpus for longer time budget (e.g., 48 hours), run

python pretrain.py --tok_name google-t5/t5-base --num_encoder_layers 6 --num_decoder_layers 6 --num_gpus 4 --budget 48

Note

We uses UL2 as the pre-training objective for encoder-decoder model and MLM for encoder-only model, following the default configuration from the original papers.
Anchor tokens [DOC], [SEC], and [CLS] are not masked during pre-training.

Available Models on

Model Name	Encoder Layers	Decoder Layers	Hidden Units	Attention Heads	Vocab	Parameters
`howey/HDT-E`	12		768	12	32,768	109M
`howey/HDT-ED`	6	6	768	12	32,128	112M

📚Citation

If you use or extend our work, please consider citing our paper. Thank you for your support! 🥰

@inproceedings{He2024COLM,
  author = {He, Haoyu and Flicke, Markus and Buchman, Jan and Gurevych, Iryna and Geiger, Andreas},
  title = {HDT: Hierarchical Document Transformer},
  publisher = {Conference on Language Modeling},
  year = {2024},
}

autonomousvision / hdt Goto Github PK

hdt's Introduction

HDT: Hierarchical Document Transformer

📖 Overview

🌟 Requirements

🧩 ListOps

📊 Datasets

Pre-training Data

FullTextSciRepEval

🚀 Training

HDT-E

HDT-ED

Available Models on

📚Citation

hdt's People

Contributors

Stargazers

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent