Giter VIP home page Giter VIP logo

hdt's Introduction

HDT: Hierarchical Document Transformer

This repository contains code for our COLM'24 paper "HDT: Hierarchical Document Transformer"

[PDF] [Project]

πŸ“– Overview

We present HDT, a novel sparse Transformer architecture tailored for structured hierarchical documents. HDT exploits document structure by introducing auxiliary anchor tokens and redesigning the attention mechanism into a sparse multi-level hierarchy. By developing a novel sparse attention kernel that considers the hierarchical structure of documents, HDT achieves computational efficiency as well as higher sample efficiency for pre-training and better performance on downstream tasks.

HDT

🌟 Requirements

The required Python packages for running this repo are listed in requirements.txt. To install these pacakages at one time, plaese run

pip install -r requirements.txt

🧩 ListOps

To verify our hierarchical attention, we start experiments on ListOPs before training on language tasks using the scripts in ListOPs. The entry point is run_experiment.py. You can provide model names and hyperparameters as command line arguments. For example, to run the HDT vs BERT vs HAT vs Longformer experiment we used in the paper:

cd ListOPs
python run_experiment.py 0.25 5 20 90000 12 128 1 512 300 120 0.0003 fixed blue 512 HDT hdt_testrun
python run_experiment.py 0.25 5 20 90000 12 128 1 512 300 120 0.0003 fixed blue 512 BERT bert_testrun
python run_experiment.py 0.25 5 20 90000 12 128 1 512 300 120 0.0003 fixed blue 512 Longformer Longformer_testrun
python run_experiment.py 0.25 5 20 90000 12 128 1 512 300 120 0.0003 fixed blue 512 HAT HAT_testrun

Note

Currently our customized attention kernel only supports three-level hierarchy, we don't use it for the ListOps tasks where the depths could be much larger, e.g., 20. We create hierarchical attention mask and directly apply the mask on the attention score matrix. A more flexible kernel will be released soon which supports arbitrary levels of hierarchy. We use cython to speed up the computation of our sparse attention mask. The cython code needs to be compiled on your system before your run the code. python setup.py build_ext --inplace

πŸ“Š Datasets

Pre-training Data

For pre-training, we collect structured documents from HUPD, unarXive, and Wikipedia. In our implementation, all documents are further preprocessed as a list of sections, where each section is a list of sentences. This hierarchical format allows the HDT model to efficiently process and exploit the structural information present in the document. Here is an example representation of the document data structure:

document = [
        [
            "Title",
            "Abstract",
            "This is the first sentence of abstract.",
            "This is the second sentence of abstract.",
            ...
        ],
        [
            "Introduction",
            "This is the first sentence of the introduction.",
            ...
        ],
    ...
]

To download the preprocessed data, run

from datasets import load_dataset

unarxive = load_dataset('howey/unarXive')
hupd = load_dataset('howey/hupd')
wikipedia = load_dataset('howey/wiki_en')

FullTextSciRepEval

In our experiments, we extend the SciRepEval with public accessible arxiv full-text data, leading to a subset called FullTextSciRepEval including full-text scientific papers and labels from SciRepEval. FullTextSciRepEval is used to benchmark long document representation in our paper.

πŸš€ Training

HDT-E

The pre-training for HDT uses pretrain.py. Note that both training of encoder-only model and encoder-decoder model shares the same training script, with different arguments setting. For instance, training a HDT encoder-only model (HDT-E) on Masked Language Modeling (MLM) task, please run

python pretrain.py --encoder_only --tok_name google-bert/bert-base-uncased

Here we directly use the BERT tokenizer for simplicity.

Following CRAMMING, we pre-train our models in an academic budget to evaluate the efficiency of our method. As default, the model is trained on 1 GPU for 24 hours.

HDT-ED

In addition, to pre-train an encoder-decoder model for generation tasks with multiple gpus for longer time budget (e.g., 48 hours), run

python pretrain.py --tok_name google-t5/t5-base --num_encoder_layers 6 --num_decoder_layers 6 --num_gpus 4 --budget 48 

Note

  • We uses UL2 as the pre-training objective for encoder-decoder model and MLM for encoder-only model, following the default configuration from the original papers.
  • Anchor tokens [DOC], [SEC], and [CLS] are not masked during pre-training.

Available Models on Hugging Face Spaces

Model Name Encoder Layers Decoder Layers Hidden Units Attention Heads Vocab Parameters
howey/HDT-E 12 768 12 32,768 109M
howey/HDT-ED 6 6 768 12 32,128 112M

πŸ“šCitation

If you use or extend our work, please consider citing our paper. Thank you for your support! πŸ₯°

@inproceedings{He2024COLM,
  author = {He, Haoyu and Flicke, Markus and Buchman, Jan and Gurevych, Iryna and Geiger, Andreas},
  title = {HDT: Hierarchical Document Transformer},
  publisher = {Conference on Language Modeling},
  year = {2024},
}

hdt's People

Contributors

cli212 avatar markus-flicke avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.