Giter VIP home page Giter VIP logo

chart2text's Introduction

Chart-to-Text: Generating Natural Language Explanations for Charts by Adapting the Transformer Model

Code for Chart-to-Text: Generating Natural Language Explanations for Charts by Adapting the Transformer Model.

Much of the code is adapted from Enhanced Transformer Model for Data-to-Text Generation [PDF] (Gong, Crego, Senellart; WNGT2019). https://github.com/gongliym/data2text-transformer

This project aims to automatically generate salient summaries from a chart's data table using a modified Transformer model.

2022 Update

Newer paper + dataset on this topic: Chart-to-Text: A Large-Scale Benchmark for Chart Summarization (Paper on arXiv)

Also visit SciCap: Scientific Figures Dataset for a related project with a larger and more detailed dataset.

Dataset

The dataset is stored in the dataset directory. There are individual files for each statistic's title, data, and caption in their respective folders.

Statistics which only contains two columns i.e simple bar or line charts are found in the dataset folder, and statistics with more than two columns i.e grouped bar or multi-line charts are found in the dataset/multiColumn folder.

There are also uncleaned versions found in the _old folders.

Chart type breakdown:

Line Bar Total:
Simple 3564 3199 6763
Complex 902 640 1542
Total: 4466 3839

Images available seperately at https://github.com/JasonObeid/Chart2TextImages due to large size ~1GB

(Added by Genglin) Step 0: Install environment

Create a new conda environment and install the necessary packages

conda create --name chart2text conda activate chart2text cd Chart2Text pip install -r requirements.txt

Note that en_core_web_md==2.2.5 cannot be installed with pip, instead install using the command

python -m spacy download en_core_web_sm

according to https://spacy.io/models/en

(However if you have done all the preprocessing then this en_core_web_md package can be skipped, and you can jump to the Generation section of the readme to proceed.)

Step1: Cleaning dataset

Clean the text within the chart titles and summaries

python utils/refactorTitles.py

python utils/refactorCaptions.py

Step2: Preprocessing

python etc/templatePreprocess.py
  • Converts data tables into a sequence of records (taken as input by the model): data/*split*/trainData.txt
  • Cleans summary tokens and substitutes any possible tokens with data variables(e.g., 2018 -> templateValue[0][0]): data/*split*/trainSummary.txt
  • Cleans the title tokens: data/*split*/trainTitle.txt
  • Labels the occurrences of records mentioned within the summary: data/*split*/trainDataLabel.txt
  • Labels the summary tokens which match a record: data/*split*/trainSummaryLabel.txt
  • Saves the gold summaries: data/*split*/testOriginalSummary.txt

Step2: Extract vocabulary for each split

cd etc
python extract_vocab.py --table ../data/valid/validData.txt --summary ../data/valid/validSummary.txt
python extract_vocab.py --table ../data/test/testData.txt --summary ../data/test/testSummary.txt
python extract_vocab.py --table ../data/train/trainData.txt --summary ../data/train/trainSummary.txt

It will generate vocabulary files for each of them:

  • data/*split*/trainData.txt_vocab
  • data/*split*/trainSummary.txt_vocab

Step3: Binarize the data for each split

cd ../model
python preprocess_table_data.py --table ../data/valid/validData.txt --table_vocab ../data/valid/validData.txt_vocab --table_label ../data/valid/validDataLabel.txt
python preprocess_table_data.py --table ../data/test/testData.txt --table_vocab ../data/test/testData.txt_vocab --table_label ../data/test/testDataLabel.txt
python preprocess_table_data.py --table ../data/train/trainData.txt --table_vocab ../data/train/trainData.txt_vocab --table_label ../data/train/trainDataLabel.txt

python preprocess_summary_data.py --summary ../data/valid/validSummary.txt --summary_vocab ../data/valid/validSummary.txt_vocab --summary_label ../data/valid/validSummaryLabel.txt
python preprocess_summary_data.py --summary ../data/test/testSummary.txt --summary_vocab ../data/test/testSummary.txt_vocab --summary_label ../data/test/testSummaryLabel.txt
python preprocess_summary_data.py --summary ../data/train/trainSummary.txt --summary_vocab ../data/train/trainSummary.txt_vocab --summary_label ../data/train/trainSummaryLabel.txt

Outputs the training data:

  • Data Records: data/*split*/trainData.txt.pth
  • Summaries: data/*split*/trainSummary.txt.pth

Note: if you get a dictionary assertion error, then delete the old .pth files in data subfolders and try again

Model Training

MODELPATH=$PWD/model
export PYTHONPATH=$MODELPATH:$PYTHONPATH

python $MODELPATH/train.py

## main parameters
python model/train.py \
    --model_path "experiments" \
    --exp_name "chart2text" \
    --exp_id "run1" \
    --train_cs_table_path data/train/trainData.txt.pth \
    --train_sm_table_path data/train/trainData.txt.pth \
    --train_sm_summary_path data/train/trainSummary.txt.pth \
    --valid_table_path data/valid/validData.txt.pth \
    --valid_summary_path data/valid/validSummary.txt.pth \
    --cs_step True \
    --lambda_cs "1" \
    --sm_step True \
    --lambda_sm "1" \
    --label_smoothing 0.05 \
    --sm_step_with_cc_loss False \
    --sm_step_with_cs_proba False \
    --share_inout_emb True \
    --share_srctgt_emb False \
    --emb_dim 512 \
    --enc_n_layers 1 \
    --dec_n_layers 6 \
    --dropout 0.1 \
    --save_periodic 40 \
    --batch_size 6 \
    --beam_size 4 \
    --epoch_size 1000 \
    --max_epoch 81 \
    --eval_bleu True \
    --sinusoidal_embeddings True \
    --encoder_positional_emb True \
    --gelu_activation True \
    --validation_metrics valid_mt_bleu

Generation

Use the following commands to generate from the above models:

Download our model (trained with data variables)

Or our baseline model adapted from Li et al. https://github.com/gongliym/data2text-transformer (trained without data variables)

python model/summarize.py 
  --model_path aug17-80.pth 
  --table_path data/test/testData.txt \
  --output_path results/aug17/templateOutput-p80.txt \
  --title_path data/test/testTitle.txt --beam_size 4 --batch_size 8

Postprocessing after generation

Substitute any predicted data variables:

python etc/summaryComparison.py

Evaluation

"Content Selection" evaluation

python studyOutcome/automatedEvaluation.py

BLEU evaluation

The BLEU evaluation script can be obtained from Moses:

perl model/src/evaluation/multi-bleu.perl data/test/testOriginalSummary.txt < results/aug17/generated-p80.txt

chart2text's People

Contributors

genglinliu avatar jasonobeid avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.