This repository contains the code for our Financial Texts Summarization task, realized for Deep Natural Language Process class(2021-2022). Our task was to re-produce the work made by Zmander who managed to rank 2nd in FNS2021 competition. In this work, we will exploit long texts' summarization task combining an extraction and abstraction approaches by a Reinforcement Learning policy. In addition to that, we propose a distributional analysis to understand which are the most salient parts of the documents and to cut them according to the distribution we found. Furthemore, we extend the task to CNN's headline generation and we use the model to FNS2022's dataset which is composed by three different languages: English, Spanish and Greek. The pipeline we propose, and you can reproduce, is the following:
- Data preprocessing, comprising documents'cut, corpus generation, etc.
- Extractor, Abstractor and RL models training
- Model evaluation
- Python 3 (tested on python 3.6)
- PyTorch 0.4.0
- with GPU and CUDA enabled installation (though the code is runnable on CPU, it would be way too slow)
- gensim
- cytoolz
- tensorboardX
- pyrouge (for evaluation)
You can download the datasets we used from the following links
In the following, we will give you a suggestion on how to re-run our work. In this way you can check our results and change the training settings if you like. If you like, a colab notebook with the most salient steps is provided at . Make a copy and play with it as you prefer!
Moreover we provide you with the preprocessed files and pretrained models:
Download the whole directories and copy them in your main Google Drive folder. Enjoy!
This step can be avoided. In fact, if it is not run, the cut is performed using to first 1000 sentences. If you would like to have a look at the distribution of importance in your documents you should run:
%run preprocess/distribution_analysis.py --data <DATASET> --language <language> --stage <stage you want to start from> --top_M <top sentences to compute the rouge with> --jit <True if you want to use jit, False otherwise>
Note that if the parameter --jit is not specified, the code is run using it.
The following sketch shows an idea of how our preprocessing pipeline works. Note that the documents are cut according to their distribution
First of all, you will need to pre-process your data. To do that, you can use the script pipeline.py which is inside the folder called "preprocess". In this way, you will transform you data in ordert to be feasible for the models.
!python preprocess/pipeline.py --data <DATASET> --language <language> --max_len <maximum length. Set this parameter even if you want to use distribution.> --stage <stage_you_want_to_start_from> --jit <True if you want to use jit, False otherwise> --use_distribution <stores true if you want to cut documents according to their distribution>
Note: if the parameter --jit is not specified, the code is run using it.
Next step is to train extractor. The image displays the main idea of its architecture.
To train it run the following cell:
!python train_extractor_ml.py --data <DATASET> --language <language> --lstm_layer <number of lstms> --lstm_hidden <number of lstm hidden layers> --batch <batch size> --ckpt_freq <checkpoint frequency> --max_word <words in a sentence are cut according to this parameter> --max_sent <sentences in a document are cut according to this parameter>
If you want to train abstractor, following line needs to be executed:
!python train_abstractor.py --data <DATASET> --language <language> --n_layer <number of layers> --n_hidden <number of hidden layers> --batch <batch size> --ckpt_freq <checkpoint_frequnce>
Last, but not the least, model to be trained is the Reinforcement Learning's agent. The scheme shows the main steps of its functioning idea.
Concerning RL, the parameter --abs_dir passes the abstraction directory to RL model. If you like performing our ablation study, do not pass the parameter --abs dir. If you would like to train RL part, run the following cell:
!python train_full_rl.py --data <DATASET> --language <language> --batch <batch size> --abs_dir <directory of the abstractor (use "model\abs"). If you want to perform ablation study, do not pass this argument>
In the end, evaluate the model using the script decode_full_model.py:
!python decode_full_model.py --data <DATASET> --language <language> --batch <batch size>
You should get the following results if you cut the documents according to their distribution.
Language | HypPar | R-type | R-1 | R-2 | R-L |
---|---|---|---|---|---|
English | n_hidden=128 n_LSTM=1 batch_size=4 | F1 | 0.332 | 0.118 | 0.326 |
Recall | 0.315 | 0.117 | 0.309 | ||
Greek | n_hidden=256 n_LSTM=2 batch_size=4 | F1 | 0.489 | 0.311 | 0.479 |
Recall | 0.227 | 0.140 | 0.224 | ||
Spanish | n_hidden=128 n_LSTM=1 batch_size=4 | F1 | 0.340 | 0.094 | 0.334 |
Recall | 0.292 | 0.081 | 0.286 |