microsoft / table-transformer Goto Github PK

Table Transformer (TATR) is a deep learning model for extracting tables from unstructured documents (PDFs and images). This is also the official repository for the PubTables-1M dataset and GriTS evaluation metric.

License: MIT License

Python 99.95% Dockerfile 0.05%

table-detection table-extraction table-structure-recognition table-functional-analysis

table-transformer's Introduction

Table Transformer (TATR)

A deep learning model based on object detection for extracting tables from PDFs and images.

First proposed in "PubTables-1M: Towards comprehensive table extraction from unstructured documents".

This repository also contains the official code for these papers:

Note: If you are looking to use Table Transformer to extract your own tables, here are some helpful things to know:

TATR can be trained to work well across many document domains and everything needed to train your own model is included here. But at the moment pre-trained model weights are only available for TATR trained on the PubTables-1M dataset. (See the additional documentation for how to train your own multi-domain model.)
TATR is an object detection model that recognizes tables from image input. The inference code built on TATR needs text extraction (from OCR or directly from PDF) as a separate input in order to include text in its HTML or CSV output.

Additional information about this project for both users and researchers, including data, training, evaluation, and inference code is provided below.

News

08/22/2023: We have released 3 new pre-trained models for TATR-v1.1 (trained on 1. PubTables-1M, 2. FinTabNet.c, and 3. both datasets combined) according to the details in our paper.
04/19/2023: Our latest papers (link and link) have been accepted at ICDAR 2023.
03/09/2023: We have added more image cropping to the official training script (like we do in our most recent paper) and updated the code and environment.yml to use Python 3.10.9, PyTorch 1.13.1, and Torchvision 0.14.1, among others.
03/07/2023: We have released a new simple inference pipeline for TATR. Now you can easily detect and recognize tables from images and convert them to HTML or CSV.
03/07/2023: We have released a collection of scripts to create training data for TATR and to canonicalize pre-existing datasets, such as FinTabNet and SciTSR.
03/01/2023: New paper "Aligning benchmark datasets for table structure recognition" is now available on arXiv.
11/25/2022: We have made the full PubTables-1M dataset alternatively available for download from Hugging Face.
05/05/2022: We have released the pre-trained weights for the table structure recognition model trained on PubTables-1M.
03/23/2022: Our paper "GriTS: Grid table similarity metric for table structure recognition" is now available on arXiv
03/04/2022: We have released the pre-trained weights for the table detection model trained on PubTables-1M.
03/03/2022: "PubTables-1M: Towards comprehensive table extraction from unstructured documents" has been accepted at CVPR 2022.
11/21/2021: Our updated paper "PubTables-1M: Towards comprehensive table extraction from unstructured documents" is available on arXiv.
10/21/2021: The full PubTables-1M dataset has been officially released on Microsoft Research Open Data.
06/08/2021: Initial version of the Table Transformer (TATR) project is released.

PubTables-1M

The goal of PubTables-1M is to create a large, detailed, high-quality dataset for training and evaluating a wide variety of models for the tasks of table detection, table structure recognition, and functional analysis.

It contains:

575,305 annotated document pages containing tables for table detection.
947,642 fully annotated tables including text content and complete location (bounding box) information for table structure recognition and functional analysis.
Full bounding boxes in both image and PDF coordinates for all table rows, columns, and cells (including blank cells), as well as other annotated structures such as column headers and projected row headers.
Rendered images of all tables and pages.
Bounding boxes and text for all words appearing in each table and page image.
Additional cell properties not used in the current model training.

Additionally, cells in the headers are canonicalized and we implement multiple quality control steps to ensure the annotations are as free of noise as possible. For more details, please see our paper.

Pre-trained Model Weights

We provide different pre-trained models for table detection and table structure recognition.

Table Detection:

Model	Training Data	Model Card	File	Size
DETR R18	PubTables-1M	Model Card	Weights	110 MB

Table Structure Recognition:

Model	Training Data	Model Card	File	Size
TATR-v1.0	PubTables-1M	Model Card	Weights	110 MB
TATR-v1.1-Pub	PubTables-1M	Model Card	Weights	110 MB
TATR-v1.1-Fin	FinTabNet.c	Model Card	Weights	110 MB
TATR-v1.1-All	PubTables-1M + FinTabNet.c	Model Card	Weights	110 MB

Evaluation Metrics

Table Detection:

Model	Test Data	AP50	AP75	AP	AR
DETR R18	PubTables-1M	0.995	0.989	0.970	0.985

Table Structure Recognition:

Model	Test Data	AP50	AP75	AP	AR	GriTS_Top	GriTS_Con	GriTS_Loc	Acc_Con
TATR-v1.0	PubTables-1M	0.970	0.941	0.902	0.935	0.9849	0.9850	0.9786	0.8243

Training and Evaluation Data

PubTables-1M is available for download from Microsoft Research Open Data.

We have also uploaded the full set of archives to Hugging Face.

The dataset on Microsoft Research Open Data comes in 5 tar.gz files:

PubTables-1M-Image_Page_Detection_PASCAL_VOC.tar.gz: Training and evaluation data for the detection model
- /images: 575,305 JPG files; one file for each page image
- /train: 460,589 XML files containing bounding boxes in PASCAL VOC format
- /test: 57,125 XML files containing bounding boxes in PASCAL VOC format
- /val: 57,591 XML files containing bounding boxes in PASCAL VOC format
PubTables-1M-Image_Page_Words_JSON.tar.gz: Bounding boxes and text content for all of the words in each page image
- One JSON file per page image (plus some extra unused files)
PubTables-1M-Image_Table_Structure_PASCAL_VOC.tar.gz: Training and evaluation data for the structure (and functional analysis) model
- /images: 947,642 JPG files; one file for each page image
- /train: 758,849 XML files containing bounding boxes in PASCAL VOC format
- /test: 93,834 XML files containing bounding boxes in PASCAL VOC format
- /val: 94,959 XML files containing bounding boxes in PASCAL VOC format
PubTables-1M-Image_Table_Words_JSON.tar.gz: Bounding boxes and text content for all of the words in each cropped table image
- One JSON file per cropped table image (plus some extra unused files)
PubTables-1M-PDF_Annotations_JSON.tar.gz: Detailed annotations for all of the tables appearing in the source PubMed PDFs. All annotations are in PDF coordinates.
- 401,733 JSON files; one file per source PDF

To download from the command line:

Visit the dataset home page with a web browser and click Download in the top left corner. This will create a link to download the dataset from Azure with a unique access token for you that looks like https://msropendataset01.blob.core.windows.net/pubtables1m?[SAS_TOKEN_HERE].
You can then use the command line tool azcopy to download all of the files with the following command:

azcopy copy "https://msropendataset01.blob.core.windows.net/pubtables1m?[SAS_TOKEN_HERE]" "/path/to/your/download/folder/" --recursive

Then unzip each of the archives from the command line using:

tar -xzvf yourfile.tar.gz

Code Installation

Create a conda environment from the yml file and activate it as follows

conda env create -f environment.yml
conda activate tables-detr

Model Training

The code trains models for 2 different sets of table extraction tasks:

Table Detection
Table Structure Recognition + Functional Analysis

For a detailed description of these tasks and the models, please refer to the paper.

To train, you need to cd to the src directory and specify: 1. the path to the dataset, 2. the task (detection or structure), and 3. the path to the config file, which contains the hyperparameters for the architecture and training.

To train the detection model:

python main.py --data_type detection --config_file detection_config.json --data_root_dir /path/to/detection_data

To train the structure recognition model:

python main.py --data_type structure --config_file structure_config.json --data_root_dir /path/to/structure_data

Evaluation

The evaluation code computes standard object detection metrics (AP, AP50, etc.) for both the detection model and the structure model. When running evaluation for the structure model it also computes grid table similarity (GriTS) metrics for table structure recognition. GriTS is a measure of table cell correctness and is defined as the average correctness of each cell averaged over all tables. GriTS can measure the correctness of predicted cells based on: 1. cell topology alone, 2. cell topology and the reported bounding box location of each cell, or 3. cell topology and the reported text content of each cell. For more details on GriTS, please see our papers.

To compute object detection metrics for the detection model:

python main.py --mode eval --data_type detection --config_file detection_config.json --data_root_dir /path/to/pascal_voc_detection_data --model_load_path /path/to/detection_model

To compute object detection and GriTS metrics for the structure recognition model:

python main.py --mode eval --data_type structure --config_file structure_config.json --data_root_dir /path/to/pascal_voc_structure_data --model_load_path /path/to/structure_model --table_words_dir /path/to/json_table_words_data

Optionally you can add flags for things like controlling parallelization, saving detailed metrics, and saving visualizations:
--device cpu: Change the default device from cuda to cpu.
--batch_size 4: Control the batch size to use during the forward pass of the model.
--eval_pool_size 4: Control the worker pool size for CPU parallelization during GriTS metric computation.
--eval_step 2: Control the number of batches of processed input data to accumulate before passing all samples to the parallelized worker pool for GriTS metric computation.
--debug: Create and save visualizations of the model inference. For each input image "PMC1234567_table_0.jpg", this will save two visualizations: "PMC1234567_table_0_bboxes.jpg" containing the bounding boxes output by the model, and "PMC1234567_table_0_cells.jpg" containing the final table cell bounding boxes after post-processing. By default these are saved to a new folder "debug" in the current directory.
--debug_save_dir /path/to/folder: Specify the folder to save visualizations to.
--test_max_size 500: Run evaluation on a randomly sampled subset of the data. Useful for quick verifications and checks.

Fine-tuning and Other Model Training Scenarios

If model training is interrupted, it can be easily resumed by using the flag --model_load_path /path/to/model.pth and specifying the path to the saved dictionary file that contains the saved optimizer state.

If you want to restart training by fine-tuning a saved checkpoint, such as model_20.pth, use the flag --model_load_path /path/to/model_20.pth and the flag --load_weights_only to indicate that the previous optimizer state is not needed for resuming training.

Whether fine-tuning or training a new model from scratch, you can optionally create a new config file with different training parameters than the default ones we used. Specify the new config file using: --config_file /path/to/new_structure_config.json. Creating a new config file is useful, for example, if you want to use a different learning rate lr during fine-tuning.

Alternatively, many of the arguments in the config file can be specified as command line arguments using their associated flags. Any argument specified as a command line argument overrides the value of the argument in the config file.

Citing

Our work can be cited using:

@software{smock2021tabletransformer,
  author = {Smock, Brandon and Pesala, Rohith},
  month = {06},
  title = {{Table Transformer}},
  url = {https://github.com/microsoft/table-transformer},
  version = {1.0.0},
  year = {2021}
}

@inproceedings{smock2022pubtables,
  title={Pub{T}ables-1{M}: Towards comprehensive table extraction from unstructured documents},
  author={Smock, Brandon and Pesala, Rohith and Abraham, Robin},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  pages={4634-4642},
  year={2022},
  month={June}
}

@inproceedings{smock2023grits,
  title={Gri{TS}: Grid table similarity metric for table structure recognition},
  author={Smock, Brandon and Pesala, Rohith and Abraham, Robin},
  booktitle={International Conference on Document Analysis and Recognition},
  pages={535--549},
  year={2023},
  organization={Springer}
}

@article{smock2023aligning,
  title={Aligning benchmark datasets for table structure recognition},
  author={Smock, Brandon and Pesala, Rohith and Abraham, Robin},
  booktitle={International Conference on Document Analysis and Recognition},
  pages={371--386},
  year={2023},
  organization={Springer}
}

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.

table-transformer's People

Contributors

Stargazers

Watchers

Forkers

standardgalactic marmikreal tuananhnguyen14111998 mk-michal jasperhino sunxingxingtf yongshuaihuang sinkibo phamquiluan arjoonnib romakoks piaoxue88 compadrejavo transferai jarlyn95 fireae chouroukhelaoui tias112 infinex praveenmunagapati iroh97 tarun-w mamafun mohitgupta3 zyzyzhou edwardpwtsoi xiaolang564321 fastflair sdwldchl peetio suyogdahal andres-mejia lxyuan-handshakes daongocan koryakovdmitry gyanendrol9 tjaffri atul997 j1wan zgsxwsdxg ml-lab shihui2010 baifanysu jzw0025 alexandertyan sniper-xx hotown kamalkarki sagar-mishra tuanbc gregbugaj ddalko tahirashehzadi gaohuan2015 nielsrogge techthiyanes whn09 tracywang95 yjaj matthew-roots-automation chuahanchong daikichiba9511 vinace asou8888 kanasethu sk2307 ritatan7 joonielee832 duy07t1k23cht njahanshiri accelextechnology padadox fone-almosca jinhopark-lomin abhishek-trivedi shubhampachori12110095 vu1seek performl prat-05er keiichi1 bsmock praveenvattem anoop-qasolve fraction-analytics o0freak0o zerocodepro chaowang66 satoru814 sm-p gayathri-nt havlenapetr namirinz anang1502 hareeshsoulpage wenjinw aisandeep danferno pkpkpk praharsha9802 ductai199x

table-transformer's Issues

Please provide inference pipeline with pretrained weights

We are looking for trying out table transformer. But we see training and evaluation pipeline. Can you please provide code for inference, where we input an image with table and it outputs a csv file.

About the logical structure of the cells

How can I convert the annotation of this dataset into a logical structural annotation (e.g., for each cell, get its start row, end row, start column and end column)?

why use PubTables1M-Table-Words-JSON to refine columns and rows when evaluating TSR

Hi, I read the postprocessing code and find that PubTables1M-Table-Words-JSON information is used to refine the columns and rows when evaluating the TSR model performance, but I think these words information is not part of model outputs. Is it reasonable to use these information to evaluate the model?

Fine-tuning Tutorial

I'm confused on how to fine-tune the model on custom dataset for table structure recognition. I had few questions regarding fine-tuning process.

What should be the folder structure for the dataset.
Also, I tried to execute main.py file and got the following error.

It would be really helpful if someone provide a fine-tuning example on any sample dataset.
Thank You so much.
cc'ing @bsmock for visibility

CUBLAS_STATUS_EXECUTION_FAILED

I'm getting this error while running eval script:

python main.py --mode eval --data_type structure --config_file structure_config.json --data_root_dir data/ --model_load_path data/model/structure.pth --debug
{'lr': 5e-05, 'lr_backbone': 1e-05, 'batch_size': 2, 'weight_decay': 0.0001, 'epochs': 20, 'lr_drop': 1, 'lr_gamma': 0.9, 'clip_max_norm': 0.1, 'backbone': 'resnet18', 'num_classes': 6, 'dilation': False, 'position_embedding': 'sine', 'emphasized_weights': {}, 'enc_layers': 6, 'dec_layers': 6, 'dim_feedforward': 2048, 'hidden_dim': 256, 'dropout': 0.1, 'nheads': 8, 'num_queries': 125, 'pre_norm': True, 'masks': False, 'aux_loss': False, 'mask_loss_coef': 1, 'dice_loss_coef': 1, 'ce_loss_coef': 1, 'bbox_loss_coef': 5, 'giou_loss_coef': 2, 'eos_coef': 0.4, 'set_cost_class': 1, 'set_cost_bbox': 5, 'set_cost_giou': 2, 'device': 'cuda', 'seed': 42, 'start_epoch': 0, 'num_workers': 2, 'data_root_dir': 'data/', 'config_file': 'structure_config.json', 'data_type': 'structure', 'model_load_path': 'data/model/structure.pth', 'metrics_save_filepath': '', 'table_words_dir': None, 'mode': 'eval', 'debug': True, 'checkpoint_freq': 1, '__module__': '__main__', '__dict__': <attribute '__dict__' of 'Args' objects>, '__weakref__': <attribute '__weakref__' of 'Args' objects>, '__doc__': None}
----------------------------------------------------------------------------------------------------
loading model
loading model from checkpoint
loading data
creating index...
index created!
Traceback (most recent call last):
  File "main.py", line 373, in <module>
    main()
  File "main.py", line 365, in main
    eval_coco(model, criterion, postprocessors, data_loader_test, dataset_test, device)
  File "/home/ali/AI/nexus/table-transformer/src/eval.py", line 653, in eval_coco
    device, None)
  File "/home/ali/AI/nexus/table-transformer/venv/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 15, in decorate_context
    return func(*args, **kwargs)
  File "../detr/engine.py", line 97, in evaluate
    outputs = model(samples)
  File "/home/ali/AI/nexus/table-transformer/venv/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "../detr/models/detr.py", line 65, in forward
    hs = self.transformer(self.input_proj(src), mask, self.query_embed.weight, pos[-1])[0]
  File "/home/ali/AI/nexus/table-transformer/venv/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "../detr/models/transformer.py", line 56, in forward
    memory = self.encoder(src, src_key_padding_mask=mask, pos=pos_embed)
  File "/home/ali/AI/nexus/table-transformer/venv/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "../detr/models/transformer.py", line 78, in forward
    src_key_padding_mask=src_key_padding_mask, pos=pos)
  File "/home/ali/AI/nexus/table-transformer/venv/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "../detr/models/transformer.py", line 183, in forward
    return self.forward_pre(src, src_mask, src_key_padding_mask, pos)
  File "../detr/models/transformer.py", line 171, in forward_pre
    key_padding_mask=src_key_padding_mask)[0]
  File "/home/ali/AI/nexus/table-transformer/venv/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ali/AI/nexus/table-transformer/venv/lib/python3.7/site-packages/torch/nn/modules/activation.py", line 845, in forward
    attn_mask=attn_mask)
  File "/home/ali/AI/nexus/table-transformer/venv/lib/python3.7/site-packages/torch/nn/functional.py", line 3827, in multi_head_attention_forward
    q = linear(query, _w, _b)
  File "/home/ali/AI/nexus/table-transformer/venv/lib/python3.7/site-packages/torch/nn/functional.py", line 1612, in linear
    output = input.matmul(weight.t())
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)`

Wrong check of torchvision version in detr.util.misc.py

The version check here fails for torchvision.__version__ == '0.11.2' because it is wrongly comparing 0.1 < 0.7. Instead the comparison should be 0.11 < 0.7.

Note that a correct implementation should also consider cases like 0.4.2.post2.

Class wise average precision, recall results while evaluating the model

Hi,
Thanks for these beautiful datasets and repositories.
I would like to know about the class-wise average precision and recall while evaluating the model.
Is there any way to get class-wise precision, and recall?
Please let me know.
Thanks.

Support Distributed Data Parallel Implementation of DETR

Detr has the option to train on multiple machines with multiple GPUs. Right now the code is only able to train with one GPU.
One Epoch on a Tesla V100 GPU as described in the paper, takes approximately 3h. I tried to parallelize the code using the PyTorch DataParallel wrapper for the model as a quick fix, but could not make it work.

Is there a way to achieve faster training using multiple GPUs right now?
And if not, are you considering implementing something like DistributedDataParallel?

https://pytorch.org/tutorials/intermediate/ddp_tutorial.html

Dependency versions are not available via pip now

I was trying to install dependencies via pip and getting errors. Please upgrade the code and dependencies.

pip3 install pytorch==1.5.0
ERROR: Could not find a version that satisfies the requirement pytorch==1.5.0 (from versions: 0.1.2, 1.0.2)

pip3 install torchvision==0.6.0
ERROR: Could not find a version that satisfies the requirement torchvision==0.6.0 (from versions: 0.1.6, 0.1.7, 0.1.8, 0.1.9, 0.2.0, 0.2.1, 0.2.2, 0.2.2.post2, 0.2.2.post3, 0.8.2, 0.9.0, 0.9.1, 0.10.0, 0.10.1, 0.11.0, 0.11.1, 0.11.2, 0.11.3, 0.12.0, 0.13.0)
ERROR: No matching distribution found for torchvision==0.6.0

prediction incorrect increase image dimension or high resolution

Thanks for great work, In this network i try to use custom data for table structure recognition. but this custom data image is hight resolution and the model can not perform better . For this issue i try to fine tune table-transformer model.
First i change backbone architecture resnet50 instead of resnet18 then resnet101 instead of resnet18. But those trained model are not perform as resnet18 pretrain model[ load weight as pretrained].
Above those explanation my question is:

how many epoch need to train then model using custom data when i want to use resnet50 or resnet101 as a backbone.

Inference individual Image for table detection

Hi authors,
I would like to visualize the result table detection for an specific Image. Which output in the code should I take out and modify in order to have coordianates of predicted bouding box to visualize it on the infered image?

Needleman-Wunch algorithm

In your paper was stated that you used Needleman-Wunch algorithm for sequence alignment. Can you please provide a code for that?

And can you please release a few samples of the dataset - because the whole dataset it very huge and it's not possible to use it on limited resources, but it would be very beneficial to get a grasp on what is the dataset like.

Problem of inference of Table Structure when tables very close to image corners

Hello,
I have trained the Table Structure algorithm for 14 epochs and manage to obtain acceptable results on your images of test data. However, when I use the algorithm to perform inference on some table images of my own, I observe problems as the one below. This is a similar image as the one provided by your grits.py code, where all classes are plotted together:

I believe the problem is related with the distance of the table itself to the image borders. If I perform inference for the same table but keeping a larger distance table - image borders these are the results:

The table border and all rows and columns are much better predicted. The image used for the examples is PMC5730189_table_0 from your dataset.

The same happens for many other tables. Moreover, I looked at the xml files with the class labels and bounding boxes data, and a large percentage of tables used for training (more than 95%) have a distance from the table border to image border of almost 40 pixels, for all borders (top, bottom, left & right).

So I was wondering how could the algorithm be made more robust for these cases, on which I need to predict the table structure and the table border is really close to the image border (less than 5-10 pixels). Should I change something on the training? Or something else?

Thanks in advance,

How to handle "Rect" is not available in fitz

When evaluation command "run main.py --mode eval --data_type detection --config_file detection_config.json --data_root_dir ../PubTables1M-Detection-PASCAL-VOC/ --model_load_path ./pubtables1m_detection_detr_r18.pth" is executed, below error log is reported out. How to move on if we stay with python 3.9?

---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
File /workspace/externalhome/XAI/table-transformer/src/main.py:22, in <module>
     20 import table_datasets as TD
     21 from table_datasets import PDFTablesDataset
---> 22 from eval import eval_coco
     25 def get_args():
     26     parser = argparse.ArgumentParser()

File /workspace/externalhome/XAI/table-transformer/src/eval.py:21, in <module>
     19 import matplotlib.pyplot as plt
     20 import matplotlib.patches as patches
---> 21 from fitz import Rect
     22 from PIL import Image
     24 sys.path.append("../detr")

ImportError: cannot import name 'Rect' from 'fitz' (/opt/conda/lib/python3.9/site-packages/fitz/__init__.py)
```

Model Weights

Hi,
Thank you for the paper and data, are there any plans to make model weights available.

Thanks

data error

in the Image_Table_Structure_PASCAL_VOC.tar.gz,there is no val and test data.only train data in the zip.

PubTables1M-PDF-Annotations-JSON annotations do not match the images in PubTables1M-Structure-PASCAL-VOC

The bboxes of the json files in the folder "PubTables1M-PDF-Annotations-JSON" do not match the images in the folder "PubTables1M-Structure-PASCAL-VOC". Do we need to apply any transformation on the images?
For example I removed 30 pixels from each side of the image but the bboxes still do not fit the texts inside the tables.

about the annotation tool

Could you please release the annotation tool for 1M-PubTables too?

Visualize model predictions

I ran the pre-trained model in eval mode and got this output:

python main.py --mode eval --data_type structure --config_file structure_config.json --data_root_dir data/ --model_load_path data/model/structure.pth --debug --device cpu
{'lr': 5e-05, 'lr_backbone': 1e-05, 'batch_size': 2, 'weight_decay': 0.0001, 'epochs': 20, 'lr_drop': 1, 'lr_gamma': 0.9, 'clip_max_norm': 0.1, 'backbone': 'resnet18', 'num_classes': 6, 'dilation': False, 'position_embedding': 'sine', 'emphasized_weights': {}, 'enc_layers': 6, 'dec_layers': 6, 'dim_feedforward': 2048, 'hidden_dim': 256, 'dropout': 0.1, 'nheads': 8, 'num_queries': 125, 'pre_norm': True, 'masks': False, 'aux_loss': False, 'mask_loss_coef': 1, 'dice_loss_coef': 1, 'ce_loss_coef': 1, 'bbox_loss_coef': 5, 'giou_loss_coef': 2, 'eos_coef': 0.4, 'set_cost_class': 1, 'set_cost_bbox': 5, 'set_cost_giou': 2, 'device': 'cpu', 'seed': 42, 'start_epoch': 0, 'num_workers': 2, 'data_root_dir': 'data/', 'config_file': 'structure_config.json', 'data_type': 'structure', 'model_load_path': 'data/model/structure.pth', 'metrics_save_filepath': '', 'table_words_dir': None, 'mode': 'eval', 'debug': True, 'checkpoint_freq': 1, '__module__': '__main__', '__dict__': <attribute '__dict__' of 'Args' objects>, '__weakref__': <attribute '__weakref__' of 'Args' objects>, '__doc__': None}
----------------------------------------------------------------------------------------------------
loading model
loading model from checkpoint
loading data
creating index...
index created!
Test:  [0/1]  eta: 0:00:00  class_error: 0.00  loss: 0.3392 (0.3392)  loss_ce: 0.0231 (0.0231)  loss_bbox: 0.0250 (0.0250)  loss_giou: 0.2912 (0.2912)  loss_ce_unscaled: 0.0231 (0.0231)  class_error_unscaled: 0.0000 (0.0000)  loss_bbox_unscaled: 0.0050 (0.0050)  loss_giou_unscaled: 0.1456 (0.1456)  cardinality_error_unscaled: 0.0000 (0.0000)  time: 0.3716  data: 0.0614  max mem: 0
Test: Total time: 0:00:00 (0.3762 s / it)
Averaged stats: class_error: 0.00  loss: 0.3392 (0.3392)  loss_ce: 0.0231 (0.0231)  loss_bbox: 0.0250 (0.0250)  loss_giou: 0.2912 (0.2912)  loss_ce_unscaled: 0.0231 (0.0231)  class_error_unscaled: 0.0000 (0.0000)  loss_bbox_unscaled: 0.0050 (0.0050)  loss_giou_unscaled: 0.1456 (0.1456)  cardinality_error_unscaled: 0.0000 (0.0000)
Accumulating evaluation results...
DONE (t=0.01s).
IoU metric: bbox
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.619
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.750
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.629
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = -1.000
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.619
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.281
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.506
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.638
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = -1.000
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.638
pubmed: AP50: 0.750, AP75: 0.629, AP: 0.619, AR: 0.638

How can I visualize the model predictions on input images? like : this

Error when running in debug mode: Runtime Error: Expected all tensors to be on the same device, but found at least 2 devices, cuda:0 and cpu!

Dear authors,
I've just implemented below code in debug mode in order to visualize reconstruction result on PDF file:

!python main.py --data_root_dir path/to/structure --model_load_path path/to/model --table_words_dir path/to/words --mode grits --metrics_save_filepath path/to/metrics_save_file --debug

And I experienced this bug. It says "Runtime Error: Expected all tensors to be on the same device, but found at least 2 devices, cuda:0 and cpu!

I implemented this on a GPU runtime of colab and this error occured. When I tried to run on CPU mode only, it said there are no GPU device.
I couldn't figure out what the reasons caused this error are. Could you help me identify where the problem is? Thanks for considering my pledge.

Homemade evaluation script not working properly + Eval dataset not available

Hi all,

I am very interested in your table detection model and wanted to check it by myself. I encountered different diffculties trying to do so and wanted to get some help.

1 - Eval dataset not available

I used an Azure VM to load the dataset and explore it. In your README.md, it is explicitly stated that the detection dataset is in PubTables-1M-Image_Page_Detection_PASCAL_VOC.tar.gz, and there should be 4 folders inside: images, train, test and val. However, when I opened the archive, there were only 2 folders: images and train, and three textfiles: train_filelist.txt, test_filelist.txt and val_filelist.txt containing the path to the XML annotation files.

test_filelist.txt and val_filelist.txt are clearly referencing files that are in /test/ and /val/, even if those folders don't exist. I verified that the test and val annotations were not all in the train folder, and they are not.

I don't know where to find the test and val annotations, you've probably changed the dataset since the readme was written, and it would be nice to update it.

2 - Homemade inference script not working

Because I didn't have the eval dataset, I evaluated the detection model on some samples from the train dataset (I know, big warning because the model saw them during the training, but I just wanted to see good results, because I struggle to use the detection model).

Here is my code:
First, I instanciate the model and load the weights (that I downloaded through the link in the README.md)

import os
import xml.etree.ElementTree as ET
from PIL import Image, ImageDraw

from torchvision import transforms
import torchvision.transforms.functional as F

import torch
from detr.models.position_encoding import PositionEmbeddingSine
from detr.models.detr import DETR
from detr.models.transformer import Transformer
from detr.models.backbone import Backbone, Joiner

position_embedding = PositionEmbeddingSine(128)
backbone = Backbone("resnet18", False, False, False)
backbone_model = Joiner(backbone, position_embedding)
backbone_model.num_channels = backbone.num_channels
backbone = backbone_model

transformer = Transformer(
    d_model=256,
    dropout=0.1,
    nhead=8,
    dim_feedforward=2048,
    num_encoder_layers=6,
    num_decoder_layers=6,
    normalize_before=True,
    return_intermediate_dec=True,
)

model = DETR(
    backbone,
    transformer,
    num_classes=2,
    num_queries=15,
    aux_loss=False,
)

weights = torch.load("~/Projects/table-parsing/models/pubtables1m_detection_detr_r18.pth", map_location=torch.device('cpu'))
model.load_state_dict(weights)

I consider this part successful because I am greeted by a <All keys matched successfully> message. If i would have instantiated the model incorrectly, I would have the usual Missing key(s) or Unexpected key(s) warnings from pytorch.

Secondly, I created a simple pipeline to reproduce the image preprocessing done in the repo:

convert_tensor = transforms.ToTensor()
mean = torch.tensor([0.485, 0.456, 0.406])
std = torch.tensor([0.229, 0.224, 0.225])
final_size = 800
max_size = 1333

def detr_pipeline(image):

    # Resizing image
    w, h = image.size
    min_original_size = float(min((w, h)))
    max_original_size = float(max((w, h)))
    if max_original_size / min_original_size * final_size > max_size:
        size = int(round(max_size * min_original_size / max_original_size))
    else:
        size = final_size

    if (w <= h and w == size) or (h <= w and h == size):
        new_h, new_w = h, w
    elif w < h:
        new_w = size
        new_h = int(size * h / w)
    else:
        new_h = size
        new_w = int(size * w / h)

    rescaled_image = F.resize(image, (new_h, new_w))
    image_tensor = convert_tensor(rescaled_image)

    # Normalizing image
    image_tensor = image_tensor - torch.broadcast_to(mean.unsqueeze(-1).unsqueeze(-1), image_tensor.shape)
    image_tensor = image_tensor / torch.broadcast_to(std.unsqueeze(-1).unsqueeze(-1), image_tensor.shape)

    # Inference
    output = model([image_tensor])
    return output

The hardcoded means and stds come from detr.datasets.coco.make_coco_transforms

Finally, I used this pipeline to evaluate 20 examples from the training set

dataset_path = "~/Data/PubTables1M-Detection-PASCAL-VOC"
annotation_folder = "train"

train_annotations = []
with open(os.path.join(dataset_path, "train_filelist.txt")) as file:
    for line in file:
        train_annotations.append(line[:-1])

found_examples = 0
current = 0

while found_examples < 20:
    ann = train_annotations[current]
    current += 1
    xml_path = os.path.join(dataset_path, ann)
    assert os.path.isfile(xml_path), 'Annotation not found'
    data = ET.parse(xml_path)
    root = data.getroot()
    image_path = os.path.join(dataset_path, "images", root[1].text)
    if not os.path.isfile(image_path):
        print(f"Skipping {root[1].text}, as file doesn't exist")
        continue
    else:
        print(image_path)
    found_examples += 1
    with Image.open(image_path) as im:
        outputs = detr_pipeline(im)
        bboxes, logits = outputs['pred_boxes'], outputs['pred_logits']
        probas_per_class = logits.softmax(-1)[:, :, :-1]
        objects_to_keep = probas_per_class.max(-1).values > 0.5
        pred_boxes = bboxes[objects_to_keep]

        draw = ImageDraw.Draw(im)
        for elem in root:
            if elem.tag == "object":
                x0, y0, xmax, ymax = [float(i.text) for i in elem.getchildren()[-1].getchildren()]
                draw.rectangle(
                    (x0, y0, xmax, ymax),
                    outline="blue",
                    width=3,
                )
        for box in pred_boxes:
                centre_x, centre_y, width, height = box
                x0 = int(im.size[0] * (centre_x - width / 2))
                y0 = int(im.size[1] * (centre_y - height / 2))
                x1 = int(im.size[0] * (centre_x + width / 2))
                y1 = int(im.size[1] * (centre_y + height / 2))
                draw.rectangle(
                    [x0, y0, x1, y1],
                    outline="red",
                    width=3
                )
        im.save(os.path.join("~/Desktop/output/table", root[1].text))

Note that here, I put a confidence threshold of 0.5, which is very low compared to some other DeTr model, where usually they consider a 0.9 confidence level. Hence I expect to have some false positive results.
Also, I want to point out that there are many annotation files that reference an image that is not in the image folder (that's why I used a while loop and not a for loop.
But when I look at the results, none of them are correct, here are a few samples (the annotations are in blue and the predictions are in red):

It is very weird, considering the model saw these samples during the training. I tried removing the preprocessing, but it doesn't change the results very much, it still looks completely random. Could you please help me with this inference script? What am I doing wrong here?

Can you provide pmids corresponding to PMCIDs

Excuse me, do you have pmids corresponding to PMCIDs in the train/dev/test dataset? If so, could you email me these relevant pmids? thank you!

BBox squezeed during inference.

I tried inference in the images provided in your repo as well as other images.. It works perfectly for the samples provided by you but the bbox seems to squeeze in my samples.

Canonicalization of column header

Hi, thanks for releasing the PubTables1M dataset. It took me a lot of time to clean the PubTabNet dataset, and the oversegmentation problem is probably the most tricky part. The release of PubTables1M not only increases the amount of data but also provides a good solution for the oversegmentation problem.

However, in Algorithm 1, step 10

for each cell in the column header do recursively merge the cell with any adjacent cells above and below in the column header that span the exact same columns

might lead to problems like:

Mistakenly merging nonblank cells in column header. For example, in PMC1064102_table_2:
- nonblank cell 1-1 (text: None), blank cell 2-1 and nonblank cell 3-1 (text: 3 (4)b) are merged. However, None and 3 (4)b are not semantically coherent and they correspond to different row headers (Addition and Gene), so we should only merge nonblank cell 1-1 (text: None) and blank cell 2-1.
- Similarly, blank cell 0-0, nonblank cell 1-0 (text: Addition...a), blank cell 2-0 and nonblank cell 3-0 (text: Gene) are merged, but we should only merge nonblank cell 1-0 (text: Addition...a) and blank cell 2-0.
The vanilla row just below the column header might be mistakenly recognized as a part of the column header, then it will be merged into the last row of the real column header under the rule of step 10. This might cause a significant mismatch between correct and wrong samples, since the visable border between the column header and the adjacent vanilla row is a strict rule for splitting cells. For example, in PMC1064102_table_0:
- nonblank cell 0-0 (text: RNA no.) and nonblank cell 1-0 (text: 1) are mistakenly merged.

* Sorry I can not upload the images since I am using the company's network.

Column oversegmentation usually occurs in top-aligned spanning cells with one or zero text line. Hence, it is helpful to merge (nonblank or blank) cell and blank cells below it, but I doubt that it is not worthwhile to merge nonblank cells.

Besides, errors caused by step 10 can not be easily corrected, maybe it should be removed from the algorithm?

Real time object detection using pretrained model(detection and recognition both)

i want to real time detect and recognize using pretrained model, how can i do this, please help me to proceed with this challenge

Guide/Tutorial to use trained models for inference

Will some sample code/tutorial be available to show how to use the trained models for inference?
Also any updates on when the official trained weights will be released?

Batched/parallel implementation of GriTS

Hi @bsmock @rohithpv ,

I was wondering if you explored implementing a batched/parallel version of GriTS. It might be quite helpful when tuning a bunch of model hyperparameters on (relatively) set of large document images.

Or, if not, could you provide any suggestions on how to approach this? I'd be happy to try and work on a PR.

Adding Table Transformer models to HuggingFace Transformers

Hi Table Transformer team :)

As I've implemented DETR in 🤗 HuggingFace Transformers a few months ago, it was relatively straightforward to port the 2 checkpoints you released. Here's a notebook that illustrates inference with DETR for table detection and table structure recognition: https://colab.research.google.com/drive/1lLRyBr7WraGdUJm-urUm_utArw6SkoCJ?usp=sharing

As you may or may not know, any model on the HuggingFace hub has its own Github repository. E.g. the DETR-table-detection checkpoint can be found here: https://huggingface.co/nielsr/detr-table-detection. If you check the "files and versions" tab, it includes the weights. The model hub uses git-LFS (large file storage) to use Git with large files such as model weights. This means that any model has its own Git commit history!

A model card can also be added to the repo, which is just a README.

Are you interested in joining the Microsoft organization on the hub, such that we can store all model checkpoints there (rather than under my user name)?

Also, it would be great to add PubTables-1M (and potentially other datasets, useful for improving AI on unstructured documents) to the 🤗 hub. Would you be up for that?

Let me know!

Kind regards,

Niels
ML Engineer @ HuggingFace

Thoughts about TD + TSR using a single model

@bsmock @rohithpv

The existing approach described in the paper trains two models, one for each for the tasks: table detection and table structure recognition. Did you also explore performing TSR directly given an input (document) image, instead of using the cropped table provided by TD?

Ignoring the cases where a single image has multiple tables, do you have thoughts on what the pros/cons are for such a model?

Thanks!

How to develop my own datasets according to your methods?

Hi authors,
Thanks for your great work! I want to develop a program to generate my own TSR datasets from word file. But I don't have enough experience to do it. So，I want to ask you for help to develop my own datasets. It would be appreciated if you canrelease the program code about the process to develop PubTables-1M or offer any other help. Thanks you again!

Typo in readme

PubTables-1M-Image_Page_Words_JSON.tar.gz: Bounding boxes and text content for all of the words in each cropped table image

this is page images not table.

What fraction of the test set did the model achieve perfect table (structure) recognition on?

The GriTS score provides performance estimate at the cell level. I was curious about how many of the images in the test set did the model get the table EXACTLY right.

@rohithpv @bsmock

The dataset home page is not working

The website kept saying that "Loading Dataset Details..." and it didn't return with anything after waiting for a long time. Another weird problem is that I cann't see any dataset or use the search function in msropendata.com. I tried to file a issue in msropendata.com, but it can not be submitted. Is the backend server of msropendata down or something else?
Thanks for your help!

Simple inference code

Hi! I have an image containing a table and I want to try the pretrained model for table structure recognition. I am unable to download the whole PubTables dataset since it is too big. What can I do to make a simple inference?

Table detection coco format

Hi,
I want to use pubtables dataset for table detection on deter. But this dataset is in xml format. can i get coco format?

TypeError: 'numpy.float64' object cannot be interpreted as an integer during evaluation

Hi,

I encountered this numpy type error during the evaluation phase. Any idea how to fix this?

How to reproduce the error

(env)$ python main.py 
  --data_type detection 
  --config_file detection_config.json 
  --data_root_dir ~/../pubtables/PubTables1M-Detection-PASCAL-VOC/

Error Message

{'lr': 5e-05, 'lr_backbone': 1e-05, 'batch_size': 2, 'weight_decay': 0.0001, 'epochs': 20, 'lr_drop': 1, 'lr_gamma': 0.9, 'clip_max_norm': 0.1, 'backbone': 'resnet18', 'num_classes': 2, 'dilation': False, 'position_
embedding': 'sine', 'emphasized_weights': {}, 'enc_layers': 6, 'dec_layers': 6, 'dim_feedforward': 2048, 'hidden_dim': 256, 'dropout': 0.1, 'nheads': 8, 'num_queries': 15, 'pre_norm': True, 'masks': False, 'aux_loss
': False, 'mask_loss_coef': 1, 'dice_loss_coef': 1, 'ce_loss_coef': 1, 'bbox_loss_coef': 5, 'giou_loss_coef': 2, 'eos_coef': 0.4, 'set_cost_class': 1, 'set_cost_bbox': 5, 'set_cost_giou': 2, 'device': 'cuda', 'seed'
: 42, 'start_epoch': 0, 'num_workers': 1, 'data_root_dir': '/home/lxyuan/../pubtables/PubTables1M-Detection-PASCAL-VOC/', 'config_file': 'detection_config.json', 'data_type': 'detection', 'model_load_path': None, 'l
oad_weights_only': False, 'model_save_dir': None, 'metrics_save_filepath': '', 'debug_save_dir': 'debug', 'table_words_dir': None, 'mode': 'train', 'debug': False, 'checkpoint_freq': 1, 'train_max_size': None, 'val_
max_size': None, 'test_max_size': None, 'eval_pool_size': 1, 'eval_step': 1, '__module__': '__main__', '__dict__': <attribute '__dict__' of 'Args' objects>, '__weakref__': <attribute '__weakref__' of 'Args' objects>
, '__doc__': None}
----------------------------------------------------------------------------------------------------
loading model
loading data
loading data
creating index...
index created!
finished loading data in : 0:00:04.291752
Max batches per epoch: 230294
Output directory:  /home/lxyuan/../pubtables/PubTables1M-Detection-PASCAL-VOC/output/20220815202559
Output model path:  /home/lxyuan/../pubtables/PubTables1M-Detection-PASCAL-VOC/output/20220815202559/model.pth
Start training
----------------------------------------------------------------------------------------------------
Epoch: [0]  [     0/230294]  eta: 21:14:22  lr: 0.000050  class_error: 33.33  loss: 7.6202 (7.6202)  loss_ce: 1.3217 (1.3217)  loss_bbox: 4.0440 (4.0440)  loss_giou: 2.2545 (2.2545)  loss_ce_unscaled: 1.3217 (1.3217
)  class_error_unscaled: 33.3333 (33.3333)  loss_bbox_unscaled: 0.8088 (0.8088)  loss_giou_unscaled: 1.1273 (1.1273)  cardinality_error_unscaled: 12.5000 (12.5000)  time: 0.3320  data: 0.1073  max mem: 796
Epoch: [0]  [  1000/230294]  eta: 5:37:18  lr: 0.000050  class_error: 100.00  loss: 2.3491 (3.9134)  loss_ce: 0.4271 (0.4534)  loss_bbox: 1.0936 (2.2053)  loss_giou: 0.8212 (1.2548)  loss_ce_unscaled: 0.4271 (0.4534
)  class_error_unscaled: 100.0000 (96.9789)  loss_bbox_unscaled: 0.2187 (0.4411)  loss_giou_unscaled: 0.4106 (0.6274)  cardinality_error_unscaled: 1.0000 (1.0919)  time: 0.0870  data: 0.0046  max mem: 1393
Epoch: [0]  [  2000/230294]  eta: 5:37:05  lr: 0.000050  class_error: 100.00  loss: 3.2153 (3.2962)  loss_ce: 0.3987 (0.4452)  loss_bbox: 1.6650 (1.7887)  loss_giou: 1.0126 (1.0623)  loss_ce_unscaled: 0.3987 (0.4452
)  class_error_unscaled: 100.0000 (94.8372)  loss_bbox_unscaled: 0.3330 (0.3577)  loss_giou_unscaled: 0.5063 (0.5312)  cardinality_error_unscaled: 1.0000 (1.0160)  time: 0.0845  data: 0.0045  max mem: 1393
Epoch: [0]  [  3000/230294]  eta: 5:35:23  lr: 0.000050  class_error: 100.00  loss: 2.4226 (2.9530)  loss_ce: 0.3809 (0.4328)  loss_bbox: 1.1276 (1.5422)  loss_giou: 0.8819 (0.9780)  loss_ce_unscaled: 0.3809 (0.4328
)  class_error_unscaled: 100.0000 (92.2951)  loss_bbox_unscaled: 0.2255 (0.3084)  loss_giou_unscaled: 0.4409 (0.4890)  cardinality_error_unscaled: 1.0000 (0.9888)  time: 0.0883  data: 0.0045  max mem: 1393
Epoch: [0]  [  4000/230294]  eta: 5:35:24  lr: 0.000050  class_error: 0.00  loss: 1.8408 (2.7103)  loss_ce: 0.3210 (0.4222)  loss_bbox: 0.7209 (1.3707)  loss_giou: 0.6109 (0.9174)  loss_ce_unscaled: 0.3210 (0.4222)
 class_error_unscaled: 50.0000 (89.8711)  loss_bbox_unscaled: 0.1442 (0.2741)  loss_giou_unscaled: 0.3055 (0.4587)  cardinality_error_unscaled: 0.5000 (0.9609)  time: 0.0906  data: 0.0049  max mem: 1393
Epoch: [0]  [  5000/230294]  eta: 5:34:34  lr: 0.000050  class_error: 0.00  loss: 2.0806 (2.5365)  loss_ce: 0.3440 (0.4120)  loss_bbox: 0.7721 (1.2546)  loss_giou: 0.7145 (0.8699)  loss_ce_unscaled: 0.3440 (0.4120)
 class_error_unscaled: 75.0000 (86.8418)  loss_bbox_unscaled: 0.1544 (0.2509)  loss_giou_unscaled: 0.3572 (0.4349)  cardinality_error_unscaled: 0.5000 (0.9352)  time: 0.0928  data: 0.0048  max mem: 1393
Epoch: [0]  [  6000/230294]  eta: 5:33:42  lr: 0.000050  class_error: 50.00  loss: 1.5561 (2.4004)  loss_ce: 0.3442 (0.4008)  loss_bbox: 0.5955 (1.1669)  loss_giou: 0.5303 (0.8327)  loss_ce_unscaled: 0.3442 (0.4008)
  class_error_unscaled: 66.6667 (82.8963)  loss_bbox_unscaled: 0.1191 (0.2334)  loss_giou_unscaled: 0.2652 (0.4163)  cardinality_error_unscaled: 0.5000 (0.8982)  time: 0.0910  data: 0.0048  max mem: 1393
Epoch: [0]  [  7000/230294]  eta: 5:32:53  lr: 0.000050  class_error: 100.00  loss: 1.9024 (2.2844)  loss_ce: 0.2432 (0.3884)  loss_bbox: 0.6760 (1.0965)  loss_giou: 0.6833 (0.7995)  loss_ce_unscaled: 0.2432 (0.3884
)  class_error_unscaled: 50.0000 (79.1719)  loss_bbox_unscaled: 0.1352 (0.2193)  loss_giou_unscaled: 0.3416 (0.3998)  cardinality_error_unscaled: 0.5000 (0.8579)  time: 0.0856  data: 0.0047  max mem: 1393
Epoch: [0]  [  8000/230294]  eta: 5:31:30  lr: 0.000050  class_error: 50.00  loss: 1.3197 (2.1904)  loss_ce: 0.2045 (0.3753)  loss_bbox: 0.5773 (1.0416)  loss_giou: 0.6363 (0.7734)  loss_ce_unscaled: 0.2045 (0.3753)
  class_error_unscaled: 33.3333 (75.1935)  loss_bbox_unscaled: 0.1155 (0.2083)  loss_giou_unscaled: 0.3182 (0.3867)  cardinality_error_unscaled: 0.0000 (0.8116)  time: 0.0903  data: 0.0047  max mem: 1393
Epoch: [0]  [  9000/230294]  eta: 5:30:25  lr: 0.000050  class_error: 100.00  loss: 1.2540 (2.1009)  loss_ce: 0.2317 (0.3612)  loss_bbox: 0.4740 (0.9915)  loss_giou: 0.5079 (0.7482)  loss_ce_unscaled: 0.2317 (0.3612
)  class_error_unscaled: 50.0000 (71.3004)  loss_bbox_unscaled: 0.0948 (0.1983)  loss_giou_unscaled: 0.2539 (0.3741)  cardinality_error_unscaled: 0.5000 (0.7655)  time: 0.0909  data: 0.0048  max mem: 1393

<truncated>

Epoch: [0]  [230293/230294]  eta: 0:00:00  lr: 0.000050  class_error: 0.00  loss: 0.2878 (0.4740)  loss_ce: 0.0005 (0.0355)  loss_bbox: 0.1188 (0.2152)  loss_giou: 0.1408 (0.2233)  loss_ce_unscaled: 0.0005 (0.0355)
 class_error_unscaled: 0.0000 (4.7529)  loss_bbox_unscaled: 0.0238 (0.0430)  loss_giou_unscaled: 0.0704 (0.1116)  cardinality_error_unscaled: 0.0000 (0.0790)  time: 0.0888  data: 0.0057  max mem: 1393
Epoch: [0] Total time: 5:45:45 (0.0901 s / it)
Averaged stats: lr: 0.000050  class_error: 0.00  loss: 0.2878 (0.4740)  loss_ce: 0.0005 (0.0355)  loss_bbox: 0.1188 (0.2152)  loss_giou: 0.1408 (0.2233)  loss_ce_unscaled: 0.0005 (0.0355)  class_error_unscaled: 0.00
00 (4.7529)  loss_bbox_unscaled: 0.0238 (0.0430)  loss_giou_unscaled: 0.0704 (0.1116)  cardinality_error_unscaled: 0.0000 (0.0790)
Epoch completed in  5:45:45.451181
    main()
  File "/home/lxyuan/playground/table-transformer/src/main.py", line 368, in main
    train(args, model, criterion, postprocessors, device)
  File "/home/lxyuan/playground/table-transformer/src/main.py", line 317, in train
    pubmed_stats, coco_evaluator = evaluate(model, criterion,
  File "/home/lxyuan/playground/table-transformer/env/lib64/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/home/lxyuan/playground/table-transformer/src/../detr/engine.py", line 81, in evaluate
    coco_evaluator = CocoEvaluator(base_ds, iou_types)
  File "/home/lxyuan/playground/table-transformer/src/../detr/datasets/coco_eval.py", line 31, in __init__
    self.coco_eval[iou_type] = COCOeval(coco_gt, iouType=iou_type)
  File "/home/lxyuan/playground/table-transformer/env/lib64/python3.9/site-packages/pycocotools/cocoeval.py", line 76, in __init__
    self.params = Params(iouType=iouType) # parameters
  File "/home/lxyuan/playground/table-transformer/env/lib64/python3.9/site-packages/pycocotools/cocoeval.py", line 527, in __init__
    self.setDetParams()
  File "/home/lxyuan/playground/table-transformer/env/lib64/python3.9/site-packages/pycocotools/cocoeval.py", line 507, in setDetParams
    self.iouThrs = np.linspace(.5, 0.95, np.round((0.95 - .5) / .05) + 1, endpoint=True)
  File "<__array_function__ internals>", line 180, in linspace
  File "/home/lxyuan/playground/table-transformer/env/lib64/python3.9/site-packages/numpy/core/function_base.py", line 120, in linspace
    num = operator.index(num)
TypeError: 'numpy.float64' object cannot be interpreted as an integer

It seems like i was able to complete one training epoch but got the numpy error message when we were trying to evaluate model performance on the validation set (i.e., src/main:L317)

Similar error when I tried to use main.py to evaluate model performance directly.

How to reproduce the error

(env)$ python main.py 
  --mode eval 
  --data_type detection 
  --config_file detection_config.json
  --data_root_dir ~/../pubtables/PubTables1M-Detection-PASCAL-VOC/ 
  --model_load_path ../pretrained_models/pubtables1m_detection_detr_r18.pth

Error Message

{'lr': 5e-05, 'lr_backbone': 1e-05, 'batch_size': 2, 'weight_decay': 0.0001, 'epochs': 20, 'lr_drop': 1, 'lr_gamma': 0.9, 'clip_max_norm': 0.1, 'backbone': 'resnet18', 'num_classes': 2, 'dilation': False, 'pos
ition_embedding': 'sine', 'emphasized_weights': {}, 'enc_layers': 6, 'dec_layers': 6, 'dim_feedforward': 2048, 'hidden_dim': 256, 'dropout': 0.1, 'nheads': 8, 'num_queries': 15, 'pre_norm': True, 'masks': Fals
e, 'aux_loss': False, 'mask_loss_coef': 1, 'dice_loss_coef': 1, 'ce_loss_coef': 1, 'bbox_loss_coef': 5, 'giou_loss_coef': 2, 'eos_coef': 0.4, 'set_cost_class': 1, 'set_cost_bbox': 5, 'set_cost_giou': 2, 'devic
e': 'cuda', 'seed': 42, 'start_epoch': 0, 'num_workers': 1, 'data_root_dir': '/home/lxyuan/mini-pubtables/PubTables1M-Dectection-PASCAL-VOC/', 'config_file': 'detection_config.json', 'data_type': 'detection',
'model_load_path': '../pretrained_models/pubtables1m_detection_detr_r18.pth', 'load_weights_only': False, 'model_save_dir': None, 'metrics_save_filepath': '', 'debug_save_dir': 'debug', 'table_words_dir': None
, 'mode': 'eval', 'debug': False, 'checkpoint_freq': 1, 'train_max_size': None, 'val_max_size': None, 'test_max_size': None, 'eval_pool_size': 1, 'eval_step': 1, '__module__': '__main__', '__dict__': <attribut
e '__dict__' of 'Args' objects>, '__weakref__': <attribute '__weakref__' of 'Args' objects>, '__doc__': None}
----------------------------------------------------------------------------------------------------
loading model
loading model from checkpoint
loading data
creating index...
index created!
Traceback (most recent call last):
  File "/home/lxyuan/playground/table-transformer/src/main.py", line 375, in <module>
    main()
  File "/home/lxyuan/playground/table-transformer/src/main.py", line 371, in main
    eval_coco(args, model, criterion, postprocessors, data_loader_test, dataset_test, device)
  File "/home/lxyuan/playground/table-transformer/src/eval.py", line 693, in eval_coco
    pubmed_stats, coco_evaluator = evaluate(args, model, criterion, postprocessors,
  File "/home/lxyuan/playground/table-transformer/env/lib64/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/home/lxyuan/playground/table-transformer/src/eval.py", line 586, in evaluate
    coco_evaluator = CocoEvaluator(base_ds, iou_types)
  File "/home/lxyuan/playground/table-transformer/src/../detr/datasets/coco_eval.py", line 31, in __init__
    self.coco_eval[iou_type] = COCOeval(coco_gt, iouType=iou_type)
  File "/home/lxyuan/playground/table-transformer/env/lib64/python3.9/site-packages/pycocotools/cocoeval.py", line 76, in __init__
    self.params = Params(iouType=iouType) # parameters
  File "/home/lxyuan/playground/table-transformer/env/lib64/python3.9/site-packages/pycocotools/cocoeval.py", line 527, in __init__
    self.setDetParams()
  File "/home/lxyuan/playground/table-transformer/env/lib64/python3.9/site-packages/pycocotools/cocoeval.py", line 507, in setDetParams
    self.iouThrs = np.linspace(.5, 0.95, np.round((0.95 - .5) / .05) + 1, endpoint=True)
  File "<__array_function__ internals>", line 180, in linspace
  File "/home/lxyuan/playground/table-transformer/env/lib64/python3.9/site-packages/numpy/core/function_base.py", line 120, in linspace
    num = operator.index(num)
TypeError: 'numpy.float64' object cannot be interpreted as an integer

NOTE: I am using numpy==1.23.2 and python3.9

Error after one epoch of training is completed

I trained the model for the strcuture task on the pubtable dataset however I got the below error at the end of the first epoch. Any clues why this is happening?

Epoch completed in 18:21:31.466229
Traceback (most recent call last):
File "main.py", line 373, in
main()
File "main.py", line 362, in main
train(args, model, criterion, postprocessors, device)
File "main.py", line 319, in train
pubmed_stats, coco_evaluator = evaluate(model, criterion,
File "/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 26, in decorate_context
return func(*args, **kwargs)
File "/table-transformer/src/../detr/engine.py", line 81, in evaluate
coco_evaluator = CocoEvaluator(base_ds, iou_types)
File "/table-transformer/src/../detr/datasets/coco_eval.py", line 31, in init
self.coco_eval[iou_type] = COCOeval(coco_gt, iouType=iou_type)
File "/lib/python3.8/site-packages/pycocotools/cocoeval.py", line 76, in init
self.params = Params(iouType=iouType) # parameters
File "/lib/python3.8/site-packages/pycocotools/cocoeval.py", line 527, in init
self.setDetParams()
File "/lib/python3.8/site-packages/pycocotools/cocoeval.py", line 507, in setDetParams
self.iouThrs = np.linspace(.5, 0.95, np.round((0.95 - .5) / .05) + 1, endpoint=True)
File "<array_function internals>", line 5, in linspace
File "/lib/python3.8/site-packages/numpy/core/function_base.py", line 120, in linspace
num = operator.index(num)
TypeError: 'numpy.float64' object cannot be interpreted as an intege

Definition of GriTS

In the paper, the GriTS is defined as the fraction of inverses. I doubt its validity.

If A and B are very dissimilar, they can get f(...) = 0 for every cell. Then, the numerator would be undefined.
Follow from 1, dissimilar table can get a higher GriTS than similar table. If so, calling GriTS a 'similarity' seems weird.
If GriTS-Recall= f(...)/|A|, GriTS-Precision=f(...)/|B|, then the F-score should be 2/(recall^-1+precision^-1)=2/(|A|/f(...)+|B|/f(...))=2f(...)/(|A|+|B|).

Hence, I doubt that the inverse should not be taken in the definition. How is that implemented in the code?

ModuleNotFoundError: No module named 'engine'

I tried the detection model evaluation code and I get the "from engine import evaluate, train_one_epoch ModuleNotFoundError: No module named 'engine''" error. I wonder how that can be fixed?

Couldn't download PubTables-1M dataset

I'm stucked with downloading PubTables-1M dataset now.
I followed the instruction how to download PubTables-1M dataset on this github page.
However, the dataset is locked. (The situation is as in the picture below.)

Also, I couldn't login the page to dowload the dataset. (The error is as in the picture below.)

Please let me know how to solve this problem.
Thanks.

Question on post-processing table structure with text bounding boxes

Hello,
I am working with the table structure detection model, using it over table images. I extract the structure and the text, using CRAFT for the detection of the text bounding boxes and the table-transformer model for the table structure. To post-process the table structure prediction I use the text bounding boxes with the postprocess functions.

I encounter the following problem when following this approach. For some table images in which the text in a cell is a single character, CRAFT commonly detects those individual characters as together, producing large text bounding boxes like in the image below (second column).

The issue is when I use these bounding boxes, some of the predicted rows are enlarged so as they contain this large OCR bounding boxes. In the image below you see the raw predicted rows, without any postprocessing.

As you can see the predicted rows are accurate. But when I take the predicted table structure and put it together with the OCR bounding boxes, using the postprocess module and the function objects_to_cells, the rows transform to this:

I hope it is visible that there is a green dotted row that goes from B to H characters, including exactly the text bounding box. I have been looking at this problem and it seems to be produced in the table_structure_to_cells function, in lines 810-844 of postprocess module.

I was wondering if you could suggest of a way to improve the postprocessing operations so this does not occur. Maybe adding a further step of postprocessing or modifying those lines of code. Or if you know of an algorithm that works better than CRAFT to detect text I am also interested.

Many thanks in advance

structure recognition results from the provided weight are weird

Hi, thanks for releasing the data, codes and weights!

But when I ran the TSR inference code with the provided weight on the PubTables1M dataset,
it gives me no-object table objects on almost all tables like this:
(gray boxes are no-object boxes)

I tested on docker and created a conda environment using the provided yml file.

Any suggestions..?

Input for TSR model?

Hi @bsmock,

From the HuggingFace colab notebook, the table were beng detected flawlessly, however, when I applied TSR on the entire pdf-page image, I got this - it tries to identify rows even in non-table zone.

And then when I tried to pass only the table image - it misses the 4 edges of the table

Am I missing something here?

Also how would you suggest the post-processing from the postprocessing.py would work? any particular steps you used to obtain a structured format table?

Many thanks in advance.

How to use table detection pre-trained weights to train a custom dataset

I have created a custom table detection dataset that has different class labels. Instead of training a model from scratch, I want to use the pre-trained weights for the table detection model trained on PubTables-1M. Please help me to proceed with this task.

Cell predictions do not match the text bboxes

Hi,
I used one of the checkpoints (model_20.pth provided in this link https://drive.google.com/drive/folders/1Ko4Trk48u99AAPNU41RcUKAoMP0BoDmU ) and tested it on some of the images (PMC1064078_table_0.jpg) in the Pubtable dataset. However it seems the predictions are not correct i.e the predicted cell bboxes are deviated from the correct bboxes. The predicted bboxes are obtained from this function https://github.com/microsoft/table-transformer/blob/main/src/eval.py#L490.
I added some padding to the image but it did not help to account for the shifted cells. Do we need to do any preprocessing to the image or something?

Thanks.