Giter VIP home page Giter VIP logo

tablesense's Introduction

Tablesense: Spreadsheet table detection with convolutional neural networks

Spreadsheet table detection is the task of detecting all tables on a given sheet and locating their respective ranges. Automatic table detection is a key enabling technique and an initial step in spreadsheet data intelligence. To enable data-driven models, we annotated a large amount of table ranges on real spreadsheet data. Our annotations are based on three public datasets (VEnron2, VEUSUS, and VFUSE), which are widely used in spreadsheet domain. To eliminate similar spreadsheets that may introduce lots of duplicated labeling efforts, we use the published dataset which has clustered similar sheets by SpreadCluster:

  1. VEnron2 is built on the Enron email archive by SpreadCluster (MSR 2017). It contains 1,609 evolution groups and 12,254 spreadsheets.
  2. VEUSES is built on EUSES by SpreadCluster (MSR 2017). It contains 177 evolution groups and 363 spreadsheets.
  3. VFUSE is built on FUSE by SpreadCluster (MSR 2017). It contains 188 evolution groups and 1,143 spreadsheets.

Note that the WebSheet dataset introduced by TableSense needs to solve compliance issues before publishing, so we firstly publish annotations for VEnron2, VEUSUS, and VFUSE to facilitate recent research. To process raw Excel files, we first transformed original Excel files from .xls to .xlsx. Second, we tried to read and extract features from these files using ClosedXML. We excluded those files that failed to transform and process. Then we seleceted one file for each cluster, and labeled only the first two sheets for those files containing multiple spreadsheets. All sheets had been labeled and checked by no less than two persons. We excluded those controversial cases between annotators. Finally we got 2,615 tables from 1,645 spreadsheet. Since VEnron2 has the greatest number of clusters, it contributes most annotated table ranges. The annotation schema looks like the following example:

File name Sheet name Training/testing File folder Table region 1 Table region ...
1_AGAVE.x February training_set VEnron2\1027 B3:F5 ...
... ... ... ... ... ...

tablesense's People

Contributors

haoareyudong avatar microsoftopensource avatar xilv avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

tablesense's Issues

Model Training Source Code

Hi,

Do you plan to open source the model training scripts for TableSense ? If yes, can you share a link to it ?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.