Giter VIP home page Giter VIP logo

math-comp-corpus's Introduction

MathComp Corpus

A corpus of code for the Coq proof assistant along with several independently machine-readable representations, derived from verification projects related to the Mathematical Components (MathComp) library. The corpus can be used, e.g., in machine learning and data mining applications. Machine-readable representations are in the form of S-expressions (sexps), and were created using the SerAPI library.

Obtaining the corpus

The latest released version of the corpus can be downloaded from GitHub. The archive includes both the Coq source files and corresponding machine-readable representations.

Corpus contents

The latest corpus release is based on Coq 8.10.2 and MathComp 1.9.0, and contains 449 source files from 21 Coq projects --- in total over 297k lines of code (LOC). For each source file (e.g., theory.v), there are two corresponding files with lists of sexps, organized at the Coq sentence level, for tokens (theory.tok.sexp) and abstract syntax trees (theory.ast.sexp). Moreover, three sexp representations are provided for each Coq lemma statement in the corpus: tokens, abstract syntax tree, and elaborated term. All machine-readable representations were created using SerAPI 0.7.1, via the SerAPI-bundled programs sercomp, sertok, and sername.

A research paper describes the corpus in more detail and provides additional statistics. The corpus is divided into three tiers based on how well projects conform to the MathComp conventions.

Project Revision SHA No. files LOC Tier License
finmap 27642a8 4 6451 1 CECILL-B
fourcolor 0851d49 60 37138 1 CECILL-B
math-comp 748d716 89 84713 1 CECILL-B
odd-order ca602a4 34 36125 1 CECILL-B
analysis 9e5fe1d 17 11739 2 CECILL-C
bigenough 5f79a32 1 78 2 CECILL-B
elliptic-curves 631af89 18 9596 2 CECILL-B
grobner dfa54f9 1 1330 2 CECILL-B
infotheo 6c17242 81 42295 2 GPL-3.0-only
multinomials 691d795 5 7363 2 CECILL-B
real-closed 495a1fa 10 8925 2 CECILL-B
robot b341ad1 13 11130 2 LGPL-3.0-only
two-square 1c09aca 2 1721 2 CECILL-B
bits 3cd298a 10 4041 3 Apache-2.0
comp-dec-pdl c1f813b 16 4419 3 CECILL-B
disel e8aa80c 20 4473 3 BSD-2-Clause
fcsl-pcm eef4503 12 5789 3 Apache-2.0
games 3d3bd31 12 4953 3 BSD-2-Clause
monae 9d0e461 18 6655 3 GPL-3.0-only
reglang da333e0 12 3033 3 CECILL-B
toychain 97bd697 14 5275 3 BSD-2-Clause

The structure of the corpus is as follows:

File/directory Contents
projects.yml List of projects, along with their URL, SHA, build command, installation command, etc.
raw-files Project source files and their machine-readable representations.
lemmas Lemmas for all projects in the corpus and their machine-readable statements.
lemmas-filtered Subset of lemmas obeying restrictions on the maximum sizes of elaborated terms.
definitions Definitions extracted from the corpus.

Applications

The corpus was used to train and evaluate the tool Roosterize, which suggests lemma names in Coq code using neural networks.

If you have used the corpus in a research project, please cite the corresponding research paper in any related publication:

@inproceedings{NieETAL20Roosterize,
  author = {Nie, Pengyu and Palmskog, Karl and Li, Junyi Jessy and Gligoric, Milos},
  title = {Deep Generation of {Coq} Lemma Names Using Elaborated Terms},
  booktitle = {International Joint Conference on Automated Reasoning},
  pages = {97--118},
  doi = {10.1007/978-3-030-51054-1_6},
  year = {2020},
}

Example data

The Coq lemma sentence

Lemma mg_eq_proof L1 L2 (N1 : mgClassifier L1) : L1 =i L2 -> nerode L2 N1.

is serialized into the following tokens (simplified):

(Sentence((IDENT Lemma)(IDENT mg_eq_proof)(IDENT L1)(IDENT L2)
  (KEYWORD"(")(IDENT N1)(KEYWORD :)(IDENT mgClassifier)
  (IDENT L1)(KEYWORD")")(KEYWORD :)(IDENT L1)(KEYWORD =i)(IDENT L2)
  (KEYWORD ->)(IDENT nerode)(IDENT L2)(IDENT N1)(KEYWORD .)))

and the corresponding parsed syntax (simplified):

(VernacExpr()(VernacStartTheoremProof Lemma (Id mg_eq_proof)
 (((CLocalAssum(Name(Id L1))(CHole()IntroAnonymous()))
   (CLocalAssum(Name(Id L2))(CHole()IntroAnonymous()))
   (CLocalAssum(Name(Id N1))
    (CApp(CRef(Ser_Qualid(DirPath())(Id mgClassifier)))(CRef(Ser_Qualid(DirPath())(Id L1))))))
  (CNotation(InConstrEntrySomeLevel"_ -> _")
   (CNotation(InConstrEntrySomeLevel"_ =i _")
    (CRef(Ser_Qualid(DirPath())(Id L1)))(CRef(Ser_Qualid(DirPath())(Id L2))))
   (CApp(CRef(Ser_Qualid(DirPath())(Id nerode)))
    (CRef(Ser_Qualid(DirPath())(Id L2)))(CRef(Ser_Qualid(DirPath())(Id N1))))))))

and finally, the corresponding elaborated term (simplified):

(Prod (Name (Id char)) ... (Prod (Name (Id L1)) ...
 (Prod (Name (Id L2)) ... (Prod (Name (Id N1)) ...
  (Prod Anonymous (App (Ref (DirPath ((Id ssrbool) (Id ssr) (Id Coq))) (Id eq_mem)) ...
   (Var (Id L1)) ... (Var (Id L2)))
  (App (Ref (DirPath ((Id myhill_nerode) (Id RegLang))) (Id nerode)) ...
   (Var (Id L2)) ... (Var (Id N1))))))))

License

As described in more detail in the license file, all corpus metadata is licensed under the MIT license, while all Coq source files and corresponding S-expression files are licensed under their respective original open-source licenses.

Authors

math-comp-corpus's People

Contributors

palmskog avatar pengyunie avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

Forkers

brando90

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.