Giter VIP home page Giter VIP logo

korashughes / llmcodegen Goto Github PK

View Code? Open in Web Editor NEW
0.0 2.0 0.0 425.64 MB

An exploration in code extraction and generation with large language models.

Home Page: https://www.proquest.com/docview/3059107595?%20Theses&sourcetype=Dissertations%20

Jupyter Notebook 25.34% Starlark 2.92% Shell 0.38% Python 6.95% Smarty 3.63% Makefile 0.05% Tcl 1.18% C 7.81% SystemVerilog 39.87% Verilog 0.03% Stata 0.01% C++ 0.97% Emacs Lisp 2.67% Assembly 2.07% HTML 0.32% Rust 5.51% CSS 0.12% JavaScript 0.08% Handlebars 0.04% SCSS 0.06%
llm python systemverilog verification verilog

llmcodegen's Introduction

CodeGen

An exploration in code & assertion generation with large language models. An explanation for the prompting schema, data acquisition, and methodology can be found here, in my Master's thesis.

Project Abstract:

Software assertions play a critical role in the creation of test benches and the overall verification of systems. In the case of formal property verification, complex design specifications are interpreted by industry experts and translated into System Verilog Assertions (SVA). Recent research has pointed toward large language models as a potential tool for SVA generation, however, lack of data and standardization of software assertions has resulted in mixed results amongst methods of evaluations. Thus, this paper proposes a dataset of code and natural language data containing assertions in SystemVerilog and Python that can be used to train and test future collaborative coding models. Additionally, this paper provides a preliminary analysis vii and novel schema for the consistent generation of quality software assertions with OpenAI’s GPT-4.

A Brief Tour:

  • Main Components:
    • AllInOne.ipynb is an aggregate of many of the processing files that includes code cleaning, prompt generation, and analysis. For more information on textbook analysis see GPT-ImageScrapper.ipynb.
    • Data/exploration.ipynb gathers statistics and other visual analytics from the asserted supervised, asserted unsupervised, and textbook datasets.
    • Organized Dataset/* contains cleaned results, prompts, image directories, & LLM responses.
      • The majority of this is focused on asserted code with the notable exception of Organized Dataset/Verilog Textbook Code/* which is all interpreted Verilog code.
    • Data/* Contains most of the intermediary states of data as it is processed.
      • Data/BigQuery/* contains the raw code data from Google Cloud.
      • Data/Data/example-code/supervised-textbook/* contains the raw & partially-processed textbook data.
      • Data/example-code/verilog examples/open-titan/* contains the raw & partially-processed testbench data.
  • generally helpful documents:
    • Gpt-Tutorial.ipynb has a brief overview of how to query openai's api & automate the process of extracting LLM-generated code
    • GPTest.ipynb shows the train of thought and some preliminary analysis done with LLM responses.

Key:

  • any file with "supervised" in the title is related to the connection between a piece of code's behavioral description (as given by an LLM) and the code-text itself.
    • likewise anything "unsupervised" relates exclusively to the code its assertions.
  • any file with "response" entails an LLM output.

For further questions feel free to reach me at [email protected]

llmcodegen's People

Contributors

korashughes avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.