Giter VIP home page Giter VIP logo

unit-mesh / unit-gen Goto Github PK

View Code? Open in Web Editor NEW
37.0 4.0 4.0 1.28 MB

UnitGen 是一个用于生成微调代码的数据框架 —— 直接从你的代码库中生成微调数据:代码补全、测试生成、文档生成等。UnitGen is a code fine-tuning data framework that generates data from your existing codebase.

Home Page: https://gen.unitmesh.cc/

License: Mozilla Public License 2.0

Kotlin 90.92% Jupyter Notebook 1.05% Python 8.03%
data-engineering evaluating finetuning llm

unit-gen's Introduction

UnitGen Logo

UnitGen

CI/CD Powered By Maven Open In OpenBayes Built with OpenBayes codecov

UnitGen 是一个用于生成微调代码的数据框架 —— 直接从你的代码库中生成微调数据:代码补全、测试生成、文档生成等。

Docs: https://gen.unitmesh.cc/

Thanks to OpenBayes for providing computing resources.

Finetune Model Examples:

name model download (HuggingFace) finetune Notebook model download (OpenBayes)
DeepSeek 6.7B unit-mesh/autodev-coder finetune.ipynb AutoDev Coder

Language support by Chapi

  • supported:
    • Java
    • Kotlin
  • doing:
    • TypeScript/JavaScript
    • Rust
  • future:
    • Go
    • Python
    • C/C++
    • C#
    • Scala

Features:

Architecture

Layered Architecture

Architecture

Workflow

UnitGen Workflow

Design Philosophy

  • Unique prompt. Integrated use of fine-tuning, evaluation, and tooling.
  • Code quality pipeline. With estimate with code complex, bad smell, test bad smell, and more rules.
  • Extendable customize quality thresholds. Custom rules, custom thresholds, custom quality type or more.

Unique Prompt

Keep the same prompt: AutoDev <-> UnitGen <-> UnitEval

AutoDev prompt

AutoDev prompt template example:

Write unit test for following code.

${context.coc}

${context.framework}

${context.related_model}

```${context.language}
${context.selection}
```

Unit Picker prompt

Unit Picker prompt should keep the same structure as the AutoDev prompt. Prompt example:

Instruction(
    instruction = "Complete ${it.language} code, return rest code, no explaining",
    output = it.output,
    input = """
    |```${it.language}
    |${it.relatedCode}
    |```
    |
    |Code:
    |```${it.language}
    |${it.beforeCursor}
    |```""".trimMargin()
)

UnitGen prompt

UnitGen prompt should keep the same structure as the AutoDev prompt. Prompt example:

Complete ${language} code, return rest code, no explaining

```${language}
${relatedCode}
```

Code:
```${language}
${beforeCursor}
```

Code quality pipeline

Code Quality Workflow

Extendable customize quality thresholds

Optional quality type:

enum class CodeQualityType {
    BadSmell,
    TestBadSmell,
    JavaController,
    JavaRepository,
    JavaService,
}

Custom thresholds' config:

data class BsThresholds(
    val bsLongParasLength: Int = 5,
    val bsIfSwitchLength: Int = 8,
    val bsLargeLength: Int = 20,
    val bsMethodLength: Int = 30,
    val bsIfLinesLength: Int = 3,
)

Custom rules:

val apis = apiAnalyser.toContainerServices()
val ruleset = RuleSet(
    RuleType.SQL_SMELL,
    "normal",
    UnknownColumnSizeRule(),
    LimitTableNameLengthRule()
    // more rules
)

val issues = WebApiRuleVisitor(apis).visitor(listOf(ruleset))
// if issues are not empty, then the code has bad smell

Quick Start

for examples, see: examples folder

use CLI

see in config-examples

download the latest version from GitHub Release

Generate Instructions

  1. config project by processor.yml
  2. run picker: java -jar unit-gen.jar

use Java API

see in config-example

1.add dependency

dependencies {
    implementation("cc.unitmesh:unit-picker:0.1.5")
    implementation("cc.unitmesh:code-quality:0.1.5")
}

2.config the unit-gen.yml file and connection.yml

3.write code

public class App {
    public static void main(String[] args) {
        List<InstructionType> builderTypes = new ArrayList<>();
        builderTypes.add(InstructionType.RELATED_CODE_COMPLETION);

        List<CodeQualityType> codeQualityTypes = new ArrayList<>();
        codeQualityTypes.add(CodeQualityType.BadSmell);
        codeQualityTypes.add(CodeQualityType.JavaService);

        PickerOption pickerOption = new PickerOption(
                "https://github.com/unit-mesh/unit-gen-testing", "master", "java",
                ".", builderTypes, codeQualityTypes, new BuilderConfig()
        );

        SimpleCodePicker simpleCodePicker = new SimpleCodePicker(pickerOption);
        List<Instruction> output = simpleCodePicker.blockingExecute();

        // handle output in here
    }
} 

Thanks to

  • abstract syntax tree: Chapi. Used features: multiple language to same data structure.
  • legacy system analysis: Coca. Inspired: Bad Smell, Test Bad Smell
  • architecture governance tool: ArchGuard. Used features: Estimation, Rule Lint (API, SQL)
  • code database CodeDB. Used features: Code analysis pipeline

LICENSE

This code is distributed under the MPL 2.0 license. See LICENSE in this directory.

unit-gen's People

Contributors

jialiu-github avatar phodal avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

unit-gen's Issues

Refactor: TypedIns to template

In current design, we use toInstruction to convert different TypedIns to Instruction, it's not working for IDE tools out AutoDev

interface TypedIns {
    val type: InstructionBuilderType

    /**
     * Build final instruction.
     */
    fun toInstruction(): Instruction
}

Bootstrap

Use Unit Eval to generate data

  • Test Code
  • Documentation
  • Code completion

Issue>>

  • object CodeDataStructUtil {}
  • fun CodeDataStruct.toUml(): String {} extension function
  • data class ?

Generate in-block data in CLI mode

我们期望能否通过CLI传入各种配置参数,比如指定只生成in-block数据?我们试验了命令行好像不支持传递其他参数。

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.