beomi / koalpaca Goto Github PK

View Code? Open in Web Editor NEW

1.5K 29.0 237.0 34.17 MB

KoAlpaca: 한국어 명령어를 이해하는 오픈소스 언어모델

Home Page: https://chat.koalpaca.com

License: Apache License 2.0

Jupyter Notebook 87.25% Python 12.24% Shell 0.51%

alpaca koalpaca korean-nlp llama chatkoalpaca polyglot-ko

koalpaca's Introduction

Update Logs

2023.07.28
- RTX3090(4090/A5000) 4대 사용한 TensorParallel 기반 학습 예제 코드 추가
2023.06.12
- Kakao/Telegram 봇 안내 삭제
2023.06.08
- Colab: QLoRA로 학습한 adapter weight 로드해서 Inference하기 추가
  - Colab Free(T4 GPU)에서 로드 가능합니다.
  - 예제로 🤗beomi/qlora-koalpaca-polyglot-12.8b-50step 레포(50step 학습한 예제)를 사용합니다.
2023.06.02
- 🐧KoRWKV-6B🤗 28B tokens ckpt 기반 🐧KoAlpaca-KoRWKV-6B🤗 v1.1a버전 학습 및 모델 공개
  
  🐧KoAlpaca-KoRWKV-6B 허깅페이스 레포🤗에 몇가지 생성 결과 샘플이 있습니다.
- RTX 3090/4090/A5000으로 Polyglot-ko 5.8b full finetune 가이드 추가
2023.05.30
- 무료 Colab(T4 GPU)에서 동작하는 학습 예제 추가: QLoRA로 Polyglot-ko 12.8B 모델 KoAlpaca v1.1a 학습 with 4bit
  - Colab 무료 - T4에서는 50sample에 약 4분, 1epoch에 약 15시간(최대 사용시간 초과)
  - Colab Pro - A100에서는 50sample에 약 44초, 1epochs에 약 2-3시간
2023.05.03
- 🤗Polyglot-ko 5.8B 기반 KoAlpaca-Polyglot-5.8B-v1.1b fp16 모델 및 🤗8bit 모델 공개
2023.04.28
- Logo 업데이트 🎉
2023.04.26
- chat.koalpaca.com 에 Hugging Face ChatUI 적용
  - 해당 ChatKoAlpaca GPU서버는 모두의연구소 K-디지털 플랫폼의 지원을 받아 진행하고 있습니다.
2023.04.20
- KoAlpaca v1.1 데이터셋 공개
- 🤗Polyglot-ko 12.8B 기반 KoAlpaca-Polyglot-12.8B-v1.1b fp16 모델 및 🤗8bit 모델 공개
- ChatKoAlpaca Web UI(v1.1b) 공개
2023.03.26: 🤗LLAMA 30B 기반 KoAlpaca 모델을 공개합니다. (LoRA로 학습)
- LLAMA 30B 학습은 GIST Sundong Kim 교수님의 A100 지원으로 학습되었습니다. 감사합니다 😄
2023.03.24: 🤗LLAMA 13B 기반 KoAlpaca 모델을 공개합니다. (LoRA로 학습)
2023.03.23: 🤗LLAMA 65B 기반 KoAlpaca 모델을 공개합니다. (LoRA로 학습)
2023.03.22: 카카오톡에 포팅된 KoAlpaca 봇이 추가되었습니다.

KoAlpaca: Korean Alpaca Model based on Stanford Alpaca (feat. LLAMA and Polyglot-ko)

Stanford Alpaca 모델을 학습한 방식과 동일한 방식으로 학습을 진행한, 한국어를 이해하는 Alpaca 모델입니다.

바로 써보기: 채팅형 웹 페이지(v1.1b-chat) 혹은 카카오톡 봇, Telegram Bot으로 만나보세요!

ChatKoAlpaca: 채팅형 KoAlpaca (Polyglot-ko-12.8b-v1.1b)

해당 데모는 모두연 KoAlpaca Lab에서 모두의연구소 K-디지털 플랫폼의 지원 받아 진행하고 있습니다. 감사합니다 😍

설치 없이 Web으로 곧바로 만날 수 있습니다.
이전 대화의 맥락을 기억해 답변을 생성합니다.
최대 512 토큰의 응답을 생성하며, 이전 맥락은 (신규 생성 답변을 포함해) 최대 1024 토큰을 기억합니다.
데모에서는 A5000 24G GPU 2대로 Tensor Parallel을 통해 서빙중이며, fp16 모델로 서비스중입니다.

링크: https://chat.koalpaca.com

완성된 모델: 한국어 기반 모델(Polyglot-ko) & 영한 모델(LLAMA)

KoAlpaca는 백본 모델로 한국어 모델은 Polyglot-ko(5.8B)모델을, 영문+한국어 기반 모델은 LLAMA를 사용하였습니다.

Polyglot-ko 12.8B 기반 [Full Finetune v1.1] -> 🤗 beomi/KoAlpaca-Polyglot-12.8B

해당 모델은 ChatKoAlpaca에서 사용되었습니다.
해당 모델은 🤗 8Bit quantized된 모델도 제공합니다.

Polyglot-ko 5.8B 기반 [Full Finetune] -> 🤗 beomi/KoAlpaca-Polyglot
Meta LLAMA 7B 기반 [Full Finetune] -> 🤗 beomi/KoAlpaca
Meta LLAMA 13B 기반 [LoRA] -> 🤗 KoAlpaca-13B-LoRA
Meta LLAMA 30B 기반 [LoRA] -> 🤗 KoAlpaca-30B-LoRA
Meta LLAMA 65B 기반 [LoRA] -> 🤗 KoAlpaca-65B-LoRA

참고: LLAMA 13B로는, 찬성님(@deep-diver)이 학습하신 🤗chansung/koalpaca-lora-13b가 추가적으로 공개되어있습니다.)

Meta의 LLAMA 모델은 한국어 데이터셋을 충분히 학습하지 않아서, 실제 Inference를 돌려보았을 때 한국어 성능이 낮게 나오는 이슈가 있습니다.

보다 한국어에 나은 퀄리티를 얻어보고자 Polyglot-ko 5.8B, 12.8B 모델을 백본으로 사용한 모델도 학습시켰습니다.

이후 v1.1 데이터와 함께, 이전 대화의 맥락과 사용자의 추가 입력의 맥락을 이해하는 모델인 KoAlpaca v1.1을 공개합니다.

KoAlpaca 모델 실행 예시 코드

Huggingface Pipeline으로 실행

최신버전 torch / HF 라이브러리 설치

pip install -U torch transformers tokenizers accelerate safetensors

아래 예제 코드로 실행해볼 수 있습니다.

import torch
from transformers import pipeline, AutoModelForCausalLM

MODEL = 'beomi/KoAlpaca-Polyglot-5.8B'

model = AutoModelForCausalLM.from_pretrained(
    MODEL,
    torch_dtype=torch.float16,
    low_cpu_mem_usage=True,
).to(device=f"cuda", non_blocking=True)
model.eval()

pipe = pipeline(
    'text-generation', 
    model=model,
    tokenizer=MODEL,
    device=0
)

def ask(x, context='', is_input_full=False):
    ans = pipe(
        f"### 질문: {x}\n\n### 맥락: {context}\n\n### 답변:" if context else f"### 질문: {x}\n\n### 답변:", 
        do_sample=True, 
        max_new_tokens=512,
        temperature=0.7,
        top_p=0.9,
        return_full_text=False,
        eos_token_id=2,
    )
    print(ans[0]['generated_text'])

ask("딥러닝이 뭐야?")
# 딥러닝은 인공신경망을 통해 입력과 출력 사이의 복잡한 관계를 학습하는 머신러닝의 한 분야입니다. 이 기술은 컴퓨터가 인간의 학습 능력과 유사한 방식으로 패턴을 학습하도록 하며, 인간의 개입 없이도 데이터를 처리할 수 있는 기술입니다. 최근에는 딥러닝을 활용한 인공지능 애플리케이션이 많이 개발되고 있습니다. 예를 들어, 의료 진단 애플리케이션에서는 딥러닝 기술을 활용하여 환자의 특징을 파악하고, 이를 통해 빠르고 정확한 진단을 내리는 데 사용됩니다. 또한, 금융 분야에서는 딥러닝 기술을 활용하여 주가 예측 모형을 학습하는 데 사용되기도 합니다.

Gradio 샘플 - Fullfinetune & 8bit 🌟 추천 🌟

이 레포의 webui 폴더를 참고해주세요.

Gradio 샘플 - LoRA

찬성님(@deep-diver)의 deep-diver/Alpaca-LoRA-Serve레포를 참고해 실행해보세요.

# 위 Repo의 가이드대로 설치 후..
BASE_URL=decapoda-research/llama-13b-hf  # 30B모델은 llama-30b-hf 65B모델은 llama-65b-hf 주의: 65B 모델은 기본 용량이 100GB 이상입니다.
FINETUNED_CKPT_URL=beomi/KoAlpaca-13B-LoRA  # 30B모델은 KoAlpaca-30B-LoRA 65B모델은 KoAlpaca-65B-LoRA 주의: 65B 모델은 A100 80G등 Vram이 아주 큰 GPU에서만 돌아갑니다.

python app.py --base_url $BASE_URL --ft_ckpt_url $FINETUNED_CKPT_URL --port 6006

데이터셋 v1.1 제작 방법

v1.1 데이터셋은 기존 Alpaca모델이 대답을 짧게 하는 경향, 그리고 맥락을 이해하지 못하는 경향을 개선하기 위해 제작되었습니다.

데이터셋 선정 및 수집

v1.1 데이터셋은 네이버 지식인을 크롤링하여 만들었습니다.

네이버 지식인 베스트 전체 질문을 수집하였습니다.

데이터 가공

실제 생성 데이터는 질문과 질문 본문, 그리고 채택된 답변의 본문을 기반으로 새로 생성되었습니다.

데이터를 ChatGPT를 통해 생성하였고, 이때 Batch Decoding을 통해 생성하였습니다.

참고: 기존 지식인 질문과 답변은 ChatGPT가 참조한 시드로 사용되었습니다

데이터셋 v1.0 제작 방법

데이터셋은 기본적으로 Stanford Alpaca에서 제공한 5만2천개 데이터셋을 기반으로 합니다.

데이터셋 번역

Alpaca 데이터셋은 다음과 같이 Instruct 부분과 Input, 그리고 Output 부분으로 구성되어있습니다.

...
{
    "instruction": "Describe a time when you had to make a difficult decision.",
    "input": "",
    "output": "I had to make a difficult decision when I was working as a project manager at a construction company. I was in charge of a project that needed to be completed by a certain date in order to meet the client\u2019s expectations. However, due to unexpected delays, we were not able to meet the deadline and so I had to make a difficult decision. I decided to extend the deadline, but I had to stretch the team\u2019s resources even further and increase the budget. Although it was a risky decision, I ultimately decided to go ahead with it to ensure that the project was completed on time and that the client\u2019s expectations were met. The project was eventually successfully completed and this was seen as a testament to my leadership and decision-making abilities."
},
{
    "instruction": "Identify the odd one out.",
    "input": "Twitter, Instagram, Telegram",
    "output": "Telegram"
},
...

Output의 경우 OpenAI의 text-davinci-003 모델의 생성 결과이기 때문에, 해당 부분은 번역하지 않았습니다.

따라서 Instruct와 Input 컬럼을 번역하고, 해당 Input부분에 에러가 있는 일부를 제외하였습니다.

번역을 하기 위해 DeepL API 서비스를 사용하였고, 번역된 데이터는 다음과 같은 형태가 됩니다.

...
{
    "instruction": "홀수 중 하나를 밝히세요.",
    "input": "트위터, 인스타그램, 텔레그램"
}
...

Output 데이터 생성

이후 해당 Instruct와 Input은 다음과 같이 Batch Decoding 기법을 통해서 OpenAI ChatGPT API(gpt-3.5-turbo)를 통해 데이터를 생성합니다.

Alpaca 연구와 동일하게, 한 Instruct/Input 셋에 대해서 단일 응답만을 생성해 데이터셋을 구성하였습니다.

기존 연구에서는 20개의 데이터셋을 한번에 생성했지만, 한국어로 생성을 진행해보았을 때 10개를 동시에 생성할 때 안정적으로 생성이 진행되어 10개씩 생성하였습니다.

아래는 답변을 생성하는데 사용한 Prompt입니다.

PROMPT = """\
다양한 작업에 대한 답변을 생성해주세요. 이러한 작업 지침은 ChatGPT 모델에 주어지며, ChatGPT 모델이 지침을 완료하는지 평가합니다.

요구 사항은 다음과 같습니다:
1. 다양성을 극대화하기 위해 각 지시에 대해 동사를 반복하지 않도록 하세요.
2. 지시에 사용되는 언어도 다양해야 합니다. 예를 들어, 질문과 명령형 지시를 결합해야 합니다.
3. 지시 사항의 유형이 다양해야 합니다. 목록에는 개방형 생성, 분류, 편집 등과 같은 다양한 유형의 작업이 포함되어야 합니다.
2. GPT 언어 모델은 지시를 완료할 수 있어야 합니다. 예를 들어 어시스턴트에게 시각적 또는 오디오 출력을 생성하도록 요청하지 마세요. 또 다른 예로, 어시스턴트가 어떤 작업도 수행할 수 없으므로 오후 5시에 깨우거나 미리 알림을 설정하도록 요청하지 마세요.
3. 답변은 한국어로 작성해야 합니다.
4. 답변을 1~2문장으로 작성하세요. 명령문이나 질문도 허용됩니다.
5. 지시 사항에 대한 적절한 입력을 생성해야 합니다. 입력 필드에는 지시에 대한 구체적인 예가 포함되어야 합니다. 실제 데이터를 포함해야 하며 단순한 자리 표시자를 포함해서는 안 됩니다. 입력은 지시 사항을 어렵게 만들 수 있는 상당한 내용을 제공해야 하지만 100단어를 넘지 않는 것이 이상적입니다.
6. 일부 지시사항은 추가 입력이 있고, 일부 지시에는 입력 필드가 비어있습니다. 예를 들어 "세계에서 가장 높은 봉우리는 무엇인가?"라는 일반적인 정보를 묻는 지시의 경우 구체적인 맥락을 제공할 필요가 없어, 입력 필드가 비어있을 수 있습니다.
7. 출력은 명령어와 입력에 대한 적절한 응답이어야 합니다.

아래에 10개의 명령어와 입력(옵션)에 따라 적절한 응답을 생성하세요. 
응답은 아래와 같은 형식으로 10가지를 0번 부터 9번 까지, 번호에 따라 해당 번호의 명령어와 입력에 알맞게 작성하세요.

각 응답 사이는 ### 으로 내용을 분리해주세요.

응답0: 첫 번째 응답내용###
응답1: 두 번째 응답내용###
...
응답9: 마지막 응답내용"""

추가적으로, 아래와 같이 ChatGPT API의 system prompt를 추가하였습니다.

"content": "아래는 작업을 설명하는 명령어입니다. 입력이 없으면 입력을 제외하고, 명령어에 따른 요청을 적절히 완료하는 응답을 작성하세요. 추가적 입력이 있다면 작업을 설명하는 명령어와 추가 컨텍스트를 제공하는 입력에 따라 요청을 적절히 완료하는 응답을 작성하세요.",

한국어로 생성이 완료된 데이터셋은 ko_alpaca_data.json에 저장되어 있습니다.

Fine tune 방식 모델 학습 방법

QLoRA 4bit 학습방법 Google Colab 예제

QLoRA로 Polyglot-ko 12.8B 모델 KoAlpaca v1.1a 학습 with 4bit

1000개 샘플로만 학습한 모델 생성결과 예시

### 질문: 풋옵션과 콜옵션의 차이, 그리고 일반 개미 투자자가 선택해야 할 포지션은?

### 답변: 풋이란 주식을 사는 것(구매)을 의미하고, 콜이란 주식을 파는(판매) 것입니다. 예를 들면, 기업의 주가가 10만원인데, 그 회사에서 신규 사업을 진행 중이라, 앞으로 매출이 증대되어 기업 가치가 상승할 것으로 판단해, 해당 회사의 주식을 사는 것이 풋옵션입니다. 이는 주식을 사는 시점과 해당 회사의 주가가 하락할 경우에는 기업의 주가와 함께 주가가 하락하는 리스크가 있습니다. 반면, 콜옵션은 주식을 파는 것입니다. 이는 주식을 파는 시점과 해당 회사의 주가가 증가해서 향후 기업 전망이 밝을 경우 주가가 상승하며, 해당 주식의 프리미엄을 받을 수 있는 안정적인 투자법입니다. 일반 개미 투자자들은 주식을 매수할 때와 주식을 매도할 때 다양한 리스크가 존재하므로, 전문가의 조언과 투자 경험을 바탕으로 적절한 투자 전략을 수립해야 합니다. 투자에 대한 자세한 사항은 아래에서 무료로 얻을 수 있습니다.

### 질문: 마진콜이 발생하는 이유가 뭐야? 그리고 어떻게 해야 마진콜을 막을 수 있어?

### 답변: 마진콜이란, 선물/옵션 매매 시 증거금이 부족해서 추가증거금 예치를 요구받는 것을 말합니다. 추가증거금은 선물/옵션 거래 계약시 미리 정해진 약정증거금에서 추가로 부담해야 하는 증거금을 의미합니다. 통상적으로 매수 계약 시에는 계약 금액의 10%를 증거금으로 예치하고, 매도 계약 시에는 해당 금액의 3%를 예치합니다. 따라서, 매수 계약 시에는 총 20%의 증거금이 필요하고, 매도 계약 시에는 총 9%의 증거금이 필요합니다. 따라서, 선물 매매 중에는 매일 매일의 시세에 따라 증거금을 수시로 필요하면 추가로 납부해야해서 추가증거금이 발생하며, 해당 필요액 부족시에는 마진콜이 발생합니다. 따라서, 선물 거래를 하기 전에는 꼭 추가증거금을 예치하는 것이 중요합니다. 추가증거금은 '선물/옵션 -> 증거금 -> 추가예치/청산' 방법으로 조회 가능하며, 인터넷이나 전화로도 조회할 수 있습니다. 또, '주문/증거금 -> 추가증거금'을 통해 추가증거금을 입금할 수도 있습니다. 추가증거금을 예치

ChatKoAlpaca(KoAlpaca Polyglot-ko-12.8B-v1.1b) 모델 학습

ChatKoAlpaca(KoAlpaca Polyglot-ko-12.8B-v1.1b) 모델은 A100 80GB 4대로 학습을 진행하였고, DeepSpeed ZeRO-3을 통해 Full Sharding으로 학습을 진행하였습니다.

해당 코드는 현재 레포의 train_1.1b폴더를 참고해주세요.

Polyglot-ko 5.8B 모델 학습

모델 학습은 A100 80GB 1대로 학습을 진행하였습니다.

LLAMA 7B 모델 학습

LLAMA기반 모델 학습은 A100 80GB 4대로 학습을 진행하였고, FSDP를 통해 학습했습니다.

학습에 사용한 코드는 train.py에 저장되어 있고, 학습에 사용한 스크립트는 train.sh에 저장되어 있습니다.

FSDP 모델 학습시 유의사항

LLAMA 모델이 Hugginface Transformers 4.28.0 이상 버전에 포함되었습니다. 그 이하 버전을 사용하면 llama가 없다는 에러가 발생합니다. FSDP 사용시 감싸줄 레이어로는 LlamaDecoderLayer를 사용합니다.

# train.sh
--fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer'

LoRA 방식 모델 학습 방법

LLAMA 13B 모델 학습

🤗 Huggingface Repo: https://huggingface.co/beomi/KoAlpaca-13B-LoRA

모델 학습은 RTX 3090 24GB 4대로 진행했습니다. 학습에 사용한 코드는 https://github.com/tloen/alpaca-lora 을 기반으로 사용하였습니다.

데이터셋은 해당 레포의 alpaca_data_cleaned.json(영문)과 현재 레포의 ko_alpaca_data.json를 합쳐 학습에 사용하였습니다.

LLAMA 30B 모델 학습

🤗 Huggingface Repo: https://huggingface.co/beomi/KoAlpaca-30B-LoRA

모델 학습은 A100 40GB 2대로 진행했습니다. 학습에 사용한 코드는 https://github.com/tloen/alpaca-lora 을 기반으로 사용하였습니다.

LLAMA 30B 학습은 GIST Sundong Kim 교수님의 A100 지원으로 학습되었습니다.

데이터셋은 해당 레포의 alpaca_data_cleaned.json(영문)과 현재 레포의 ko_alpaca_data.json를 합쳐 학습에 사용하였습니다.

LLAMA 65B 모델 학습

🤗 Huggingface Repo: https://huggingface.co/beomi/KoAlpaca-65B-LoRA

모델 학습은 A100 80GB 8대로 진행했습니다. 학습에 사용한 코드는 https://github.com/tloen/alpaca-lora 을 기반으로 사용하였습니다.

데이터셋은 해당 레포의 alpaca_data_cleaned.json(영문)과 현재 레포의 ko_alpaca_data.json를 합쳐 학습에 사용하였습니다.

Benchmarks

NSMC Benchmark Test

Benchmark Test 방식

polyglot-ko 5.8b와 KoAlpaca-polyglot과 Benchmark 성능 비교
Few-shot Learning으로 2 가지 프롬프트 유형으로 구분하여 테스트 진행

1. Few-shot Learning 구성

few-shot 구성
- k=10 설정
- 텍스트 토큰 길이가 25 이하인 문장만 구성
- 데이터 토큰 길이의 약 75 퍼센타일에 해당하는 데이터만 포함
NSMC 데이터 토큰 길이 분포

Few shot 케이스 토큰 평균 길이:  20.229
Few shot 케이스 토큰 최대 길이:  280
Few shot 케이스 토큰 길이 표준편차:  16.488
Few shot 케이스 토큰 길이 80 퍼센타일:  27.0

2. 프롬프트 구성

프롬프트 유형 1: 매우 단순한 프롬프트 구성. 분석 Task에 대한 내용을 명시적으로 하지 않음

def build_prompt_text(sent):
    return "문장: " + sent + '\n감정:'

프롬프트 유형 2: 프롬프트 유형 1에 비해 Task 의도 내용을 포함. 분석 Task에 대한 내용을 의문형으로 표현

def build_prompt_text(sent):
    return '다음 문장은 긍정일까요 부정일까요?\n' + sent + '\n정답:'

3. Benchmark Test 결과

acc.	프롬프트 유형 1	프롬프트 유형 2
polyglot-ko	0.5752	0.7223
koalpaca-polyglot	0.6939	0.7683

Stanford Alpaca: An Instruction-following LLaMA Model

This is the repo for the Stanford Alpaca project, which aims to build and share an instruction-following LLaMA model. The repo contains:

A web demo to interact with our Alpaca model
The 52K data used for fine-tuning the model
The code for generating the data
The code for fine-tuning the model

Overview

The current Alpaca model is fine-tuned from a 7B LLaMA model [1] on 52K instruction-following data generated by the techniques in the Self-Instruct [2] paper, with some modifications that we discuss in the next section. In a preliminary human evaluation, we found that the Alpaca 7B model behaves similarly to the text-davinci-003 model on the Self-Instruct instruction-following evaluation suite [2].

Alpaca is still under development, and there are many limitations that have to be addressed. Importantly, we have not yet fine-tuned the Alpaca model to be safe and harmless. We thus encourage users to be cautious when interacting with Alpaca, and to report any concerning behavior to help improve the safety and ethical considerations of the model.

Our initial release contains the data generation procedure, dataset, and training recipe. We intend to release the model weights if we are given permission to do so by the creators of LLaMA. For now, we have chosen to host a live demo to help readers better understand the capabilities and limits of Alpaca, as well as a way to help us better evaluate Alpaca's performance on a broader audience.

Please read our release blog post for more details about the model, our discussion of the potential harm and limitations of Alpaca models, and our thought process for releasing a reproducible model.

[1]: LLaMA: Open and Efficient Foundation Language Models. Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, Guillaume Lample. https://arxiv.org/abs/2302.13971v1

[2]: Self-Instruct: Aligning Language Model with Self Generated Instructions. Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A. Smith, Daniel Khashabi, Hannaneh Hajishirzi. https://arxiv.org/abs/2212.10560

Data Release

alpaca_data.json contains 52K instruction-following data we used for fine-tuning the Alpaca model. This JSON file is a list of dictionaries, each dictionary contains the following fields:

instruction: str, describes the task the model should perform. Each of the 52K instructions is unique.
input: str, optional context or input for the task. For example, when the instruction is "Summarize the following article", the input is the article. Around 40% of the examples have an input.
output: str, the answer to the instruction as generated by text-davinci-003.

We used the following prompts for fine-tuning the Alpaca model:

for examples with a non-empty input field:

Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{instruction}

### Input:
{input}

### Response:

for examples with an empty input field:

Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
{instruction}

### Response:

During inference (eg for the web demo), we use the user instruction with an empty input field (second option).

Data Generation Process

Running the code

Set environment variables OPENAI_API_KEY to your OpenAI API key.
Install the dependencies with pip install -r requirements.txt.
Run python -m generate_instruction generate_instruction_following_data to generate the data.

We built on the data generation pipeline from self-instruct and made the following modifications:

We used text-davinci-003 to generate the instruction data instead of davinci.
We wrote a new prompt (prompt.txt) that explicitly gave the requirement of instruction generation to text-davinci-003. Note: there is a slight error in the prompt we used, and future users should incorporate the edit in tatsu-lab/stanford_alpaca#24
We adopted much more aggressive batch decoding, i.e., generating 20 instructions at once, which significantly reduced the cost of data generation.
We simplified the data generation pipeline by discarding the difference between classification and non-classification instructions.
We only generated a single instance for each instruction, instead of 2 to 3 instances as in [1].

This produced an instruction-following dataset with 52K examples obtained at a much lower cost (less than $500). In a preliminary study, we also find our 52K generated data to be much more diverse than the data released by self-instruct. We plot the below figure (in the style of Figure 2 in the self-instruct paper to demonstrate the diversity of our data. The inner circle of the plot represents the root verb of the instructions, and the outer circle represents the direct objects.

Fine-tuning

We fine-tune our models using standard Hugging Face training code with the following hyperparameters:

Hyperparameter	Value
Batch size	128
Learning rate	2e-5
Epochs	3
Max length	512
Weight decay	0

Given Hugging Face hasn't officially supported the LLaMA models, we fine-tuned LLaMA with Hugging Face's transformers library by installing it from a particular fork (i.e. this PR to be merged). The hash of the specific commit we installed was 68d640f7c368bcaaaecfc678f11908ebbd3d6176.

To reproduce our fine-tuning runs for LLaMA, first install the requirements

pip install -r requirements.txt

Then, install the particular fork of Hugging Face's transformers library.

Below is a command that fine-tunes LLaMA-7B with our dataset on a machine with 4 A100 80G GPUs in FSDP full_shard mode. We were able to reproduce a model of similar quality as the one we hosted in our demo with the following command using Python 3.10. Replace <your_random_port> with a port of your own, <your_path_to_hf_converted_llama_ckpt_and_tokenizer> with the path to your converted checkpoint and tokenizer (following instructions in the PR), and <your_output_dir> with where you want to store your outputs.

torchrun --nproc_per_node=4 --master_port=<your_random_port> train.py \
    --model_name_or_path <your_path_to_hf_converted_llama_ckpt_and_tokenizer> \
    --data_path ./alpaca_data.json \
    --bf16 True \
    --output_dir <your_output_dir> \
    --num_train_epochs 3 \
    --per_device_train_batch_size 4 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 8 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 2000 \
    --save_total_limit 1 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --fsdp "full_shard auto_wrap" \
    --fsdp_transformer_layer_cls_to_wrap 'LLaMADecoderLayer' \
    --tf32 True

Warning

fsdp_transformer_layer_cls_to_wrap must be set to the name of the specific decoder layer. The LLaMA Hugging Face PR is not stable. Earlier commits used the name LLaMADecoderLayer for their decoder layer (the commit hash our code is based on this). More recent commits use LlamaDecoderLayer (notice the small case difference). Not setting fsdp_transformer_layer_cls_to_wrap to the correct name will lead to drastic slowdowns in training.

Side notes

The same script also works for OPT fine-tuning. Here's an example for fine-tuning OPT-6.7B

torchrun --nproc_per_node=4 --master_port=<your_random_port> train.py \
    --model_name_or_path "facebook/opt-6.7b" \
    --data_path ./alpaca_data.json \
    --bf16 True \
    --output_dir <your_output_dir> \
    --num_train_epochs 3 \
    --per_device_train_batch_size 4 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 8 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 2000 \
    --save_total_limit 1 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --fsdp "full_shard auto_wrap" \
    --fsdp_transformer_layer_cls_to_wrap 'OPTDecoderLayer' \
    --tf32 True

Note the given training script is meant to be simple and easy to use, and is not particularly optimized. To run on more gpus, you may prefer to turn down gradient_accumulation_steps to keep a global batch size of 128. Global batch size has not been tested for optimality.

Authors

All grad students below contributed equally and the order is determined by random draw.

All advised by Tatsunori B. Hashimoto. Yann is also advised by Percy Liang and Xuechen is also advised by Carlos Guestrin.

Citation

Please cite the repo if you use the data or code in this repo.

@misc{alpaca,
  author = {Rohan Taori and Ishaan Gulrajani and Tianyi Zhang and Yann Dubois and Xuechen Li and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto },
  title = {Stanford Alpaca: An Instruction-following LLaMA model},
  year = {2023},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/tatsu-lab/stanford_alpaca}},
}

Naturally, you should also cite the original LLaMA paper [1] and the Self-Instruct paper [2].

Acknowledgements

We thank Yizhong Wang for his help in explaining the data generation pipeline in Self-Instruct and providing the code for the parse analysis plot. We thank Yifan Mai for helpful support, and members of the Stanford NLP Group as well as the Center for Research on Foundation Models (CRFM) for their helpful feedback.

koalpaca's People

Contributors

Stargazers

Watchers

Forkers

tobykim nyangko hyojunguy hephaex gor031 techthiyanes lexifdev com-lab gyunggyung forus-ai laypauls taekyoon steelblu storyhe arisohn usuall haven-jeon jaechoon2 miyu4u pppanghyun girinman kimyongyeon munifico codingchild2424 heisnotanimposter peternara 5joon2 neuralworks-io qqq-tech tieck-it sgd122 etri-crossmodal cloudchamb3r ady95 futurity75 iamthatiam7 cs20131516 mkpaeg canslove simsimhae91 simboyz ykyou hwisaac poveteen disorder-rose dhkyhk evelynn jisang0814 hhy5277 beumsoo 4thgen techventurebuilder kangwoo ailabteam newcave kidmam digksskawk01 keithkim ddaying yesinkim bluemusk ssemiya thrillcrazyer joonyeong97 leoincedo pjt3591oo yoondk7721 vicjung pawhang gl2een chorockuin sjinwoo jade2290 jaej-dev jesusoctavioas effectsno1 kkw0818 lch8798 yoyogo96 younlea kohjt ilyoungkim websider samyoahri udplus mskim76 kang-donghoon-eland luatic-mint ksj2937 krvista myeonghahwang kimjehyun call518 mac999 royce-kep jongwon-jay-lee paulsunnypark sdh1513 sangkyunyoon derek-tjhwang

koalpaca's Issues

안녕하세요..! 죄송하지만 제가 궁금한 점이 있어서 여쭙고자 문의 드립니다..!

우선 KoAlpaca를 공개해주셔서 감사드립니다.

제가 궁금해하는 부분은 언어 모델의 리소스 관련인데요,
언어 모델의 파라미터 크기에 따라 필요한 GPU 스펙을 어떻게 예측하고, 학습과 실행을 진행하셨는지 여쭙고 싶습니다.
관련 리소스, 논문, 참고 자료 등을 추천해주시면 감사드리겠습니다..!

KoAlpaca(Polyglot) 4bit 양자화

AutoGPTQ 라이브러리를 사용하여 GPTQ알고리즘을 적용하는 코드를 만들었습니다.
기본적으로 bit = 4,groupsize = 128을 사용하도록 설정하였기 때문에 4.15bit로 작동합니다.
https://github.com/qwopqwop200/GPTQ-for-KoAlpaca

KoAlpaca/nsmc_polyglot_and_koalpaca_by_few_shot.ipynb를 colab 실행시 "사용 가능한 RAM을 모두 사용한 후 세션이 다운되었습닏" 오류 발생합니다.

안녕하세요...

https://github.com/Beomi/KoAlpaca/blob/main/nsmc_polyglot_and_koalpaca_by_few_shot.ipynb 를 google colab에서 실행하려도 시도했습니다.

Alpaca-polyglot 로드 하는 도중에,
"사용 가능한 RAM을 모두 사용한 후 세션이 다운되었습니다."
라고 오류가 발생하는데, 어떻게 대처해야 하는지 알고 싶습니다.

KoAlpaca-Polyglot-ko-5.8b-v1.1b inference code

After training the polyglot-ko-5.8b model, I am now looking for an inference code to test it. Is there any inference code available for polyglot-ko-5.8bModel? I have tried using the ChatKoAlpaca inference code with the polyglot-ko-5.8b model, but the results are not accurate and appear to be random.

Thank you!

train_polyglot5.8b_singleA100.sh 실행 후 ImportError: cannot import name 'DEFAULT_CIPHERS' from 'urllib3.util.ssl_' 해결법

오픈소스 공유 감사드립니다 !

제목과 같은 에러는

pip install urllib3==1.26

로 변경해주니 잘돌아가네요 ㅎㅎ

웹 서비스 구현

안녕하세요.

먼저 이렇게 멋진 코드를 공유해주셔서 감사합니다.

혹시, 크롬 이외에 어떤 다운로드도 없는 서비스를 구현하실 계획이 있으신지 궁금합니다.

감사합니다!

답변 어투와 fine-tuning에 관한 건

안녕하세요 먼저 흔치않은 한국어 버전 언어모델을 빌드해주셔서 정말 감사드립니다.

다름이 아니라, 현재 존댓말로 답변하는 코-알파카 모델의 말투를 바꿔보려고 하는데 (반말이라던가, 캐릭터성을 부여한다던가 등) 이런 경우에는

ko_alpaca_data.json의 답변 말투를 전부 바꾼뒤 fine-tuning하는게 효과적일까요, 아니면 처음부터 다시 학습을 하는게 효과적일까요?

컴퓨터 비전 쪽만 처리 해봤고 자연어 처리는 처음이라 조금 멍청한 질문 정말 죄송합니다.

실례가 되지않는다면 답변기다리겠습니다.

감사합니다.

polyglot-ko-5.8b 학습 관련 질문

안녕하세요?

README를 보면 Polyglot-ko 5.8B 모델 학습 시에 A100 80GB 1대를 사용했다고 하셨고,

추가로 #13 에 달아주신 답변을 보면 모델도 fp16으로 올리고 시작하셨다고 하셨는데,

그러면 모델 학습 시에 모델의 가중치를 fp16인 상태로 학습하신 걸까요??

만약 fp16으로 모델을 학습 시키셨다면

모델의 가중치가 nan이 발생할 수 있는 문제
(https://stackoverflow.com/questions/74907860/torch-parameter-is-going-to-nan-when-after-first-optim-step-while-doing-gpts)
학습의 정밀도가 떨어지는 문제

위와 같은 문제가 발생할 수 있다고 생각하는데,

혹시 학습 시에 이러한 문제가 발생하지 않으셨는지,

또 이러한 문제가 발생했다면 어떤 방법으로 대처하신지 알 수 있을까요??

좋은 모델 공유해주셔서 감사합니다!

요약 프롬프트에 관한 조언

안녕하세요, KoAlpaca를 요약 task에 대하여 사용하려합니다.

여러 프롬프트를 사용해보았으나 원문의 내용을 포함하는 내용의 출력결과가 나오지 않아 질문드렸습니다.

대체로 아래와 같은 형식으로 prompt를 던졌습니다.
다음 원문을 20단어 내외로 요약: 원문: [요약하고자 하는 내용 ex. 이메일 등]

감사합니다.

학습 데이터 형식에 관한 문의

안녕하세요! 우선 항상 빠른 답변 감사드립니다. 덕분에 많은 도움 되었습니다.
학습 데이터 중
{"instruction": , "input":, "output":}
형식이 있고
"### 질문: ### 맥락: ### 답변:"
형식이 있는데 어떻게 다른 것인가요?
그리고 Polyglot-ko-12.8b-v1.1b를 학습시킬 때는 어떤 형식을 사용하셨는지 문의드립니다.
감사합니다!

polyglot-ko 모델의 lora 기반 학습

혹시 polyglot-ko (5.8b, 12.8b) 학습에 혹시 lora를 테스트 해보신적이 있으실까요?
가지고 있는 GPU가 부족한 상태에서 직접 학습을 해보고싶은 마음에 alpaca-lora를 이용해 학습을 시도해보았는데, 결과가 좋지 못해 문의드립니다.

제가 진행한 실험 세팅은 다음과 같습니다.
(alpaca-lora/finetune.py에서 tokenizer와 model만 AutoTokenizer와 AutoModelForCausalLM으로 변경해서 다음 파라미터로 학습했습니다.)

python3.10 polyglot_finetune.py \
    --base_model '/mnt/hf/polyglot-ko-5.8b' \
    --data_path '/mnt/git/KoAlpaca/ko_alpaca_data.json' \]
    --num_epochs=10 \
    --batch_size=128 \
    --micro_batch_size=8 \
    --learning_rate=3e-4 \
    --cutoff_len=512 \
    --val_set_size=2000 \
    --lora_r=16 \
    --lora_alpha=16 \
    --lora_dropout=0.05 \
    --group_by_length \
    --lora_target_modules='[query_key_value]'

학습 자체는 loss값만 보면 잘 된것 같은데... 결과 성능이 너무 참혹하네요

혹시 polyglot finetuning시 유의해야할 점이 있을까요?

Polyglot 병렬 처리 등에 대해서 여쭙니다.

병렬처리를 위해 fsdp 사용할 때에 모델을 Polyglot 사용하게 되면
fsdp_transformer_layer_cls_to_wrap 에는 어떤 layer 을 넣어주어야 하는지 궁금합니다.
같은 역할을 하는 것으로 보이는 GPTNeoXLayer 을 넣어 주면 되는 것인가요...?

Pull requests

cpp로 만들었습니다. 권한을 주실 수 있으실지 문의드립니다.

https://github.com/gyunggyung/KoAlpaca.cpp

webui 실행시 발생하는 에러

https://github.com/Beomi/KoAlpaca/tree/main/webui readme를 따라 실행하면

in sample next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1) RuntimeError: probability tensor contains either `inf`, `nan` or element < 0
위와같은 에러가 발생합니다. 혹시 해당 해당 에러를 겪으신분이 있다면 해결책을 공유 받을 수 있을까요 ? :)

polyglot-ko-12.8b 모델을 A100 80G 2 대로 학습하는 것도 가능할까요?

오픈 소스로 공개해주셔서 많은 도움을 받고 있습니다.

예산 때문에 A100 2 대 이상의 시스템 구성이 어려울 듯 한데,

polyglot-ko-12.8b 모델을 A100 80G 2 대로 학습하는 것도 가능할까요?

감사합니다.

v1.1 학습시 output이 학습되지 않는 이슈

train_v1.1 의 학습코드인 run_clm.py를 읽어보다가 이슈를 남깁니다.

433번줄부터 확인해보면, json 데이터셋에서 column_names[0] 외에 다른 column을 제거하는 것을 볼 수 있습니다.

    # Preprocessing the datasets.
    # First we tokenize all the texts.
    if training_args.do_train:
        column_names = list(raw_datasets["train"].features)
    else:
        column_names = list(raw_datasets["validation"].features)
    text_column_name = "text" if "text" in column_names else column_names[0]

    # since this will be pickled to avoid _LazyModule error in Hasher force logger loading before tokenize_function
    tok_logger = transformers.utils.logging.get_logger("transformers.tokenization_utils_base")

    def tokenize_function(examples):
        with CaptureLogger(tok_logger) as cl:
            output = tokenizer(examples[text_column_name])
        # clm input could be much much longer than block_size
        if "Token indices sequence length is longer than the" in cl.out:
            tok_logger.warning(
                "^^^^^^^^^^^^^^^^ Please ignore the warning above - this long input will be chunked into smaller bits"
                " before being passed to the model."
            )
        return output

    with training_args.main_process_first(desc="dataset map tokenization"):
        if not data_args.streaming:
            tokenized_datasets = raw_datasets.map(
                tokenize_function,
                batched=True,
                num_proc=data_args.preprocessing_num_workers,
                remove_columns=column_names,
                load_from_cache_file=not data_args.overwrite_cache,
                desc="Running tokenizer on dataset",
            )
        else:
            tokenized_datasets = raw_datasets.map(
                tokenize_function,
                batched=True,
                remove_columns=column_names,
            )

KoAlpaca_v1.1.jsonl 의 데이터셋을 보면 "instruction"과 "output"으로 구성된 것을 볼 수 있는데,

위 코드대로라면 "instruction"만 토크나이징 되어 dataset에 남고, 이를 이후에 group_texts 함수에서 한데 묶어서 학습을 진행하게 됩니다.

    def group_texts(examples):
        # Concatenate all texts.
        concatenated_examples = {k: list(chain(*examples[k])) for k in examples.keys()}
        total_length = len(concatenated_examples[list(examples.keys())[0]])
        # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
        # customize this part to your needs.
        if total_length >= block_size:
            total_length = (total_length // block_size) * block_size
        # Split by chunks of max_len.
        result = {
            k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
            for k, t in concatenated_examples.items()
        }
        result["labels"] = result["input_ids"].copy()
        return result

    # Note that with `batched=True`, this map processes 1,000 texts together, so group_texts throws away a remainder
    # for each of those groups of 1,000 texts. You can adjust that batch_size here but a higher value might be slower
    # to preprocess.
    #
    # To speed up this part, we use multiprocessing. See the documentation of the map method for more information:
    # https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.map

    with training_args.main_process_first(desc="grouping texts together"):
        if not data_args.streaming:
            lm_datasets = tokenized_datasets.map(
                group_texts,
                batched=True,
                num_proc=data_args.preprocessing_num_workers,
                load_from_cache_file=not data_args.overwrite_cache,
                desc=f"Grouping texts in chunks of {block_size}",
            )
        else:
            lm_datasets = tokenized_datasets.map(
                group_texts,
                batched=True,
            )

그렇다면 데이터셋의 output은 활용되지 못하고, instruction만 반복되는 데이터셋으로 fine-tuning을 수행하게 되는 것으로 보입니다.

해당 이슈는 학습 버그인 것 같아서, 이슈 등록 해봅니다.

easy-lm-trainer 학습 데이터 파일 형식

안녕하세요! polyglot-ko-5.8b 학습을 easy-lm-trainer 레포지토리의 코드를 사용하셨다고 했는데
해당 코드를 사용해서 polyglot 학습을 할 때
alpaca 데이터 형식이 어떻게 구성되어야하나요?
instruction-input-output 쌍인 jsonl 파일과는 구성이 다른 것 같아 문의드립니다.
감사합니다.

특정 질문에 대한 커스텀 답변 관련

안녕하세요 LLM을 시작한지 그리 오래되지 않은 초보입니다...

최근에 오픈한 https://chat.koalpaca.com/ 를 실제로 사용해보면서 테스트하고있는데

똑같은 데이터를 가지고 진행을 하고있으나, 궁금한부분이

"너는 누구니?" 등의 질문을 하였을 때

"저는 ChatKoAlpaca 입니다." 라는 대답을 하는데

이렇게 특정 Message에 대해서 특정한 대답을 하도록 유도하는것은

{
"instruction": "너는 누구니?",
"output": "저는 ChatKoAlpaca 입니다"
}

등의 내용을 FineTune 하면 가능한 부분일까요?
아니면 아얘 처음부터 Train을 진행해야 하는부분일까요...?

아니면 다른 방법이 있는건지... 궁금합니다!!

polyglot-ko-5.8b A100 80G 관련 문의의 건

안녕하세요

readme에 polyglot-ko-5.8b A100 80G 1장으로 학습을 하셨다고 말씀해주셨는데요
A100 80G 4장으로 훈련을 돌리는대도 CUDA OOM 에러가 발생하여서요.
혹시 제 스크립트에 어떤 문제점이 있는지 여쭤봐도 될까요?
(ps. 스크립트 자체가 문제가 있을까봐 polyglot-ko-1.3b을 돌려봤는데 정상적으로 학습이 완료되었습니다.)

torchrun --nproc_per_node=4 train.py \
    --model_name_or_path "EleutherAI/polyglot-ko-5.8b" \
    --data_path ./ko_alpaca_data.json \
    --bf16 True \
    --output_dir ./models \
    --num_train_epochs 1 \
    --per_device_train_batch_size 4 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 8 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 2000 \
    --save_total_limit 1 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 True

python finetune.py --base_model 'beomi/KoAlpaca-Polyglot' 실행 시 ValueError가 발생합니다.

README.md의 Lora 방식 모델 학습 방법을 참조하였습니다.

https://github.com/tloen/alpaca-lora
해당 repo의 finetune.py 스크립트로 fine tuning training을 진행하고자 했고, base model로는 beomi/KoAlpaca-Polyglot을 사용하려 했습니다.

ValueError: weight is on the meta device, we need a value to put in on 0

해당 에러가 발생하고 있습니다.

실행 환경은, google colab gpu 40GB 입니다.

좀 더 자세히 이슈 사항에 대해 이야기 하자면,

현재 custom data를 25개 정도 구성하였고, ko_alpaca_data.json 파일 하단에 추가하는 식으로 dataset을 구성하였습니다.

그리고 https://github.com/tloen/alpaca-lora 해당 repo의 finetune.py를 활용하여
python finetune.py --base_model '{huggingface에서 참조할 모델 이름}' --data_path '{ko_alpaca_data.json 파일 위치 경로}' 명령어로 training을 진행하고자 했습니다.

실행환경은 상기한 바와 같고,
huggingface에서 참조할 모델 이름을 바꾸는 과정에 여러 문제가 있었습니다

beomi/KoAlpaca-13B-LoRA를 사용 시, config.json이 huggingface repo에 없다는 에러가,
decapoda-research/llama-7b-hf 혹은 decapoda-research/llama-13b-hf 사용 시, json 파일을 읽을 수 없다는 에러가(아마 한글로 작성되어 있어서 그런 듯 합니다)
beomi/KoAlpaca-Polyglot 사용 시 상기한 바 대로 ValueError가 발생했습니다.

https://github.com/Beomi/KoAlpaca README.md의 fine tuning 학습 방식을 위해 train.sh을 이용하는 것은
학습 속도에 많은 영향이 있을 것이라 판단되어, 그 밑의 Lora 방식을 통해 training을 진행하고자 합니다.

LLAMA 13B 모델 학습을 하실 때, 상기한 문제는 없으셨는지(특히 한글 데이터를 읽을 수 없다는 문제)
발생했다면 어떻게 해결하셨는지 노하우를 부탁드리고 싶습니다.

run_clm.py tokenizer_function padding 문의

안녕하세요. 엄청난 오픈소스 공유에 감사드립니다.

train_v1.1b 폴더 참고해서 polyglot-ko-5.8b, 모델 학습 시 global_step 이 6으로 잡혀 문의 드립니다.

사용 데이터 : KoAlpaca_v1.1.jsonl
사용 GPU : A100 x 1

위와 같은 상황인데, run_clm.py 파일을 보니 tokenize_function 에서 별도 padding을 주지 않아
tokenizer_function -> group_texts 과정에서 max_length(block_size)로 데이터들이 합쳐져서 그런 것 같습니다.

  def tokenize_function(examples):
      with CaptureLogger(tok_logger) as cl:
          output = tokenizer(examples[text_column_name])

혹시 학습 과정에서 별도 padding을 주지 않았는지 문의 드립니다!

KoAlpaca 로 한글 텍스트 임베딩 결과도 받아볼 수 있나요?

너무 필요한걸 만들어주셔서 감사합니다.
현재 ChatGPT API로 만든 데모를 KoAlpaca 기반으로 변환하는 것을 검토 중인데
혹시 한 텍스트를 임베딩해서 Vector 값을 뽑아내는 용도로도 KoAlpaca나 LLaMa로 가능할까요?

학습 시 epoch

polyglot-ko-5.8b 모델 학습 시 1 epoch로 학습하신 건가요?
12.8b는 2 epoch인 것 같은데 5.8b는 1 epoch로 하신 이유가 있는지 궁금합니다!
그리고 epoch 수에 따라 모델 성능이 많이 좌우되는지도 문의 드립니다.
감사합니다:)

이전 대화 기억

혹시 해당 작업 같이 하실 생각 없으신가요? DST를 쓰면 좀 더 효율적으로 할 수 있을 거 같습니다.

감사합니다 :)

chat.koalpaca.com 500 internal error

koalpaca.com이 500 에러로 인해 동작하지 않는것 같습니다. 확인해 주실 수 있을까요?

LLAMA-LoRa 학습 관련 문의 드립니다.

안녕하세요.

LLAMA-LoRA 학습시 https://github.com/tloen/alpaca-lora 참고 하신 것으로 표기 되어 있는데요.
해당 깃 레포의 official hyper parameter(7B)는 다음과 같이 되어 있는데 이대로 한글모델 LLAMA 13B를 LoRA로 학습하신 것인지요. (베이스 모델만 13b로 바꿔서)
python finetune.py \ --base_model='decapoda-research/llama-7b-hf' \ --num_epochs=10 \ --cutoff_len=512 \ --group_by_length \ --output_dir='./lora-alpaca' \ --lora_target_modules='[q_proj,k_proj,v_proj,o_proj]' \ --lora_r=16 \ --micro_batch_size=8
LLAMA-LoRA 13B 학습시 데이터 두가지를 함께 학습하셨다고 했는데, tloen 깃헙의 alpaca_data_cleaned.json(영문) 은 DeepL등으로 번역해서 합치신 건지요

OpenQA Prompt

안녕하세요! 한국어로 좋은 성능의 생성모델을 사용해볼 수 있게 해주셔서 감사합니다:)
해당 모델을 OpenQA task에 사용해보고 싶은데
seed task 파일에서
{"id": "seed_task_14", "name": "question_answering_generation_from_facts", "instruction": "Using the facts given, write a question-answer pair.", "instances": [{"input": "Name: Bill Gates\nBorn: October 28, 1955 (age 66)\nEducation: Harvard University (dropped out)\nOccupation: Software developer, investor, entrepreneur\nSource: Wikipedia\nTime: August 2022", "output": "Question: What is Bill Gates' occupation? \nAnswer: He is a software developer, investor, and entrepreneur."}], "is_classification": false}
형식을 참고하면 될까요?
감사합니다.

응답 형식에 대해 궁금한 점이 있습니다.

Polyglot-ko 5.8B 기반 모델을 돌려보고 있는데 응답 내용에

### 응답:
### instruction:

같은 것들이 들어있는 것 같습니다.
해당 부분은 따로 제거하거나 표시할 때 특별한 처리를 프론트엔드에서 해야하는 것인지 궁금한데 여기 여쭤봐도 될까요?

찬성님(@deep-diver)의 deep-diver/Alpaca-LoRA-Serve레포를 참고한후

위 Repo의 가이드대로 설치 후..

BASE_URL=decapoda-research/llama-13b-hf # 30B모델은 llama-30b-hf 65B모델은 llama-65b-hf 주의: 65B 모델은 기본 용량이 100GB 이상입니다.
FINETUNED_CKPT_URL=beomi/KoAlpaca-13B-LoRA # 30B모델은 KoAlpaca-30B-LoRA 65B모델은 KoAlpaca-65B-LoRA 주의: 65B 모델은 A100 80G등 Vram이 아주 큰 GPU에서만 돌아갑니다.

python app.py --base_url $BASE_URL --ft_ckpt_url $FINETUNED_CKPT_URL --port 6006

이렇게 실행을 하는데

unsupported model type. only llamastack, alpaca, flan, and baize are supported

에러가 나네요... 뭘 잘못한 걸까요?

polyglot-ko-5.8b 모델 a100-40G 2대 학습

안녕하세요!
올려주신 train_v1.1b train.sh로 polyglot-ko-5.8b 모델을 a100-40G 2대로 학습하려고 하니
NCCL WARN Call to posix_fallocate failed : No space left on device
에러가 뜨는데요!
GPU 사용률 모니터링 하면서 코드를 실행시키니
Loading checkpoint shards 때부터 이미 각 GPU의 15G 정도를 차지하더라구요
원래 이런 게 맞나요??

카카오톡 챗봇 베이스 모델 버전이 궁금합니다!

안녕하세요 업로드 해주신 모델들을 하나씩 inference 해보고 있습니다..
현재는 카카오톡 챗봇 성능만큼 나오지 않는 거 같아서 혹시 카카오톡 챗봇의 베이스 모델이 무슨 모델인지 말씀해주실 수 있나요?

라이선스 문의 드립니다.

안녕하세요 Beomi님.

기업에서 연구 목적으로 사용할 사전 학습 한국어 모델을 찾다가 확인하게 되었습니다.

KoAlpaca-Polyglot-ko-5.8b-v1.1b 의 경우 백본이 polyglot이라고 하셨는데,

LLAMA와 무관한 이름만 Alpaca 인지가 궁금합니다.

외부로의 서비스는 계획이 전혀 없지만, 기업 환경 특성상 상업적 목적으로 보여지므로

LLAMA는 이용할 수 가 없어서요.

답변 부탁 드려도 될까요?

감사합니다.

polyglot-ko-5.8b 학습에 관련하여 문의드립니다.

안녕하세요,
먼저 좋은 오픈 소스를 공유해주셔서 감사 말씀드립니다.

polyglot-ko-5.8b 학습 중 문의사항이 있어 글을 남기게 되었습니다.

학습 환경으로 클라우드의 A100 80GB * 8 서버를 사용 중에 있는데,
올려주신 train_v1.1b/train_polyglot5.8b_singleA100.sh 스크립트를 통해 학습 시 CUDA out of memory 에러가 발생하였습니다.

위 문제로 5.8b 도 12.8b 모델과 동일하게 여러대의 GPU 로 학습시키려고 하고 있습니다.

train_v1.1b/train.sh 스크립트(DeepSpeed) 에서 모델만 EleutherAI/polyglot-ko-5.8b 로 변경하여 실행해보았는데요,
학습이 완료된 모델이 올려주신 모델 (beomi/KoAlpaca-Polyglot-5.8B) 에 비해 답변 시간이 많이 늦는것 같습니다.

혹시 train_v1.1b/train.sh 스크립트로 5.8b 모델을 학습 시 스크립트의 파라미터들을 조정해주어야 하는지 조언을 구하고자 합니다.

감사합니다.

학습해주신 모델의 상업적 이용이 가능한가요?

안녕하세요, 좋은 모델 공개 감사합니다.

https://github.com/tatsu-lab/stanford_alpaca/blob/main/DATA_LICENSE 에는 데이터의 상업적 이용이 불가능하다고 나와있는데, 이 라이센스 문서는 스탠포드 알파카의 라이센스 문서 를 가져오신듯합니다. 그래서 이 모델도 상업적 이용이 불가능한가보다.. 하고 생각하고 있는데, 허깅페이스에 올려주신 모델은 mit 라이센스로 표기되어있더라고요.

공개주신 모델은 상업적이용이 가능한가요?

telegram bot 코딩 관련 문의의 건.

안녕하세요!
저는 chat-GPT API를 사용해서 telegram bot에 붙일려고 했는데 번번히 실패했습니다.
아직 코딩쪽에 초보라서 telegram의 context 로 handler를 이용해서 붙여보려고 했는데 잘 안되더군요..
혹시 telegram bot에 연동하신 코드를 알려주실 수 있나요? 그게 힘드시면 힌트라도 알려주시면 정말 고맙겠습니다!!!

아래는 주가를 알려주는 telegram bot 입니다. 여기에 붙여보고자 했는데 제 실력으로 아직 갈 길이 멉나 봅니다.

import logging
from telegram import Update
from telegram.ext import ApplicationBuilder, ContextTypes, CommandHandler
from yahoo_fin import stock_info

logging.basicConfig(
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
level=logging.INFO
)

async def start(update: Update, context: ContextTypes.DEFAULT_TYPE):
await context.bot.send_message(chat_id=update.effective_chat.id, text="I'm a bot, please talk to me!")

async def ticker(update: Update, context: ContextTypes.DEFAULT_TYPE):
if not context.args:
await context.bot.send_message(chat_id=update.effective_chat.id, text="ticker를 입력해주세요.")
return

stock = context.args[0]
try:
    price = stock_info.get_live_price(stock)
    await update.message.reply_text(f"{stock}의 현재 주가는 ${price:.2f} 입니다.")
except Exception as e:
        await update.message.reply_text("오류가 발생했습니다. 다시 시도해주세요.")

if name == 'main':
application = ApplicationBuilder().token('token_address').build()

start_handler = CommandHandler('start', start)
stock_handler = CommandHandler('ticker', ticker)

application.add_handler(start_handler)
application.add_handler(stock_handler)

application.run_polling()

webui gradio로 인퍼런스시 속도 문제

사용하는 그래픽카드는 A100 (40gb) * 4장을 사용하고 있습니다.
12.8b모델은 OOM 이슈로 5.8b모델로 테스트를 하는데 webui로 인퍼런스를 진행하면 1~2분이상이 소요가됩니다.
https://chat.koalpaca.com/ 여기와 비교하면 상당히 느린 속도인데요.

사용하는 코드 beomi님이 답글 달아주신 여러 이슈들을 참고하여,
model = AutoModelForCausalLM.from_pretrained( MODEL, device_map="auto", load_in_8bit=True, revision="8bit", torch_dtype=torch.float16, #low_cpu_mem_usage=True, )
위와같은 코드로 진행하였습니다.
low_cpu_mem_usage를 Fasle, True로 모두 테스트를 해보고, nvidia-smi를 통해 usage를 확인하면 4장모두 12프로정도의 Volatile GPU-Util 을 보입니다.

참고한 이슈 : #34

How did you fine-tune 65b?

I see that you successfully fine-tuned LLaMA 65B using the alpaca-lora repo on 8xA100 80GB, but what command did you use to train it across multiple GPUs?

train 관련해서 문의 드립니다.

안녕하세요. 일단 이런 멋진 오픈소스 공유해주셔서 너무 감사합니다.

repository 보면서 train쪽 관심생겨 따라해보고 있는데요.

train_v1.1b 폴더에서 명령어 실행해서 나온 결과물이, 오픈해주신 모델이랑 결과가 너무 다르게 나와 문의드립니다.

train_v1.1b 폴더에 train.py로 실행 했습니다.

질문 사항이 있어요.

A100 80g x 8 환경에서 1시간 20분 정도에 train이 완료 됩니다. 정상인가요?
완료된 모델로 webui에서 실행하면 잘못된 답변을 계속 반복하여 출력 합니다
ex)
질문:대한민국 대통령은?
답변: 대한민국 대통령은 문재인입니다.대한민국 대통령은 문재인입니다.대한민국 대통령은 문재인입니다... 무한 반복
공유해주신 모델로 webui를 실행하면 정상적으로 정돈된 답변을 하는 것을 확인 했습니다.
(https://huggingface.co/beomi/KoAlpaca-Polyglot-12.8B)
train file은 ko_alpaca_data.json, KoAlpaca_v1.1.jsonl 두 파일을 모두 사용 해보았는데 증상이 같네요.

제가 놓친 부분이 있을까요?

제가 완전 초보 수준이라, 질문을 맞게 하는지도 모르겠네요.

혹시 봐주실까 해서 로그도 올려봅니다.

train.log

is it possible to speed up the polyglot-12.8b-koalpaca-v1.1b ?

Hi there,
I tried to use your new provided model (polyglot-12.8b-koalpaca-v1.1b) on my local system (with one GPU)
but it's kinda slow, is there any way that I can speed up the functionality?

thank you!

혹시 alpaca.cpp 에 사용할 수 있도록 퀀타제이션 된 모델 제공도 부탁드려도 될까요?

안녕하세요.
https://github.com/antimatter15/alpaca.cpp
위 레포처럼 요즘 LLM들을 퀀타제이션하여 only cpu로도 inference를 해보는 예제들이 많이 있는데요.
모델 파일이 다들 alpaca 기준으로만 있다보니 한국어 성능이 조금 아쉽습니다.

혹시 alpaca.cpp에 호환되는 모델파일을 제공해주실 수 있을지 문의드립니다.

train_v1.1/train.sh로 학습을 테스트하는 중인데 데이터셋이 697건에 불과합니다.

KoAlpaca_v1.1.json에는 약 21000건 정도의 학습 데이터셋이 들어가 있는데 반해,
train.sh코드를 실행 시 실제 학습되는 데이터 건 수는 697건에 불과합니다. 코드 추적 결과 전처리 과정에서 상당수가 날라가는것으로 보이는데 정상적인것일까요??

with training_args.main_process_first(desc="grouping texts together"):
    if not data_args.streaming:
        lm_datasets = tokenized_datasets.map(
            group_texts,
            batched=True,
            num_proc=data_args.preprocessing_num_workers,
            load_from_cache_file=not data_args.overwrite_cache,
            desc=f"Grouping texts in chunks of {block_size}",
        )
    else:
        lm_datasets = tokenized_datasets.map(
            group_texts,
            batched=True,
        )

위 코드의 동작에서 데이터수가 21000 -> 697건으로 줄어드는것까진 확인하였습니다.

좋은 코드 공유해주셔서 감사합니다.

KoAlpaca-Polyglot-12.8B-v1.1b fp16 모델에 사용하신 데이터셋에 대한 문의

안녕하세요, 먼저 좋은 오픈 소스를 공유해주셔서 감사합니다.

KoAlpaca 12.8B모델은 데이터 v1.1b을 사용하여 학습한 것으로 보았습니다만,
Polyglot 모델에 데이터 v1.0을 학습하신 후에 데이터 v1.1을 추가로 학습하신 건지
데이터 v1.1만 학습하신 건지 궁금합니다.

데이터 v1.1은 데이터 v1.0에 비해 Instruction안에 어떤 task가 있는 지 파악하기 어려웠고
그럼에도 KoAlpaca 12.8B 모델이 좋은 성능을 내는 이유가 무엇인지 궁금하여 질문드립니다.
단순히 파라미터가 많아서 좋은걸까요? 아니면 Instruction Fine-tuning을 위한 데이터가 중요한 걸까요?

감사합니다.

조금더 최적화된 모델 계획은 있으신가요?

몇 안되는 한국어 모델을 학습 및 공유에 정말 깊은 감사를 드립니다.
코랩에서 돌려보고 집에 뒹굴고 있는 컴퓨터에서 5.8과 7 모델을 로드 해보려 했으나 턱없이 부족한 성능탓에...
눈물만 나오네요 ㅠㅠ

koalpaca inference.ipynb 질문있습니다!

안녕하세요~
lora weight을 업로드 하고 싶은데요!
tokenizer = LlamaTokenizer.from_pretrained("../KoAlpaca/")
model = LlamaForCausalLM.from_pretrained("../KoAlpaca/").to('cuda:0')
peft를 사용해서 업로드 해야되나요???
�
https://huggingface.co/beomi/KoAlpaca-13B-LoRA
의 파일을 사용해보려고합니다.
peft로 업로드하고,,
백본 모델을,,, decapoda-research/llama-13b-hf
로 사용하는게 맞을까요?

다른 공개된 한국어 데이터 셋에서 성능측정 계획 있으신가요?

안녕하세요. 모델 공개해주셔서 감사합니다.
혹시 NSMC 를 제외한 다른 한국어 데이터 셋에서 성능을 측정할 계획이 있으신가요?

ko_alpaca_data 관련 문의

안녕하세요
공개해주신 코드 너무 잘 보고 있습니다.

다름이 아니라 올려주신 ko_alpaca_data 관련하여 질문이 있습니다.
어떤 데이터는 input이 있고, 어떤 데이터는 없는데 이게 왜 이런 구성을 가지게 되었는지 이해가 안돼서요 ㅠㅠ

polyglot-ko-12.8B 학습 시 run_clm.py의 전처리 부분에 대한 질문이 있습니다.

좋은 코드 공유해주셔서 감사합니다. 그런데 KoAlpaca_v1.1을 학습하는 train.sh를 실행하는데 궁금한점이 있어서 문의드립니다.

run_clm.py를 실행하게 되면 하기한 부분을 실행한 후에 학습 데이터 중에 다른 'output' 'url'이 탈락되고

'instruction'만이 남아서 이후 학습 데이터셋에 들어가게 되는 것 같습니다.

# Preprocessing the datasets.
# First we tokenize all the texts.
if training_args.do_train:
    column_names = list(raw_datasets["train"].features)
else:
    column_names = list(raw_datasets["validation"].features)
text_column_name = "text" if "text" in column_names else column_names[0]

실제 학습 시 lm_datasets를 decode해서 출력하면 아래의 결과가 나오는데 학습 데이터셋의 '질문'부분만 뭉쳐서 blocksize에 맞게 재구성된 형태입니다.

제가 생각하기에 학습 데이터셋을 패딩 없이 blocksize에 맞추고자 할 경우 Q+A [sep] Q+A [sep]....이런 형태가 되야 할것 같은데
혹시 놓친 부분이 있을까요??

감사합니다.

polygot 인퍼런스

안녕하세요. 좋은 결과물 감사드립니다.

https://github.com/Beomi/KoAlpaca/blob/main/Inference%20Test.ipynb 을 통해 llama 베이스 모델은 손쉽게 사용해볼 수 있었는데 혹시 polygot 기반 인퍼런스 코드도 공개해주실 수 있을까요?

https://github.com/Beomi/KoAlpaca/blob/main/nsmc_polyglot_and_koalpaca_by_few_shot.ipynb 을 참고해서 시도했는데, 잘 안되서 이슈남깁니다.

감사합니다.

KoAlpaca의 Django에서의 사용

안녕하세요, KoAlpaca가 웹으로 구현된 것을 보고 Django에 이식해보려 했는데, ipynb 파일 상에서만 실행이 되고 웹으로 세팅하기 위해서는 어떻게 해야 하는지 막막하네요.

혹시 https://chat.koalpaca.com/ 은 어떤 방식으로 구현하셨는지 여쭤볼 수 있을까요? 또는 py 파일 상으로 구현하는 방법이 있는지 여쭙습니다.

load KoAlpaca-65B-LoRA on local system

Hi there,

could you please guide me how can I using your model on higgingface: beomi/KoAlpaca-65B-LoRA to just test how it's work. I don't want to train it, I want to use your trained weight if it's possible?
I have been trying to get it to work on my machine, but I am running into some issues and could use some guidance.

I tried to use pipeline, or even using torch and load_state_dict, but none of them worked for me to load this model on my system

could you please provide some guidance on how to get your model up and running?

Thank you so much for your time and expertise. I really appreciate your contributions to the open-source community and I'm intrested to test your model on my local system.

beomi / koalpaca Goto Github PK

koalpaca's Introduction

Update Logs

KoAlpaca: Korean Alpaca Model based on Stanford Alpaca (feat. LLAMA and Polyglot-ko)

바로 써보기: 채팅형 웹 페이지(v1.1b-chat) 혹은 카카오톡 봇, Telegram Bot으로 만나보세요!

완성된 모델: 한국어 기반 모델(Polyglot-ko) & 영한 모델(LLAMA)

KoAlpaca 모델 실행 예시 코드

Huggingface Pipeline으로 실행

Gradio 샘플 - Fullfinetune & 8bit 🌟 추천 🌟

Gradio 샘플 - LoRA

데이터셋 v1.1 제작 방법

데이터셋 v1.0 제작 방법

Fine tune 방식 모델 학습 방법

QLoRA 4bit 학습방법 Google Colab 예제

ChatKoAlpaca(KoAlpaca Polyglot-ko-12.8B-v1.1b) 모델 학습

Polyglot-ko 5.8B 모델 학습

LLAMA 7B 모델 학습

FSDP 모델 학습시 유의사항

LoRA 방식 모델 학습 방법

LLAMA 13B 모델 학습

LLAMA 30B 모델 학습

LLAMA 65B 모델 학습

Benchmarks

NSMC Benchmark Test

Benchmark Test 방식

1. Few-shot Learning 구성

2. 프롬프트 구성

3. Benchmark Test 결과

Stanford Alpaca: An Instruction-following LLaMA Model

Overview

Data Release

Data Generation Process

Fine-tuning

Warning

Side notes

Authors

Citation

Acknowledgements

koalpaca's People

Contributors

Stargazers

Watchers

Forkers

koalpaca's Issues

위 Repo의 가이드대로 설치 후..

Recommend Projects

Recommend Topics

Recommend Org