여러 종류의 Korpus 로부터 texts 성분만 취하여 이들을 병합하여 언어 모델을 학습할 때 사용할 수 있는 데이터로 정제하는 기능을 CLI 형태로 제

위의 커밋은 development version 으로, kcbert , <code class="n

언어 모델 학습용 병합 말뭉치 생성 기능 제공 about korpora HOT 2 CLOSED

ko-nlp commented on May 10, 2024

언어 모델 학습용 병합 말뭉치 생성 기능 제공

from korpora.

Comments (2)

lovit commented on May 10, 2024

위의 커밋은 development version 으로, kcbert, kowikitext, namuwikitext 의 데이터양이 많기 때문에 파일의 1000 줄만 이용하여 데이터를 만들도록 hard-coding 되어 있습니다.

@ratsgo 지금 버전으로 인터페이스 및 기본적인 코드 리뷰 부탁드립니다.

아래는 두 종류의 사용예시입니다. --save_each 에 따라서 말뭉치를 하나의 파일에 저장할지, 각 코퍼스별로 별도의 파일에 저장할지 나뉘어집니다. --multilingual 이면 번역 데이터의 경우 한국어의 번역대상 언어도 포함합니다.

(script)

git checkout dev-lmdata#65
python setup.py install

korpora lmdata \
  --corpus all \
  --output_dir ~/local/train/ \
  --multilingual

(print message)

| Done | Corpus name               | Num sents  | File name |                                                                                                  
| ---- | ------------------------- | ---------- | --------- |                                                                                                  
|  x   | kcbert                    |       1000 | all.train |                                                                                                  
|  x   | korean_chatbot_data       |      23646 | all.train |                                                                                                  
|  x   | korean_hate_speech        |    2042260 | all.train |                                                                                                  
|  x   | korean_parallel_koen_news |     194246 | all.train |                                                                                                  
|  x   | korean_petitions          |     867262 | all.train |                                                                                                  
|  x   | kornli                    |    1900708 | all.train |                                                                                                  
|  x   | korsts                    |      17256 | all.train |                                                                                                  
|  x   | kowikitext                |       1582 | all.train |                                                                                                  
|  x   | namuwikitext              |       2081 | all.train |                                                                                                  
|  x   | naver_changwon_ner        |      90000 | all.train |                                                                                                  
|  x   | nsmc                      |     200000 | all.train |                                                                                                  
|  x   | question_pair             |      13776 | all.train |

(script)

git checkout dev-lmdata#65
python setup.py install

korpora lmdata \
  --corpus all \
  --output_dir ~/local/train/ \
  --multilingual \
  --save_each

(print message)

| Done | Corpus name               | Num sents  | File name                       |
| ---- | ------------------------- | ---------- | ------------------------------- |
|  x   | kcbert                    |       1000 | kcbert.train                    |
|  x   | korean_chatbot_data       |      23646 | korean_chatbot_data.train       |
|  x   | korean_hate_speech        |    2042260 | korean_hate_speech.train        |
|  x   | korean_parallel_koen_news |     194246 | korean_parallel_koen_news.train |
|  x   | korean_petitions          |     867262 | korean_petitions.train          |
|  x   | kornli                    |    1900708 | kornli.train                    |
|  x   | korsts                    |      17256 | korsts.train                    |
|  x   | kowikitext                |       1582 | kowikitext.train                |
|  x   | namuwikitext              |       2081 | namuwikitext.train              |
|  x   | naver_changwon_ner        |      90000 | naver_changwon_ner.train        |
|  x   | nsmc                      |     200000 | nsmc.train                      |
|  x   | question_pair             |      13776 | question_pair.train             |

@ratsgo 모두의 말뭉치는 Korpora.load() 기능 구현 후 추가작업을 할 예정입니다.

from korpora.

lovit commented on May 10, 2024

--n_samples, --min_length, --max_length 기능은 Korpora 에서 제공
- --n_samples 가 <1 float 일 경우 sample ratio
--deduplicate 기능은 korpora-preprocessing 에서 제공

from korpora.

언어 모델 학습용 병합 말뭉치 생성 기능 제공 about korpora HOT 2 CLOSED

Comments (2)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent