yoheikikuta / bert-japanese Goto Github PK

View Code? Open in Web Editor NEW

497.0 497.0 94.0 220 KB

BERT with SentencePiece for Japanese text.

License: Apache License 2.0

Dockerfile 0.26% Python 35.03% Shell 0.47% Jupyter Notebook 64.24%

bert-japanese's Introduction

Yohei Kikuta

Resume: https://github.com/yoheikikuta/resume
X: https://twitter.com/yohei_kikuta

bert-japanese's People

Contributors

Stargazers

Watchers

Forkers

david5ive kiyo-e alvations lightondust mmizutani thaoth58 shirayu sciseed ktaskn manhnd1112 takashi1029 rosssong goodnasubi fhiyo sukesan1984 weiczhu takeshikondo pvcastro tkysk oderdene meshidenn iki-taichi minhpqn nnnngo chochobo jetafull suzuki-ken23 arita37 onursahil giegloop zmwebdev takanobu-watanabe masao-taketani nek0128 manba036 thanhtd91 hisakakoji raymondhs hisashi-ito embeddedsamurai takuya-andou amir22010 vochicong moyomot erukiti okd-hkd ysekky tsogtbayarn giangpol gochipon alinear-corp rollingstone kokeman dineshdevpandey jadeliu0 gsakabe rkim17 blakepancin eizanprime hiro747 kacky24 peinan chomolungma miyamonz nukea soichiroota quangnhat0710 inonb hkazuakey pgsrv y-kuro-u iamminster vvbreak manojyamasani daichikojima jorgeascencionmol brosp1 binh-forked-projects mkshing nready-rnd tidar-lts 10mo8 triper1022 akinoriosamura astpy bharathrajcl pqmsoft1 abdulrauf312 taikis

bert-japanese's Issues

SentencePiece tokenizerについて

文章の分かち書きをSentencePiece tokenizerを使っていますが、Mecabと比較して例えば次のような違いがある

SentencePiece tokenizer:

/わずか/30/秒/の/短い/映像/ながらも/ス/リル/満/点/に/仕/上がって/いる/
絆/をつなぐ/大/切/さを/映画/を通して/伝える/

Mecab

/わずか/30/秒/の/短い/映像/ながら/も/スリル/満点/に/仕上がっ/て/いる/
/絆/を/つなぐ/大切/さ/を/映画/を通して/伝える/

「大/切」と「大切」や「満/点」と「満点」の違いはモデルが吸収してくれる期待もできるかもしれませんが、「さを」と「さ/を」や「ス/リル」と「スリル」あたりはモデルのパフォーマンスにクリティカルにな影響を与えうる気がします。

辞書（VOCAB）の更新とTokenizerのカスタマイズ（Mecab、JUMAN++など）を追加学習などで行いたいですが、何かいい方法がありますか。

例えば既存のモデルを再利用して、Tokenizerを変え、VOCABをカスタマイズしてからpretrainしたいですが、run_pretraining.pyを見てもtokenizeの処理が見当たらなくて。。。

finetune-to-livedoor-corpusのノートブックでkeyerrorになる

共有をありがとうございます。
bert-japanese/notebook/finetune-to-livedoor-corpus.ipynb
を実行する際

!python3 ../src/run_classifier.py \
  --task_name=livedoor \
  --do_train=true \
  --do_eval=true \
  --data_dir=../data/livedoor \
  --model_file=../model/wiki-ja.model \
  --vocab_file=../model/wiki-ja.vocab \
  --init_checkpoint={PRETRAINED_MODEL_PATH} \
  --max_seq_length=512 \
  --train_batch_size=4 \
  --learning_rate=2e-5 \
  --num_train_epochs=10 \
  --output_dir={FINETUNE_OUTPUT_DIR}

で下記のエラーになることがあります。

KeyError                                  Traceback (most recent call last)
<ipython-input-113-351daf650d70> in <module>
      2     train_file = os.path.join(FLAGS.output_dir, "train.tf_record")
      3     file_based_convert_examples_to_features(
----> 4         train_examples, label_list, FLAGS.max_seq_length, tokenizer, train_file)
      5     tf.logging.info("***** Running training *****")
      6     tf.logging.info("  Num examples = %d", len(train_examples))

~/bert/bert-japanese/src/run_classifier.py in file_based_convert_examples_to_features(examples, label_list, max_seq_length, tokenizer, output_file)
    362 
    363     feature = convert_single_example(ex_index, example, label_list,
--> 364                                      max_seq_length, tokenizer)
    365 
    366     def create_int_feature(values):

~/bert/bert-japanese/src/run_classifier.py in convert_single_example(ex_index, example, label_list, max_seq_length, tokenizer)
    331   assert len(segment_ids) == max_seq_length
    332 
--> 333   label_id = label_map[example.label]
    334   if ex_index < 5:
    335     tf.logging.info("*** Example ***")

KeyError: '大島優子がここからどう

原因は

df = pd.DataFrame({'text' : all_text, 'label' : all_label})

でdataframeを作っても、カラムの順番がlabelとtextになって、下記のようにexampleがうまく生成されません。

{'guid': 'train-3',
 'label': 'ブラックマジックデザイン、...',
 'text_a': 'kaden-channel',
 'text_b': None}

御用とお急ぎの方はrun_classifier.pyのところのtext_aとlabelのところのインデックスをひっくり返せばとりあえずうまくいきますが、notebookの方で

df = pd.DataFrame({'text' : all_text, 'label' : all_label})

の下に

df = df[['text', 'label']]

などをして順番を固定した方がいいと思います。

WikiExtractor.py - No such file or directory

@yoheikikuta thanks for creating this repository and sharing the instructions on how to train BERT with Japanese wiki data! I'm trying to reproduce everything from scratch and I can't find WikiExtractor.py file.

python3 src/data-download-and-extract.py

100.0% 2906087424 / 2906079739
python3: can't open file '/data/bert-japanese/src/../../wikiextractor/WikiExtractor.py': [Errno 2] No such file or directory

ありがとうございました

Which lisence type does your pre-trained model belong to?

Hello, @yoheikikuta

Thank you very much for your great work!
Could you specify the lisence type of your pre-trained model, such as "Apache License 2.0"?
Although you clearly indicate "Apache Lisense 2.0" in GitHub, there seems to be no lisence tag in Hugging Face:
https://huggingface.co/ALINEAR/albert-japanese-v2

Thanks

tokenizerのdo_lower_caseについて

他のモデルにsentencepieceモデルを使いたく、こちらのコードを参考にさせていただいていたのですが、do_lower_caseをTrueにしていると条件づけに使うタグ([CLS]など)の文字も小文字にされ、結果として1トークンではなくバラバラのトークンとして処理されてしまっていました。

おそらくこちらもそうなってしまっているのではないかと思い、issue送らせていただきました（判別タスクなので大きな影響はないと思いますが）。

finetune-to-livedoor-corpusのノートブックでValueError: test_size=3.001358 should be smaller than 1.0 or be an integerになる

共有、有難うございます。日本語bertの参考にしてます。
実行すると以下のセルでエラーが発生します。
実行環境は
Ubuntu18.04
python3.7.3
tensorflow 1.13.1
GPU GTX1080Ti
です。

%%time

model = GradientBoostingClassifier(n_estimators=200,
validation_fraction=len(train_df)/len(dev_df),
n_iter_no_change=5,
tol=0.01,
random_state=23)

1/5 of full training data.

model = GradientBoostingClassifier(n_estimators=200,

validation_fraction=len(dev_df)/len(train_df),

n_iter_no_change=5,

tol=0.01,

random_state=23)

model.fit(train_dev_xs_, train_dev_ys)

ValueError Traceback (most recent call last)
in

~/.pyenv/versions/anaconda3-5.2.0/lib/python3.7/site-packages/sklearn/ensemble/gradient_boosting.py in fit(self, X, y, sample_weight, monitor)
1408 train_test_split(X, y, sample_weight,
1409 random_state=self.random_state,
-> 1410 test_size=self.validation_fraction))
1411 else:
1412 X_val = y_val = sample_weight_val = None

~/.pyenv/versions/anaconda3-5.2.0/lib/python3.7/site-packages/sklearn/model_selection/_split.py in train_test_split(*arrays, **options)
2205 cv = CVClass(test_size=test_size,
2206 train_size=train_size,
-> 2207 random_state=random_state)
2208
2209 train, test = next(cv.split(X=arrays[0], y=stratify))

~/.pyenv/versions/anaconda3-5.2.0/lib/python3.7/site-packages/sklearn/model_selection/_split.py in init(self, n_splits, test_size, train_size, random_state)
1276 def init(self, n_splits=10, test_size="default", train_size=None,
1277 random_state=None):
-> 1278 _validate_shuffle_split_init(test_size, train_size)
1279 self.n_splits = n_splits
1280 self.test_size = test_size

~/.pyenv/versions/anaconda3-5.2.0/lib/python3.7/site-packages/sklearn/model_selection/_split.py in _validate_shuffle_split_init(test_size, train_size)
1797 raise ValueError(
1798 'test_size=%f should be smaller '
-> 1799 'than 1.0 or be an integer' % test_size)
1800 elif np.asarray(test_size).dtype.kind != 'i':
1801 # int values are checked during split based on the input

ValueError: test_size=3.001358 should be smaller than 1.0 or be an integer

テキスト要約タスクにファインチューニングする方法

こんにちは、
テキスト要約タスクについて質問があります。

googleのbertモデルと同様、bert-japaneseのコードも今テキスト分類タスクになっています。

しかし、テキスト分類タスクからテキスト要約タスクに変更したいと思い、run_classifier.pyを編集してみました。そして編集した途中でいくつかの不明点があります。

create_modelに使用するnum_labelsは元々livedoorニューズのジャンルと同じ数になっています。でもテキスト要約の場合は、どう設定すればよいでしょうか。
InputExampleに入力するtext_a、text_b、labelの指定方法がちょっと理解できませんでした。テキスト分類の場合、text_aにニューズ記事を入力し、labelにエンコーディングした記事のジャンルの数字だと分かります。しかし、テキスト要約の場合、text_aに要約したい本文を入力し、text_bに要約文を入力し、labelを使わないことでしょうか。

テキスト要約タスクに変更できるまで、他の部分も変更する必要があると思います。しかし今、思い浮かべるものは上記しかありません。

他のソースからいろいろ参考をしてBERTのことを理解できるようになりました。しかし、ソースコードの編集し方までは記述されたところを見つかりませんでした。

日本語不慣れな質問で申し訳ございません。日本語をもっとうまくなりたいので、日本語で質問させていただきます。
テキスト要約タスクについて心当たりがございましたら、どうかよろしくお願いいたします。

Cannot load checkpoint for pretraining

Hi!

I am trying to fine-tuning the predicting masked token task on my corpus by running the run_pretraining.py but I cannot load your model checkpoint.

This is what I did

%%time
!python3 '/content/bert-japanese/src/run_pretraining.py' \
    --input_file='/content/gdrive/My Drive/Japanese OCR/train.tfrecord' \
    --output_dir='/content/gdrive/My Drive/Japanese OCR/bert-wiki-ja' \
    --init_checkpoint='/content/gdrive/My Drive/Japanese OCR/bert-wiki-ja/model.ckpt-1400000' \
    --use_tpu=True \
    --tpu_name='grpc://10.50.229.186:8470' \
    --do_train=True \
    --do_eval=False \
    --max_eval_steps=50 \

And I got this results

Unsuccessful TensorSliceReader constructor: Failed to get matching files on /content/gdrive/My Drive/Japanese OCR/bert-wiki-ja/model.ckpt-1400000: Unimplemented: File system scheme '[local]' not implemented (file: '/content/gdrive/My Drive/Japanese OCR/bert-wiki-ja/model.ckpt-1400000')
	 [[node checkpoint_initializer_204 (defined at tensorflow-1.15.2/python3.6/tensorflow_core/python/framework/ops.py:1748) ]]

【質問】BERTの中間層の重みベクトルの抽出

深層学習の中間層の重み行列を抽出すると、特徴量として用いたり、可視化して内部状態の説明が可能になると思います。
例えばBERTの中間層をもちいて、
下記のようなword2vecの単語マッピングのように、単語の可視化をしたいです
https://qiita.com/Kosuke-Szk/items/d49e2127bf95a1a8e19f
この場合、具体的にBERTのどの中間層の情報を用いれば、
上記の記事のようなマッピングができますでしょうか？
御意見を頂きたいです。宜しくお願い致します。

学習プログラムの再現性を確保する方法について

プログラムやドキュメント等を共有いただきありがとうございます。
finetune-to-livedoor-corpus.ipynbでsrc/run_classifier.pyを実行する箇所で学習の再現性を確保したいです。
例えばシード値の固定など、再現性確保のために書き換える必要があるプログラムと処理を教えていただけると幸いです。

現状、以下のプログラムで呼び出しているモジュールに対してシード値を固定していますが、再現性を確保できていない状況です。

run_classifier.py
- tensorflow
tokenization_sentencepiece.py
- tensorflow
modeling.py
- numpy
- tensorflow
optimization.py
- tensorflow

numpy,tensorslowのシード値は以下の方法で固定しています。

def fix_seed(seed):
    # Numpy
    np.random.seed(seed)
    # Tensorflow
    tf.compat.v1.set_random_seed(seed)

SEED = 42
fix_seed(SEED)

上記のコードは全て、例えば以下のように最初のimportが終わった個所に書いています。
https://github.com/yoheikikuta/bert-japanese/blob/master/src/run_classifier.py#L20

上記以外にシード値固定などの学習の再現性確保に必要なプログラムやモジュールがあればご教示いただけますと幸いです。
また、他にも学習の再現性確保をするための補足事項もあればご教示いただけると幸いです。

do_predictフラグを使わない理由はありますか

finetune-to-livedoor-courpus.ipynbのPredict using the finetuned modelの章において

result = estimator.predict(input=predict_input_fn)

このセルを実行するためにここより上にてrun_classifier.pyと重複するコードが書かれているようです。

しかし、正答率やclassification_reportを見るためであれば、run_classifier.pyのdo_predictフラグで実行した後、出力されたtest_results.tsvを見たほうが楽なように思えます。

こちらに例を用意してみました
https://gist.github.com/miyamonz/0aa8c4aabca21822df685f439167dfc8
手元で動かしましたが、正常に動いているように見えます。

機械学習は素人なので、初歩的な勘違い等であったら申し訳ありません。

May I know the Specs of PC or cloud's instance this pretraining?

it shows the log you took 5 days to do the pre-training, may I know what is the specs of this pre-training done on?

Pretrained modelsをKaggleへ登録しても大丈夫ですか

Google DriveにあるPretrained modelsをKaggleの public datasetsとして登録して使っても大丈夫でしょうか。

Any experiment result that can conclude the sentence piece Japanese BERT model outperforms word piece one?

Do we have any result to support that the sentence piece is better?

BERT vs sklearn, both using SentencePiece

共有ありがとうございます。
サンプル https://github.com/yoheikikuta/bert-japanese/blob/master/notebook/finetune-to-livedoor-corpus.ipynb では

BERT with SentencePiece
sklearn GradientBoostingClassifier with MeCab

の比較がありますが、sklearn with SentencePiece との比較はありませんか。

Where to find your trained sentencepiece tokenizer model?

Update from upstream (BERT)

create_pretraining_data, run_pretraining, run_classifier は BERT の実装をベースにしているようですが、BERT本家の更新があったら取り込む予定はありますでしょうか。

diff -s -q bert/ src/ | sort | grep -v .git
Files bert/create_pretraining_data.py and src/create_pretraining_data.py differ
Files bert/run_classifier.py and src/run_classifier.py differ
Files bert/run_pretraining.py and src/run_pretraining.py differ
Only in bert/: CONTRIBUTING.md
Only in bert/: extract_features.py
Only in bert/: __init__.py
Only in bert/: LICENSE
Only in bert/: modeling.py
Only in bert/: modeling_test.py
Only in bert/: multilingual.md
Only in bert/: optimization.py
Only in bert/: optimization_test.py
Only in bert/: README.md
Only in bert/: requirements.txt
Only in bert/: run_squad.py
Only in bert/: sample_text.txt
Only in bert/: tokenization.py
Only in bert/: tokenization_test.py
Only in src/: data-download-and-extract.py
Only in src/: file-preprocessing.sh
Only in src/: tokenization_sentencepiece.py
Only in src/: train-sentencepiece.py
Only in src/: utils.py