imcaspar / gpt2-ml Goto Github PK

View Code? Open in Web Editor NEW

1.7K 1.7K 333.0 1000 KB

GPT2 for Multiple Languages, including pretrained models. GPT2 多语言支持, 15亿参数中文预训练模型

License: Apache License 2.0

Python 91.71% Shell 1.77% Jupyter Notebook 2.70% Perl 3.11% Dockerfile 0.70%

bert chinese colab gpt-2 nlp pretrained-models tensorflow text-generation tpu

gpt2-ml's People

Contributors

Stargazers

Watchers

Forkers

laoli2046 emmyhz taroorat liuhc001 c1a1o1 charlottesean dbb4560 nonva dreadlord1984 xiaodanjiao chriswu92 gz475 felixgithub2017 dream1202 bowendoctor nicemartin perfmjs zhenyangze adewin lahu2046 leo-xxx qianrenjian tchigher fengrk coolshinejonson chunlei jiaojiaoguan skyhawk1990 shadowkun atanida yehjames young2019 beare nickgao86 gaoyiyeah napoler duanyiqun jonyboy2000 wangroot aigrandmaster ah-its-andy yufanpapa longmarch7 brightgems junglezax amyxie361 gmono jawatech xiaobitao githubssj wqnow kukupigs awesome-archive awesome-nlp johnson7788 icoolworld matrixcpu ethcod icanfly777 friendmine yzu2ustc 210010 nlpofwhat herumw adamszq michael-wzhu lucklygaj xqs2018 qiujz myougg susangzj xtdx gzwplato kunde122 luwis2020 artxtech land007 hakanaku1234 3-body ulandz r1cebank magicknight sse001007 greengrass2015 bcmi220 millx2021 atiyit lucifer2288 leileixiao chenhao81 yyht rocke2020 a200332 iflab 82lyw ecrows duofangxiangcai zjjhym liuwq168 diejiangnan

gpt2-ml's Issues

训练问题请教

作者你好！想跟你请教一下：
通过大数据量训练得到的预训练模型后再小范围适配训练，真的比直接将数据进行分类并清洗后，再训练成不同的模型效果好吗？

模型提取输入的特征，但是模型所能表示的特征还是有限的，特别是自然语言处理方面的问题，一个模型真的能包罗万象吗？随着数据量的不断增加，特征组合总会出现冲突的，从而降低模型的准确性。
不断的增加层数，增加数据量，扩大模型的体积，真的比提高数据质量专门训练的模型要好吗？就好比人来说，术业有专攻，一个人可以精通多种技能，但是最熟悉的还是当前正在使用的技能，突然要转向其它的，依旧还需要再复习一下。
而模型中的体现是，训练得到的模型趋向于输出最近训练的内容，所以能不能将数据分类，如新闻数据，翻译数据，财经数据等，各自训练成一个预训练模型，再精准训练，最后提供一个适配层来接收所有前端输入，后端再由具体的模型处理。
而不是不断增加数据量及模型层数，最终使得模型泛而不精，因为模型特征参数始终是有限的，通过加权求和得到的组合，最终还是难以表示全部可能性，随着数据量的不断增加，层数的不断增加，冲突出现的可能就越大，模型的适用性就在降低。
以上是我个人的看法，请指教！

Description

因为现在生成语句是通顺的，但是还是容易逃离主题，期待能与uber的PPLM结合起来
https://github.com/huggingface/transformers/blob/master/examples/pplm/run_pplm.py

Additional information

老师这个代码并不是用的GPT2的官方程序，我自己没有头绪无从下手
或者是老师能加入词袋功能也是可以的

可有偿：QQ2027725943，感谢老师百忙中查阅

以下上联是人写的（可以自行百度），下联都是机器对出来的，效果很赞，估计你都忘记原来的词句了：
【上联】悠悠柳岸落红霞【下联】寂寂春窗落絮声
【上联】悠悠柳岸落红霞【下联】袅潺松阴披晚节
【上联】悠悠柳岸落红霞【下联】郁郁荷塘映彩虹
【上联】悠悠柳岸落红霞【下联】冉冉荷塘生绿烟
【上联】悠悠柳岸落红霞【下联】款款江堤荡碧波
【上联】漠漠水田飞白鹭【下联】茫茫云路隐青山
【上联】悠悠柳岸落红霞【下联】款款荷塘荡碧波
【上联】漠漠水田飞白鹭【下联】滔滔波浪涌黄河
【上联】漠漠水田飞白鹭【下联】悠悠柳岸荡轻舟
【上联】帝道真知，如今都成过去事【下联】民心所向，自古都是往来人
【上联】旷古圣人才，能以逍遥通世法【下联】平生名利事，自然淡泊得真如
【上联】公谊不妨私，平日政见分弛，肝胆至今推挚友【下联】人生何足论？此时心怀坦荡，襟怀自古仰高风
【上联】公谊不妨私，平日政见分弛，肝胆至今推挚友【下联】人间原是纸，吾意情犹激励，江山从古胜文章
【上联】公谊不妨私，平日政见分弛，肝胆至今推挚友【下联】子孙何必羡？此时贤亲义士，风霜从古更哀师
【上联】英雄作事无它，只坚忍一心，能成世界能成我【下联】壮志凌云有志，纵风流万里，不负春秋不负人
【上联】英雄作事无它，只坚忍一心，能成世界能成我【下联】人间多情至此，在消磨半句，可笑天伦最是家
【上联】共和误民国？民国误共和？百世而后，再平是狱【下联】以人为天下，天下为己任：万物之中，一物之间
【上联】共和误民国？民国误共和？百世而后，再平是狱【下联】同心作主人，天地为公道：一尘不染，无欲则刚
【上联】共和误民国？民国误共和？百世而后，再平是狱【下联】一语关己身，子孙成大孝！四德可先，既往以终
【上联】我以我血荐轩辕【下联】自有英名垂宇宙
【上联】漠漠水田飞白鹭【下联】盈盈竹坞醉秋风

PS:基于Uber提出的PPLM（Plug and Play Language Models: a Simple Approach to Controlled Text Generation），用词袋控制生成主题不偏离，比如下面的对联，我输入上联，下联紧扣“春天“这个主题，貌似效果还不错：

【上联】春山暖日和风，阑干楼阁帘栊【下联】暮水朝云细雨，别院花木兰芳
【上联】京口瓜洲一水间，钟山只隔数重山【下联】巴人竹叶千杯里，花雨不沾半缕尘
【上联】春风又绿江南岸，明月何时照我还【下联】柳岸常依燕子楼，繁花不处有谁家
【上联】碧玉妆成一树高，万条垂下绿丝绦【下联】红日照亮千畴艳，百鸟争鸣金缕机
【上联】碧玉妆成一树高，万条垂下绿丝绦【下联】红云照亮千峰翠，百鸟唤来金雀巢
【上联】天阶夜色凉如水【下联】曲径花声艳若霞
【上联】春风又绿江南岸【下联】细雨还红陌上桃
【上联】春风又绿江南岸【下联】旭日重临岭上林
【上联】春风又绿江南岸【下联】细雨还红陌上桃
【上联】我看青山多妩媚【下联】谁言梅水不芬芳
名篇《春江花月夜》的下句改写，估计你都忘记实际的下句了，哈哈哈：
【上联】春江潮水连海平，海上明月共潮生【下联】古塔风云绕山青，峰顶彩霞共霭晖
【上联】滟滟随波千万里，何处春江无月明【下联】幽幽如幻五十年，此间尘世有风流
【上联】江流宛转绕芳甸，月照花林皆似霰【下联】山势氤氲浮翠霭，风梳杨柳欲飞烟
【上联】斜月沉沉藏海雾，碣石潇湘无限路【下联】春江漫卷漾烟雨，长河日夜有缘人
【上联】不知江月待何人，但见长江送流水【下联】只向渔家寻旧迹，试问何处答耕桑
【上联】白云一片去悠悠，青枫浦上不胜愁【下联】碧水千层横漠漠，赤雁南来正飞歌
【上联】鸿雁长飞光不度，鱼龙潜跃水成文【下联】渔家小住月相随，鸥鹭往来船载歌
【上联】此时相望不相闻，愿逐月华流照君【下联】何处觅人犹有约？欲邀仙子去还家
【上联】江畔何人初见月？江月何年初照人【下联】天下有地可藏龙，海风哪处可观心
【上联】江天一色无纤尘，皎皎空中孤月轮【下联】杨柳千丝犹有梦，蒙蒙地上满船灯

【上联】转朱阁，低绮户，照无眠，不应有恨【下联】举大樽，邀清风，斟古月，何必无言
【上联】举杯邀明月，对影成三人【下联】临屏赏白梅，随心咏一声
【上联】胸无点墨【下联】口有余香
【上联】十口心思,思君思国思社稷【下联】一腔热泪，恨己恨他恨人民
【上联】破釜沉舟，三千越甲可吞吴【下联】乘风破浪，六十回头能啸秦
【上联】卧薪尝胆，三千越甲可吞吴【下联】伏釜沉舟，四海扬波能缚蛟
【上联】三杯两盏淡酒，怎敌他晚来风急【下联】一醉几巡闲言，便知是梦了诗成
【上联】燕草如碧丝，秦桑低绿枝【下联】夕阳似金锦，柳枝悬红线
【上联】庭院深深深几许【下联】春山隐隐隐孤身
【上联】庭院深深深几许【下联】梅花点点点春香

【上联】漠漠水田飞白鹭，阴阴夏木啭黄鹂【下联】绵绵秋雨润禾苗，处处山林映玉蝉
【上联】漠漠水田飞白鹭，阴阴夏木啭黄鹂【下联】依依山径落斜阳，袅袅炊烟绕翠竹

【上联】五岭横云一片青，衡山积雪终年白【下联】九州铺锦八方绿，林海堆红满眼春
【上联】五岭横云一片青，衡山积雪终年白【下联】双峰映日十分碧，函谷春深几度红

【上联】洞庭浩渺通长江，春来水涨连天碧【下联】华阁崔巍迎盛会，时过车奔遍地红
【上联】慷慨歌燕市，从容作楚囚【下联】清音醉杏坛，寂寞成唐诗
【上联】赤橙黄绿青蓝紫，谁持彩练当空舞【下联】柴米油盐酱醋茶，客为力军作秀言

【上联】茫茫九派流**，沉沉一线穿南北【下联】滔滔三江溯太行，滚浩万年贯古今
【上联】茫茫九派流**，沉沉一线穿南北【下联】浩浩千春耀上京，滚滚三江舞东西
【上联】茫茫九派流**，沉沉一线穿南北【下联】荡荡东洋过大江，滚滚千流过古今
【上联】茫茫九派流**，沉沉一线穿南北【下联】悠悠三湾锁西洋，滚滚千商过东西

【上联】横空出世，莽昆仑，阅尽人间春色【下联】下笔挥毫，成巨匠，绘出生态蓝图
【上联】飞起玉龙三百万，搅得周天寒彻【下联】勾起渭河数千年，汇来秦土风情
【上联】飞起玉龙三百万，搅得周天寒彻【下联】携来佛法五千言，引来妙谛玄机
【上联】我失骄杨君失柳，杨柳轻扬直上重霄九【下联】春回大地雪迎雷，地冰乍起先苏万物锦
【上联】夜长天色总难明，寂寞披衣起坐数寒星【下联】人静心灰都是幻，相思枕席轻敲几尺丝
【上联】夜长天色总难明，寂寞披衣起坐数寒星【下联】人在画图空有趣，从容拄杖谈玄一醉开
【上联】夜长天色总难明，寂寞披衣起坐数寒星【下联】金风玉露悄然冷，相思倚枕听啼无归人
【上联】六月天兵征腐恶，万丈长缨要把鲲鹏缚【下联】一支玉笔写春秋，千年遗恨永铭将士碑

【上联】百万工农齐踊跃，席卷江西直捣湘和鄂【下联】五千兵戈俱喧阗，水归淮北空余屈子魂
【上联】百万工农齐踊跃，席卷江西直捣湘和鄂【下联】千余老幼尽欢颜，联惊海外长怀赵与韩
【上联】百万工农齐踊跃，席卷江西直捣湘和鄂【下联】三千志士竞登攀，功成世界全凭武与文【上联】百万工农齐踊跃，席卷江西直捣湘和鄂【下联】三千豪杰竞风流，舟行淮北犹歌宋与唐
【上联】百万工农齐踊跃，席卷江西直捣湘和鄂【下联】七旬岁月竞风流，梦牵河北长扬楚与吴
【上联】此去泉台招旧部，旌旗十万斩阎罗【下联】我来江畔谒崇陵，遗烈八年壮国威注：崇陵是光绪皇帝爱新觉罗.载湉和叶赫那拉·静芬（隆裕太后）的合葬陵寝【上联】此去泉台招旧部，旌旗十万斩阎罗【下联】回首江山换新颜，春花一样笑杨梅
【上联】此去泉台招旧部，旌旗十万斩阎罗【下联】重归海岛护同盟，风雨千秋祭祖师

【上联】此去泉台招旧部，旌旗十万斩阎罗【下联】且邀明月饮新醪，玉液千杯酹泰山注：醪，新酿的酒；酹，把酒浇在地上，表示祭奠
【上联】此去泉台招旧部，旌旗十万斩阎罗【下联】今观国粹谱新篇，联对千家颂党恩
【上联】此去泉台招旧部，旌旗十万斩阎罗【下联】重归梓里吊忠魂，天地三千开后人
【上联】天命吾身踏黄泉，定起万军夺阴曹【下联】帝无前事佐明主，却图一战扫烟云
【上联】天命吾身踏黄泉，定起万军夺阴曹【下联】人生向上多自力，始终一战统山河
【上联】天命吾身踏黄泉，定起万军夺阴曹【下联】死犹未免埋白骨，亦如一抔葬忠魂
【上联】天命吾身踏黄泉，定起万军夺阴曹【下联】人心臣首怀赤胆，未能一死竟平生
【上联】天命吾身踏黄泉，定起万军夺阴曹【下联】人思此处访遗庙，未经千载是英雄
【上联】天命吾身踏黄泉，定起万军夺阴曹【下联】不由后死赴江汉，岂独一败论英雄
【上联】天命吾身踏黄泉，定起万军夺阴曹【下联】谁似臣心报国门？至今百战识英雄
【上联】天命吾身踏黄泉，定起万军夺阴曹【下联】自悲吾辈失青鬓，竟成千古痛元君
【上联】天命吾身踏黄泉，定起万军夺阴曹【下联】我来仙界横紫气，定披七彩绣河山
【上联】天命吾身踏黄泉，定起万军夺阴曹【下联】不为世界哭元老，且持三字表心胸
【上联】天命吾身踏黄泉，定起万军夺阴曹【下联】世风余韵遗青史，岂惟五代识君臣
【上联】天命吾身踏黄泉，定起万军夺阴曹【下联】世间此地留忠骨，长教千载祀英雄

[Discussion] Vocabulary file

Hi all,

Thanks for the cool contribution :) I'm trying to use your repo to pre-train GPT-2 from scratch on other languages (not English nor Chinese). Could you say a bit more on how you generated your vocabulary (which library you used? sentencepiece?)? Also, what kind of vocabulary is it (BPE, etc...)?

If I understood well, when I have my vocab file for my language, I can then create the tfrecords for my data and run the train.py file?

Thanks a lot in advance!

[Bug] name y本地运行出现tensorflow版本不匹配的问题

我把5G多的模型下载到本地，现在本地跑起来，结果发现tensorflow的版本问题，半天都没解决，请问你默认运行的版本是多少？以及对应的依赖库？谢谢大神！

Key: input_ids. Can't parse serialized Example.

tensorflow.python.framework.errors_impl.InvalidArgumentError: Key: input_ids. Can't parse serialized Example.
[[{{node ParseSingleExample/ParseSingleExample}}]]
[[IteratorGetNext]]

[Discussion] models/mega/model.ckpt-100000.data的问题

Could not open models/mega/model.ckpt-100000.data: Data loss: not an sstable (bad magic number): perhaps your file is in a different file format and you need to use a different restore operator?
有没有类似问题呀？下载后sha是对的。
...
...
mlp_ln1
mlp_ln0
mlp_ln1
WARNING:tensorflow:From /usr/local/lib/python3.5/dist-packages/tensorflow/python/training/saver.py:1266: checkpoint_exists (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version.
Instructions for updating:
Use standard file APIs to check for files with this prefix.
2019-12-04 11:03:11.980040: W tensorflow/core/util/tensor_slice_reader.cc:95] Could not open models/mega/model.ckpt-100000.data: Data loss: not an sstable (bad magic number): perhaps your file is in a different file format and you need to use a different restore operator?
2019-12-04 11:03:11.980242: W tensorflow/core/util/tensor_slice_reader.cc:95] Could not open models/mega/model.ckpt-100000.data: Data loss: not an sstable (bad magic number): perhaps your file is in a different file format and you need to use a different restore operator?
2019-12-04 11:03:11.980286: W tensorflow/core/framework/op_kernel.cc:1401] OP_REQUIRES failed at save_restore_tensor.cc:175 : Data loss: Unable to open table file models/mega/model.ckpt-100000.data: Data loss: not an sstable (bad magic number): perhaps your file is in a different file format and you need to use a different restore operator?
Traceback (most recent call last):
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1334, in _do_call
return fn(*args)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1319, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1407, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.DataLossError: Unable to open table file models/mega/model.ckpt-100000.data: Data loss: not an sstable (bad magic number): perhaps your file is in a different file format and you need to use a different restore operator?
[[{{node save/RestoreV2}}]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "scripts/interactive_conditional_samples.py", line 171, in
saver.restore(sess, args.model_ckpt)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/saver.py", line 1276, in restore
{self.saver_def.filename_tensor_name: save_path})
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 929, in run
run_metadata_ptr)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1152, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1328, in _do_run
run_metadata)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1348, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.DataLossError: Unable to open table file models/mega/model.ckpt-100000.data: Data loss: not an sstable (bad magic number): perhaps your file is in a different file format and you need to use a different restore operator?
[[node save/RestoreV2 (defined at scripts/interactive_conditional_samples.py:170) ]]

Caused by op 'save/RestoreV2', defined at:
File "scripts/interactive_conditional_samples.py", line 170, in
saver = tf.train.Saver()
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/saver.py", line 832, in init
self.build()
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/saver.py", line 844, in build
self._build(self._filename, build_save=True, build_restore=True)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/saver.py", line 881, in _build
build_save=build_save, build_restore=build_restore)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/saver.py", line 513, in _build_internal
restore_sequentially, reshape)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/saver.py", line 332, in _AddRestoreOps
restore_sequentially)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/saver.py", line 580, in bulk_restore
return io_ops.restore_v2(filename_tensor, names, slices, dtypes)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/gen_io_ops.py", line 1572, in restore_v2
name=name)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper
op_def=op_def)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
return func(*args, **kwargs)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 3300, in create_op
op_def=op_def)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 1801, in init
self._traceback = tf_stack.extract_stack()

DataLossError (see above for traceback): Unable to open table file models/mega/model.ckpt-100000.data: Data loss: not an sstable (bad magic number): perhaps your file is in a different file format and you need to use a different restore operator?
[[node save/RestoreV2 (defined at scripts/interactive_conditional_samples.py:170) ]]

你是今天上午参加GTC讨论了ERNIE吗？

请问可以提供个百度云的下载链接吗，google下载挺慢的

您好，想问一下如果想对模型进行训练，数据的格式是什么样的啊？有训练的脚本吗

Checksum does not match: stored 2098457901 vs. calculated on the restored bytes 3401360046

tensorflow.python.framework.errors_impl.DataLossError: Checksum does not match: stored 2098457901 vs. calculated on the restored bytes 3401360046
我下载模型的时候对data文件还检查了sha256 的checksum 是对的，但是还是报这个错。
请问怎么办啊？

想问一下这个模型的结构和原始的GPT-2的模型一样吗？

如题，因为我想把这个模型的预训练好的参数加载到GPT-2模型当中去，看起来这是Grover模型？

[Discussion] 词表文件

请问想利用30G训练数据的那个版本进行finetuning的话，目前的prepare_data.py好像词表不对应？请问prepare_data.py中需要做什么调整呢？把词表文件从bert的换成clue的就可以吗？还有什么其他地方要改动吗？
@imcaspar

[Bug] 不能下载文件

Below is the issue template. You can fill each part then submit your issue.
Or you can just delete all of these and describe your questions in you-like style.
But please remember: the more detailed info you offered, the greater possibility your problem will be solved. 😜

Please write a clear and concise description of what the bug is.

fatal: destination path 'gpt2-ml' already exists and is not an empty directory.
/content/gpt2-ml
models/mega/model.c [ <=> ] 3.02K --.-KB/s in 0s
Couldn't download the file :-(

Please write a clear and concise description of what you expected to happen.
solve the problem

Environment

Colab

Python version:
OS:
(Optional) Other libraries and their versions:

Error messages, stack traces, or logs

fatal: destination path 'gpt2-ml' already exists and is not an empty directory.
/content/gpt2-ml
models/mega/model.c [ <=> ] 3.02K --.-KB/s in 0s
Couldn't download the file :-(

Steps to reproduce

1.starts when
2.error occurs

[Discussion] How do I fine-tune?

Hello.
please tell me how to fine-tune pretrained model. Can I do fine-tune with train.py?
Thank you.

无法下载“15亿参数中文预训练模型 [Google Drive 下载]”

The passed save_path is not a valid checkpoint[Discussion] your question

我用的windows10的系统，加载模型时会出现这个问题，后来进入到源码中，发现需要有index文件，能不能把你们保存的这个模型的 index meta文件都上传一下

怎么转pytorch模型

我使用transformers里面的gpt2 convert不能转换成功. 命令如下:
transformers gpt2 ./gpt-ml/mega/model.ckpt-10000 ./output ./mega.json
接连出现 GPT2Model no attribute "_step" 和 "shape" 问题.

有转换成功的能不能分享一下方法，多谢!

加载模型错误

我下载了开源的模型，显示如下错误：
tensorflow.python.framework.errors_impl.DataLossError: Unable to open table file ./model.ckpt-100000: Data loss: not an sstable (bad magic number): perhaps your file is in a different file format and you need to use a different restore operator?
请问是什么问题？谢谢

No module named 'train'

from train.modeling import GroverModel, GroverConfig, sample

ModuleNotFoundError: No module named 'train'

Im using conda create -n ml2 python=3.7
here is pip list:
Package Version

absl-py 0.9.0
astor 0.8.1
attrs 19.3.0
backcall 0.1.0
bleach 3.1.0
certifi 2019.11.28
chardet 3.0.4
colorama 0.4.3
decorator 4.4.1
defusedxml 0.6.0
entrypoints 0.3
gast 0.3.3
grpcio 1.27.2
h5py 2.10.0
idna 2.8
importlib-metadata 1.5.0
ipykernel 5.1.4
ipython 7.12.0
ipython-genutils 0.2.0
ipywidgets 7.5.1
jedi 0.16.0
Jinja2 2.11.1
joblib 0.14.1
jsonschema 3.2.0
jupyter 1.0.0
jupyter-client 5.3.4
jupyter-console 6.1.0
jupyter-core 4.6.1
Keras-Applications 1.0.8
Keras-Preprocessing 1.1.0
Markdown 3.2.1
MarkupSafe 1.1.1
mistune 0.8.4
mkl-fft 1.0.15
mkl-random 1.1.0
mkl-service 2.3.0
mock 4.0.1
nbconvert 5.6.1
nbformat 5.0.4
notebook 6.0.3
numpy 1.18.1
pandas 0.24.2
pandocfilters 1.4.2
parso 0.6.1
pickleshare 0.7.5
pip 20.0.2
prometheus-client 0.7.1
prompt-toolkit 3.0.3
protobuf 3.11.4
Pygments 2.5.2
pyreadline 2.1
pyrsistent 0.15.7
python-dateutil 2.8.1
pytz 2019.3
pywin32 227
pywinpty 0.5.7
pyzmq 18.1.1
qtconsole 4.6.0
regex 2019.4.14
requests 2.22.0
scikit-learn 0.22.1
scipy 1.4.1
Send2Trash 1.5.0
setuptools 45.2.0.post20200210
six 1.14.0
tensorboard 1.13.1
tensorflow 1.13.1
tensorflow-estimator 1.13.0
termcolor 1.1.0
terminado 0.8.3
testpath 0.4.4
tornado 6.0.3
tqdm 4.31.1
traitlets 4.3.3
urllib3 1.25.8
wcwidth 0.1.8
webencodings 0.5.1
Werkzeug 1.0.0
wheel 0.34.2
widgetsnbextension 3.5.1
wincertstore 0.2
zipp 2.2.0

生成效果重复率高可否增加一个惩罚机制？

请教一下作者是在什么机器上训练出来的

vocab_size 21130 vs 21128

mega.json中vocab_size是21130，bert-base-chinese-vocab.txt的行数是21128，跟模型中大小不匹配。预训练模型是使用这个vocab训练的吗? 谢谢.

生成效果

你好，请问这个预训练模型生成的效果是怎样的，我转成pytorch后，生成的内容非常差，想知道是否是因为转换的关系。
谢谢！

预计啥时候放出训练得更充分的版本呢？

Description

（就像README里那个没打勾的CheckBox里描述的那样hhh）

Additional information

[Bug] name your bug

Environment

Python version: Python 3.7.5 x64
OS: Win10 18363.476
(Optional) Other libraries and their versions:
pandas==0.24.2
regex==2019.4.14
h5py==2.9.0
numpy==1.16.2
tensorboard==1.13.1
tensorflow==1.13.1
tensorflow-estimator==1.13.0
tqdm==4.31.1
requests==2.22.0

Error messages, stack traces, or logs

D:\gpt2-ml-master>python scripts/interactive_conditional_samples.py -model_config_fn configs/mega.json -model_ckpt models/mega/model.ckpt-100000 -eos_token 511 -min_len 200 -samples 10
Traceback (most recent call last):
File "scripts/interactive_conditional_samples.py", line 10, in
from train.modeling import GroverModel, GroverConfig, sample
ModuleNotFoundError: No module named 'train'

Additional context (optional)

我本地跑i下结果是这样为什么呢？难道就是由于我没独显嘛谢谢

请问有finetune教程的教程吗

您好，请问如何使用自己的数据集finetune？有没有啥readme之类的？

模型放在谷歌网盘里，用colab训练的时候说找不到匹配的文件

init_checkpoint定位到model.ckpt-100000，说找不到任何匹配的文件，有朋友遇到过这种问题吗？

使用方法

谢谢分享
@imcaspar 能写一下怎样使用吗?

使用该模型继续进行预训练的步骤
使用该模型进行inference的步骤

谢谢

期待50G语料的模型，传说12月发，训练的咋样了？[Discussion] your question

期待50G语料的模型，传说12月发，训练的咋样了？

up主是不是把模型训练成近似恒等式了？

看了一下生成结果，来自问题1的引用：
【想要学习更多内容，请关注微信号：b##ms##h##200##1】【文章来源：艾锐文化】点击下
方阅读原文查看更多内容。↓↓↓##↓##↓##↓##↓##↓##↓##↓##↓##↓##↓##↓##↓##↓##↓##↓##↓##↓点击"阅读原文"【查看更多内容】

回头特地看了一下训练参数和字典，不是分词，按理说“文章来源”训练出来可以接受，但“艾锐文化”输出就有恒等输出的可能了，这是命名词，不应该有这么强的联系。
这一点在Morizeyao/GPT2-Chinese中，也有一位**的朋友遇到，训练出来的模型总是倾向于输出原文。。。

所以，我的问题其实是openAI这个大力出奇迹的超大模型，究竟是在生成还是复现？

出现不能下载某个文件的问题

为什么会出现：

在gtx 2080 8G 上运行显示 OOM，请问GPU最低要求是多少？

prepare_data.py里面的max_seq_length大小对训练有什么影响？

The release time of 1.5B model trained on 50G corpus

First of all, GREAT THANKS for the release of the big Chinese GPT-2 model! I have tested with some primes and the results look very good.

So I am very looking forward to the release of the model trained with even more data and more epochs, could I know the exact release time of it? since the author said it should be due on the early Dec.

Thanks very much!

[Discussion] 使用8G显存的显卡跑生成文本会触发OOM

请问怎样设置可以缓解一下这个状况

出现不能下载某个文件的问题

为什么会出现：
models/mega/model.c [ <=> ] 56.27K --.-KB/s in 0.001s
Couldn't download the file :-(

这样的错误提示？

运行环境是Colab

词表问题

哥们你好，请问你发布的模型中，其对应的词表在哪里？库目录中提供的两个词表(tokenization)跟模型训练的词表单词数完全不一致。
能否告知？谢谢！

[Discussion] finetune过程需要的显存好像很大啊,2080ti 11G显存OOM了

batchsize已经调整到1了,还是不能finetune,有使用民用级显卡finetune成功的分享一下经验吗

[Bug] bug of sequence length

    for article in buffered_and_sliding_window_article_iterator(tokenizer,
                           final_desired_size=max(args.max_seq_length + 1, 1025)):
        writer2use = train_writer
        assert len(article['input_ids']) == (args.max_seq_length + 1)

上面是prepare.py 的第188到191行，第189和191行所认为的sequence长度是不是矛盾的？在188行当max_seq_length<1024时总是取sequence长度为1025

请问要怎么finetune自己的语料

谢谢分享！🙏

我想在finetune自己的语料，应该怎么使用你的代码，谢谢！

[Discussion] 百度云链接

gogle drive 中的模型，已上传到百度云，提取码y4k6。模型大小约5.2GB。

[Discussion] Pretrain过程中是否有EOS

你好，请问pretrain的时候有在数据结尾加EOS吗？

[Discussion] gpu预测速度

模型在v100 GPU上直接利用demo.py进行预测，发现在min_len设置为150时，每生成一个sample需要80-90s，即使把min_len缩短为10，每生成一个sample也需要30s左右，请问这是正常的吗？如果是的话有什么可以加速预测的设置或者方法吗？

谢谢。

小的GPT

有没有小的模型能在gpu上finetune的？

[Discussion] 可以公开下训练时的gpu配置吗

目前我config选的mega，用开源的1.5B模型初始化。batch size已经设为2，max_seq_length设为512，gpu是单个16G的V100。遇到了OOM。

[Discussion] GPT-3

Thank you for great work. The Appendix B of the GPT-3 paper mentions the following. I'm wondering whether the idea has been implemented in gpt2-ml. If not yet, what would you advise regarding how to implement it?

Appendix B.

....
During training we always train on sequences of the full nctx = 2048 token context window, packing multiple documents into a single sequence when documents are shorter than 2048, in order to increase computational efficiency. Sequences with multiple documents are not masked in any special way but instead documents within a sequence are delimited with a special end of text token, giving the language model the information necessary to infer that context separated by the end of text token is unrelated. This allows for efficient training without need for any special sequence-specific masking.
....

imcaspar / gpt2-ml Goto Github PK

gpt2-ml's People

Contributors

Stargazers

Watchers

Forkers

gpt2-ml's Issues

Description

Additional information

Environment

Error messages, stack traces, or logs

Steps to reproduce

Description

Additional information

Environment

Error messages, stack traces, or logs

Additional context (optional)

Recommend Projects

Recommend Topics

Recommend Org