Giter VIP home page Giter VIP logo

xlxwalex / fcgec Goto Github PK

View Code? Open in Web Editor NEW
97.0 2.0 11.0 13.11 MB

The Corpus & Code for EMNLP 2022 paper "FCGEC: Fine-Grained Corpus for Chinese Grammatical Error Correction" | FCGEC中文语法纠错语料及STG模型

Home Page: https://aclanthology.org/2022.findings-emnlp.137

License: Apache License 2.0

Python 97.65% Shell 2.35%
corpus dataset emnlp2022 gec grammatical-error-correction emnlp

fcgec's People

Contributors

pjwjavier avatar wujeevan avatar xlxwalex avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

fcgec's Issues

indep_evaluate.py Line106有bug

您修改了[data_utils.py]中的reconstruct_tagger_V2函数返回值为三个,但是调用该函数的indep_evaluate.py Line106使用两个变量接收。
scshot

Incorrect Labeling in Dataset: "SC" replaced with "SD" in Two Instances of train dataset

Issue Details:
Incorrect Labeling: Upon careful inspection of the datasetFCGEC_train.json, I noticed that in two separate entries, the intended label "SC" has been erroneously written as "SD." This inconsistency may create confusion for users who rely on the dataset for training models or conducting research.
I have thoroughly checked these entries, and I am confident that they should be labeled as "SC," not "SD."~~~

Affected Entries: The specific instances of this mislabeling can be found in the following data points:

  • Entry 1:
"1544600e0a7c45bdcba6ce5525855ee2": {
        "sentence": "青年作家残雪认为鲁迅的作品实现了一种“突破”,而《故事新编》中的《铸剑》则将这种创造达到了登峰造极。",
        "error_flag": 1,
        "error_type": "CM;SD",
        "operation": "[{\"Insert\":[{\"pos\":48,\"tag\":\"INS_3\",\"label\":\"的境界\"}]},{\"Delete\":[42,43,44],\"Modify\":[{\"pos\":37,\"tag\":\"MOD_1\",\"label\":\"使\"}]}]",
        "version": "FCGEC EMNLP 2022"
    },
  • Entry 2:
"aad7a950e2853b271eba684c84f81b55": {
        "sentence": "“非典”期间,在白衣天使们身上,都无不闪耀着舍身忘我、奋不顾身的光辉。",
        "error_flag": 1,
        "error_type": "CM;SD",
        "operation": "[{\"Delete\":[7,16],\"Insert\":[{\"pos\":12,\"tag\":\"INS_1\",\"label\":\"的\"}]},{\"Delete\":[7,17,18],\"Insert\":[{\"pos\":12,\"tag\":\"INS_1\",\"label\":\"的\"}]}]",
        "version": "FCGEC EMNLP 2022"
    }, 

模型的纠错能力

当我使用inference_singleline.py时,如果我随便输入一些词,基本上都是返回原句子。
Input the incorrect sentence (q for quit):我爱李

corrected sentence: 我爱李

Input the incorrect sentence (q for quit):我爱北京天是安门

corrected sentence: 我爱北京天是安门。

checkp变量赋值

您好!在使用run_stg_joint.sh文件运行您提供的预训练模型的时候,发现checkp变量还未定义,我不是很清楚checkp变量的含义以及应该如何赋值,希望您不吝赐教,谢谢!

中文引号等特殊符号处理

您好,请问bert vocab没有的符号,比如双引号单引号、英文大写有做处理吗?直接用joint_evaluate得到的数据将原文中的中文符号变成英文符号了,英文大写变成小写了
比如,
去年5月,阿里巴巴宣布将旗下的“一达通”平台,向我国外贸出口企业发放“出口补贴”,进一步推进整个外贸生态系统的可持续发展。->去年5月,阿里巴巴宣布将用旗下的"一达通"平台,向我国外贸出口企业发放"出口补贴",进一步推进整个外贸生态系统的可持续发展。

Unexpected key(s) in state_dict: "XXX._bert.embeddings.position_ids"

利用作者提供的checkpoints.pt 文件做inference 报错,报错信息如下:
Traceback (most recent call last): File "/data/FCGEC/model/STG-correction/joint_evaluate.py", line 148, in <module> evaluate(args) File "/data/FCGEC/model/STG-correction/joint_evaluate.py", line 48, in evaluate model.load_state_dict(params) File "/data/miniconda3/envs/bert/lib/python3.11/site-packages/torch/nn/modules/module.py", line 2152, in load_state_dict raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format( RuntimeError: Error(s) in loading state_dict for JointModel: Unexpected key(s) in state_dict: "switch._encoder._bert.embeddings.position_ids", "tagger._encoder._bert.embeddings.position_ids", "generator._lmodel.bert.embeddings.position_ids".

Single-sentence text inference directly using FCGEC-Joint

Hello, I have finished predicting the test set according to the steps you provided, but now I want to predict only one line of text how to do it? Obviously the prediction steps you provided are a bit too responsible, can I just use Forwardl of FCGEC-Joint to fix a single line of text? If so could you please give me an example?

数据问题

image

train.json里面的这句话error_flag=1, 但是op是空,是没错吗?

使用自定义数据集训练

我使用自己的数据按照FCGEC的格式构造了一些数据集,但是在加载数据会报错,所以导致一些句子无法加载,我查看了代码发现是convert_tagger2generator函数中的tokens和tagger的长度不对,继续往上我发现了在构造TaggerConverter的时候会有一些句子发生错误,打印出error之后发现只出现了pos的错误,但是pos错误指的是什么错误并不知道,于是我截取了tagger,但是导致tagger的loss爆炸增长,请问这应该怎么解决?
try: tagger = TaggerConverter(self.args, auto=True, **kwargs) except Exception as e: print("发生了错误: {}".format(e)) print("发生错误的句子: {}".format(sentences[idx])) tagger = tagger[:len(tokens)] # 截取tagger
另外我还想知道怎么看这个模型的效果,joint模型好像是只记录 The best performances of three modules in STG

数据转换

您好~我看论文中提到了通过最小编辑距离打标签,现在可以实现嘛?就是将平行句子转化为模型训练需要的含有标注的句子。

如何获取codaLab 需要的输出格式?

通过inference_singleline.py 获得的输出只包含纠正后的句子,不包含error_flag 和 error_type, 请问如何获取这两个输出?
另外,能否提供下在验证集上模型最终的指标?
非常感谢!

训练集的数据出现在验证集及测试集中

您好,统计了一下,在2000条句子的验证集中,有37条句子纠错前的原始错句或170条句子纠错后的答案曾在训练集中出现;在3000条句子的测试集中,有48条句子纠错前的原始错句曾在训练集中出现(由于测试集的答案未知,因此有多少句子纠错后的答案曾在训练集中出现未知)。这个情况可能会导致few-shot模型测试结果不准确的问题。

请问是否能提供一个过滤集,包含所有需要从训练集中筛去的出现在验证集或测试集中的句子(包括同源句子的出现),以便得到一个更纯净的训练集?非常感谢!

关于目前学术界公开论文的在本数据集上的SOTA结果

作者您好,我觉得您的数据集非常棒,想在上面做一些相关工作,请问目前这个数据集上学术界的SOTA还是您这篇论文的方法么,我通过被引找了下好像还没有找到其他的更好的方法,想和您确认下,蟹蟹。

run Reporter app with error : r._hidden2tag.linear.weight: param with shape torch.Size from checkpoint does not match the shape in current model is torch.Size

Hello,

I trained the model with run_stg_joint.sh. After i run demo_pipeline.py, while receving an error. I paste the full error here:

[jupyter@jupyter-d134f2d8-ead8-4a47-92a0-8dcb5293de93-54b4cfd585-jh5dl STG-correction]$ python demo_pipeline.py
jieba are not installed, use default mode.
Some weights of the model checkpoint at ../pretrained-models/hflchinese-roberta-wwm-ext/ were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']

  • This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
  • This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
    Traceback (most recent call last):
    File "/jupyter/workspace/bert/FCGEC-main/model/STG-correction/demo_pipeline.py", line 27, in
    pipecls = Pipeline(args_binary, args_demo)
    File "/jupyter/workspace/bert/FCGEC-main/model/STG-correction/app/Pipeline.py", line 18, in init
    self.model_bucket = ModelBucketV1(args_demo, self.device, binary=True, switch=True, taggen=True, checkpoints_name='checkpoint.pt')
    File "/jupyter/workspace/bert/FCGEC-main/model/STG-correction/app/ModelBucket.py", line 35, in init
    joit_model.load_state_dict(joit_model_params)
    File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2041, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
    RuntimeError: Error(s) in loading state_dict for JointModel:
    size mismatch for tagger._hidden2tag.linear.weight: copying a param with shape torch.Size([7, 768]) from checkpoint, the shape in current model is torch.Size([6, 768]).
    size mismatch for tagger._hidden2tag.linear.bias: copying a param with shape torch.Size([7]) from checkpoint, the shape in current model is torch.Size([6]).

please give some hint to solve it. Thanks.

convert_fcgec_to_seq2seq.py有bug

转换为seq2seq格式的脚本有bug:在句首处的插入操作("Insert":"pos":-1)会被插入到句尾。
如:
source:天空和土地日益被拥挤的高楼遮蔽的时代,他们怀着忧虑之心仰望天空,守卫土地。
target: 天空和土地日益被拥挤的高楼遮蔽的时代,他们怀着忧虑之心仰望天空,守卫土地。在

run_stg_joint.sh数据加载遇见的问题

如题,在加载项目自带的训练数据的时候会遇见如下问题:

以“都市桃花源”为主 题,用两个巨大而奇特的魔比斯环为载体,世博湖南馆将吸引众多参观者的眼球而流连忘返。
Processing train Dataset[Tagger/Gen Part]: 13%|████████▍ | 2651/19758 [00:01<00:10, 1683.82it/s]
Traceback (most recent call last):
File "joint_stg.py", line 76, in
train(args, checkp)
File "joint_stg.py", line 32, in train
Trainset = JointDataset(args, train_dir, 'train')
File "/home/jydong/FCGEC-main/model/STG-correction/DataProcessor/JointDataset.py", line 32, in init
self.gen_token, self.genwd_idx, self.tgt_mlm = self._process_tagger(self.sentences, self.operates)
File "/home/jydong/FCGEC-main/model/STG-correction/DataProcessor/JointDataset.py", line 196, in _process_tagger
gen_token, gen_label = tagger2generator(tokens, label_comb['tagger'], label_comb['mask_label'])
File "/home/jydong/FCGEC-main/model/STG-correction/utils/mask.py", line 44, in convert_tagger2generator
post_sequence.append(tokens[index])
IndexError: list index out of range

这个怎么解决呀?

数据转换成平行句对

如何将提供的数据集转换成平行句对: (错误句子,修正后句子)
尝试进行解析,发现还挺麻烦,存在一些不一致。比如:Insert的label为一个list,也可能为一个str
image

image

希望能提供一下转换脚本,或者直接转换好的数据,非常感谢

Experiment Results' Reproduce using provided Checkpoint

Hello!
I downloaded the trained checkpoint in README for inferring on the test set to reproduce the results.
The results given in README are (EM / F0.5 : 34.10 / 45.48). But my results (utilizing the run_stg_joint.sh) are (EM / F0.5 : 50.5 / 37.9). This difference cannot be neglected.
Actually, I adjusted some code while inferring.

  1. In line 16 of FCGEC/model/STG-correction/Model/tagger_model.py, I have to change the self.max_token = args.max_generate + 1 to self.max_token = args.max_generate. Otherwise, the parameter shape of self._hidden2t in the checkpoint cannot match the constructed model.
  2. In line 46 of FCGEC-main/model/STG-correction/preprocess_data.py. Some additional code needs to be added because the "uid" for every sentence is essential in the test process. Thus, an additional column of the key is added in test.csv and I copy it to stg_joint_test.xlsx. I used this excel to form the final submission. My results are in row GMago on the Codalab page of results.

数据集里面很多打标是错误的,这是怎么回事啊?

"2c9026bd7a4e6deafeacbe37f4678b78": {
"sentence": "一些网民认为,涉黑案件中的被告人否认涉黑,是因为他们抱有侥幸心理,是助长其嚣张气焰的“保护伞”尚未打掉,法院应进一步加大查处力度。",
"error_flag": 1,
"error_type": "CM",
"operation": "[{"Modify":[{"pos":33,"tag":"MOD_18+INS_3","label":"是助长其嚣张气焰的“保护伞”尚未打掉的表现"}]}]",
"version": "FCGEC EMNLP 2022"
}
这里不应该只是插入“的表现”么?
"0326b9713d05e155ec25eb17d50e67a8": {
"sentence": "由于法律意识淡薄,这些售假的摊主设置重重障碍,围攻、阻止工商管理人员正常执行公务。",
"error_flag": 1,
"error_type": "IWC",
"operation": "[{"Delete":[23,24,25]},{"Modify":[{"pos":23,"tag":"MOD_17","label":"围攻工商管理人员并组织他们执行公务"}]}]",
"version": "FCGEC EMNLP 2022"
}
label中不应该是“围攻工商管理人员并阻止他们执行公务”么?
"4c07f073554789af9afe6df8455103bb": {
"sentence": "温家宝在讲话中说,要建立孤儿国家保障制度,使这个最弱小、最困难的群体能够病有所医、住有所居、生有所养、学有所教。",
"error_flag": 1,
"error_type": "IWO",
"operation": "[{"Modify":[{"pos":36,"tag":"MOD_19","label":"生有所养,病有所医,住有所居,学有所教"}]}]",
"version": "FCGEC EMNLP 2022"
}
这个不应该是乱序么?
等等

IndexError: index 2992 is out of bounds for axis 0 with size 2992

你好,我将数据集换成自己的数据集(也是3000条),用之前训练好的模型参数,运行joint_evaluate.py,报错IndexError: index 2992 is out of bounds for axis 0 with size 2992
但是,用你的数据集test.csv就是好的,这该怎么解决

在使用checkpoint时,遇到了缺少collate_fn_demo的报错

报错如下:
Traceback (most recent call last):
File "E:\新建文件夹\FCGEC-main\model\STG-correction\preprocess_data.py", line 4, in
from utils.argument import ArgumentGroup
File "E:\新建文件夹\FCGEC-main\model\STG-correction\utils_init_.py", line 12, in
from utils.collate import collate_fn_base, collate_fn_tagger, collate_fn_joint, collate_fn_tagger_V2, collate_fn_bertbase_tti, collate_fn_tagger_V2TTI, collate_fn_jointV2, collate_fn_demo
ImportError: cannot import name 'collate_fn_demo' from 'utils.collate' (E:\新建文件夹\FCGEC-main\model\STG-correction\utils\collate.py)
请问这个函数是需要自己写吗?

求模型checkpoint文件

您好!您提供的在FCGEC语料上训练好的模型checkpoint文件地址好像失效了,请问您可以再提供一个文件吗?谢谢!

需要您的帮助

非常感谢您的工作,我对您的工作很感兴趣!我在读了您的论文之后看了下代码,我在论文中读到,训练方式可以分为联合和单独训练,区别好像是否合在一起和共享编码器,但是在代码中我发现在联合训练时,应该是各自模块的数据送到各自的模块的模型,每个模块都实例化了一个model它们彼此之间好像并没有联系,只是将着三个模块放到了一个大的模型下面,那么这个联合训练它们共享的是什么呢?我对此感到迷惑,希望您可以在方便的时候帮我解答下,谢谢您。

approaches and evaluation

Hello,
I had some questions regarding GEC and your approach.
First I think you used Seq2Edit Models and Seq2Seq Models for grammar correction part and modified some models like GECToR for Chinese. Then you proposed your model based on Switch-Tagger-Generator. Does STG is some how Seq2Edit and Seq2Seq combination?
Also for grammar classification and detection Seq2Edit models aren't useful? And does this repository contains classification part?
About evaluation I also have some questions.
You used ChERRANT. Does the performance of it have any differences with m2score?
Also ChERRANT wants m2 format but I don't see what is the need of m2 format actually :) . I mean the example like :
S The cat sat at mat .
A 3 4|||Prep|||on|||REQUIRED|||-NONE-|||0
A 4 4|||ArtOrDet|||the||a|||REQUIRED|||-NONE-|||0
can be written as :
S The cat sat at mat .
A 3 4|||on|||
A 4 4|||the||a|||
because if I'm not wrong the only important part are error indexes and the correction of it for evaluation part.

Data

Hi, I'm actully new in GEC and have some question if possible please help me.
I saw that for GEC tasks the dataset format is in source/target file. and for evaluation it is in M2 format file. But i think you used json format for both training and evaluation. What is the diffrence between them?
Also for computing the metrics i saw you used ChERRANT. Is it same as ERRANT? I mean does it get data in M2 format and calculate metrics?

Structure of Dataset

Hi,
I'm kind of beginner at GEC and i had a question about structure of dataset because I wanted to create it myself for my work. I see the format of your data is in json and sometimes i see the M2 format or parallel file format. Are they different from each other and where should we use each one of them? if you will help me i would be thankful.

在使用checkpoint时,遇到了缺少test.csv的报错

我的运行指令是这样的:
python joint_evaluate.py --mode test --gpu_id 0 --seed 2023 --checkpoints checkpoints --checkp joint_mode --export stg_joint_test.xlsx --data_base_dir dataset --max_generate 5 --lm_path model/pretrained-models/roberta-base-chinese --batch_size 32

报错如下:
Traceback (most recent call last):
File "joint_evaluate.py", line 134, in
evaluate(args)
File "joint_evaluate.py", line 44, in evaluate
switch_test = SwitchDataset(args, test_dir, 'test')
File "/workspace/ceph-rbd/mwz/FCGEC-main/model/STG-correction/DataProcessor/SwitchDataset.py", line 22, in init
self.sentences, self.label = self._read_csv(path)
File "/workspace/ceph-rbd/mwz/FCGEC-main/model/STG-correction/DataProcessor/SwitchDataset.py", line 29, in _read_csv
data = np.array(pd.read_csv(path, encoding='ISO-8859-1'))
File "/opt/conda/lib/python3.8/site-packages/pandas/io/parsers.py", line 688, in read_csv
return _read(filepath_or_buffer, kwds)
File "/opt/conda/lib/python3.8/site-packages/pandas/io/parsers.py", line 454, in _read
parser = TextFileReader(fp_or_buf, **kwds)
File "/opt/conda/lib/python3.8/site-packages/pandas/io/parsers.py", line 948, in init
self._make_engine(self.engine)
File "/opt/conda/lib/python3.8/site-packages/pandas/io/parsers.py", line 1180, in _make_engine
self._engine = CParserWrapper(self.f, **self.options)
File "/opt/conda/lib/python3.8/site-packages/pandas/io/parsers.py", line 1993, in init
src = open(src, "rb")
FileNotFoundError: [Errno 2] No such file or directory: 'dataset/test.csv'

请问这个文件是要自行构建么?如果是,结构应该是怎么样的呢,我看好像有一个sentences,一个label的样子,有更详细的介绍么,或者能否提供一下test.csv文件呢,麻烦了

数据集咨询

我在随机检查数据error_type的时候发现,其中有这么一条数据:
"9649376ce406de096f5c49a23177cf46": {
"sentence": "由于生产厂家众多,质量.服务不能与国际市场接轨的现象,使得**的小家电市场没有形成大名牌优势。",
"error_flag": 1,
"error_type": "CM",
"operation": "[{"Delete":[0,1,23,24,25]},{"Delete":[27,28]}]",
"version": "FCGEC EMNLP 2022"
}
CM表示缺少成分,但是和修改的行为delete并不相符,这是有什么说法吗?是可以理解为虽然语义错误是缺少成分,但是并不需要严格按照语义错误修改吗,或者说是数据集中的噪声呢?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.