ckiplab / ckip-transformers Goto Github PK
View Code? Open in Web Editor NEWCKIP Transformers
Home Page: https://ckip-transformers.readthedocs.io
License: GNU General Public License v3.0
CKIP Transformers
Home Page: https://ckip-transformers.readthedocs.io
License: GNU General Public License v3.0
Hi @emfomy , thank you for your attention 🙏
ckip_transformers
version0.2.7
Set device
= -1, but the model still uses GPU.
script:
from ckip_transformers.nlp import CkipNerChunker
ner_driver = CkipNerChunker(level=3, device=-1)
res = ner_driver(text_list)
It should not consume GPU resources.
Run the script in GPU enable env:
from ckip_transformers.nlp import CkipNerChunker
ner_driver = CkipNerChunker(level=3, device=-1)
res = ner_driver(text_list)
Ubuntu 20.04.2 LTS
I've checked the source code, self.device
is set as "cpu", and both model and data tensor has to(self.device)
, so it's weird to have this problem.
And if the environment has no GPU, the model script is still runnable.
您好,我想要微調ckiplab/bert-base-chinese-ner這個模型,但看到模型的label有72個,有辦法從72個label中選我會使用到的29個,然後再進行微調嗎?
请问POS任务中识别出的标签=Neu是什么意思呢,指连串的数字?
In section README.rst
item
4. Show results
showed this line
print(pack_ws_pos_sentece(sentence_ws, sentence_pos))
It gave error since this function pack_ws_pos_sentece() was not defined in this block of code.
想請教一下,
貴單位BERT-base-chinese預訓練方式是完全遵照原始BERT的方式,
只有將資料集換成繁體中文、Tokenizer改變是嗎?
感謝
啥时候能出一个支持gpt2 和 bloom的ner模型呀
HuggingFace's team released a new major version of transformers (v4).
We should add support to this version.
When I tried to use CKIP-transformer to perform Chinese NER task pytorch. But when I loaded the model of level 3, The follow error occurs:
Traceback (most recent call last):
File "ner.py", line 3, in
ner_driver = CkipNerChunker(level=3, device=0)
File "/home/nieyang/anaconda3/envs/huggingface/lib/python3.6/site-packages/ckip_transformers/nlp/driver.py", line 224, in init
super().init(model_name=model_name, **kwargs)
File "/home/nieyang/anaconda3/envs/huggingface/lib/python3.6/site-packages/ckip_transformers/nlp/util.py", line 64, in init
self.model = AutoModelForTokenClassification.from_pretrained(model_name)
File "/home/nieyang/anaconda3/envs/huggingface/lib/python3.6/site-packages/transformers/models/auto/auto_factory.py", line 360, in from_pretrained
pretrained_model_name_or_path, *model_args, config=config, **kwargs
File "/home/nieyang/anaconda3/envs/huggingface/lib/python3.6/site-packages/transformers/modeling_utils.py", line 1066, in from_pretrained
f"Unable to load weights from pytorch checkpoint file for '{pretrained_model_name_or_path}' "
OSError: Unable to load weights from pytorch checkpoint file for 'ckiplab/bert-base-chinese-ner' at '/home/nieyang/.cache/huggingface/transformers/46785b95696d8e6a5004a6a73fcee887d60745a5872af82ca7599b9470554ce3.bdaa5056a5c748eca59fe2c7eef8fa2d034f5092fc84ce6b008c27ddf6f0025c'If you tried to load a PyTorch model from a TF 2.0 checkpoint, please set from_tf=True.
So I added the flag from_tf=True
to self.model = AutoModelForTokenClassification.from_pretrained(model_name)
in ckip_transformers/nlp/util.py, but it then cames out that the model name is wrong.
So can you help me with this?
HuggingFace's tokenizer can also return the original indices.
We may rewrite the tokenization step using this feature instead of tokenizing character by character.
如果用了你的bert base chinese , 应该加哪个参考文献?
from langchain.embeddings import HuggingFaceEmbeddings
from llama_index.embeddings.langchain import LangchainEmbedding
lc_embed_model = HuggingFaceEmbeddings(model_name="GanymedeNil/text2vec-large-chinese")
embed_model = LangchainEmbedding(lc_embed_model)
The above embedding model works for Chinese,
I wonder if there is a CKIP version of embedding model for Traditonal Chinese ?
您好 !
想請教之後有可能開發依存句法分析 dependency parsing 的工具嗎
感謝回答
Thanks so much for this excellent model and having it accessible in huggingface.
Would like to know why the ckiplab/bert-base-chinese
seems a bit strange to me when compared to the usual bert-base-chinese
which I think it mainly trained on simplified chinese. For instance, when I masked the word 風
of the phrase 颱風預測。
in the usual bert-base-chinese
it managed to give me back 風
with high probability 0.992; in contrast, in the ckiplab/bert-base-chinese
it didn't give back the masked word 風
in the top 5 but giving the word 的
with highest probability albeit only around 0.3 something which I am wondering.
Is it supposed that we have to fine-tune this MLM first? Or perhaps I interpreted it wrongly (as I'm very new in this field). Mind sharing a bit on your thought? Thanks very much and thanks in advance.
Thanks for the great library! Not sure if this is the correct place to ask, but I think I was using your tokenizer in huggingface transformers. I found that some traditional Chinese characters are mapped to UNKs, see the below screenshot.
The code I used was
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-chinese")
input_ids = tokenizer.encode("重刋道藏輯要高上玉皇本行集經天樞上相(臣)張良校正三淸勅門下湛寂常道信擬議之至難恢漠神通豈形容之可盡", return_tensors='pt')
print ('encoded ids: ', input_ids)
print ('map encoded ids back to words: ', tokenizer.decode(input_ids[0]))
Thanks in advance!
when computing this:
tokenizer = BertTokenizerFast.from_pretrained('bert-base-chinese')
model = AutoModel.from_pretrained('ckiplab/albert-tiny-chinese-pos')
I have issue:
f"Unable to load weights from pytorch checkpoint file for '{pretrained_model_name_or_path}' "
OSError: Unable to load weights from pytorch checkpoint file for 'ckiplab/albert-tiny-chinese-pos' at
I have
transformers==4.2.2
ckip-transformers==0.2.1
torch==1.4.0
Originally posted by @WachaIPSOS in #3 (comment)
Hello,
Can you share please an example how to use your model to split Chinese text into separate words?
At this moment this code:
from transformers import (
BertTokenizerFast,
AutoModelForMaskedLM,
AutoModelForCausalLM,
AutoModelForTokenClassification,
)
# casual language model (GPT2)
tokenizer = BertTokenizerFast.from_pretrained('bert-base-chinese')
model = AutoModelForCausalLM.from_pretrained('ckiplab/gpt2-base-chinese') # or other models above
encoded_input = tokenizer.encode(sample_input, return_tensors="pt")
# batch = []
# batch.append(encoded_input)
predictions = model.generate(encoded_input)
tokenizer.batch_decode(predictions)
gives ['[CLS] 之 后 你 看 看 了 我 的 出 版 请 告 诉 我 你 认 为 什 么 [SEP] 我']
for 之后你看看了我的出版请告诉我你认为什么
input.
At the same time your example in example.py
in the repo gives correct output for my input:
之后你看看了我的出版请告诉我你认为什么
之后(Nd) 你(Nh) 看看(VE) 了(Di) 我(Nh) 的(DE) 出版(Nv) 请(VF) 告诉(VE) 我(Nh) 你(Nh) 认为(VE) 什么(Nep)
(in case you are interested in context of this issue, here is google doc with my R&D information on this task)
Hi, I am new in this field. Is it possible to provide a demo code for bert-base-chinese-qa?
I tried the following code, following the book "Getting Started with Google BERT":
from transformers import BertTokenizerFast, BertForQuestionAnswering
Tokenizer = BertTokenizerFast.from_pretrained("ckiplab/bert-base-chinese")
model = BertForQuestionAnswering.from_pretrained("ckiplab/bert-base-chinese-qa")
paragraph = "李同 也 沒有 在意 , 大廈 中 , 几乎 每 天 都 有 人 搬進 搬出 , 原 不足為奇 。 \
可是 , 當 李同 走進 大廈 時 , 卻 看見 了 那 個 老者 , 那 老者 是 倒退 著 身子 走出來 的 , \
在 那 老者 的 面前 , 兩 個 搬運 工人 , 正 抬 著 一 只 箱子 。 那 是 一 只 木 箱子 , \
很 殘舊 了 , 箱子 并 不 大 , 但是 兩 個 搬運 工人 抬 著 , 看來 十分 吃力 。[SEP]".strip(" ")
question = "[CLS]老者怎麼走出來的?[SEP]"
question_tokens = tokenizer.tokenize(question)
paragraph_tokens = tokenizer.tokenize(paragraph)
tokens = question_tokens + paragraph_tokens
input_ids = tokenizer.convert_tokens_to_ids(tokens)
segment_ids = [0] * len(question_tokens)
segment_ids += [1] * len(paragraph_tokens)
input_ids = torch.tensor([input_ids])
segment_ids = torch.tensor([segment_ids])
# Getting the answer
res = model(input_ids, token_type_ids=segment_ids)
start_scores, end_scores = res['start_logits'], res['end_logits']
start_index = torch.argmax(start_scores)
end_index = torch.argmax(end_scores)
print(" ".join(tokens[start_index:end_index+1]))
But, I got [CLS]. Could you provide a sample code to how how this Chinese QA model can work properly?
Thank you!
We may implement our own tokenizer rather than using BertTokenizerFast.
Our own tokenizer should have the following features:
tokenizer.convert_tokens_to_ids(list(input_text))
)clean_up_tokenization
method. The default method is implemented for English only. Our method may remove whitespaces and convert half-width punctuations to full-width ones.您好:
想請教一下我目前利用了您在說明中提到的範例檔run_ner.py來去依照我自己的資料集微調完model了
https://github.com/huggingface/transformers/tree/main/examples/pytorch/token-classification
最後分別生成了config.json以及tf_model.h5兩個檔案
但是當我想使用使用自己微調過的model時
在這行
ws_driver = CkipNerChunker(model="tmp/tf_model.h5")
跳出了以下錯誤
Traceback (most recent call last):
File "/root/miniconda3/envs/chatbot/lib/python3.6/site-packages/ckip_transformers/nlp/util.py", line 89, in _get_model_name
model_name = self._model_names[model]
KeyError: './tmp/tf_model.h5'
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "Transformers_pretrained.py", line 12, in
ws_driver = CkipWordSegmenter(model="./tmp/tf_model.h5")
File "/root/miniconda3/envs/chatbot/lib/python3.6/site-packages/ckip_transformers/nlp/driver.py", line 52, in init
model_name = kwargs.pop("model_name", self._get_model_name(model))
File "/root/miniconda3/envs/chatbot/lib/python3.6/site-packages/ckip_transformers/nlp/util.py", line 91, in _get_model_name
raise KeyError(f"Invalid model {model}") from exc
KeyError: 'Invalid model ./tmp/tf_model.h5'
請問我該如何正確地使用我自己微調完的model搭配CkipWordSegmenter, CkipPosTagger以及CkipNerChunker呢
Is there a way to get an equivalent albert-tiny english language model to perform downstream tasks like intent and entity classification. I'm afraid there is no albert-tiny model present hence any lead on this regards or guide to create one from scratch, would be highly appreciated.
Thanks
你好,
我想請問若要fine-tune以下ws ,pos, ner 的model,
ckiplab/bert-base-chinese-ws
ckiplab/bert-base-chinese-pos
ckiplab/bert-base-chinese-ner
依照例子透過huggingFace上的run_ner.py 來執行,去置換model_name_or_path成以上三個 model來源來做訓練,
那這樣我在fine-tune這三種model時,我的訓練的data標記是只能有 B 跟 I 嗎? 不能額外標註類型嗎,例如 "B-PRODUCT", "I-PRODUCT" 的這種方式嗎? 也不能有O嗎? 因為我看先前的issue提問說是用B、I。
謝謝
您好,
Hi, thanks for your great work.
I found a tiny error for your example, when execute the code
ws = ws_driver(text, batch_size=256, max_length=512)
It would show the error message is that
"AssertionError: Sequence length is longer than the maximum sequence length for this model (512 > 510)."
Set the max_length lower than 510 can fix this.
Without that, everything is fine. It's a excellent and convenience tool for extract information from data.
您好:
想請教一下在使用CkipWordSegmenter, CkipPosTagger, CkipNerChunker
能從結果中獲取每一個output的embedding嗎?
像是範例中的字串長度為45的句子
傅達仁今將執行安樂死,卻突然爆出自己20年前遭緯來體育台封殺,他不懂自己哪裡得罪到電視台。
最終輸出時可以從某地方得到45x768這樣的結果嗎? 謝謝
这个模型是否适用于简体中文呢?是否有简体中文的相关实验数据?
Hi,
I'm currently using ckip-transformers-ws as a preprocessing tool in my project, and I noticed that the DataLoader's pin_memory flag was hard-coded True
in util.py
.
As pinning memory is incompatible with multiprocessing (or multiple workers) [1], when users leverage ckip-transformers in their collate_fn of DataLoader with multiple workers, a CUDA error will occur as shown in [1], even if only using CPU for inference.
Therefore, I think it would be better that:
Regards.
[1] https://discuss.pytorch.org/t/pin-memory-vs-sending-direct-to-gpu-from-dataset/33891/2
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.