sertiscorp / thai-word-segmentation Goto Github PK

View Code? Open in Web Editor NEW

80.0 80.0 23.0 2.64 MB

Thai word segmentation with bi-directional RNN

License: MIT License

Python 100.00%

thai-word-segmentation's People

Contributors

Stargazers

Watchers

thai-word-segmentation's Issues

.\checkpoints : Access is denied.

I am using your training code with english data. and found this error while saving the model. kindly help

AttributeError: module 'tensorflow._api.v2.saved_model' has no attribute 'signature_constants'

AttributeError: module 'tensorflow._api.v2.saved_model' has no attribute 'signature_constants'
python 3.8.12
tensorflow 2.2.0
scikit-learn 1.0.2
numpy 1.22.2

Thank you

Could you please give me some data or just a few lines of data format?

Installing requirements

Hi,
when simply using pip install -r requirements.txt it fails for me because scikit-learn depends on scipy.
Would it help if the ordering is different (or you remove scipy)? Manual install works fine if I say pip install scikit-learn after the other packages are installed...

Except from log (using requirements.txt file):

...
ImportError: Scientific Python (SciPy) is not installed.
  scikit-learn requires SciPy >= 0.9.
...

Problem when deploying the model with Tensorflow serving

I have completed deploying saved model in Tensorflow serving on a server side but I have a problem when client trying to match the input tensor format with the saved model.

save_model function in model.py

inputs = {
'inputs': tf.saved_model.utils.build_tensor_info(tf.identity(self.tf_tokens_batch, 'inputs')),
'lengths': tf.saved_model.utils.build_tensor_info(tf.identity(self.tf_lengths_batch, 'lengths')),
'training': tf.saved_model.utils.build_tensor_info(tf.identity(self.tf_training, 'training'))
}

outputs = {
'outputs': tf.saved_model.utils.build_tensor_info(tf.identity(self.tf_masked_prediction, 'outputs'))
}

client.py (my implementation)

text = "ทดสอบ"
inputs = [ThaiWordSegmentLabeller.get_input_labels(text)]
lengths = [len(text)]
request.model_spec.name = 'word'
request.model_spec.signature_name = 'word_segmentation'

request.inputs['inputs'].CopyFrom(tf.contrib.util.make_tensor_proto(values=inputs,dtype=tf.int64))
request.inputs['lengths'].CopyFrom(tf.contrib.util.make_tensor_proto(values=lengths,dtype=tf.int64))
request.inputs['training'].CopyFrom(tf.contrib.util.make_tensor_proto(values=False,dtype=tf.bool))

Output

grpc.framework.interfaces.face.face.AbortionError: AbortionError(code=StatusCode.INVALID_ARGUMENT, details="You must feed a value for placeholder tensor 'Placeholder_1' with dtype bool
         [[Node: Placeholder_1 = Placeholder[_output_shapes=[[]], dtype=DT_BOOL, shape=[], _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]")

Best Regards,

Lao word segmentation Issue

I saw your report with such high accuracy, I am Lao student studying master in computer science in China now, I saw your project is very interesting and I also have an idea to do so for Lao word segmentation and pos tagging, what does the Lao corpus look like which i should prepare and I want to try to code such thing for Lao language.

Please give me any idea

Sentence segmentation

Hi,
I was looking for a tool to do thai sentence segmentation. But there seems to be no readily available tool. Paper propose methods for use cases like thai-engl translation [1] or disambiguation of space characters as sentence markers [2] according to the number of verbs (morphemes) / rule-based / discourse analysis.

Do you know a tool for this task or is it possible to use your word segmentation tool as part of a toolchain to do sentence segmentation?

Thank you in advance for any reply. Best regards.

PS.: using your tool to apply space characters between each "word" seems to improve the result of google translate (for single sentences). :-) Well, my sample size was not that large ...

[1] http://www.aclweb.org/anthology/W10-3602
[2] http://pioneer.netserv.chula.ac.th/~awirote/ling/snlp2007-wirote.pdf

Hello

สวัดดีครับ เวลา train แล้ว แต่มันไม่ทำการ save model ให้จะทำไงดีครับ

How the text is supposed to be normalized if I'm using the saved model?

Hi, just wondering (without trying), what format and what normalization do you expect? utf-8? NFKD? Something else?

f1-score evaluation

Hi,

While working on #8, it seems to me that the evaluation of f-score is based on flatten true and pred labels. For example, given 2 samples whose lengths are 7 and 20. The current code flatten the labels to shape (27,) and compute the score. However, I think it could overestimate the value.

To illustrate, I've made a notebook using random data. You can see in there that the avg f-score is slightly lower than the f-score from the flatten data.

Looking forward to your thought on this.

Duplicate writing training data in preprocessing.py?

thai-word-segmentation/preprocess.py

Lines 44 to 51 in 5c77e02

 x, y = process_line(line) 

 p = random.random() 

 example = make_sequence_example(x, y) 

 training_writer.write(example.SerializeToString()) 

 if p <= training_proportion: 

 training_writer.write(example.SerializeToString()) 

 else: 

 validation_writer.write(example.SerializeToString())

At line number 47, it looks like the data is not splitted correctly. I am not sure whether this line should be removed as we should write the data only once based on random variable p.

sertiscorp / thai-word-segmentation Goto Github PK

thai-word-segmentation's People

Contributors

Stargazers

Watchers

Forkers

thai-word-segmentation's Issues

.\checkpoints : Access is denied.

AttributeError: module 'tensorflow._api.v2.saved_model' has no attribute 'signature_constants'

Could you please give me some data or just a few lines of data format?

Installing requirements

Problem when deploying the model with Tensorflow serving

Lao word segmentation Issue

Sentence segmentation

Hello

How the text is supposed to be normalized if I'm using the saved model?

f1-score evaluation

Duplicate writing training data in preprocessing.py?

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

	x, y = process_line(line)
	p = random.random()
	example = make_sequence_example(x, y)
	training_writer.write(example.SerializeToString())
	if p <= training_proportion:
	training_writer.write(example.SerializeToString())
	else:
	validation_writer.write(example.SerializeToString())