prosusai / finbert Goto Github PK
View Code? Open in Web Editor NEWFinancial Sentiment Analysis with BERT
License: Apache License 2.0
Financial Sentiment Analysis with BERT
License: Apache License 2.0
Hello, You mentioned in one of your response that you used 50 finance keywords to extract the finance related text. Do you mind sharing the keywords you used ?
Thanks
When trying to load the model I get the following error. Do I need to download any additional model files? Thank you.
---------------------------------------------------------------------------
UnpicklingError Traceback (most recent call last)
<ipython-input-8-19ad01fc2649> in <module>
5 pass
6
----> 7 bertmodel = BertForSequenceClassification.from_pretrained(lm_path,cache_dir=None, num_labels=3)
8
9
/anaconda/envs/bert10k/lib/python3.7/site-packages/pytorch_pretrained_bert/modeling.py in from_pretrained(cls, pretrained_model_name_or_path, *inputs, **kwargs)
601 if state_dict is None and not from_tf:
602 weights_path = os.path.join(serialization_dir, WEIGHTS_NAME)
--> 603 state_dict = torch.load(weights_path, map_location='cpu')
604 if tempdir:
605 # Clean up temp dir
/anaconda/envs/bert10k/lib/python3.7/site-packages/torch/serialization.py in load(f, map_location, pickle_module, **pickle_load_args)
424 if sys.version_info >= (3, 0) and 'encoding' not in pickle_load_args.keys():
425 pickle_load_args['encoding'] = 'utf-8'
--> 426 return _load(f, map_location, pickle_module, **pickle_load_args)
427 finally:
428 if new_fd:
/anaconda/envs/bert10k/lib/python3.7/site-packages/torch/serialization.py in _load(f, map_location, pickle_module, **pickle_load_args)
601 f.seek(0)
602
--> 603 magic_number = pickle_module.load(f, **pickle_load_args)
604 if magic_number != MAGIC_NUMBER:
605 raise RuntimeError("Invalid magic number; corrupt file?")
UnpicklingError: invalid load key, 'v'.
I have seen that your training implementation of FinBERT supports distributed training. But how can we launch it on multiple servers ?
Hi, I have ubuntu version 16. And I have cloned finBERT in my machine. I have also created a folder and have put the required files from hugging face website. Everything was working fine I had got the environment created by running environment.yml file provided in GitHub.
But when I am calling AutoModelForSequenceClassification.from_pretrained(file_path, cache_dir=None, num_labels=3)
1. I am getting Unable to open file (file signature not found) error.
2. Also, when I am checking for version of torch. It's 1.1.0 in my terminal whereas it is 1.6.0 in the jupyter notebook.
All this code is being executed in Jupyter Notebook
I tried this code
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
import numpy as np
tokenizer = AutoTokenizer.from_pretrained("ProsusAI/finbert")
model = AutoModelForSequenceClassification.from_pretrained("ProsusAI/finbert")
but get this error.
OSError: Can't load 'ProsusAI/finbert'. Make sure that:
- 'ProsusAI/finbert' is a correct model identifier listed on 'https://huggingface.co/models'
- or 'ProsusAI/finbert' is the correct path to a directory containing a 'config.json' file
Please provide the information where to download the config.json for Sentiment analysis model trained on Financial PhraseBank
The instruction says to place a copy of config.json
. The link to download the Sentiment analysis model trained on Financial PhraseBank
is the bin file only, not include the config.json
.
For both of these model, the workflow should be like this:
Create a directory for the model. For example: models/sentiment/
Download the model and put it into the directory you just created.
Put a copy of config.json in this same directory. <------------------------------
Call the model with .from_pretrained()
The config.json in the huggingface looks to be for TRC2.
{
"_name_or_path": "/home/ubuntu/finbert/models/language_model/finbertTRC2", <-----
"architectures": [
"BertForSequenceClassification"
],
"attention_probs_dropout_prob": 0.1,
"gradient_checkpointing": false,
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"hidden_size": 768,
"id2label": {
"0": "positive",
"1": "negative",
"2": "neutral"
},
"initializer_range": 0.02,
"intermediate_size": 3072,
"label2id": {
"positive": 0,
"negative": 1,
"neutral": 2
},
"layer_norm_eps": 1e-12,
"max_position_embeddings": 512,
"model_type": "bert",
"num_attention_heads": 12,
"num_hidden_layers": 12,
"pad_token_id": 0,
"position_embedding_type": "absolute",
"type_vocab_size": 2,
"vocab_size": 30522
}
Can you please attach the train, validation and test data which you used for training and testing?
Thanks in advance.
I'm facing an issue at code
trained_model = finbert.train(train_examples = train_data, model = model)
Error is
TypeError Traceback (most recent call last)
<ipython-input-11-2ebf0cb3d4e8> in <module>
----> 1 trained_model = finbert.train(train_examples = train_data, model = model)
~\finBERT-master\finbert\finbert.py in train(self, train_examples, model)
482 print('No best model found')
483 torch.save({'epoch': str(i), 'state_dict': model.state_dict()},
--> 484 self.config.model_dir / ('temporary' + str(i)))
485 best_model = i
486
TypeError: unsupported operand type(s) for /: 'str' and 'str'
I downloaded and placed the language model (pytorch_model.bin) and config.json in the same directory and when i run the piece of code: configuring training parameters, get below error:
INFO - pytorch_pretrained_bert.modeling - loading archive file /directory/pytorch_model.bin
INFO - pytorch_pretrained_bert.modeling - extracting archive file /directory/pytorch_model.bin to temp dir /tmp/tmppjpupb
Readerror: not a gzip file
Looking more granular:
/usr/lib/python3.6/tarfile.py in gzopen(cls, name, mode, fileobj, compresslevel, **kwargs)
1645 fileobj.close()
1646 if mode == 'r':
-> 1647 raise ReadError("not a gzip file")
1648 raise
1649 except:
Hi,
Thanks for sharing the work.
In order to run notebook "finbert_training.ipynb" for finetuning a sentiment classifier, I could not understand the train.csv (test,validation) format.
Financial Phrase Bank dataset has files where the sentences and labels are separated by @.
Can you tell me the format of these csv files in which it should be prepared ?
I see the following code for reading .csv in utils.py but could not get it.
def _read_tsv(cls, input_file):
"""Reads a tab separated value file."""
with open(input_file, "r") as f:
reader = csv.reader(f, delimiter="\t")
lines = []
for line in reader:
if sys.version_info[0] == 2:
line = list(unicode(cell, 'utf-8') for cell in line)
lines.append(line)
return lines```
Thanks.
I was playing around with the finBERT model a bit and I noticed that for short sentences having a period at the end makes a big difference on the model's predictions (see Figures 1-2 below).
Any idea why that is the case? Could it be that the model was fine-tuned on the sentences with a dot at the end and that's why it makes such a difference? Or does it have to do with BERT embeddings, i.e. is there a special embedding for a dot?
When I call predict, I get the error "only one element tensors can be converted to Python scalars" on line 618 of finbert.py.
When I modify the line from:
logits = softmax(np.array(logits))
to
logits = softmax(np.array(logits[0]))
I get no error, but the predictions and sentiment scores do not seem right when I tested it on examples.csv. The logit looks like [small number, .99..., small number], so the labels are all negative and the scores are all around -.99...
For reference, I copied predict.py, finbert.py, and util.py into a jupyter notebook and used the following as my model
model = BertForSequenceClassification.from_pretrained("ipuneetrathore/bert-base-cased-finetuned finBERT",num_labels=3,cache_dir=None)
Good morning,
I am running the configuring parameters cell and I am getting the below error:
UnpicklingError Traceback (most recent call last)
in
5 pass
6
----> 7 bertmodel = BertForSequenceClassification.from_pretrained(lm_path,cache_dir=None, num_labels=3)
8
9
~/anaconda3/envs/finbert/lib/python3.7/site-packages/pytorch_pretrained_bert/modeling.py in from_pretrained(cls, pretrained_model_name_or_path, *inputs, **kwargs)
601 if state_dict is None and not from_tf:
602 weights_path = os.path.join(serialization_dir, WEIGHTS_NAME)
--> 603 state_dict = torch.load(weights_path, map_location='cpu')
604 if tempdir:
605 # Clean up temp dir
~/anaconda3/envs/finbert/lib/python3.7/site-packages/torch/serialization.py in load(f, map_location, pickle_module, **pickle_load_args)
385 f = f.open('rb')
386 try:
--> 387 return _load(f, map_location, pickle_module, **pickle_load_args)
388 finally:
389 if new_fd:
~/anaconda3/envs/finbert/lib/python3.7/site-packages/torch/serialization.py in _load(f, map_location, pickle_module, **pickle_load_args)
562 f.seek(0)
563
--> 564 magic_number = pickle_module.load(f, **pickle_load_args)
565 if magic_number != MAGIC_NUMBER:
566 raise RuntimeError("Invalid magic number; corrupt file?")
UnpicklingError: invalid load key, 'v'.
Moreover, can you kindly explain how I can construct the files train.csv, validation.csv, test.csv?
Regards,
Bernard
Hi there, great work and thanks for sharing star!
I am currently trying to reproduce pre-training of bert using trc2 corpus for research purposes to which I have access to, so this is not a request for data. Instead, could you please share the pre-processing code you used for pre-training bert to produce finBERT, how you ingested the data into bert pre-training etc.?
Best regards-
Hello, I trained the model with my own parameters, and saved it.
However, whenever I try to use it, I get the following error:
UnicodeDecodeError Traceback (most recent call last)
in
4 tokenizer = AutoTokenizer.from_pretrained("ProsusAI/finbert")
5
----> 6 model = AutoModelForSequenceClassification.from_pretrained("C:/Users/Verena/Documents/finbert_new/models/classifier_model/finbert-sentiment.bin")
7 label_list = label_list=['positive','negative','neutral']
~\anaconda3\envs\finbert\lib\site-packages\transformers-4.0.1-py3.8.egg\transformers\models\auto\modeling_auto.py in from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs)
1237 if not isinstance(config, PretrainedConfig):
1238 config, kwargs = AutoConfig.from_pretrained(
-> 1239 pretrained_model_name_or_path, return_unused_kwargs=True, **kwargs
1240 )
1241
~\anaconda3\envs\finbert\lib\site-packages\transformers-4.0.1-py3.8.egg\transformers\models\auto\configuration_auto.py in from_pretrained(cls, pretrained_model_name_or_path, **kwargs)
339 {'foo': False}
340 """
--> 341 config_dict, _ = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs)
342
343 if "model_type" in config_dict:
~\anaconda3\envs\finbert\lib\site-packages\transformers-4.0.1-py3.8.egg\transformers\configuration_utils.py in get_config_dict(cls, pretrained_model_name_or_path, **kwargs)
387 )
388 # Load config dict
--> 389 config_dict = cls._dict_from_json_file(resolved_config_file)
390
391 except EnvironmentError as err:
~\anaconda3\envs\finbert\lib\site-packages\transformers-4.0.1-py3.8.egg\transformers\configuration_utils.py in _dict_from_json_file(cls, json_file)
470 def _dict_from_json_file(cls, json_file: str):
471 with open(json_file, "r", encoding="utf-8") as reader:
--> 472 text = reader.read()
473 return json.loads(text)
474
~\anaconda3\envs\finbert\lib\codecs.py in decode(self, input, final)
320 # decode input (taking the buffer into account)
321 data = self.buffer + input
--> 322 (result, consumed) = self._buffer_decode(data, self.errors, final)
323 # keep undecoded input until the next call
324 self.buffer = data[consumed:]
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte
The same happens when I try to load the language model, even though both models are downloaded locally. I was only able to use finbert through transformers.
Can you please help me? Thanks!
Hey there,
How much improvement do you see with "discriminative fine-tuning" and "Gradual Unfreezing"?
Could you quantify the improvement?
i.e.:
1%
5%
thanks!
Hi, I'm trying to use predict.py, colud it possible to predict entire artile (maybe 17 lines) to one final result? but not 17 result for every sentence ?
The images show 17 lines in green background, each line get a result, could it just predict the entire article(yellow background) ?
Thanks for your help.
Hi!
I just downloaded this .bin file and was wondering how to get the following two files(Language model trained on TRC2
& Sentiment analysis model trained on Financial PhraseBank) from it?
Thanks!!
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
model = model.cuda(device)
gives this error!
File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/pytorch_pretrained_bert/modeling.py", line 989, in forward
_, pooled_output = self.bert(input_ids, token_type_ids, attention_mask, output_all_encoded_layers=False)
File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in call
result = self.forward(*input, **kwargs)
File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/pytorch_pretrained_bert/modeling.py", line 730, in forward
embedding_output = self.embeddings(input_ids, token_type_ids)
File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in call
result = self.forward(*input, **kwargs)
File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/pytorch_pretrained_bert/modeling.py", line 267, in forward
words_embeddings = self.word_embeddings(input_ids)
File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in call
result = self.forward(*input, **kwargs)
File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/nn/modules/sparse.py", line 114, in forward
self.norm_type, self.scale_grad_by_freq, self.sparse)
File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/nn/functional.py", line 1484, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Expected object of device type cuda but got device type cpu for argument #3 'index' in call to _th_index_select
Is it possible to acquire each sentence representation by removing your last few layers in the model?
If yes, which layer output sentence representation?
Thanks!
Dear Sir,
I hope this email finds you well.I am new to data science and for my course project I need to use finbert for get the tweet sentiment score and I am unable to install the finbert on pycharm kindly help me on this Thank you
Best Regards
Taha Ahmad
Hi,
Since the TRC2 dataset cannot be publicly made available, I was wondering if some similar format sample training dataset can be provided.
Thanks!
Thanks for sharing your code on this matter. I have used your trained model. however, I want to try and train the model on my own with the help of the datasets which you have mentioned.
But with running the finbert_training.ipynb
I hit to an error on trained_model = finbert.train(train_examples = train_data, model = model)
I have tried to debug the code, for some reasons step is always 0. I was wondering if you can give me some hints on the hint on how to fix this issue :)
Hello,
Could you include the vocab.txt
for finBERT? I don't see it in the model's directory and it seems like you are using bert-base-uncased
vocabulary for constructing the tokenizer ( https://github.com/ProsusAI/finBERT/blob/master/finbert/finbert.py)
self.tokenizer = BertTokenizer.from_pretrained("bert-base-uncased", do_lower_case=self.config.do_lower_case)
Thank you,
Andrei
I have label of dimension 240
, it is multi label
classification problem.
I downloaded the Finbert model:
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer
from transformers import AutoTokenizer
tokenizer=AutoTokenizer.from_pretrained("/home/pratik/finbert", use_fast=True)
train_tokenizer_texts = list(map(lambda t: tokenizer.tokenize(t,add_special_tokens=True,max_length=512,padding='max_length'), tqdm(train_sentences)))
np.array(train_tokenizer_texts[0])
%%time
#Inititaing a BERT model
model = AutoModelForSequenceClassification.from_pretrained("/home/pratik/finbert", num_labels = 240)
model.cuda()
This gives me error:
RuntimeError: Error(s) in loading state_dict for BertForSequenceClassification:
size mismatch for classifier.weight: copying a param with shape torch.Size([3, 768]) from checkpoint, the shape in current model is torch.Size([240, 768]).
size mismatch for classifier.bias: copying a param with shape torch.Size([3]) from checkpoint, the shape in current model is torch.Size([240]).
I succeeded in training model with executing finbert_training.ipynb. Then I ran a docker conatiner from Dockerfile and threw POST request to container(localhost:8080), which showed me the following error.
08/27/2021 17:03:42 - INFO - pytorch_pretrained_bert.modeling - Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex .
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data] Unzipping tokenizers/punkt.zip.
08/27/2021 17:03:42 - INFO - pytorch_pretrained_bert.modeling - loading archive file /src/models/classifier_model/finbert-sentiment
08/27/2021 17:03:42 - INFO - pytorch_pretrained_bert.modeling - Model config {
"_name_or_path": "c:\\Users\\user\\projects\\finbert\\finBERT\\models\\language_model\\finbertTRC2",
"attention_probs_dropout_prob": 0.1,
"gradient_checkpointing": false,
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"hidden_size": 768,
"id2label": {
"0": "LABEL_0",
"1": "LABEL_1",
"2": "LABEL_2"
},
"initializer_range": 0.02,
"intermediate_size": 3072,
"label2id": {
"LABEL_0": 0,
"LABEL_1": 1,
"LABEL_2": 2
},
"layer_norm_eps": 1e-12,
"max_position_embeddings": 512,
"model_type": "bert",
"num_attention_heads": 12,
"num_hidden_layers": 12,
"pad_token_id": 0,
"position_embedding_type": "absolute",
"type_vocab_size": 2,
"vocab_size": 30522
}
08/27/2021 17:03:45 - INFO - pytorch_pretrained_bert.modeling - Weights from pretrained model not used in BertForSequenceClassification: ['bert.embeddings.position_ids']
* Serving Flask app 'main' (lazy loading)
* Environment: production
WARNING: This is a development server. Do not use it in a production deployment.
Use a production WSGI server instead.
* Debug mode: off
08/27/2021 17:03:45 - WARNING - werkzeug - * Running on all addresses.
WARNING: This is a development server. Do not use it in a production deployment.
08/27/2021 17:03:45 - INFO - werkzeug - * Running on http://172.17.0.2:8080/ (Press CTRL+C to quit)
08/27/2021 17:10:41 - INFO - filelock - Lock 140200106040192 acquired on /root/.cache/huggingface/transformers/c1d7f0a763fb63861cc08553866f1fc3e5a6f4f07621be277452d26d71303b7e.20430bd8e10ef77a7d2977accefe796051e01bc2fc4aa146bc862997a1a15e79.lock
Downloading: 100% 28.0/28.0 [00:00<00:00, 10.4kB/s]
08/27/2021 17:10:42 - INFO - filelock - Lock 140200106040192 released on /root/.cache/huggingface/transformers/c1d7f0a763fb63861cc08553866f1fc3e5a6f4f07621be277452d26d71303b7e.20430bd8e10ef77a7d2977accefe796051e01bc2fc4aa146bc862997a1a15e79.lock
08/27/2021 17:10:43 - INFO - filelock - Lock 140200106041056 acquired on /root/.cache/huggingface/transformers/3c61d016573b14f7f008c02c4e51a366c67ab274726fe2910691e2a761acf43e.37395cee442ab11005bcd270f3c34464dc1704b715b5d7d52b1a461abe3b9e4e.lock
Downloading: 100% 570/570 [00:00<00:00, 406kB/s]
08/27/2021 17:10:43 - INFO - filelock - Lock 140200106041056 released on /root/.cache/huggingface/transformers/3c61d016573b14f7f008c02c4e51a366c67ab274726fe2910691e2a761acf43e.37395cee442ab11005bcd270f3c34464dc1704b715b5d7d52b1a461abe3b9e4e.lock
08/27/2021 17:10:44 - INFO - filelock - Lock 140200106040720 acquired on /root/.cache/huggingface/transformers/45c3f7a79a80e1cf0a489e5c62b43f173c15db47864303a55d623bb3c96f72a5.d789d64ebfe299b0e416afc4a169632f903f693095b4629a7ea271d5a0cf2c99.lock
Downloading: 100% 232k/232k [00:00<00:00, 490kB/s]
08/27/2021 17:10:45 - INFO - filelock - Lock 140200106040720 released on /root/.cache/huggingface/transformers/45c3f7a79a80e1cf0a489e5c62b43f173c15db47864303a55d623bb3c96f72a5.d789d64ebfe299b0e416afc4a169632f903f693095b4629a7ea271d5a0cf2c99.lock
08/27/2021 17:10:46 - INFO - filelock - Lock 140200106040768 acquired on /root/.cache/huggingface/transformers/534479488c54aeaf9c3406f647aa2ec13648c06771ffe269edabebd4c412da1d.7f2721073f19841be16f41b0a70b600ca6b880c8f3df6f3535cbc704371bdfa4.lock
Downloading: 100% 466k/466k [00:00<00:00, 595kB/s]
08/27/2021 17:10:47 - INFO - filelock - Lock 140200106040768 released on /root/.cache/huggingface/transformers/534479488c54aeaf9c3406f647aa2ec13648c06771ffe269edabebd4c412da1d.7f2721073f19841be16f41b0a70b600ca6b880c8f3df6f3535cbc704371bdfa4.lock
['The Federal Reserve is committed to using its full range of tools to support the U.S. economy in this challenging time, thereby promoting its maximum employment and price stability goals.']
08/27/2021 17:10:49 - INFO - finbert.utils - *** Example ***
08/27/2021 17:10:49 - INFO - finbert.utils - guid: 0
08/27/2021 17:10:49 - INFO - finbert.utils - tokens: [CLS] the federal reserve is committed to using its full range of tools to support the u . s . economy in this challenging time , thereby promoting its maximum employment and price stability goals . [SEP]
08/27/2021 17:10:49 - INFO - finbert.utils - input_ids: 101 1996 2976 3914 2003 5462 2000 2478 2049 2440 2846 1997 5906 2000 2490 1996 1057 1012 1055 1012 4610 1999 2023 10368 2051 1010 8558 7694 2049 4555 6107 1998 3976 9211
3289 1012 102 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
08/27/2021 17:10:49 - INFO - finbert.utils - attention_mask: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
08/27/2021 17:10:49 - INFO - finbert.utils - token_type_ids: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
08/27/2021 17:10:49 - INFO - finbert.utils - label: None (id = 9090)
[<finbert.utils.InputFeatures object at 0x7f82e1849d90>]
08/27/2021 17:10:49 - INFO - root - tensor([ 2.1882, -2.1247, -0.7895])
[ 2.1882384 -2.1246738 -0.78948754]
08/27/2021 17:10:49 - ERROR - main - Exception on / [POST]
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/site-packages/flask/app.py", line 2070, in wsgi_app
response = self.full_dispatch_request()
File "/opt/conda/lib/python3.8/site-packages/flask/app.py", line 1515, in full_dispatch_request
rv = self.handle_user_exception(e)
File "/opt/conda/lib/python3.8/site-packages/flask_cors/extension.py", line 165, in wrapped_function
return cors_after_request(app.make_response(f(*args, **kwargs)))
File "/opt/conda/lib/python3.8/site-packages/flask/app.py", line 1513, in full_dispatch_request
rv = self.dispatch_request()
File "/opt/conda/lib/python3.8/site-packages/flask/app.py", line 1499, in dispatch_request
return self.ensure_sync(self.view_functions[rule.endpoint])(**req.view_args)
File "/src/main.py", line 21, in score
return(predict(text, model).to_json(orient='records'))
File "/src/finbert/finbert.py", line 615, in predict
logits = softmax(np.array(logits))
File "/src/finbert/utils.py", line 215, in softmax
e_x = np.exp(x - np.max(x, axis=1)[:, None])
File "<__array_function__ internals>", line 5, in amax
File "/opt/conda/lib/python3.8/site-packages/numpy/core/fromnumeric.py", line 2705, in amax
return _wrapreduction(a, np.maximum, 'max', axis, None, out,
File "/opt/conda/lib/python3.8/site-packages/numpy/core/fromnumeric.py", line 87, in _wrapreduction
return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
numpy.AxisError: axis 1 is out of bounds for array of dimension 1
08/27/2021 17:10:49 - INFO - werkzeug - 172.17.0.1 - - [27/Aug/2021 17:10:49] "POST / HTTP/1.1" 500 -
['The Federal Reserve is committed to using its full range of tools to support the US economy in this challenging
time, thereby promoting its maximum employment and price stability goals.']
08/27/2021 17:22:27 - INFO - finbert.utils - *** Example ***
08/27/2021 17:22:27 - INFO - finbert.utils - guid: 0
08/27/2021 17:22:27 - INFO - finbert.utils - tokens: [CLS] the federal reserve is committed to using its full range of tools to support the us economy in this challenging time , thereby promoting its maximum employment and price stability goals . [SEP]
08/27/2021 17:22:27 - INFO - finbert.utils - input_ids: 101 1996 2976 3914 2003 5462 2000 2478 2049 2440 2846 1997 5906 2000 2490 1996 2149 4610 1999 2023 10368 2051 1010 8558 7694 2049 4555 6107 1998 3976 9211 3289 1012 102 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
08/27/2021 17:22:27 - INFO - finbert.utils - attention_mask: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
08/27/2021 17:22:27 - INFO - finbert.utils - token_type_ids: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
08/27/2021 17:22:27 - INFO - finbert.utils - label: None (id = 9090)
[<finbert.utils.InputFeatures object at 0x7f82e1004d90>]
08/27/2021 17:22:27 - INFO - root - tensor([ 2.0840, -2.0827, -0.8532])
[ 2.0839536 -2.0827212 -0.85315543]
08/27/2021 17:22:27 - ERROR - main - Exception on / [POST]
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/site-packages/flask/app.py", line 2070, in wsgi_app
response = self.full_dispatch_request()
File "/opt/conda/lib/python3.8/site-packages/flask/app.py", line 1515, in full_dispatch_request
rv = self.handle_user_exception(e)
File "/opt/conda/lib/python3.8/site-packages/flask_cors/extension.py", line 165, in wrapped_function
return cors_after_request(app.make_response(f(*args, **kwargs)))
File "/opt/conda/lib/python3.8/site-packages/flask/app.py", line 1513, in full_dispatch_request
rv = self.dispatch_request()
File "/opt/conda/lib/python3.8/site-packages/flask/app.py", line 1499, in dispatch_request
return self.ensure_sync(self.view_functions[rule.endpoint])(**req.view_args)
File "/src/main.py", line 21, in score
return(predict(text, model).to_json(orient='records'))
File "/src/finbert/finbert.py", line 615, in predict
logits = softmax(np.array(logits))
File "/src/finbert/utils.py", line 215, in softmax
e_x = np.exp(x - np.max(x, axis=1)[:, None])
File "<__array_function__ internals>", line 5, in amax
File "/opt/conda/lib/python3.8/site-packages/numpy/core/fromnumeric.py", line 2705, in amax
return _wrapreduction(a, np.maximum, 'max', axis, None, out,
File "/opt/conda/lib/python3.8/site-packages/numpy/core/fromnumeric.py", line 87, in _wrapreduction
return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
numpy.AxisError: axis 1 is out of bounds for array of dimension 1
08/27/2021 17:22:27 - INFO - werkzeug - 172.17.0.1 - - [27/Aug/2021 17:22:27] "POST / HTTP/1.1" 500 -
I have solved this issue by changing code in finbert.py
Before:
with torch.no_grad():
logits = model(all_input_ids, all_attention_mask, all_token_type_ids)[0]
After:
with torch.no_grad():
logits = model(all_input_ids, all_attention_mask, all_token_type_ids)
I will appreciate if you check this out.
Hi, how to setup and create train.csv, validation.csv, test.csv from Financial Pharase Bank data?
Hi! Thanks for developing and sharing the codes.
I wonder which vanilla BERT model you used to post-training on financial domain text.
To be specific, I wonder whether this FinBERT model can tell the difference between uppercase and lowercase.
Hello. I just would like to report you that running container created by image built from Dockerfile in current master repository would cause an ModuleNotFoundError.
I solved this issue by changing code in Dockerfile.
Before:
RUN pip install pytorch_pretrained_bert numpy pandas nltk Flask flask-cors
After:
RUN pip install pytorch_pretrained_bert numpy pandas nltk Flask flask-cors transformers
I would appreciate if you check it out.
The first model to download "Language model trained on TRC2" is not valid. I get wrong predictions with it.
The second one seems to be fine.
Thank you for your great work!
When I run the example predict.py I got below errors. Should you add pytorch_pretrained_bert to your environment.yml?
Traceback (most recent call last):
File "predict.py", line 1, in
from finbert.finbert import predict
File "C:\Projects\Python\GitHub\finBERT\finbert\finbert.py", line 6, in
from pytorch_pretrained_bert.tokenization import BertTokenizer
ModuleNotFoundError: No module named 'pytorch_pretrained_bert'
Hi Sir,
Thank you so much for sharing the code.
I notice that in finbertTRC2 folder, the vocab.txt file is missing, could you tell me where I could find this file?
Thanks!
Hi,
I downloaded the data set from the Financial Phrase Bank from Malo et al. (2014). And created train.csv using the data.
train_data = finbert.get_data('train')
But for the above code snippets in "finbert_training"-notebook, an error message was generated as follows.
Is there any method to resolve this issue.
Thanks..
UnicodeDecodeError Traceback (most recent call last)
in
7 #print(cl_data_path)
8 # Get the training examples
----> 9 train_data = finbert.get_data('train')
~\Documents\FIN_BERT\finBERT-master\finbert\finbert.py in get_data(self, phase)
192 self.num_train_optimization_steps = None
193 examples = None
--> 194 examples = self.processor.get_examples(self.config.data_dir, phase)
195 self.num_train_optimization_steps = int(
196 len(
~\Documents\FIN_BERT\finBERT-master\finbert\utils.py in get_examples(self, data_dir, phase)
89 Name of the .csv file to be loaded.
90 """
---> 91 return self._create_examples(self._read_tsv(os.path.join(data_dir, (phase + ".csv"))), phase)
92
93 def get_labels(self):
~\Documents\FIN_BERT\finBERT-master\finbert\utils.py in _read_tsv(cls, input_file)
66 reader = csv.reader(f, delimiter="\t")
67 lines = []
---> 68 for line in reader:
69 if sys.version_info[0] == 2:
70 line = list(unicode(cell, 'utf-8') for cell in line)
D:\Python\lib\encodings\cp1252.py in decode(self, input, final)
21 class IncrementalDecoder(codecs.IncrementalDecoder):
22 def decode(self, input, final=False):
---> 23 return codecs.charmap_decode(input,self.errors,decoding_table)[0]
24
25 class StreamWriter(Codec,codecs.StreamWriter):_
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 7919: character maps to
On line 628 of finbert.py you use result = pd.concat([result,batch_result])
when it should be result = pd.concat([result,batch_result], ignore_index=True)
.
In your result
DataFrame the when you concatenate multiple batches together you will have id's that are the same. e.g. 2 batches of 3 items the indexes in result will be 0,1,2,0,1,2.
If you were to convert the DataFrame to a dictionary the results override each other as multiples keys of the same value exist.
I'm trying to run FinBERT for stock market prediction based on SEC filings. I am using the finbert_training notebook as a reference.
When running:
trained_model = finbert.train(train_examples = train_data, model = model)
I get the following error:
I ran the dataset.py script, and my dataset looks like this:
Could you please guide me as for what I could do?
Thank you!
I am unable to download files through git-lfs because
"" This repository is over its data quota. Account responsible for LFS bandwidth should purchase
more data packs to restore access. ""
Could you please provide the models through an alternate resource or upgrade the LFS data packs? If that is not possible could you provide the train/test/validation sets used to train the classifier? Thank you!
I am predicting paragraph based on this model.
The output separates sentences randomly.
Certainly it has something to do with the fast the my data text is not clean.
I wonder whether this is a need for some others too.
Thank you.
Hi,
When installing the environment file the following error occurs:
Pip subprocess output:
Collecting joblib==0.13.2
Using cached joblib-0.13.2-py2.py3-none-any.whl (278 kB)
Collecting pytorch-pretrained-bert==0.6.2
Using cached pytorch_pretrained_bert-0.6.2-py3-none-any.whl (123 kB)
Collecting scikit-learn==0.21.2
Using cached scikit_learn-0.21.2-cp37-cp37m-win_amd64.whl (5.9 MB)
Collecting spacy==2.1.4
Using cached spacy-2.1.4-cp37-cp37m-win_amd64.whl (29.0 MB)
Pip subprocess error:
ERROR: Could not find a version that satisfies the requirement torch==1.1.0 (from -r C:\Users\Matth\condaenv.xiuu97lv.requirements.txt (line 5)) (from versions: 0.1.2, 0.1.2.post1, 0.1.2.post2)
ERROR: No matching distribution found for torch==1.1.0 (from -r C:\Users\Matth\condaenv.xiuu97lv.requirements.txt (line 5))
CondaEnvException: Pip failed
As a consequence packages can not be loaded successfully in the created environment.
The error seems common for windows users ( #21 ). However, updating python did not do the trick for me here.
I use Win 10, python version(3.7.7), torch version (1.5.1) and conda version (4.8.3).
Hopefully you can provide me with some assistance to get FinBert up and running on windows.
Hey there!
I found that when trying to train the model there was an error, because the _read_tsv's call to open on line 63 in Utils.py didn't have the encoding specified.
To fix it I changed it from:
def _read_tsv(cls, input_file): with open(input_file, "r") as f:
To:
def _read_tsv(cls, input_file): with open(input_file, "r",encoding='utf-8') as f:
Now it works great! Thanks!
I did not find code for FiQA aspects-based sentiment task, and wonder how roberta model handle aspects-based sentiment task which is different from vanilla sentiment classification task. Thanks a lot
I try this code
tokenizer = AutoTokenizer.from_pretrained("ProsusAI/finbert")
but got error.
OSError: Can't load 'ProsusAI/finbert'. Make sure that:
'ProsusAI/finbert' is a correct model identifier listed on 'https://huggingface.co/models'
or 'ProsusAI/finbert' is the correct path to a directory containing a 'config.json' file
hi, I tried to run predict.py by:
!python predict.py --text_path="test.txt" --output_dir="output/" --model_path="pytorch_model.bin"
,as you said in readme
but I got the following error:
usage: predict.py [-h] [--data_path DATA_PATH]
predict.py: error: unrecognized arguments: --text_path=test.txt --output_dir=output/ --model_path=pytorch_model.bin
what is the problem? it looks like it is using another file but I can't figured out which and why!
would you please help me?
regards
Hi,
I am trying to setup the environment. It always pops up with pip & torch version error. Can you please update the yml for me?
Wanted to check your solution.
Hi,
Thank you for making the code available.
As per the readme file, I understand that there are two models:
finbert_training.ipynb is used to load the language_model and fine-tune it on Financial Phrasebank.
I was wondering if you could make the script used for further pre-training the language_model available too.
Thanks!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.