prosusai / finbert Goto Github PK

Financial Sentiment Analysis with BERT

License: Apache License 2.0

Python 42.61% Jupyter Notebook 57.13% Dockerfile 0.26%

finbert's Issues

Preprocessing using TRC2

Hello, You mentioned in one of your response that you used 50 finance keywords to extract the finance related text. Do you mind sharing the keywords you used ?

Thanks

Cannot load the model

When trying to load the model I get the following error. Do I need to download any additional model files? Thank you.

---------------------------------------------------------------------------
UnpicklingError                           Traceback (most recent call last)
<ipython-input-8-19ad01fc2649> in <module>
      5     pass
      6 
----> 7 bertmodel = BertForSequenceClassification.from_pretrained(lm_path,cache_dir=None, num_labels=3)
      8 
      9 

/anaconda/envs/bert10k/lib/python3.7/site-packages/pytorch_pretrained_bert/modeling.py in from_pretrained(cls, pretrained_model_name_or_path, *inputs, **kwargs)
    601         if state_dict is None and not from_tf:
    602             weights_path = os.path.join(serialization_dir, WEIGHTS_NAME)
--> 603             state_dict = torch.load(weights_path, map_location='cpu')
    604         if tempdir:
    605             # Clean up temp dir

/anaconda/envs/bert10k/lib/python3.7/site-packages/torch/serialization.py in load(f, map_location, pickle_module, **pickle_load_args)
    424         if sys.version_info >= (3, 0) and 'encoding' not in pickle_load_args.keys():
    425             pickle_load_args['encoding'] = 'utf-8'
--> 426         return _load(f, map_location, pickle_module, **pickle_load_args)
    427     finally:
    428         if new_fd:

/anaconda/envs/bert10k/lib/python3.7/site-packages/torch/serialization.py in _load(f, map_location, pickle_module, **pickle_load_args)
    601             f.seek(0)
    602 
--> 603     magic_number = pickle_module.load(f, **pickle_load_args)
    604     if magic_number != MAGIC_NUMBER:
    605         raise RuntimeError("Invalid magic number; corrupt file?")

UnpicklingError: invalid load key, 'v'.

How can we launch the finebert_training on multiple servers

I have seen that your training implementation of FinBERT supports distributed training. But how can we launch it on multiple servers ?

finBERT pretained model giving issues while calling it

Hi, I have ubuntu version 16. And I have cloned finBERT in my machine. I have also created a folder and have put the required files from hugging face website. Everything was working fine I had got the environment created by running environment.yml file provided in GitHub.

But when I am calling AutoModelForSequenceClassification.from_pretrained(file_path, cache_dir=None, num_labels=3)

1. I am getting Unable to open file (file signature not found) error.

2. Also, when I am checking for version of torch. It's 1.1.0 in my terminal whereas it is 1.6.0 in the jupyter notebook.

All this code is being executed in Jupyter Notebook

Here is the image of error I am getting

How to use finBERT using Hugging Face model?

I tried this code

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
import numpy as np

tokenizer = AutoTokenizer.from_pretrained("ProsusAI/finbert")
model = AutoModelForSequenceClassification.from_pretrained("ProsusAI/finbert")

but get this error.

OSError: Can't load 'ProsusAI/finbert'. Make sure that:

- 'ProsusAI/finbert' is a correct model identifier listed on 'https://huggingface.co/models'

- or 'ProsusAI/finbert' is the correct path to a directory containing a 'config.json' file

Where is the config.json for Sentiment analysis model trained on Financial PhraseBank

Please provide the information where to download the config.json for Sentiment analysis model trained on Financial PhraseBank

The instruction says to place a copy of config.json. The link to download the Sentiment analysis model trained on Financial PhraseBank is the bin file only, not include the config.json.

For both of these model, the workflow should be like this:

Create a directory for the model. For example: models/sentiment/
Download the model and put it into the directory you just created.
Put a copy of config.json in this same directory. <------------------------------
Call the model with .from_pretrained()

The config.json in the huggingface looks to be for TRC2.

{
  "_name_or_path": "/home/ubuntu/finbert/models/language_model/finbertTRC2",   <-----
  "architectures": [
    "BertForSequenceClassification"
  ],
  "attention_probs_dropout_prob": 0.1,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "positive",
    "1": "negative",
    "2": "neutral"
  },
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "label2id": {
    "positive": 0,
    "negative": 1,
    "neutral": 2
  },
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "type_vocab_size": 2,
  "vocab_size": 30522
}

Please attach the train, validation and test data which you used for training and testing.

Can you please attach the train, validation and test data which you used for training and testing?

Thanks in advance.

TypeError: unsupported operand type(s) for /: 'str' and 'str' at trained_model = finbert.train(train_examples = train_data, model = model)

I'm facing an issue at code

trained_model = finbert.train(train_examples = train_data, model = model)

Error is

TypeError                                 Traceback (most recent call last)
<ipython-input-11-2ebf0cb3d4e8> in <module>
----> 1 trained_model = finbert.train(train_examples = train_data, model = model)

~\finBERT-master\finbert\finbert.py in train(self, train_examples, model)
    482                     print('No best model found')
    483                 torch.save({'epoch': str(i), 'state_dict': model.state_dict()},
--> 484                            self.config.model_dir / ('temporary' + str(i)))
    485                 best_model = i
    486 

TypeError: unsupported operand type(s) for /: 'str' and 'str'

lm_fine_tuning

Hi,

With reference to my previous issue posted here, I followed what you suggested but I am still couldn't fine_tune on my sample domain. I was wondering if you could take a look at this.

Thanks!

loading pytorch_model.bin

I downloaded and placed the language model (pytorch_model.bin) and config.json in the same directory and when i run the piece of code: configuring training parameters, get below error:

INFO - pytorch_pretrained_bert.modeling - loading archive file /directory/pytorch_model.bin
INFO - pytorch_pretrained_bert.modeling - extracting archive file /directory/pytorch_model.bin to temp dir /tmp/tmppjpupb
Readerror: not a gzip file

Looking more granular:
/usr/lib/python3.6/tarfile.py in gzopen(cls, name, mode, fileobj, compresslevel, **kwargs)
1645 fileobj.close()
1646 if mode == 'r':
-> 1647 raise ReadError("not a gzip file")
1648 raise
1649 except:

Sentiment classifier finetuning Input Format

Hi,

Thanks for sharing the work.
In order to run notebook "finbert_training.ipynb" for finetuning a sentiment classifier, I could not understand the train.csv (test,validation) format.
Financial Phrase Bank dataset has files where the sentences and labels are separated by @.

Can you tell me the format of these csv files in which it should be prepared ?

I see the following code for reading .csv in utils.py but could not get it.

    def _read_tsv(cls, input_file):
        """Reads a tab separated value file."""
        with open(input_file, "r") as f:
            reader = csv.reader(f, delimiter="\t")
            lines = []
            for line in reader:
                if sys.version_info[0] == 2:
                    line = list(unicode(cell, 'utf-8') for cell in line)
                lines.append(line)
        return lines```

Thanks.

Dot at the end of short sentences

I was playing around with the finBERT model a bit and I noticed that for short sentences having a period at the end makes a big difference on the model's predictions (see Figures 1-2 below).

Any idea why that is the case? Could it be that the model was fine-tuned on the sentences with a dot at the end and that's why it makes such a difference? Or does it have to do with BERT embeddings, i.e. is there a special embedding for a dot?

Figure 1 (short sentence, no period at the end):

Figure 2 (short sentence, a period at the end):

ad Gateway for url: https://huggingface.co/bert-base-uncased/resolve/main/config.json

predict() in Predict.py function issue "only one element tensors can be converted to Python scalars"

When I call predict, I get the error "only one element tensors can be converted to Python scalars" on line 618 of finbert.py.
When I modify the line from:
logits = softmax(np.array(logits))
to
logits = softmax(np.array(logits[0]))

I get no error, but the predictions and sentiment scores do not seem right when I tested it on examples.csv. The logit looks like [small number, .99..., small number], so the labels are all negative and the scores are all around -.99...

For reference, I copied predict.py, finbert.py, and util.py into a jupyter notebook and used the following as my model
model = BertForSequenceClassification.from_pretrained("ipuneetrathore/bert-base-cased-finetuned finBERT",num_labels=3,cache_dir=None)

Error running the configuring parameters cell

Good morning,

I am running the configuring parameters cell and I am getting the below error:

UnpicklingError Traceback (most recent call last)
in
5 pass
6
----> 7 bertmodel = BertForSequenceClassification.from_pretrained(lm_path,cache_dir=None, num_labels=3)
8
9

~/anaconda3/envs/finbert/lib/python3.7/site-packages/pytorch_pretrained_bert/modeling.py in from_pretrained(cls, pretrained_model_name_or_path, *inputs, **kwargs)
601 if state_dict is None and not from_tf:
602 weights_path = os.path.join(serialization_dir, WEIGHTS_NAME)
--> 603 state_dict = torch.load(weights_path, map_location='cpu')
604 if tempdir:
605 # Clean up temp dir

~/anaconda3/envs/finbert/lib/python3.7/site-packages/torch/serialization.py in load(f, map_location, pickle_module, **pickle_load_args)
385 f = f.open('rb')
386 try:
--> 387 return _load(f, map_location, pickle_module, **pickle_load_args)
388 finally:
389 if new_fd:

~/anaconda3/envs/finbert/lib/python3.7/site-packages/torch/serialization.py in _load(f, map_location, pickle_module, **pickle_load_args)
562 f.seek(0)
563
--> 564 magic_number = pickle_module.load(f, **pickle_load_args)
565 if magic_number != MAGIC_NUMBER:
566 raise RuntimeError("Invalid magic number; corrupt file?")

UnpicklingError: invalid load key, 'v'.

Moreover, can you kindly explain how I can construct the files train.csv, validation.csv, test.csv?

Regards,
Bernard

pre-training script on trc2

Hi there, great work and thanks for sharing star!

I am currently trying to reproduce pre-training of bert using trc2 corpus for research purposes to which I have access to, so this is not a request for data. Instead, could you please share the pre-processing code you used for pre-training bert to produce finBERT, how you ingested the data into bert pre-training etc.?

Best regards-

Unidecode error when trying to load model saved locally

Hello, I trained the model with my own parameters, and saved it.
However, whenever I try to use it, I get the following error:

UnicodeDecodeError Traceback (most recent call last)
in
4 tokenizer = AutoTokenizer.from_pretrained("ProsusAI/finbert")
5
----> 6 model = AutoModelForSequenceClassification.from_pretrained("C:/Users/Verena/Documents/finbert_new/models/classifier_model/finbert-sentiment.bin")
7 label_list = label_list=['positive','negative','neutral']

~\anaconda3\envs\finbert\lib\site-packages\transformers-4.0.1-py3.8.egg\transformers\models\auto\modeling_auto.py in from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs)
1237 if not isinstance(config, PretrainedConfig):
1238 config, kwargs = AutoConfig.from_pretrained(
-> 1239 pretrained_model_name_or_path, return_unused_kwargs=True, **kwargs
1240 )
1241

~\anaconda3\envs\finbert\lib\site-packages\transformers-4.0.1-py3.8.egg\transformers\models\auto\configuration_auto.py in from_pretrained(cls, pretrained_model_name_or_path, **kwargs)
339 {'foo': False}
340 """
--> 341 config_dict, _ = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs)
342
343 if "model_type" in config_dict:

~\anaconda3\envs\finbert\lib\site-packages\transformers-4.0.1-py3.8.egg\transformers\configuration_utils.py in get_config_dict(cls, pretrained_model_name_or_path, **kwargs)
387 )
388 # Load config dict
--> 389 config_dict = cls._dict_from_json_file(resolved_config_file)
390
391 except EnvironmentError as err:

~\anaconda3\envs\finbert\lib\site-packages\transformers-4.0.1-py3.8.egg\transformers\configuration_utils.py in _dict_from_json_file(cls, json_file)
470 def _dict_from_json_file(cls, json_file: str):
471 with open(json_file, "r", encoding="utf-8") as reader:
--> 472 text = reader.read()
473 return json.loads(text)
474

~\anaconda3\envs\finbert\lib\codecs.py in decode(self, input, final)
320 # decode input (taking the buffer into account)
321 data = self.buffer + input
--> 322 (result, consumed) = self._buffer_decode(data, self.errors, final)
323 # keep undecoded input until the next call
324 self.buffer = data[consumed:]

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte

The same happens when I try to load the language model, even though both models are downloaded locally. I was only able to use finbert through transformers.

Can you please help me? Thanks!

How much improvement do you see with "discriminative fine-tuning" and "Gradual Unfreezing"?

Hey there,

How much improvement do you see with "discriminative fine-tuning" and "Gradual Unfreezing"?

Could you quantify the improvement?

i.e.:

1%
5%

thanks!

Could finBERT only give one result (logit / prediction / sentiment_score) for entire article ?

Hi, I'm trying to use predict.py, colud it possible to predict entire artile (maybe 17 lines) to one final result? but not 17 result for every sentence ?

The images show 17 lines in green background, each line get a result, could it just predict the entire article(yellow background) ?

Thanks for your help.

pytorch_model.bin file

Hi!

I just downloaded this .bin file and was wondering how to get the following two files(Language model trained on TRC2
& Sentiment analysis model trained on Financial PhraseBank) from it?

Thanks!!

How to force Predict.py to use GPU ? Please help.

device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
model = model.cuda(device)

gives this error!

File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/pytorch_pretrained_bert/modeling.py", line 989, in forward
_, pooled_output = self.bert(input_ids, token_type_ids, attention_mask, output_all_encoded_layers=False)
File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in call
result = self.forward(*input, **kwargs)
File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/pytorch_pretrained_bert/modeling.py", line 730, in forward
embedding_output = self.embeddings(input_ids, token_type_ids)
File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in call
result = self.forward(*input, **kwargs)
File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/pytorch_pretrained_bert/modeling.py", line 267, in forward
words_embeddings = self.word_embeddings(input_ids)
File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in call
result = self.forward(*input, **kwargs)
File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/nn/modules/sparse.py", line 114, in forward
self.norm_type, self.scale_grad_by_freq, self.sparse)
File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/nn/functional.py", line 1484, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Expected object of device type cuda but got device type cpu for argument #3 'index' in call to _th_index_select

Sentence Representation Layer

Is it possible to acquire each sentence representation by removing your last few layers in the model?
If yes, which layer output sentence representation?

Thanks!

How to install and get the financial tweet sentiment score using finbert

Dear Sir,
I hope this email finds you well.I am new to data science and for my course project I need to use finbert for get the tweet sentiment score and I am unable to install the finbert on pycharm kindly help me on this Thank you

Best Regards
Taha Ahmad

sample set of TRC2

Hi,

Since the TRC2 dataset cannot be publicly made available, I was wondering if some similar format sample training dataset can be provided.

Thanks!

Training division by 0 error

Thanks for sharing your code on this matter. I have used your trained model. however, I want to try and train the model on my own with the help of the datasets which you have mentioned.
But with running the finbert_training.ipynb I hit to an error on trained_model = finbert.train(train_examples = train_data, model = model)

I have tried to debug the code, for some reasons step is always 0. I was wondering if you can give me some hints on the hint on how to fix this issue :)

Vocabulary

Hello,

Could you include the vocab.txt for finBERT? I don't see it in the model's directory and it seems like you are using bert-base-uncased vocabulary for constructing the tokenizer ( https://github.com/ProsusAI/finBERT/blob/master/finbert/finbert.py)

self.tokenizer = BertTokenizer.from_pretrained("bert-base-uncased", do_lower_case=self.config.do_lower_case)

Thank you,
Andrei

Using Finbert for 240 multilabel multiclass classification

I have label of dimension 240, it is multi label classification problem.

I downloaded the Finbert model:

from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer
from transformers import AutoTokenizer

tokenizer=AutoTokenizer.from_pretrained("/home/pratik/finbert", use_fast=True)

train_tokenizer_texts = list(map(lambda t: tokenizer.tokenize(t,add_special_tokens=True,max_length=512,padding='max_length'), tqdm(train_sentences)))

np.array(train_tokenizer_texts[0])

%%time
#Inititaing a BERT model
model = AutoModelForSequenceClassification.from_pretrained("/home/pratik/finbert", num_labels = 240)
model.cuda()

This gives me error:

RuntimeError: Error(s) in loading state_dict for BertForSequenceClassification:
	size mismatch for classifier.weight: copying a param with shape torch.Size([3, 768]) from checkpoint, the shape in current model is torch.Size([240, 768]).
	size mismatch for classifier.bias: copying a param with shape torch.Size([3]) from checkpoint, the shape in current model is torch.Size([240]).

AxisError when call predict via REST API on Flask

I succeeded in training model with executing finbert_training.ipynb. Then I ran a docker conatiner from Dockerfile and threw POST request to container(localhost:8080), which showed me the following error.

08/27/2021 17:03:42 - INFO - pytorch_pretrained_bert.modeling -   Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex .
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
08/27/2021 17:03:42 - INFO - pytorch_pretrained_bert.modeling -   loading archive file /src/models/classifier_model/finbert-sentiment
08/27/2021 17:03:42 - INFO - pytorch_pretrained_bert.modeling -   Model config {
  "_name_or_path": "c:\\Users\\user\\projects\\finbert\\finBERT\\models\\language_model\\finbertTRC2",
  "attention_probs_dropout_prob": 0.1,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2"
  },
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1,
    "LABEL_2": 2
  },
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "type_vocab_size": 2,
  "vocab_size": 30522
}

08/27/2021 17:03:45 - INFO - pytorch_pretrained_bert.modeling -   Weights from pretrained model not used in BertForSequenceClassification: ['bert.embeddings.position_ids']
 * Serving Flask app 'main' (lazy loading)
 * Environment: production
   WARNING: This is a development server. Do not use it in a production deployment.
   Use a production WSGI server instead.
 * Debug mode: off
08/27/2021 17:03:45 - WARNING - werkzeug -    * Running on all addresses.
   WARNING: This is a development server. Do not use it in a production deployment.
08/27/2021 17:03:45 - INFO - werkzeug -    * Running on http://172.17.0.2:8080/ (Press CTRL+C to quit)
08/27/2021 17:10:41 - INFO - filelock -   Lock 140200106040192 acquired on /root/.cache/huggingface/transformers/c1d7f0a763fb63861cc08553866f1fc3e5a6f4f07621be277452d26d71303b7e.20430bd8e10ef77a7d2977accefe796051e01bc2fc4aa146bc862997a1a15e79.lock
Downloading: 100% 28.0/28.0 [00:00<00:00, 10.4kB/s]
08/27/2021 17:10:42 - INFO - filelock -   Lock 140200106040192 released on /root/.cache/huggingface/transformers/c1d7f0a763fb63861cc08553866f1fc3e5a6f4f07621be277452d26d71303b7e.20430bd8e10ef77a7d2977accefe796051e01bc2fc4aa146bc862997a1a15e79.lock
08/27/2021 17:10:43 - INFO - filelock -   Lock 140200106041056 acquired on /root/.cache/huggingface/transformers/3c61d016573b14f7f008c02c4e51a366c67ab274726fe2910691e2a761acf43e.37395cee442ab11005bcd270f3c34464dc1704b715b5d7d52b1a461abe3b9e4e.lock
Downloading: 100% 570/570 [00:00<00:00, 406kB/s]
08/27/2021 17:10:43 - INFO - filelock -   Lock 140200106041056 released on /root/.cache/huggingface/transformers/3c61d016573b14f7f008c02c4e51a366c67ab274726fe2910691e2a761acf43e.37395cee442ab11005bcd270f3c34464dc1704b715b5d7d52b1a461abe3b9e4e.lock
08/27/2021 17:10:44 - INFO - filelock -   Lock 140200106040720 acquired on /root/.cache/huggingface/transformers/45c3f7a79a80e1cf0a489e5c62b43f173c15db47864303a55d623bb3c96f72a5.d789d64ebfe299b0e416afc4a169632f903f693095b4629a7ea271d5a0cf2c99.lock
Downloading: 100% 232k/232k [00:00<00:00, 490kB/s]
08/27/2021 17:10:45 - INFO - filelock -   Lock 140200106040720 released on /root/.cache/huggingface/transformers/45c3f7a79a80e1cf0a489e5c62b43f173c15db47864303a55d623bb3c96f72a5.d789d64ebfe299b0e416afc4a169632f903f693095b4629a7ea271d5a0cf2c99.lock
08/27/2021 17:10:46 - INFO - filelock -   Lock 140200106040768 acquired on /root/.cache/huggingface/transformers/534479488c54aeaf9c3406f647aa2ec13648c06771ffe269edabebd4c412da1d.7f2721073f19841be16f41b0a70b600ca6b880c8f3df6f3535cbc704371bdfa4.lock
Downloading: 100% 466k/466k [00:00<00:00, 595kB/s]
08/27/2021 17:10:47 - INFO - filelock -   Lock 140200106040768 released on /root/.cache/huggingface/transformers/534479488c54aeaf9c3406f647aa2ec13648c06771ffe269edabebd4c412da1d.7f2721073f19841be16f41b0a70b600ca6b880c8f3df6f3535cbc704371bdfa4.lock
['The Federal Reserve is committed to using its full range of tools to support the U.S. economy in this challenging time, thereby promoting its maximum employment and price stability goals.']
08/27/2021 17:10:49 - INFO - finbert.utils -   *** Example ***
08/27/2021 17:10:49 - INFO - finbert.utils -   guid: 0
08/27/2021 17:10:49 - INFO - finbert.utils -   tokens: [CLS] the federal reserve is committed to using its full range of tools to support the u . s . economy in this challenging time , thereby promoting its maximum employment and price stability goals . [SEP]
08/27/2021 17:10:49 - INFO - finbert.utils -   input_ids: 101 1996 2976 3914 2003 5462 2000 2478 2049 2440 2846 1997 5906 2000 2490 1996 1057 1012 1055 1012 4610 1999 2023 10368 2051 1010 8558 7694 2049 4555 6107 1998 3976 9211 
3289 1012 102 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
08/27/2021 17:10:49 - INFO - finbert.utils -   attention_mask: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
08/27/2021 17:10:49 - INFO - finbert.utils -   token_type_ids: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
08/27/2021 17:10:49 - INFO - finbert.utils -   label: None (id = 9090)
[<finbert.utils.InputFeatures object at 0x7f82e1849d90>]
08/27/2021 17:10:49 - INFO - root -   tensor([ 2.1882, -2.1247, -0.7895])
[ 2.1882384  -2.1246738  -0.78948754]
08/27/2021 17:10:49 - ERROR - main -   Exception on / [POST]
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/site-packages/flask/app.py", line 2070, in wsgi_app
    response = self.full_dispatch_request()
  File "/opt/conda/lib/python3.8/site-packages/flask/app.py", line 1515, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/opt/conda/lib/python3.8/site-packages/flask_cors/extension.py", line 165, in wrapped_function
    return cors_after_request(app.make_response(f(*args, **kwargs)))
  File "/opt/conda/lib/python3.8/site-packages/flask/app.py", line 1513, in full_dispatch_request
    rv = self.dispatch_request()
  File "/opt/conda/lib/python3.8/site-packages/flask/app.py", line 1499, in dispatch_request
    return self.ensure_sync(self.view_functions[rule.endpoint])(**req.view_args)
  File "/src/main.py", line 21, in score
    return(predict(text, model).to_json(orient='records'))
  File "/src/finbert/finbert.py", line 615, in predict
    logits = softmax(np.array(logits))
  File "/src/finbert/utils.py", line 215, in softmax
    e_x = np.exp(x - np.max(x, axis=1)[:, None])
  File "<__array_function__ internals>", line 5, in amax
  File "/opt/conda/lib/python3.8/site-packages/numpy/core/fromnumeric.py", line 2705, in amax
    return _wrapreduction(a, np.maximum, 'max', axis, None, out,
  File "/opt/conda/lib/python3.8/site-packages/numpy/core/fromnumeric.py", line 87, in _wrapreduction
    return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
numpy.AxisError: axis 1 is out of bounds for array of dimension 1
08/27/2021 17:10:49 - INFO - werkzeug -   172.17.0.1 - - [27/Aug/2021 17:10:49] "POST / HTTP/1.1" 500 -
['The Federal Reserve is committed to using its full range of tools to support the US economy in this challenging 
time, thereby promoting its maximum employment and price stability goals.']
08/27/2021 17:22:27 - INFO - finbert.utils -   *** Example ***
08/27/2021 17:22:27 - INFO - finbert.utils -   guid: 0
08/27/2021 17:22:27 - INFO - finbert.utils -   tokens: [CLS] the federal reserve is committed to using its full range of tools to support the us economy in this challenging time , thereby promoting its maximum employment and price stability goals . [SEP]
08/27/2021 17:22:27 - INFO - finbert.utils -   input_ids: 101 1996 2976 3914 2003 5462 2000 2478 2049 2440 2846 1997 5906 2000 2490 1996 2149 4610 1999 2023 10368 2051 1010 8558 7694 2049 4555 6107 1998 3976 9211 3289 1012 102 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
08/27/2021 17:22:27 - INFO - finbert.utils -   attention_mask: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
08/27/2021 17:22:27 - INFO - finbert.utils -   token_type_ids: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
08/27/2021 17:22:27 - INFO - finbert.utils -   label: None (id = 9090)
[<finbert.utils.InputFeatures object at 0x7f82e1004d90>]
08/27/2021 17:22:27 - INFO - root -   tensor([ 2.0840, -2.0827, -0.8532])
[ 2.0839536  -2.0827212  -0.85315543]
08/27/2021 17:22:27 - ERROR - main -   Exception on / [POST]
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/site-packages/flask/app.py", line 2070, in wsgi_app
    response = self.full_dispatch_request()
  File "/opt/conda/lib/python3.8/site-packages/flask/app.py", line 1515, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/opt/conda/lib/python3.8/site-packages/flask_cors/extension.py", line 165, in wrapped_function
    return cors_after_request(app.make_response(f(*args, **kwargs)))
  File "/opt/conda/lib/python3.8/site-packages/flask/app.py", line 1513, in full_dispatch_request
    rv = self.dispatch_request()
  File "/opt/conda/lib/python3.8/site-packages/flask/app.py", line 1499, in dispatch_request
    return self.ensure_sync(self.view_functions[rule.endpoint])(**req.view_args)
  File "/src/main.py", line 21, in score
    return(predict(text, model).to_json(orient='records'))
  File "/src/finbert/finbert.py", line 615, in predict
    logits = softmax(np.array(logits))
  File "/src/finbert/utils.py", line 215, in softmax
    e_x = np.exp(x - np.max(x, axis=1)[:, None])
  File "<__array_function__ internals>", line 5, in amax
  File "/opt/conda/lib/python3.8/site-packages/numpy/core/fromnumeric.py", line 2705, in amax
    return _wrapreduction(a, np.maximum, 'max', axis, None, out,
  File "/opt/conda/lib/python3.8/site-packages/numpy/core/fromnumeric.py", line 87, in _wrapreduction
    return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
numpy.AxisError: axis 1 is out of bounds for array of dimension 1
08/27/2021 17:22:27 - INFO - werkzeug -   172.17.0.1 - - [27/Aug/2021 17:22:27] "POST / HTTP/1.1" 500 -

I have solved this issue by changing code in finbert.py
Before:

with torch.no_grad():
            logits = model(all_input_ids, all_attention_mask, all_token_type_ids)[0]

After:

with torch.no_grad():
            logits = model(all_input_ids, all_attention_mask, all_token_type_ids)

I will appreciate if you check this out.

how to create train.csv, validation.csv, test.csv

Hi, how to setup and create train.csv, validation.csv, test.csv from Financial Pharase Bank data?

Is finBert cased or uncased?

Hi! Thanks for developing and sharing the codes.

I wonder which vanilla BERT model you used to post-training on financial domain text.

To be specific, I wonder whether this FinBERT model can tell the difference between uppercase and lowercase.

pip install transformers is necessary to Dockerfile

Hello. I just would like to report you that running container created by image built from Dockerfile in current master repository would cause an ModuleNotFoundError.

I solved this issue by changing code in Dockerfile.

Before:
RUN pip install pytorch_pretrained_bert numpy pandas nltk Flask flask-cors

After:
RUN pip install pytorch_pretrained_bert numpy pandas nltk Flask flask-cors transformers

I would appreciate if you check it out.

model not valid

The first model to download "Language model trained on TRC2" is not valid. I get wrong predictions with it.
The second one seems to be fine.

What is the news_id?

Could anyone explain what is the news_id in predict()?

pytorch_pretrained_bert not found

Thank you for your great work!

When I run the example predict.py I got below errors. Should you add pytorch_pretrained_bert to your environment.yml?

Traceback (most recent call last):
File "predict.py", line 1, in
from finbert.finbert import predict
File "C:\Projects\Python\GitHub\finBERT\finbert\finbert.py", line 6, in
from pytorch_pretrained_bert.tokenization import BertTokenizer
ModuleNotFoundError: No module named 'pytorch_pretrained_bert'

The vocab.txt for finbertTRC2 model

Hi Sir,
Thank you so much for sharing the code.
I notice that in finbertTRC2 folder, the vocab.txt file is missing, could you tell me where I could find this file?

Thanks!

Unable to train the model

Hi,

I downloaded the data set from the Financial Phrase Bank from Malo et al. (2014). And created train.csv using the data.
train_data = finbert.get_data('train')
But for the above code snippets in "finbert_training"-notebook, an error message was generated as follows.

Is there any method to resolve this issue.

Thanks..

UnicodeDecodeError Traceback (most recent call last)
in
7 #print(cl_data_path)
8 # Get the training examples
----> 9 train_data = finbert.get_data('train')

~\Documents\FIN_BERT\finBERT-master\finbert\finbert.py in get_data(self, phase)
192 self.num_train_optimization_steps = None
193 examples = None
--> 194 examples = self.processor.get_examples(self.config.data_dir, phase)
195 self.num_train_optimization_steps = int(
196 len(

~\Documents\FIN_BERT\finBERT-master\finbert\utils.py in get_examples(self, data_dir, phase)
89 Name of the .csv file to be loaded.
90 """
---> 91 return self._create_examples(self._read_tsv(os.path.join(data_dir, (phase + ".csv"))), phase)
92
93 def get_labels(self):

~\Documents\FIN_BERT\finBERT-master\finbert\utils.py in _read_tsv(cls, input_file)
66 reader = csv.reader(f, delimiter="\t")
67 lines = []
---> 68 for line in reader:
69 if sys.version_info[0] == 2:
70 line = list(unicode(cell, 'utf-8') for cell in line)

D:\Python\lib\encodings\cp1252.py in decode(self, input, final)
21 class IncrementalDecoder(codecs.IncrementalDecoder):
22 def decode(self, input, final=False):
---> 23 return codecs.charmap_decode(input,self.errors,decoding_table)[0]
24
25 class StreamWriter(Codec,codecs.StreamWriter):_

UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 7919: character maps to

DataFrame id overlap

On line 628 of finbert.py you use result = pd.concat([result,batch_result]) when it should be result = pd.concat([result,batch_result], ignore_index=True).

In your result DataFrame the when you concatenate multiple batches together you will have id's that are the same. e.g. 2 batches of 3 items the indexes in result will be 0,1,2,0,1,2.

If you were to convert the DataFrame to a dictionary the results override each other as multiples keys of the same value exist.

Error when calling finbert.train()

I'm trying to run FinBERT for stock market prediction based on SEC filings. I am using the finbert_training notebook as a reference.

When running:

trained_model = finbert.train(train_examples = train_data, model = model) I get the following error:

I ran the dataset.py script, and my dataset looks like this:

Could you please guide me as for what I could do?

Thank you!

Incorrect prediction

Hi,

Used pre-trained models on financial news of a company for text classification. Given below are the three sentences which should have been predicted as positive. Is there a more refined model available or we need to further fine tune it ourselves ?

Thanks,

Unable to download models

I am unable to download files through git-lfs because
"" This repository is over its data quota. Account responsible for LFS bandwidth should purchase
more data packs to restore access. ""

Could you please provide the models through an alternate resource or upgrade the LFS data packs? If that is not possible could you provide the train/test/validation sets used to train the classifier? Thank you!

Does it support paragraph prediction?

I am predicting paragraph based on this model.
The output separates sentences randomly.
Certainly it has something to do with the fast the my data text is not clean.
I wonder whether this is a need for some others too.
Thank you.

torch error / environment file on windows

Hi,

When installing the environment file the following error occurs:

Pip subprocess output:
Collecting joblib==0.13.2
  Using cached joblib-0.13.2-py2.py3-none-any.whl (278 kB)
Collecting pytorch-pretrained-bert==0.6.2
  Using cached pytorch_pretrained_bert-0.6.2-py3-none-any.whl (123 kB)
Collecting scikit-learn==0.21.2
  Using cached scikit_learn-0.21.2-cp37-cp37m-win_amd64.whl (5.9 MB)
Collecting spacy==2.1.4
  Using cached spacy-2.1.4-cp37-cp37m-win_amd64.whl (29.0 MB)

Pip subprocess error:
ERROR: Could not find a version that satisfies the requirement torch==1.1.0 (from -r C:\Users\Matth\condaenv.xiuu97lv.requirements.txt (line 5)) (from versions: 0.1.2, 0.1.2.post1, 0.1.2.post2)
ERROR: No matching distribution found for torch==1.1.0 (from -r C:\Users\Matth\condaenv.xiuu97lv.requirements.txt (line 5))


CondaEnvException: Pip failed

As a consequence packages can not be loaded successfully in the created environment.

The error seems common for windows users ( #21 ). However, updating python did not do the trick for me here.

I use Win 10, python version(3.7.7), torch version (1.5.1) and conda version (4.8.3).

Hopefully you can provide me with some assistance to get FinBert up and running on windows.

Are my predictions correct?

Thats are the test predictions. Does someone know if thats right?

Error in _read_tsv when trying to read in the data

Hey there!
I found that when trying to train the model there was an error, because the _read_tsv's call to open on line 63 in Utils.py didn't have the encoding specified.

To fix it I changed it from:
def _read_tsv(cls, input_file): with open(input_file, "r") as f:
To:
def _read_tsv(cls, input_file): with open(input_file, "r",encoding='utf-8') as f:
Now it works great! Thanks!

Train model

Dear all,

When I train the model, the following error occurred. Is there any method to solve this issue.

Thanks.

no code for FiQA sentiment classification task?

I did not find code for FiQA aspects-based sentiment task, and wonder how roberta model handle aspects-based sentiment task which is different from vanilla sentiment classification task. Thanks a lot

How to use finbert using Hungging Face model hub ?

I try this code

tokenizer = AutoTokenizer.from_pretrained("ProsusAI/finbert")

but got error.

OSError: Can't load 'ProsusAI/finbert'. Make sure that:

'ProsusAI/finbert' is a correct model identifier listed on 'https://huggingface.co/models'
or 'ProsusAI/finbert' is the correct path to a directory containing a 'config.json' file

error using predict.py

hi, I tried to run predict.py by:
!python predict.py --text_path="test.txt" --output_dir="output/" --model_path="pytorch_model.bin"
,as you said in readme

but I got the following error:
usage: predict.py [-h] [--data_path DATA_PATH]
predict.py: error: unrecognized arguments: --text_path=test.txt --output_dir=output/ --model_path=pytorch_model.bin

what is the problem? it looks like it is using another file but I can't figured out which and why!
would you please help me?
regards

Update the environment file

Hi,

I am trying to setup the environment. It always pops up with pip & torch version error. Can you please update the yml for me?
Wanted to check your solution.

script for further pre-training on Reuters TRC2

Hi,

Thank you for making the code available.

As per the readme file, I understand that there are two models:

language_model that has been further pre-trained on Reuters TRC2
classifier_model that has been fine-tuned on Financial Phrasebank.

finbert_training.ipynb is used to load the language_model and fine-tune it on Financial Phrasebank.

I was wondering if you could make the script used for further pre-training the language_model available too.

Thanks!

prosusai / finbert Goto Github PK

finbert's Issues

Figure 1 (short sentence, no period at the end):

Figure 2 (short sentence, a period at the end):

Recommend Projects

Recommend Topics

Recommend Org