Giter VIP home page Giter VIP logo

Comments (13)

Me1oyy avatar Me1oyy commented on August 29, 2024 2

Thank you! We will upload the mentioned checkpoints soon.

from pixiu.

aloy99 avatar aloy99 commented on August 29, 2024 1

The links in the readme were changed, but the new repos are missing the model checkpoints as you said.
Old repos with checkpoints are still up at:
ChanceFocus/finma-7b-full
ChanceFocus/finma-7b-nlp

from pixiu.

aloy99 avatar aloy99 commented on August 29, 2024 1

No worries!
I'm not a contributor, I think someone on the team will update the new repositories soon. Would be good to leave this open so the team is aware, and users who encounter the same issue can figure out the problem.

from pixiu.

aloy99 avatar aloy99 commented on August 29, 2024 1

I encountered the same issue last night with --model hf-causal-llama.

I got it to run by loading the model as a VLLM model with the below arguments, but benchmark results were much worse than what is presented in the PIXIU paper. This might be expected, as VLLM model output in HF is supposedly inconsistent with what the same model would produce when loaded as AutoLlamaCausalLM.

python3 src/eval.py \
    --model hf-causal-vllm\
    --tasks flare_fpb,flare_fiqasa,flare_headlines,flare_ner,flare_finqa,flare_convfinqa,flare_sm_bigdata,flare_sm_acl,flare_sm_cikm \
    --model_args use_accelerate=True,pretrained=ChanceFocus/finma-7b-full,tokenizer=ChanceFocus/finma-7b-full,max_gen_toks=2048,use_fast=False \
    --no_cache \
    --batch_size 8 \
    --model_prompt 'finma_prompt' \
    --num_fewshot 0 \
    --write_out

from pixiu.

jiminHuang avatar jiminHuang commented on August 29, 2024 1

@stas00 The latest paper expands the set of tasks that models are benchmarked on, I wanted a subset of tasks that covered most of the categories that I could run within a reasonable amount of time (thus I referred to the earlier paper's selection of tasks). But yes, I think it would make sense to compare to the results from the newer paper instead.

The table I shared above was evaluated with finma-7b-full, with the snippet I shared previously:

    --model hf-causal-vllm\
    --tasks flare_fpb,flare_fiqasa,flare_headlines,flare_ner,flare_finqa,flare_convfinqa,flare_sm_bigdata,flare_sm_acl,flare_sm_cikm \
    --model_args use_accelerate=True,pretrained=ChanceFocus/finma-7b-full,tokenizer=ChanceFocus/finma-7b-full,max_gen_toks=2048,use_fast=False \
    --no_cache \
    --batch_size 8 \
    --model_prompt 'finma_prompt' \
    --num_fewshot 0 \
    --write_out

I spent last night re-running the benchmarks with a shell script that runs one task at a time, and the results I got are very close to those presented in both papers. I'm puzzled by why I no longer have the issue with missing responses from the model, given that the only change I've made is to run the tasks separately (I did also change hardware from RTX4090 to A6000, but I don't think it would be a vram issue).

declare -a tasks=("flare_fpb" "flare_fiqasa" "flare_headlines" "flare_ner" "flare_finqa" "flare_convfinqa" "flare_sm_bigdata" "flare_sm_acl" "flare_sm_cikm")

now="$(date +'%Y-%m-%d-%T')"
start=$(date +%s)

for TASK in "${tasks[@]}"
do
    python3 src/eval.py \
        --model hf-causal-vllm\
        --tasks $TASK \
        --model_args use_accelerate=True,pretrained=chancefocus/finma-7b-full,tokenizer=chancefocus/finma-7b-full,use_fast=False \
        --no_cache \
        --batch_size 8 \
        --model_prompt 'finma_prompt' \
        --num_fewshot 0  >> output_"$now".txt
done
end=$(date +%s)
seconds=$(echo "$end - $start" | bc)
echo $seconds' sec'
awk -v t=$seconds 'BEGIN{t=int(t*1000); printf "%d:%02d:%02d\n", t/3600000, t/60000%60, t/1000%60}' >> output_"$now".txt

Keep in mind that if you're using any non-finma model, you should run flare_headlines and flare_ner with --num_fewshot 5 as mentioned in the PIXIU paper (the older one):

image

Unfortunately, the newer FinBen paper mentions that some tasks are zero-shot and some tasks are few-shot, but does not mention which tasks fall into which category.

I benchmarked llama-2-7b-chat (to match the FinBen paper), and got an accuracy of 0.766 on flare_fiqasa and 0 on flare_finqa. I think it's not surprising to get a 0 for flare_finqa, given that the metric used is exact match accuracy. flare_fiqasa on the other hand is a normal sentiment analysis task and your result of 0 accuracy on that is surprising.

I've also dropped the authors an email requesting the arguments/scripts that were used for the benchmark results presented in the papers.

Hi @aloy99 and @stas00,

I really appreciate your interest and efforts in reproducing our results. We are grateful for your engagement with our work.

I must apologize for the delayed response. Our team is currently immersed in preparing a rebuttal for the ARR of the FinBen paper, which has demanded our full attention.

We recognize the importance of transparency and reproducibility in research. To address this, we are committed to providing detailed instructions on how to reproduce all our results, including all necessary parameters and inference choices. This will ensure that anyone interested can accurately benchmark and understand the nuances of our models.

Please accept our apologies for not being able to respond sooner. We are dedicated to addressing all your issues and questions as promptly as possible. Thank you again for your patience and for bringing these matters to our attention.

from pixiu.

stas00 avatar stas00 commented on August 29, 2024

thank you for that hint, aloy99 - found the olds ones now

let me know if you want me to close this issue or perhaps it's better to do so once the README.md has been matched to the hub contents?

from pixiu.

stas00 avatar stas00 commented on August 29, 2024

Are you able to use these older models though? If I try the eval as explained here:
https://github.com/The-FinAI/PIXIU?tab=readme-ov-file#automated-task-assessment
it works with llama2-7b model, but if I try any of ChanceFocus/finma-7b-full, ChanceFocus/finma-7b-nlp I get a bunch of:

../aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [82,0,0], thread: [0,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [82,0,0], thread: [1,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [82,0,0], thread: [2,0,0] Assertion `srcIndex < srcSelectDimSize` failed.

which usually indicates that the tokenizer has higher indices than the embedding (not the same vocab) - i.e. the model or the tokenizer files are possibly borked?

from pixiu.

stas00 avatar stas00 commented on August 29, 2024

@Me1oyy, how can we reproduce the numbers in the paper - as you can see from the comments above it doesn't work.

from pixiu.

stas00 avatar stas00 commented on August 29, 2024

Thank you for sharing the working code, @aloy99! I was able to reproduce it.

So you're saying it's lower than the paper, I run just the flare_finqa task and got acc=0.0785, whereas in paper it's actually the lower 0.04 - unless FinMA in the paper refers to ChanceFocus/finma-7b-nlp perhaps?

I get 0.061 w/ finma-7b-nlp

Also it's bound not produce very good results because many of the prompts are cut off:

 Input prompt (2445 tokens) is too long and exceeds limit of 2048           | 0/16 [00:00<?, ?it/s]

Also any idea why if I use normal llama I get acc=0? surely it should at least a bit of it right, no?

eval.py --model hf-causal-llama --model_args \
use_accelerate=True,pretrained=meta-llama/Llama-2-7b-hf,tokenizer=meta-llama/Llama-2-7b-hf,use_fast=False \
--tasks flare_finqa --batch_size 16 --model_prompt finma_prompt --num_fewshot \
0 --write_out

it also takes a much longer time to run - perhaps because it's 4k context window vs 2k finma?

from pixiu.

aloy99 avatar aloy99 commented on August 29, 2024

@stas00 You're right, I think the poor performance is largely because prompts are truncated to the maximum context length, and there is no response as a result. This shows up as 'missing' in the results report of the evaluation script.

I'm comparing my results to this paper.
It mentions on page 7 that:

The maximum length of input texts is 2048.

which does not seem to be the case with the versions of the evaluation script we can access at the moment

The paper also contains the following results:
image

Here are the results I got:

Task Version Metric Value Stderr
flare_convfinqa 1 acc 0.0007 ± 0.0007
flare_finqa 1 acc 0.0000 ± 0.0000
flare_fiqasa 1 acc 0.0000 ± 0.0000
missing 1.0000 ± 0.0000
f1 0.0000
macro_f1 0.0000
mcc 0.0000
flare_fpb 1 acc 0.0474 ± 0.0068
missing 0.8598 ± 0.0112
f1 0.0765
macro_f1 0.1025
mcc 0.0281
flare_headlines 1 avg_f1 0.6201
flare_ner 1 entity_f1 0.0000
flare_sm_acl 1 acc 0.5024 ± 0.0082
missing 0.0000 ± 0.0000
f1 0.4993
macro_f1 0.5002
mcc 0.0091
flare_sm_bigdata 1 acc 0.5041 ± 0.0130
missing 0.0000 ± 0.0000
f1 0.5053
macro_f1 0.5034
mcc 0.0115
flare_sm_cikm 1 acc 0.5074 ± 0.0148
missing 0.0000 ± 0.0000
f1 0.5097
macro_f1 0.5068
mcc 0.0297

I was able to reproduce the F1 score and accuracy alone when running the FPB task alone in a previous run: I'll find the arguments for that run again and update you when I'm able to get compute again.

from pixiu.

stas00 avatar stas00 commented on August 29, 2024

Shouldn't you compare it against their latest paper which in theory should match the source code on this repo? https://arxiv.org/abs/2402.12659 or perhaps the original paper you linked to should match the finma's performance since it was trained back then.

But please clarify - is the table you shared was eval'ed with llama2-7b or finma-7b-nlp or finma-7b-full?

So you too are getting acc=0 on flare_finqa and flare_fiqasa as I got with llama2-7b, which is suspicious

from pixiu.

aloy99 avatar aloy99 commented on August 29, 2024

@stas00
The latest paper expands the set of tasks that models are benchmarked on, I wanted a subset of tasks that covered most of the categories that I could run within a reasonable amount of time (thus I referred to the earlier paper's selection of tasks). But yes, I think it would make sense to compare to the results from the newer paper instead.

The table I shared above was evaluated with finma-7b-full, with the snippet I shared previously:

    --model hf-causal-vllm\
    --tasks flare_fpb,flare_fiqasa,flare_headlines,flare_ner,flare_finqa,flare_convfinqa,flare_sm_bigdata,flare_sm_acl,flare_sm_cikm \
    --model_args use_accelerate=True,pretrained=ChanceFocus/finma-7b-full,tokenizer=ChanceFocus/finma-7b-full,max_gen_toks=2048,use_fast=False \
    --no_cache \
    --batch_size 8 \
    --model_prompt 'finma_prompt' \
    --num_fewshot 0 \
    --write_out

I spent last night re-running the benchmarks with a shell script that runs one task at a time, and the results I got are very close to those presented in both papers. I'm puzzled by why I no longer have the issue with missing responses from the model, given that the only change I've made is to run the tasks separately (I did also change hardware from RTX4090 to A6000, but I don't think it would be a vram issue).

declare -a tasks=("flare_fpb" "flare_fiqasa" "flare_headlines" "flare_ner" "flare_finqa" "flare_convfinqa" "flare_sm_bigdata" "flare_sm_acl" "flare_sm_cikm")

now="$(date +'%Y-%m-%d-%T')"
start=$(date +%s)

for TASK in "${tasks[@]}"
do
    python3 src/eval.py \
        --model hf-causal-vllm\
        --tasks $TASK \
        --model_args use_accelerate=True,pretrained=chancefocus/finma-7b-full,tokenizer=chancefocus/finma-7b-full,use_fast=False \
        --no_cache \
        --batch_size 8 \
        --model_prompt 'finma_prompt' \
        --num_fewshot 0  >> output_"$now".txt
done
end=$(date +%s)
seconds=$(echo "$end - $start" | bc)
echo $seconds' sec'
awk -v t=$seconds 'BEGIN{t=int(t*1000); printf "%d:%02d:%02d\n", t/3600000, t/60000%60, t/1000%60}' >> output_"$now".txt

Keep in mind that if you're using any non-finma model, you should run flare_headlines and flare_ner with --num_fewshot 5 as mentioned in the PIXIU paper (the older one):

image

Unfortunately, the newer FinBen paper mentions that some tasks are zero-shot and some tasks are few-shot, but does not mention which tasks fall into which category.

I benchmarked llama-2-7b-chat (to match the FinBen paper), and got an accuracy of 0.766 on flare_fiqasa and 0 on flare_finqa. I think it's not surprising to get a 0 for flare_finqa, given that the metric used is exact match accuracy. flare_fiqasa on the other hand is a normal sentiment analysis task and your result of 0 accuracy on that is surprising.

I've also dropped the authors an email requesting the arguments/scripts that were used for the benchmark results presented in the papers.

from pixiu.

aloy99 avatar aloy99 commented on August 29, 2024

@jiminHuang
Your response is much appreciated, I hope to hear from the team regarding the release of instructions for reproducing your results when possible. I have reached out via email as well regarding a paper I am working on and the information I may need.

All the best for your rebuttal!

from pixiu.

Related Issues (9)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.