zjunlp / trice Goto Github PK

View Code? Open in Web Editor NEW

32.0 5.0 3.0 15.78 MB

[NAACL 2024] Making Language Models Better Tool Learners with Execution Feedback

Home Page: https://zjunlp.github.io/project/TRICE/

License: MIT License

Python 100.00%

execution feedback large-language-models reasoning reinforcement-learning tools tool-learning trice agent natur

trice's People

Contributors

Stargazers

Watchers

Forkers

zxlzr

trice's Issues

Experimental Setup in HotpotQA

Hi,

I came across your blog post (https://www.zjukg.org/project/TRICE/) and noticed that you conducted experiments using the HotpotQA dataset. I have a question regarding the experimental setup.

In the results table, it mentions "Unseen Tool". Does this mean that a different search engine was used compared to the WikiSearch used for datasets like WebQuestion and NaturalQuestion? If so, could you please specify which search engine was used?

Thank you!

Ryoma Obara

I feel that the paper is really pretty good. May I know which conference I am currently submitting to?

bugs in math evaluation

there are some bugs in the math evaluation:

this is the code to postprocess answers without calculator usage.

            sentences = response.split(".")
            sentences = [s for s in sentences if s != ""]
            pred_sentence = sentences[-1] if len(sentences) > 0 else ""
            pattern = re.compile(r"-?[1-9]\d*")
            pred = pattern.findall(pred_sentence)
            pred = int(pred[-1].replace(",", "")) if len(pred) > 0 else None

The regex does parse large number with comma incorrectly, e.g. 100,000 as 100. So even if the model produces the correct answer, it would not match against the gold answer. Out of the 1785 dev instances, 14 have a comma in them.
The "pred" number is casted to int such that the model without calculator can never be correct if the gold answer is a float (9/1785).

If the numbers in Table 1 for the Alpaca-LoRA-7B baseline without calculator are computed with this script they are not accurate and should be higher.

EDIT:
Just saw that most of the gold answers with float have a 0 decimal. Only (9/1785) are affected, so the numbers in Table 1 would change only a little.

What version of vicuna-7b

What version of vicuna-7b is it exactly that you provide the lora weights for? In the adapter config it only says "/data/qiaoshuofei/PLMs/vicuna-7b", but it is not clear if this is v1.1 v1.3 etc

zjunlp / trice Goto Github PK

trice's People

Contributors

Stargazers

Watchers

Forkers

trice's Issues

Experimental Setup in HotpotQA

I feel that the paper is really pretty good. May I know which conference I am currently submitting to?

bugs in math evaluation

What version of vicuna-7b

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent