Giter VIP home page Giter VIP logo

Comments (7)

Xqq2620xx avatar Xqq2620xx commented on August 11, 2024 2

Maybe you are right, there is a possibility that something in the iu-xray dataset itself is causing this to happen. In response to your answer, I can agree with most of your points. By the way, I also appreciate your patient reply and wonderful work contribution. Thank you.

Hello! I've encountered the same problem as you! I also achieved very good results in the 1st epoch, but the generated sentences are all repetitive. I would like to share my thoughts and discuss them with you:

I have tried R2Gen, R2GenCMN, and XProNet, and their results on IU-XRay were very unstable. (You mentioned that R2GenCMN had the highest value at the 25th epoch, but I also experienced cases where the highest value occurred in the first five epochs). I have also modified my own model and encountered situations where the 1st epoch had very high results.

Currently, everyone(the previous papars) is taking the best validation result, and the evaluation metrics do not include a measure of diversity. I think there might not be a good solution to this problem at the moment. Taking the average of results from all epochs or just using the results from the final epoch doesn't seem appropriate either.

In addition, I found that using LSTM as the decoder, compared to Transformer, can result in better diversity, but I don't understand the specific reasons behind it.

However, on MIMIC-CXR, the above situation was largely alleviated, and the results were relatively stable. At least in the experiments I conducted, I did not encounter cases where the first five epochs had very high results. Perhaps we can explore more on MIMIC-CXR.

I think that we need better and more reasonable metrics to evaluate the ability of radiology report generation models😂~

from xpronet.

Markin-Wang avatar Markin-Wang commented on August 11, 2024

Hi, thanks for your interest in our work. Note that for radiology report generation, the precision is more important than the diversity of the reports. For the validity of our method, our work follows the method of the most notable work in this area: we utilized six widely-used evaluation metrics to gauge the performance of our model. We also observed the same phenomenon during our experiments on different models, e.g., R2Gen (see this issue) , R2GenCMN, on IU-Xray dataset. The possible reason could be the IU-Xray has both the frontal and lateral views, hence it's difficult for the visual extractor to capure the diffrence between different samples, hence the model is likely to generate the similar reports. Besides, IU-Xray is a small dataset, hence the diversity itself in the report is smaller than the MIMIC-CXR dataset. Hope this can help you figure out the problem.

from xpronet.

Leepoet avatar Leepoet commented on August 11, 2024

Hi, thanks for your reply.
I have tried to reproduce the work of R2GenCMN, and finally found that its best model and best results are generated around the 25th epoch, which is within the acceptable range in my opinion.
As I said earlier, the best model obtained in the first few epochs is usually not of reference value. In my recent repro experiments, I got pretty good results in the first epoch, but in the end found that only one kind of report was generated. Does this mean that the model was not well trained in the first few epochs so that the diversity of the generated reports is poor but the six evaluation indicators of the results are good? Further, if this is the case, please forgive my bold doubts, then the validity of the method you propose may lack some convincing.

from xpronet.

Markin-Wang avatar Markin-Wang commented on August 11, 2024

Hi, thanks for your reply. I have tried to reproduce the work of R2GenCMN, and finally found that its best model and best results are generated around the 25th epoch, which is within the acceptable range in my opinion. As I said earlier, the best model obtained in the first few epochs is usually not of reference value. In my recent repro experiments, I got pretty good results in the first epoch, but in the end found that only one kind of report was generated. Does this mean that the model was not well trained in the first few epochs so that the diversity of the generated reports is poor but the six evaluation indicators of the results are good? Further, if this is the case, please forgive my bold doubts, then the validity of the method you propose may lack some convincing.

Hi, I guess the epoch the best performance occured is also influenced by the hyper-parameters such as learning rate and the working environment in addition to the method itself. As we mentioned earlier, our work follows the method of the most notable work in this area such as R2Gen and R2GenCMN: we utilized six widely-used evaluation metrics to gauge the performance of our model. In addition, from my perspective, the problem is that the NLP evaluation metrics may not reflect the true performance of the model, which is a common problem in text generation tasks. This is why we normally focus more on the larger dataset such as MIMIC-CXR to mitigate this problem. Moreover, the higher diversity is not always with the higher precision. SME involvement is required to truly gauge this.

from xpronet.

Leepoet avatar Leepoet commented on August 11, 2024

Maybe you are right, there is a possibility that something in the iu-xray dataset itself is causing this to happen. In response to your answer, I can agree with most of your points. By the way, I also appreciate your patient reply and wonderful work contribution. Thank you.

from xpronet.

Markin-Wang avatar Markin-Wang commented on August 11, 2024

Maybe you are right, there is a possibility that something in the iu-xray dataset itself is causing this to happen. In response to your answer, I can agree with most of your points. By the way, I also appreciate your patient reply and wonderful work contribution. Thank you.

Never mind, and thank you for your interest to our work and the conrete discussion. Please feel free to contact again if you have any other questions.

from xpronet.

ThatNight avatar ThatNight commented on August 11, 2024

@Leepoet Hello Leepoet , I have repeated the experiment many times and it is difficult to get the results of iu-xray dataset. Can you share the parameters of utils.py on the iu-xray dataset, or the random seed?

from xpronet.

Related Issues (19)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.