In fashion-iq dataset, there are two captions available, in paper "The fashion iq data

Is FashionIQ evaluation comparable? about val HOT 4 CLOSED

yanbeic commented on September 13, 2024

Is FashionIQ evaluation comparable?

from val.

Comments (4)

yanbeic commented on September 13, 2024

Hi,
In this paper, one caption is used as text input in training and evaluation. This follows the same spirit as the previous CVPR19 paper and the FashionIQ paper. Since the training and the evaluation are consistent, this setup should be reasonable.

Note that in the original FashionIQ paper, there are no results reported for the exact same task considered in this paper. Our reported results are obtained by re-implementation of prior methods under the same setup, so the evaluation in our paper is comparable.

from val.

helson73 commented on September 13, 2024

Sorry, I confused the task in their paper ("The fashionIQ dataset : ...") with their FashionIQ challenge.

If all evaluation results in your result chart are obtained by yourself and evaluated under the same protocol, indeed they are comparable, it's my bad to confuse about them.

However, the guys who released FashionIQ dataset also organized a contest named "FashionIQ challenge", I think you mentioned results from this task in your paper as "un-published SOTA". (Although they didn't published them on major journals and conferences, their reports are available.)

The "FashionIQ challenge" is the exact same task you did in your work, except for one thing, the evaluation protocol.

The "FashionIQ challenge" treats two captions together as one text input, as well as all participants of FashionIQ challenge 2019, 2020 follow the same rule. On the official fashionIQ dataset, they released a "starter-kit" for evaluating baseline (TIRG) on fashionIQ dataset, in these "starter-kit", they follows the same rule, treat two captions together as one input.

In perspective of a starter who wants to know about this task at the first time, one usually firstly access the official website, and then one highly likely would use the "starter-kit" as the starter point. If one had read your paper, one might possibly think that the task in your paper is the same task in FashionIQ challenge.

I understand your concern about consistency with other works (CVPR19 paper for instance), but in this case there is already an official challenge of specific dataset exists, you did the same task, mentioned them in your paper (not in the chart, but mentioned anyway), but used a different evaluation protocol without mentioning it directly, which may cause confusion.

If possible, I suggest you put a notice about this difference, at least in the readme file :)

from val.

helson73 commented on September 13, 2024

I have another question about evaluation result table in your paper.

The paper shows large gap between the evaluation result from the original TIRG implementation and the result from the re-implementation on FashionIQ dataset, I wondering what cause this? More specifically, what is difference between original TIRG and your re-implemented one?

from val.

yanbeic commented on September 13, 2024

The difference is the backbone network. ResNet-50 is used on FashionIQ in this paper.

from val.

Is FashionIQ evaluation comparable? about val HOT 4 CLOSED

Comments (4)

Related Issues (13)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent