open-spaced-repetition / fsrs-vs-sm17 Goto Github PK

A simple comparison between FSRS and SM-17

Jupyter Notebook 89.78% Python 10.22%

spaced-repetition spaced-repetition-algorithm supermemo dataset fsrs

fsrs-vs-sm17's Introduction

FSRS vs SM-17

It is a simple comparison between FSRS and SM-17. FSRS-v-SM16-v-SM17.ipynb is the notebook for the comparison.

Due to the difference between the workflow of SuperMemo and Anki, it is not easy to compare the two algorithms. I tried to make the comparison as fair as possible. Here is some notes:

The first interval in SuperMemo is the duration between creating the card and the first review. In Anki, the first interval is the duration between the first review and the second review. So I removed the first record of each card in SM-17 data.
There are six grades in SuperMemo, but only four grades in Anki. So I merged 0, 1 and 2 in SuperMemo to 1 in Anki, and mapped 3, 4, and 5 in SuperMemo to 2, 3, and 4 in Anki.
I use the R (SM17)(exp) recorded in sm18/systems/{collection_name}/stats/SM16-v-SM17.csv as the prediction of SM-17. Reference: Confusion among R(SM16), R(SM17)(exp), R(SM17), R est. and expFI.
To ensure FSRS has the same information as SM-17, I implement an online learning version of FSRS, where FSRS has zero knowledge of the future reviews as SM-17 does.
The results are based on the data from a small group of people. It may be different from the result of other SuperMemo users.

Metrics

We use two metrics in the FSRS benchmark to evaluate how well these algorithms work: log loss and a custom RMSE that we call RMSE (bins).

Log Loss (also known as Binary Cross Entropy): Utilized primarily for its applicability in binary classification problems, log loss serves as a measure of the discrepancies between predicted probabilities of recall and review outcomes (1 or 0). It quantifies how well the algorithm approximates the true recall probabilities, making it an important metric for model evaluation in spaced repetition systems.
Weighted Root Mean Square Error in Bins (RMSE (bins)): This is a metric engineered for the FSRS benchmark. In this approach, predictions and review outcomes are grouped into bins according to the predicted probabilities of recall. Within each bin, the squared difference between the average predicted probability of recall and the average recall rate is calculated. These values are then weighted according to the sample size in each bin, and then the final weighted root mean square error is calculated. This metric provides a nuanced understanding of model performance across different probability ranges.

Smaller is better. If you are unsure what metric to look at, look at RMSE (bins). That value can be interpreted as "the average difference between the predicted probability of recalling a card and the measured probability". For example, if RMSE (bins)=0.05, it means that that algorithm is, on average, wrong by 5% when predicting the probability of recall.

Result

Total users: 16

Total repetitions: 194,281

The following tables represent the weighted means and the 99% confidence intervals.

Weighted by number of repetitions

Algorithm	Log Loss	RMSE(bins)
FSRS-4.5	0.4±0.08	0.06±0.021
FSRSv4	0.4±0.09	0.07±0.025
FSRSv3	0.4±0.09	0.08±0.021
SM-17	0.4±0.10	0.08±0.020
SM-16	0.4±0.09	0.11±0.026

Weighted by ln(number of repetitions)

Algorithm	Log Loss	RMSE(bins)
FSRS-4.5	0.4±0.08	0.09±0.030
SM-17	0.5±0.10	0.10±0.029
FSRSv4	0.4±0.09	0.11±0.043
FSRSv3	0.5±0.10	0.11±0.035
SM-16	0.5±0.11	0.12±0.033

The image below shows the p-values obtained by running the Wilcoxon signed-rank test on the RMSE (bins) of all pairs of algorithms. Red means that the row algorithm performs worse than the corresponding column algorithm, and green means that the row algorithm performs better than the corresponding column algorithm. Grey means that the p-value is >0.05, and we cannot conclude that one algorithm performs better than the other.

It's worth mentioning that this test is not weighted, and therefore doesn't take into account that RMSE (bins) depends on the number of reviews.

Share your data

If you would like to support this project, please consider sharing your data with us. The shared data will be stored in ./dataset/ folder.

You can open an issue to submit it: https://github.com/open-spaced-repetition/fsrs-vs-sm17/issues/new/choose

Contributors

_{leee_} 🔣	_{Jarrett Ye} 🔣	_{天空守望者} 🔣	_reallyyy 🔣	_shisuu 🔣	_Winston 🔣	_Spade7 🔣
_{John Qing} 🔣	_{WolfSlytherin} 🔣	_HyFran 🔣	_Hansel221 🔣	_{曾经沧海难为水} 🔣	_Pariance 🔣	_{github-gracefeng} 🔣

fsrs-vs-sm17's People

Contributors

Stargazers

Watchers

Forkers

qqxx0011

fsrs-vs-sm17's Issues

[Data]

Data file

You can find SM16-v-SM17.csv in sm18/systems/{collection_name}/stats folder. The private content is stored in column Title. I recommend removing it before uploading to GitHub and sharing. Don't forget to make a copy before you delete it. Then you can drag and drop the file here.

你可以在 sm18/systems/{collection_name}/stats 文件夹中找到 SM16-v-SM17.csv 文件。Title 列中可能包含隐私信息，建议在上传到 GitHub 分享之前删除它。删之前别忘了备份。然后你就可以将文件拖进这里上传分享。
SM16-v-SM17.csv

[Data]

Data file

你可以在 sm18/systems/{collection_name}/stats 文件夹中找到 SM16-v-SM17.csv 文件。Title 列中可能包含隐私信息，建议在上传到 GitHub 分享之前删除它。删之前别忘了备份。然后你就可以将文件拖进这里上传分享。

Gather more data

I know that this isn't very helpful, and I can't contribute anything substantial or help in any way, so I'm just opening this issue to remind you that right now this benchmark is based on very limited data, and ideally we need 1000000+ reviews. So if you have any ideas where to find more SuperMemo users who are willing to share their data, it would be great.

NoHeartPen's [Data]

Data file

SM16-v-SM17.csv

说明一下：虽然文件里包含了最近一年的数据，但我用 SuperMemo 用得不是那么多，大部分都是「三」天打鱼「三十」天晒网的状态233，希望我的「脏」数据也能有一定的参考价值:)

[Data]

SM16-v-SM17.csv
Data file

[Data]

Data file

SM16-v-SM17.csv

[Feature Request] Add FSRS v4 too, not just FSRS-4.5

The title says it all.

Winston's data

SM16-v-SM17.csv

[Data]

Data file

[Data]

Data file
SM16-v-SM17.csv

You can find SM16-v-SM17.csv in sm18/systems/{collection_name}/stats folder. The private content is stored in column Title. I recommend removing it befo
re uploading to GitHub and sharing. Don't forget to make a copy before you delete it. Then you can drag and drop the file here.

Something is wrong with the comparison

FSRS is indeed a good SRS model but, I can't believe that it would be better than SM-17. However, the results of the comparison made in this repo suggest that FSRS is better than SM-17.

This makes me think that there is something wrong with the comparison. However, I am unable to come up with reasonable causes for the poor performance of SM-17 against FSRS in this comparison.

[Data] Leee and LMSherlock

SM16-v-SM17_Leee.csv

SM16-v-SM17_LMSherlock.csv

Update the benchmark with FSRS-4.5

Since you finished benchmarking FSRS-4.5 in the main benchmark repo, it makes sense to run it here too.

[Data]

Data file

SM16-v-SM17.csv

Add FSRS v4 and FSRS v3 to the comparison

Currently the table just says "FSRS".

Ideally, both FSRS v3 and FSRS v4 should be added. If you don't want to change v3 code just for benchmarking that's fine, although I think it would be interesting to see the difference. Also, please specify which version of FSRS is used in the benchmark. I know it's v4, but I still think it would be better to specify that explicitly.

[Data]

Data file

T-test between FSRS and SM-17

Total number of users: 16
Total size: 194281

Scale: reviews
FSRSv4 LogLoss: 0.4±0.08
FSRSv4 LogLoss (mean±std): 0.370±0.101
FSRSv3 LogLoss: 0.4±0.09
FSRSv3 LogLoss (mean±std): 0.401±0.116
SM17 LogLoss: 0.4±0.10
SM17 LogLoss (mean±std): 0.414±0.121
SM16 LogLoss: 0.4±0.09
SM16 LogLoss (mean±std): 0.421±0.118

FSRSv4 RMSE(bins): 0.06±0.027
FSRSv4 RMSE(bins) (mean±std): 0.061±0.034
FSRSv3 RMSE(bins): 0.10±0.028
FSRSv3 RMSE(bins) (mean±std): 0.098±0.035
SM17 RMSE(bins): 0.10±0.039
SM17 RMSE(bins) (mean±std): 0.096±0.047
SM16 RMSE(bins): 0.12±0.027
SM16 RMSE(bins) (mean±std): 0.117±0.035

https://www.evanmiller.org/ab-testing/t-test.html

@Expertium, I find that the difference is statistically significant when we compare FSRS and SM-17 with their weighted mean and weighted standard deviation in T-test.

[Data] SM18 Data Share

Data file

你可以在 sm18/systems/{collection_name}/stats 文件夹中找到 SM16-v-SM17.csv 文件。Title 列中可能包含隐私信息，建议在上传到 GitHub 分享之前删除它。删之前别忘了备份。然后你就可以将文件拖进这里上传分享。
SM16-v-SM17.csv
SM16-v-SM17.csv

open-spaced-repetition / fsrs-vs-sm17 Goto Github PK

fsrs-vs-sm17's Introduction

FSRS vs SM-17

Metrics

Result

Weighted by number of repetitions

Weighted by ln(number of repetitions)

Share your data

Contributors

fsrs-vs-sm17's People

Contributors

Stargazers

Watchers

Forkers

fsrs-vs-sm17's Issues

Recommend Projects

Recommend Topics

Recommend Org