coder-yu / selfrec Goto Github PK

View Code? Open in Web Editor NEW

463.0 463.0 67.0 32.82 MB

An open-source framework for self-supervised recommender systems.

Python 100.00%

selfrec's Introduction

Hi there 👋

This is Junliang Yu. I am currently a postdoctoral research fellow working on data science [Homepage][Google Scholar]

I’m working with A/Prof. Hongzhi Yin and Prof. Shazia Sadiq at the University of Queensland.
My research interests include recommender systems, tiny machine learning, self-supervised learning and graph learning.
Feel free to drop me an email if you have any questions. 📧

Featured Project 🍊

You can find all the implementations of my papers in QRec 😜.

SELFRec is a Python framework for self-supervised recommendation (SSR) which integrates commonly used datasets and metrics, and implements many state-of-the-art SSR models. SELFRec has a lightweight architecture and provides user-friendly interfaces. It can facilitate model implementation and evaluation.

selfrec's People

Contributors

Stargazers

Watchers

selfrec's Issues

求更新。。。。。

XSimGCL的均匀性

我在对uniformity loss复现时发现uniformity loss趋势和论文中展示的差不多，数值不太一样，不知道是因为我的uniformity loss实现和你不一样或者是我对你的采样策略理解有误，方便公布一下代码吗

[Help Wanted] 请教前辈关于 SGL/SimGCL 中 LightGCN backbone实现最后一层细节。

在SGL/SimGCL中使用了 LightGCN 作为 BackBone，在最后 N 次特殊的正则化图卷积操作之后得到 N 层embedding，这 N 个embedding 在您的论文中表示如下（来自SimGCL 公式3）：

关于SELFRec对多测试集的支持

您好，请问SELFRec支持在测试中使用多个测试集吗？例如对同一个数据集，我有两个测试集A和B，我希望他们在测试时都一并进行测试了

代码性能问题

求问大佬，使用SELFRec好像GPU利用率并不高，需要设置什么参数可以提高运行速度

Seeking advice on data cleaning and optimizing for fast execution.

Dear Yu,

Thank you very much for your outstanding work. I have been attempting to build a recommendation system framework from scratch recently, but I encountered some issues. I attempted to find answers within SELFRec, but I'm still a bit confused. Therefore, I would like to ask you about two questions, hoping to receive your guidance:

Firstly, I notice that you have already prepared preprocessing for large sparse datasets like Yelp. I want to know your cleaning rules. Currently, I've only done some basic processing:

# Filter out users with less than 5 occurrences
user_counts = data_df['user_id'].value_counts()
data_df = data_df[data_df['user_id'].isin(user_counts[user_counts >= 5].index)]

# Create a mapping of unique user and item IDs to sequential integers
user_id_map = {id: i for i, idx in enumerate(data_df['user_id'].unique())}
item_id_map = {id: i for i, idx in enumerate(data_df['item_id'].unique())}
data_df['user_id'] = data_df['user_id'].map(user_id_map)
data_df['item_id'] = data_df['item_id'].map(item_id_map)

# Filter out interactions with a rating less than 3
data_df = data_df.loc[data_df['rating'] >= 3].copy()  # Make a copy of the filtered DataFrame
data_df.drop(columns=['rating'], inplace=True)

Secondly, I want to know the strategies you use when conducting quick evaluation. When I use a full ranking for evaluation, the code runs relatively slowly.

Thank you for your valuable work again. I've been reading your article on self-supervised learning recently, and it's been absolutely fascinating.

SASRec vs CL4SRec

Thank you very much for your great work.

I compared the accuracy of the two models (SASRec and CL4SRec in your code) in Amazon-beauty, and SASRec was superior.
Empirically, contrastive learning based on InfoNCE should work better, pushing the representation space of sequential embeddings too.

Could you tell me what do you think of this? Is this due to defferences in datasets and lack of hyperparameter tuning (or the all-ranking protocol)?

一些关于xsimgcl模型训练的问题

当我通过训练出一个模型后，如果我有了新的数据该如何在原有模型的基础上继续训练（新的数据可能包含之前数据集中未出现过的物品和用户）？我看到模型初始化会根据输入的数据构建一个scr矩阵，我是要在原有csr矩阵的基础上补上新数据的部分然后训练，还是仅对新的数据训练后给旧的embedding加上新增数据的embedding。

SSL4Rec在yelp上的结果不太好

大佬您的SimGCL的文章实验里面DNN+SSL应该对应的就是SSL4Rec这个模型吧, 但是这个模型我在yelp数据集上跑结果如下

Best Performance
Epoch: 22, Hit Ratio:0.0183126791239777 | Precision:0.009372236958443855 | Recall:0.02152239714180159 | MDCG:0.01612651549405071

但是我看您的两篇论文SimGCL和XSimGCL中的结果比这个好很多recall达到0.0483, 不知道是不是我哪里设置得有问题, 我就改了conf文件中的数据位置为yelp

[Questions] negative items are not token as input of batch_softmax loss

Hey Yu. Your work is so impressful. Thanks for your open source contribution. I also have some problems when using this project. From the figure below you can see, as the implementation of SSL, key point of this work is to construct negative samples by masking or dropout tech. I noticed that you constructed negative samples from item list which are not interacted with user in-batch. But the '_neg' are not the input of loss calculation. Could you please explain this and help solve my questions？ thanks a lot!

SimGCL中的Regulating Uniformity实现问题请教

大佬我想请教一下SimGCL中公式(9)的实现, 我看论文是两个特征相减求得L2范数, 但是loss里面的l2_reg_loss()实现是单纯的范数求和, 我个人理解公式是一种类似pair-wise的相对的特征一致, 但是代码里面实现的loss就是希望范数小, 不知道大佬是怎么考虑的

About amazon-book dataset

请问一下作者SimGCL论文里使用的amazon-book数据集是该仓库的/data/amazon-kindle里内容吗？

数据集中的rating是否有作用

大佬你好, 我是个RS小白, 最近看了BUIR的代码使用的yelp数据集, 个人理解最后预测的就是用户会不会和物品交互, 并不会预测rating的数值. 不知道是不是这样理解, 数据集的txt第三列是个数字就行?
因为我看ui_graph.py 中的training_set_u 这个变量虽然记录了rating, 但是在graph_recommender.py下的test() 函数里面的并没有用变量li的内容.
其次我想问一下, ui_graph.py中的__create_sparse_bipartite_adjacency()函数中rating是不是也和数据集中rating没有关系, 因为是用np.ones_like()生成的, 只是代表是否相连的关系?

XSimGCL中长尾分布数据处理

作者大大，能详细描述一下XSimGCL中对于长尾数据进行分组的处理吗？我的理解是将训练数据集中的item通过交互数量进行排序，然后按顺序分配到每个组，使得每个组的总交互数量差不多。这样相当于得到了10个新的训练集，然后在这些新训练集上进行模型训练和测试吗？

About Amazon Book settings

Dear authors,

Thanks a lot for sharing the code and datasets for your model. Would you kindly share the settings you used to obtain the results for SimGCL on Amazon Book? That would be useful for my research.

Thank you again.

Best regards,
Daniele

[Bug]The program generate recommendation list containing duplicate element

When I was debugging, I found recommendation list containing duplicate element.
At base/graph_recommender.py, 87 line, rec_list = self.test().
I got a rec_list whose content is {'22': [('1190', 0.00026), ('325', 0.00023), ('325', 0.00023), ('166', 0.00017), ('166', 0.00017)],....}
'325' and '166' appear twice.

I checked your code. I found bug at util/algorithm.py, 152 line , program is for iid, score in enumerate(candidates)
Your candidates[:K] has been assigned to n_candidates, but 152 line still traverse all candidate and failure to avoid repetition.

When I changed util/algorithm.py 152 line to for iid, score in enumerate(candidates[K:]): and 153 line is iid = iid + K.
Bug was fixed.

关于SimGCL测试的一点疑问

作者您好，想请问一下在SimGCL对于训练集中未出现却在测试集中出现的User会怎样处理呢。我在进行测试的时候，发现实际测试的user数量略小于给定的测试集中的user数。例如该数据集测试集中有1w名user，但实际给出的推荐结果只有9740名user。

关于验证集的问题

SimGCL论文的原文中有这样一段话：“We split the datasets into three parts (training set, validation set, and test set) with a ratio of 7:1:2.”
但是我看数据集中并没有提供验证集，只有训练集和测试集，这是为什么呢？

表示分布均匀性的分析代码

作者您好，我是debias of RS的初学者，最近拜读了SimGCL这篇工作，想请问一下论文2.3节Fig.2以及3.2节Fig.4的绘制代码是否会考虑开源。盼望您的回复。

Is there any Repo for updated version

I am facing error for 'tensorflow' has no attribute 'contrib' while executing the algo MHCN,

is there any repo to support advanced version of tansorflow .

AttributeError Traceback (most recent call last)
Input In [2], in <cell line: 4>()
29 exit(-1)
30 rec = SELFRec(conf)
---> 31 rec.execute()
32 e = time.time()
33 print("Running time: %f s" % (e - s))

File C:\SELFRec-main/SELFRec-main\SELFRec.py:28, in SELFRec.execute(self)
26 exec(import_str)
27 recommender = self.config['model.name'] + '(self.config,self.training_data,self.test_data,**self.kwargs)'
---> 28 eval(recommender).execute()

File C:\SELFRec-main/SELFRec-main\base\recommender.py:71, in Recommender.execute(self)
69 self.print_model_info()
70 print('Initializing and building model...')
---> 71 self.build()
72 print('Training Model...')
73 self.train()

File C:\SELFRec-main/SELFRec-main\model\graph\MHCN.py:62, in MHCN.build(self)
60 self.weights = {}
61 self.n_channel = 4
---> 62 initializer = tf.contrib.layers.xavier_initializer()
63 self.user_embeddings = tf.Variable(initializer([self.data.user_num, self.emb_size]))
64 self.item_embeddings = tf.Variable(initializer([self.data.item_num, self.emb_size]))

AttributeError: module 'tensorflow' has no attribute 'contrib'

LightGCN Recall@20 复现问题

尊敬的作者您好～在使用您的代码复现论文的时候包括SGL NCL SimGCL XSimGCL 指标都是对应的。都没有问题～

但是在 LIghtGCN 上很差距leaderboard的效果还差很多。没有改动您的代码和超参数，也多次实验。没办法达到大部分论文中的leaderboard和论文中的指标（yelp2018）。可能是我漏掉了什么步骤吗～

后续还会更新别的模型吗

我看ssl_sequential_models 的选择中有DuoRec，请问后面会更新这个模型的实现代码吗，我看该模型的issue中很多人都说达不到论文中的效果，但作者都没有回应

关于数据集的问题

你好，感谢你的开源框架和开源数据集。但是我注意到，在Xsimgcl这篇文章中，有一个Amazon-Electronics数据集我没有在这里找到，请问下这个数据集可以开源吗？

望回复，万分感谢。

对比学习参数

你好，其它数据集的在对比学习中温度t的最佳参数都是什么呢

NCL中KeyError的问题

相同的数据集其他的模型跑就不会出现这个问题，但是NCL中就会有这个报错该怎么解决呢？
File "main.py", line 37, in
rec.execute()
File "/root/SELFRec/SELFRec.py", line 25, in execute
eval(recommender).execute()
File "/root/SELFRec/base/recommender.py", line 73, in execute
self.train()
File "/root/SELFRec/model/graph/NCL.py", line 121, in train
self.fast_evaluation(epoch)
File "/root/SELFRec/base/graph_recommender.py", line 91, in fast_evaluation
rec_list = self.test()
File "/root/SELFRec/base/graph_recommender.py", line 57, in test
item_names = [self.data.id2item[iid] for iid in ids]
File "/root/SELFRec/base/graph_recommender.py", line 57, in
item_names = [self.data.id2item[iid] for iid in ids]
KeyError: 22225

Origin of the datasets

Hello,

I was wondering where I can find the original dataset you are using here? I understand the accompanied dataset has already been preprocessed (btw exactly how its been pre-processed is unclear to me). Is there a resource access to the raw dataset?

For instance I was expecting the amazon-book dataset to be something similar to this where you can visualize the data as done in this example. It's unclear to me whether there is one standard dataset being used (as I see multiple papers in this survey referencing) or whether there are different variations.

Thanks!

Request for settings of the Amazon-Electronics dataset

I am impressed by your work XSimGCL on self-supervised learning for GNN-based recommender systems. I have read your paper and code, and I would like to reproduce your results on the Amazon-Electronics dataset. However, I could not find the pre-processed version of this dataset or the detailed pre-processing pipeline. Could you please kindly share the pre-process settings of the Amazon-Electronics dataset, such as the train/test split file and the k-core setting? I would greatly appreciate your help. Thank you for your time and attention.

大佬有更新对比序列推荐的准备嘛

如果有的话, 想催更~

请教关于SimGCL的训练问题

您好，感谢大佬提供一个简洁高效的框架。在训练SimGCL的过程中我发现似乎随着cl_rate的增大，模型在训练初期需要更多的epoch来实现在验证集上的推荐效果的增长，您认为如果对SimGCL采用early stop策略，阈值设置为多少比较稳妥？

想请问下训练集样本处理问题

您好！我发现SELFRec中的data在处理训练集样本时：

假设训练数据内容是用户0与项目0、1、2交互，用户1与项目0、1交互，即总共有5条交互数据在训练集中。对于训练集的构造，SELFRec是使用了(0，0，_)，(0，1，_)，(0，2，_)，(1，0，_)，(1，1，_)
三元组的3个元素分别代表用户ID、正样本ID、负样本ID，下划线代表随机抽取的负样本。并且保证了训练集内不重不漏地包含了所有的交互数据。请问我的理解正确吗？

但是在LightGCN和SGL中，我发现它们使用的方法都是先采样训练集样本数个用户，然后再得到正负样本，请问这两种方法的哪种更为合理呢？如果要把LightGCN作为对比实验，是不是需要定义类似它一样的采样方法呢？以及如果我想设计我自己的推荐算法，您建议我应该使用哪种采样方式呢？

非常感谢！

[help wanted] 请教下前辈对SimGCL理解

冒昧打扰前辈，不好意思。
SimGCL上的工作很有启发性，但是我有点疑问。文章提出关键不是droupout增强，而是均匀的分布达到去偏的效果
但题目起的是《Are Graph Augmentations Necessary? Simple Graph
Contrastive Learning for Recommendation》让我觉得有一些迷惑，dourpout base的增广在 GBRSs 上是不太好的，其他非 droupout 增广方法在 GBRs 上是否有必要？

When will codes be added into this repo? Thanks!

About results of LightGCN

Thank you very much for your job.

The output of "top-20items.txt" is useful, so I want to output the results of LightGCN with this code.
However, the results were lower than expected. Could you please tell me the cause of this?

LightGCN - INFO - ### model configuration ###
LightGCN - INFO - training.set=./dataset/yelp2018/train.txt
LightGCN - INFO - test.set=./dataset/yelp2018/test.txt
LightGCN - INFO - model.name=LightGCN
LightGCN - INFO - model.type=graph
LightGCN - INFO - item.ranking=-topN 10,20
LightGCN - INFO - embbedding.size=64
LightGCN - INFO - num.max.epoch=500
LightGCN - INFO - batch_size=2048
LightGCN - INFO - learnRate=0.001
LightGCN - INFO - reg.lambda=0.0001
LightGCN - INFO - LightGCN=-n_layer 3
LightGCN - INFO - output.setup=-dir ./results/
LightGCN - INFO - ###Evaluation Results###
LightGCN - INFO - ['Top 10\n', 'Precision:0.03074712643678161\n', 'Recall:0.03410035637045833\n', 'F1:0.03233704450729225\n', 'NDCG:0.03980414986661795\n', 'Top 20\n', 'Precision:0.02679202980927119\n', 'Recall:0.05948183429247423\n', 'F1:0.03694372783847195\n', 'NDCG:0.049277584262262614\n']

The results by 1000 epochs were almost the same.
Is there anything I got wrong? Thanks.

about sparse_norm_adj

ego_embeddings = torch.sparse.mm(self.sparse_norm_adj, ego_embeddings)
你好，请问对于上述代码中的sparse_norm_adj，这一稀疏矩阵生成的依据是什么？做了实验，发现这个矩阵很重要。

代码理解问题

你好，请问下面XSimGCL模型中的这行代码怎么理解
ego_embeddings += torch.sign(ego_embeddings) * F.normalize(random_noise, dim=-1) * self.eps
这应该是论文中数据增强部分，是否对应sign（e）点乘X，如果是的话，X我没有理解，如果不是的话，请问 F.normalize(random_noise, dim=-1) * self.eps这部分是什么意思？

关于模型测试的疑问

请问模型正常情况下是测试比训练慢很多的吗，我监测中gpu基本没用

关于SimGCL中Sec3.2 Regulating Uniformity

您好！想请问下Sec3.2中公式9计算L-uniform的代码是如下吗？

def uniform_loss(x, t=2):
    return torch.pdist(x, p=2).pow(2).mul(-t).exp().mean().log()

# user_e 和 item_e 是卷积得到的final_embedding
uniform = (uniform_loss(user_e) + uniform_loss(item_e)) / 2

Questions with reg_loss~

Thank you for your excellent work. I noticed slightly different regularization loss in different models.
In LightGCN, the total loss is:

batch_loss = bpr_loss(user_emb, pos_item_emb, neg_item_emb) + l2_reg_loss(self.reg, user_emb,pos_item_emb,neg_item_emb)/self.batch_size

In SGL, the total loss is:

batch_loss =  rec_loss + l2_reg_loss(self.reg, user_emb, pos_item_emb,neg_item_emb) + cl_loss

In SimSGL, the total loss is:

batch_loss =  rec_loss + l2_reg_loss(self.reg, user_emb, pos_item_emb) + cl_loss

The l2_reg_loss() of these three losses are different. Is there something I missed? Looking forward to your reply.

关于SimGCL中的分布图画法请教

非常抱歉打扰大佬，我最近关注到您在SIGIR22‘发表的SimGCL的工作且对您里边画的分布图非常感兴趣，不知能否获得关于这段画图的代码，提前谢谢大佬了。

亲爱的作者，请问仓库有考虑增加复现新模型吗？如LightGCL

（ICLR'23) LightGCL: Simple Yet Effective Graph Contrastive Learning for Recommendation

关于SimGCL与XSimGCL所使用的 InfoNCE 的问题

作者您好，有幸拜读了您的两篇杰出工作SimGCL与XSimGCL，特别是XSimGCL它是那么简单优雅并且有效。

我在关注两个模型所使用的InfoNCE函数时发现它只关注正样本之间的相似度，而对负样本之间的相似度不关心，请问这是出于什么目的呢？（有可能这个问题比较简单，如能答复不胜感激）

def InfoNCE(view1, view2, temperature: float, b_cos: bool = True):
    if b_cos:
        view1, view2 = F.normalize(view1, dim=1), F.normalize(view2, dim=1)
    pos_score = (view1 @ view2.T) / temperature
    score = torch.diag(F.log_softmax(pos_score, dim=1))
    return -score.mean()

也是因为与XsimGCL论文中损失函数不一致让我感到困惑

[Bug]关于best performance 代码小问题

SELFRec/base/graph_recommender.py

Lines 107 to 111 in 3fc66eb

 for m in measure[1:]: 

 k, v = m.strip().split(':') 

 performance[k] = float(v) 

 self.bestPerformance.append(performance) 

 self.save()

前辈这里的代码，如果没有保存过最好的结果应该是append 到外面的吗？

 for m in measure[1:]: 
     k, v = m.strip().split(':') 
     performance[k] = float(v)
-    self.bestPerformance.append(performance) 
+self.bestPerformance.append(performance) 
 self.save()

设置了相同的种子，加对比学习后每次运行结果不同。

两次对比学习损失函数误差逐渐增加

SimGCL特征分布图具体画法

作者您好，我想请教一下SimGCL中特征分布图的具体画法，若能解答不胜感激

Sequential models loss outputting nan after a few epochs

When training the sequential models such as CL4SRec, after a few epochs of training, I'm getting nans for the batch_loss and rec_loss. For instance see the output below:

## CL4SRec
Epoch: 2, Hit Ratio:0.02066  |  Precision:0.00103  |  Recall:0.02066  |  NDCG:0.00784
*Best Performance* 
Epoch: 2, Hit Ratio:0.02066  |  NDCG:0.00784
------------------------------------------------------------------------------------------------------------------------
training: 3 batch 50 batch_loss: 0.5157323479652405 rec_loss: 0.4582507908344269
Evaluating the model...
Progress: [++++++++++++++++++++++++++++++++++++++++++++++++++]100%
------------------------------------------------------------------------------------------------------------------------
Real-Time Ranking Performance  (Top-20 Item Recommendation)
*Current Performance*
Epoch: 3, Hit Ratio:0.03372  |  Precision:0.00169  |  Recall:0.03372  |  NDCG:0.01302
*Best Performance* 
Epoch: 3, Hit Ratio:0.03372  |  NDCG:0.01302
------------------------------------------------------------------------------------------------------------------------
training: 4 batch 50 batch_loss: 0.4821103513240814 rec_loss: 0.4299513101577759
Evaluating the model...
Progress: [++++++++++++++++++++++++++++++++++++++++++++++++++]100%
------------------------------------------------------------------------------------------------------------------------
Real-Time Ranking Performance  (Top-20 Item Recommendation)
*Current Performance*
Epoch: 4, Hit Ratio:0.04212  |  Precision:0.00211  |  Recall:0.04212  |  NDCG:0.01622
*Best Performance* 
Epoch: 4, Hit Ratio:0.04212  |  NDCG:0.01622
------------------------------------------------------------------------------------------------------------------------
training: 5 batch 50 batch_loss: nan rec_loss: nan

Any ideas what could be causing this?

P.S. This is training with the amazon-beauty datasets, some of the other datasets don't load with this model.

ue = open('user', 'rb')
user = pickle.load(ue)
udx = np.random.choice(len(user), 2000)
embs = ['user_lgcn.emb', 'user_sgl.emb', 'user_simgcl.emb']
models = ['LightGCN','SGL-ED','SimGCL']
data = {}

想请教下SimGCL中的分布图画法中的user embedding和item embedding是被优化后的初始embedding：

  def _init_model(self):
      initializer = nn.init.xavier_uniform_
      embedding_dict = nn.ParameterDict({
          'user_emb': nn.Parameter(initializer(torch.empty(self.data.user_num, self.emb_size))),
          'item_emb': nn.Parameter(initializer(torch.empty(self.data.item_num, self.emb_size))),
      })
        return embedding_dict

还是说经过模型forward后得到的user embedding和item embedding呢

关于SGL代码运行问题

作者您好，请问在项目中，SGL-ND、SGL-ED、SGL-RW和SGL-WA的代码是如何运行的？

	for m in measure[1:]:
	k, v = m.strip().split(':')
	performance[k] = float(v)
	self.bestPerformance.append(performance)
	self.save()

coder-yu / selfrec Goto Github PK

selfrec's Introduction

Hi there 👋

Featured Project 🍊

selfrec's People

Contributors

Stargazers

Watchers

Forkers

selfrec's Issues

Recommend Projects

Recommend Topics

Recommend Org