shenweichen / deepctr Goto Github PK

Easy-to-use,Modular and Extendible package of deep-learning based CTR models .

Home Page: https://deepctr-doc.readthedocs.io/en/latest/index.html

License: Apache License 2.0

Python 100.00%

autoint click-through-rate ctr deep-learning deepcross deepfm deepinterestevolutionnetwork deepinterestnetwork dien din esmm factorization-machines ffm fgcnn mlr mmoe nfm ple recommendation xdeepfm

deepctr's Issues

请问,DeepFM支持libffm格式的数据输入吗?

因为每一个稀疏特征会有对应的值例如: field:feature_index:feature_value

还有, 看了看 multi-value 数据输入的中间结果, 貌似目前不支持.

之后会支持稀疏格式的数据例如: libffm 输入吗?

Describe the question(问题描述)
你好，作为新手，非常感谢贡献
在用自己数据跑DIN时候，数据处理后输入的历史行为不定长，比如输入是(分别为用户id，用户性别，广告 id，广告分类，用户点击的历史广告id，用户点击的历史广告的分类)
uid,ugender,iid,icate,hist_iid,hist_icate
13,1,24,3,[1,7],[2,5]
13,1,13,1,[1,7,24],[2,5,3]

如上按时间对历史行为排序后取一个用户的连续两次记录，这样两笔记录的历史列则不等长，当输入到模型时候会报：

尝试用加padding的方式将记录扩充到等长序列，虽然模型可以成功跑起来，但这样会导致输入数据极速扩大,且有用的信息却没有变，该如何处理历史序列不等长的输入？

Operating environment(运行环境):

python version [e.g. 3.6]
tensorflow version [e.g. 1.10.0,]
deepctr version [e.g. 0.3.3,]

Training model in multi gpu environment

Describe the question(问题描述)
I've wrapped DeepFM model into multi_gpu_model and trying to train on multiple GPUs. However from running time and monitoring gpu utilization, I can see that only one gpu utilized (~50%) at a time while other GPUs are idle (~2%). Any tips on ways to handle this problem would be appreciated

Additional context
Add any other context about the problem here.

Operating environment(运行环境):

python version 3.6
tensorflow version 1.12.0
deepctr version 0.3.2

Would you please explain VarLenFeat's parameters

The utils have SingleFeat and VarLenFeat object
but document have not explain the init parameters
just fewer lines in examples.
If I want build a sparse or dense sequence feature, how to set it?

for example:
examples/run_dien.py
behavior_feature_list = ["item", "item_gender"]
"item" and "item_gender" is not like a kind of seq_feature_list

make me confuse.

hope answer

请问DeepFM能支持多维的dense feature输入吗

在推荐过程中，如果输入有提取的图象特征，比如本身是2048维的，能否直接映射到一个embedding feature，而不是2048个embedding feature?

How i change the Initializers

Describe the question(问题描述)
A clear and concise description of what the question is.

Additional context
Add any other context about the problem here.

Operating environment(运行环境):

python version [e.g. 3.6]
tensorflow version [e.g. 1.4.0,]
deepctr version [e.g. 0.3.2,]

请问一下如何做多模型集成

RT，
还有能不能在训练好把embeding的向量提出来

PNN非常慢而且占用大量内存

Describe the question(问题描述)
特征维度5000+，用的PNN模型且全部特征都是连续特征，在PNN初始化的时候非常慢，而且占用了大量的内存，大约不到30G，然后fit的时候，样本大约在20万+，60G内存都报内存错误了，请问下，这是我哪里使用错了吗？
另外我也测试了其他的样本集，特征在1500维左右，也都是连续特征，样本数在6万+，就一切运行是正常的

Operating environment(运行环境):

python version [e.g. 3.4, 3.6]
deepctr version [e.g. 0.2.3,]

How to add a long feature vector as a feature to the model?

Describe the question(问题描述)
Hello Weichen!
I have a question like this.
For example, now I have extracted visual feature vectors with the dimension of 2048 of every item. I need to embed it to your CTR model (like in the docs demo). Thus, how can I use them in the sparse or dense features to train the model and make predictions?
Thanks a lot!

Additional context
Add any other context about the problem here.

Operating environment(运行环境):

python version [3.6]
tensorflow version [1.6.0]
deepctr version [e.g. 0.3.1]

Support for TensorFlow 1.13

DeepCTR requires tensorflow <= 1.12. The problem is that tensorflow-gpu 1.12 (via pip) has been compiled with CUDA 9. This means that we cannot upgrade to CUDA 10 (unless we compile tensorflow 1.12 from sources, which makes little sense since it has superseded by a newer version).

Is it possible to upgrade DeepCTR to support tensorflow 1.13? This way we could pip install tensorflow-gpu 1.13, which has been compiled with CUDA 10, so we could upgrade from CUDA 9 to CUDA 10. Thanks.

It is available to design the loss function by myself? If so, how can i change the loss function?

Describe the question(问题描述)
English:
In order to deal with the imbalance sample question, I would like to change the loss function to the tf.nn.weighted_cross_entropy_with_logits and set pos_weight parameter. However, I don't know how to change it in your framework?

Actually, I have noticed that you define the loss function through model.compile() function. But I don't know if this method works for my problem.

中文：
我想知道怎么替换你框架中的loss函数，我想用 tf.nn.weighted_cross_entropy_with_logits 这个函数处理样本不平衡的数据，但是不知道如何加入你的框架中

我注意到你在代码中用了model.compile() 去定义loss函数和优化器，但是不知道这种方式是否对我的问题有用，而且也不知道如何将loss所带的参数通过这个方法传递进去。

Operating environment(运行环境):

python version 3.6
tensorflow version 1.12.0
deepctr version 0.3.1

pip installation installs tensorflow, overwritting tensorflow-gpu

pip installation now installs tensorflow which overwrites tensorflow-gpu installation. Please remove tensorflow from requirements or allow tensorflow-gpu to be a replacement.

Does it support libsvm format file?

in production, the train data is huge. Does the deepctr framework support the libsvm fortmat?

from deepctr import SingleFeat

I've installed the latest version deepctr, and when I import SingFeat and got the import error

from deepctr import SingleFeat

ImportError: cannot import name 'SingleFeat'

Packages Version：
tensorflow:1.12.0
keras:2.2.2
deepctr:0.2.3

OS: Ubuntu "16.04.5 LTS (Xenial Xerus)"

What about adding a Benchmark

What about adding a performance benchmark (by AUC and logloss) over datasets such as Criteo?

How to perform regularization on CIN layer?

Describe the question(问题描述)
How to perform regularization on CIN layer.Seems that's no regularizer parameter in the function CIN in the deepctr.layers.interaction.

Additional context
Add any other context about the problem here.

Operating environment(运行环境):

python version [e.g. 3.4, 3.6]
tensorflow version [e.g. 1.4.0, 1.12.0]
deepctr version [e.g. 0.2.3,]

How can i obtain the attentional weights of feature interaction in AFM?

GPU Utilization

What should be the highest GPU Utilization that one can expect via this ???

How to Encode Numerical Sparse features?

Question on encoding numerical sparse features:

How do we encode sparse features with non binary values? Say we have frequency/strength values in X for the sparse features (normalized between 0 and 1)? All my sparse features are already stored in separate columns. (col2 : col11133)

Currently I do this:
sparse_features = ['col' + str(i) for i in range(2, n)]
dense_features = []

testing_dataframe[sparse_features] = testing_dataframe[sparse_features].fillna(0, )
testing_dataframe[dense_features] = testing_dataframe[dense_features].fillna(0, )

sparse_feature_list = [SingleFeat(feat, 0) for feat in sparse_features]
dense_feature_list = [SingleFeat(feat, 0) for feat in dense_features]

test_model_input = [df_test[feat.name].values for feat in sparse_feature_list] + \
                   [df_test[feat.name].values for feat in dense_feature_list]

This makes my network take much longer to initialize. I am using embedding of 50, with ~11K features.

I see the examples shown are more so categorical.

Any assistance would be greatly appreciated!

python version 3.6
tensorflow version 1.12.0
deepctr version latest

?

can't concat when embedding_size is set to "auto"

Describe the bug(问题描述)
When set the embedding size to "auto", the Concatenate layer can't merge all input Embedding with different size at axis=2

def concat_fun(inputs, axis=-1):
if len(inputs) == 1:
return inputs[0]
else:
return Concatenate(axis=axis)(inputs)

To Reproduce(复现步骤)
Steps to reproduce the behavior:

Go to '...'
Click on '....'
Scroll down to '....'
ValueError: A Concatenate layer requires inputs with matching shapes except for the concat axis. Got inputs shapes: [(None, 1, 36), (None, 1, 30), (None, 1, 6), (None, 1, 12), (None, 1, 12), (None, 1, 30), (None, 1, 12)]

Operating environment(运行环境):

python version [e.g. 3.4, 3.6]
tensorflow version [e.g. 1.4.0, 1.12.0]
deepctr version [e.g. 0.2.3,]

Additional context
Add any other context about the problem here.

Running model.fit() multiple times on batches for DeepFM

Describe the question(问题描述)

I am using the DeepFM module and have a lot of data. I would like to run the model in batches, using the model.train function()

    history = model.fit(train_model_input
                        , train[target].values
                        , batch_size=1024
                        , epochs=1
                        , verbose=2
                        , validation_split=0.05)

This seems to work iteratively, my loss does go down, my AUC does go up, but can we confirm this please ? I have read conflicting articles about this.

Also, I am wondering about the consistency of the embedding across batches. Given that embedding representation depends on the composition of a dataset, does it makes sense/does it work, if I incrementally train the model using calls on model.train() on a sequence of files?

Please let me know. Absolutely love this library so far I must say too 👍

Operating environment(运行环境):

python version [3.5]
tensorflow version [1.4.0]
deepctr version [latest]

为什么train_model_input 的shape是 (feature_num, data_num) 而不是 (data_num, feature_num)?

这里的train_model_input 是run_classification_criteo.py 中的模型输入数据。

How to run the demo with GPU

Describe the question(问题描述)
Im testing the deepctr demo. However, it's running with CPU. How can I modify the code to run it with my own GPU?
Thanks for ur answering.

Additional context
Add any other context about the problem here.

Operating environment(运行环境):

python version [3.6]
tensorflow version [1.6.0]
tensorflow-gpu version [1.6.0]
deepctr version [0.3.1]

Could not find a version that satisfies the requirement deepctr (from versions)

Run the Examples Classification: Criteo, 报错

运行官网的例子时，报错信息为：
tensorflow.python.framework.errors_impl.InvalidArgumentError: indices[0,0] = 104 is not in [0, 14) [[{{node sparse_emb_18-C14/embedding_lookup}} = ResourceGather[Tindices=DT_INT32, _class=["loc:@training/Adam/gradients/sparse_emb_18-C14/embedding_lookup_grad/Reshape"], dtype=DT_FLOAT, validate_indices=true, _device="/job:localhost/replica:0/task:0/device:CPU:0"](sparse_emb_18-C14/embeddings, linear_emb_18-C14/Cast)]]

系统版本：
tensorflow:1.11.0
keras:2.2.4
deepctr:0.2.2

训练时正常，predict报错

model.fit()能正常运行但是在进行预测时候报错，请问有能正常运行model.predict()或者model.predict_on_batch()的example吗
'''
output = model.predict(model_output, batch_size=256, verbose=0, steps=None)
'''
报错：
tensorflow.python.framework.errors_impl.InvalidArgumentError: indices[139, 0] = 28 is not in [0, 27]

How to define Learning Rates and Early_STOP

在运行demo时报错AttributeError: 'float' object has no attribute '_unconditional_loss'

在运行demo时，报错AttributeError: 'float' object has no attribute '_unconditional_loss'，请问怎么解决？

merge_dense_input in input_embedding.py

Describe the question(问题描述)
It's difficult for me to understand doing embedding for a dense feature,Why we need do that?thx !

Additional context
Add any other context about the problem here.

Operating environment(运行环境):

python version [e.g. 3.6]
tensorflow version [e.g. 1.4.0,]
deepctr version [e.g. 0.3.2,]

why not set sparse as True

why not set sparse as True this line
sparse_input[feat.name] = Input(shape=(1,), name=prefix+'sparse_' + str(i) + '-' + feat.name)
tensorflow.python.keras.layers.Input
@tf_export('keras.layers.Input', 'keras.Input') def Input( # pylint: disable=invalid-name shape=None, batch_size=None, name=None, dtype=None, sparse=False, tensor=None, **kwargs):

FM层维度问题

代码中FM层做交叉后好像是1维。不应该是K（embedding的维度）吗？

Is the pypi package sync with latest source code

Describe the question(问题描述)
When I use pip install deepctr, I found the arguments of DeepFM() is not consistent with those of source code. Which version should I use?

Operating environment(运行环境):

python version 3.7
deepctr version 0.4,0

load model 报错

使用deepctr 包 DCN 训练模型没有报错。save模型： save_model(model,outfile_model2) #outfile_model2="./model/DCN.h5"
没有报错，但是load model：
model = load_model(outfile_model2, custom_objects)
报错：TypeError: issubclass() arg 1 must be a class

@shenweichen 麻烦帮忙看看问题在哪里怎么解决？我调用包的其他算法也是同样错误

new methods

Describe the solution you'd like
new methods:

multivalent input 问题

你好，请问下DIN，XDeepFM都会有multivalent 分类变量 embedding，这样的特征模型要怎么输入尼

custom_object can't be imported

Describe the bug(问题描述)
custom_objects from deepctr.utils can't be imported. But models, SingleFeat etc. can be imported properly.

To Reproduce(复现步骤)
Steps to reproduce the behavior:

pipenv shell as I use pipenv
python
from deepctr.utils import custom_objects

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ImportError: cannot import name 'custom_objects'

Operating environment(运行环境):

python version 3.6
tensorflow version 1.12.0
deepctr version 0.3.2

Additional context
Exact same code works well on my other PC.

Thanks for this great work.

时序输入

你好，现在我有一个需求。
CTR预测的时候输入的数据是时序的，不知该如何实现。
现在基于单个时刻的，用DeepFM可以得到一个结果，但是希望将输入改为时序输入，以此来提升效果。

GPU out of memory with reasonable data

Describe the bug(问题描述)
I have a sparse feature with 1M items with 256 embedding. That is 1GB memory. "Adam" optimizor will need 2 more copies of GPU memories. So it is around 3GB. But it is still OOM on GPU with 8GB memory.
Is there any way to reduce GPU memory usage, like put some tensors on CPU instead?
Thanks.

To Reproduce(复现步骤)
Steps to reproduce the behavior:

The code:
import sys
import time
from deepctr.models import DeepFM
from deepctr import SingleFeat
if name == 'main':
PIDCount = 1048576
embedding_size = 256
sparse_feature_list = [SingleFeat("PID", PIDCount)]
dense_feature_list = [SingleFeat('AttrCS', 0)]
model = DeepFM({"sparse": sparse_feature_list,
"dense": dense_feature_list}, embedding_size=embedding_size)
model.compile("adam", "binary_crossentropy",
metrics=['binary_crossentropy'], )
history = model.fit([[0], [0]], [[0]], epochs=1, verbose=2)
OOM for GPU
Summary of in-use Chunks by size:
81 Chunks of size 256 totalling 20.2KiB
11 Chunks of size 512 totalling 5.5KiB
6 Chunks of size 1024 totalling 6.0KiB
1 Chunks of size 1280 totalling 1.2KiB
4 Chunks of size 65536 totalling 256.0KiB
4 Chunks of size 262144 totalling 1.00MiB
7 Chunks of size 4194304 totalling 28.00MiB
2 Chunks of size 4194560 totalling 8.00MiB
6 Chunks of size 1073741824 totalling 6.00GiB
Sum Total of in-use chunks: 6.04GiB
Stats:
Limit: 7440534733
InUse: 6481544704
MaxInUse: 6481544704
NumAllocs: 124
MaxAllocSize: 1073741824

请问这个包支持多值embeding吗

比如一个电影有多个标签。这种embeding支持不。。

如何在此项目里使用经过one-hot编码的YOYI数据集

您好！
我想在此项目中试一试使用yoyi数据集，yoyi数据集我拿到的是经过张伟楠团队onehot处理后的数据，训练集和测试集的格式均如下：
标签市场价特征1 特征2 特征3 ....特征n
1 10 122:1 223:1...2001:1
0 20 433:1 890:1..8981:1
...
去掉市场价后就可以用来做点击率预测。但是我发现deepctr这个项目里对数据集进行编码，是直接在原始数据的基础上进行数值特征和分类特征的处理。那么对于yoyi这一类拿不到原始数据，只有one-hot编码后的数据集，该如何处理？谢谢

请问这个包支持流数据实时训练吗？

如题。。

如何应用于大数据

Describe the question(问题描述)
想问个问题，假设训练数据很大，需要逐行读入处理，需要对输入数据部分做什么改动，最好能有示例代码。

Additional context
Add any other context about the problem here.

Operating environment(运行环境):

python version [e.g. 3.4, 3.6]
tensorflow version [e.g. 1.4.0, 1.12.0]
deepctr version [e.g. 0.2.3,]

Difference between embed_size = 1 and 'auto', all dense feature

Describe the question(问题描述)

Hello, I would like to understand the difference between setting:

embed_size = 'auto'
vs.
setting embed_size = 1

ALL of my features all numerical and DENSE. Num features = 11190, binary

I have tried to print out tensor sizes and the model.summary() as follows:

    print('AUTO')
    model = DeepFM({"sparse": sparse_feature_list,
                    "dense": dense_feature_list}
                   , embedding_size='auto'
                   , use_fm=True
                   , hidden_size=(10,10)
                   , l2_reg_linear=0.0001
                   , l2_reg_embedding=0.00001
                   , l2_reg_deep=0.0001
                   , init_std=0.0001
                   , seed=1024
                   , keep_prob=0.8
                   , activation='relu'
                   , final_activation='sigmoid'
                   , use_bn=False)
    print(model.summary())

    print('EMBED 1')
    model = DeepFM({"sparse": sparse_feature_list,
                    "dense": dense_feature_list}
                   , embedding_size=1
                   , use_fm=True
                   , hidden_size=(10,10)
                   , l2_reg_linear=0.0001
                   , l2_reg_embedding=0.00001
                   , l2_reg_deep=0.0001
                   , init_std=0.0001
                   , seed=1024
                   , keep_prob=0.8
                   , activation='relu'
                   , final_activation='sigmoid'
                   , use_bn=False)
    print(model.summary())

Here's what I see:

Add any other context about the problem here.

FOR AUTO:
Total params: 123,221
Trainable params: 123,221
Non-trainable params: 0

for EMBED 1:
Total params: 123,222
Trainable params: 123,222
Non-trainable params: 0

I have printed out all the layers and we can see the shape of the fm_input:
FOR AUTO:
fm_inputshape - (?, 1, 11190) -- deepinputshape - (?, 11190)

for EMBED 1:
fm_inputshape - (?, 11190, 1) -- deepinputshape - (?, 11190)

Also,