shenweichen / deepctr Goto Github PK
View Code? Open in Web Editor NEWEasy-to-use,Modular and Extendible package of deep-learning based CTR models .
Home Page: https://deepctr-doc.readthedocs.io/en/latest/index.html
License: Apache License 2.0
Easy-to-use,Modular and Extendible package of deep-learning based CTR models .
Home Page: https://deepctr-doc.readthedocs.io/en/latest/index.html
License: Apache License 2.0
因为每一个稀疏特征会有对应的值 例如: field:feature_index:feature_value
还有, 看了看 multi-value 数据输入的中间结果, 貌似目前不支持.
之后会支持稀疏格式的数据 例如: libffm 输入吗?
Describe the question(问题描述)
你好,作为新手,非常感谢贡献
在用自己数据跑DIN时候,数据处理后输入的历史行为不定长,比如输入是(分别为用户id,用户性别,广告 id,广告分类,用户点击的历史广告id,用户点击的历史广告的分类)
uid,ugender,iid,icate,hist_iid,hist_icate
13,1,24,3,[1,7],[2,5]
13,1,13,1,[1,7,24],[2,5,3]
如上按时间对历史行为排序后取一个用户的连续两次记录,这样两笔记录的历史列则不等长,当输入到模型时候会报:
尝试用加padding的方式将记录扩充到等长序列,虽然模型可以成功跑起来,但这样会导致输入数据极速扩大,且有用的信息却没有变,该如何处理历史序列不等长的输入?
Operating environment(运行环境):
Describe the question(问题描述)
I've wrapped DeepFM model into multi_gpu_model
and trying to train on multiple GPUs. However from running time and monitoring gpu utilization, I can see that only one gpu utilized (~50%) at a time while other GPUs are idle (~2%). Any tips on ways to handle this problem would be appreciated
Additional context
Add any other context about the problem here.
Operating environment(运行环境):
The utils have SingleFeat and VarLenFeat object
but document have not explain the init parameters
just fewer lines in examples.
If I want build a sparse or dense sequence feature, how to set it?
for example:
examples/run_dien.py
behavior_feature_list = ["item", "item_gender"]
"item" and "item_gender" is not like a kind of seq_feature_list
make me confuse.
hope answer
在推荐过程中,如果输入有提取的图象特征,比如本身是2048维的,能否直接映射到一个embedding feature,而不是2048个embedding feature?
Describe the question(问题描述)
A clear and concise description of what the question is.
Additional context
Add any other context about the problem here.
Operating environment(运行环境):
RT,
还有能不能在训练好把embeding的向量提出来
Describe the question(问题描述)
特征维度5000+,用的PNN模型且全部特征都是连续特征,在PNN初始化的时候非常慢,而且占用了大量的内存,大约不到30G,然后fit的时候,样本大约在20万+,60G内存都报内存错误了,请问下,这是我哪里使用错了吗?
另外我也测试了其他的样本集,特征在1500维左右,也都是连续特征,样本数在6万+,就一切运行是正常的
Operating environment(运行环境):
Describe the question(问题描述)
Hello Weichen!
I have a question like this.
For example, now I have extracted visual feature vectors with the dimension of 2048 of every item. I need to embed it to your CTR model (like in the docs demo). Thus, how can I use them in the sparse or dense features to train the model and make predictions?
Thanks a lot!
Additional context
Add any other context about the problem here.
Operating environment(运行环境):
DeepCTR requires tensorflow <= 1.12. The problem is that tensorflow-gpu 1.12 (via pip) has been compiled with CUDA 9. This means that we cannot upgrade to CUDA 10 (unless we compile tensorflow 1.12 from sources, which makes little sense since it has superseded by a newer version).
Is it possible to upgrade DeepCTR to support tensorflow 1.13? This way we could pip install tensorflow-gpu 1.13, which has been compiled with CUDA 10, so we could upgrade from CUDA 9 to CUDA 10. Thanks.
Describe the question(问题描述)
English:
In order to deal with the imbalance sample question, I would like to change the loss function to the tf.nn.weighted_cross_entropy_with_logits
and set pos_weight
parameter. However, I don't know how to change it in your framework?
Actually, I have noticed that you define the loss function through model.compile()
function. But I don't know if this method works for my problem.
中文:
我想知道怎么替换你框架中的loss函数,我想用 tf.nn.weighted_cross_entropy_with_logits
这个函数处理样本不平衡的数据,但是不知道如何加入你的框架中
我注意到你在代码中用了model.compile()
去定义loss函数和优化器,但是不知道这种方式是否对我的问题有用,而且也不知道如何将loss所带的参数通过这个方法传递进去。
Operating environment(运行环境):
pip installation now installs tensorflow which overwrites tensorflow-gpu installation. Please remove tensorflow from requirements or allow tensorflow-gpu to be a replacement.
in production, the train data is huge. Does the deepctr framework support the libsvm fortmat?
I've installed the latest version deepctr, and when I import SingFeat and got the import error
from deepctr import SingleFeat
ImportError: cannot import name 'SingleFeat'
Packages Version:
tensorflow:1.12.0
keras:2.2.2
deepctr:0.2.3
OS: Ubuntu "16.04.5 LTS (Xenial Xerus)"
What about adding a performance benchmark (by AUC and logloss) over datasets such as Criteo?
Describe the question(问题描述)
How to perform regularization on CIN layer.Seems that's no regularizer parameter in the function CIN in the deepctr.layers.interaction.
Additional context
Add any other context about the problem here.
Operating environment(运行环境):
What should be the highest GPU Utilization that one can expect via this ???
Question on encoding numerical sparse features:
How do we encode sparse features with non binary values? Say we have frequency/strength values in X for the sparse features (normalized between 0 and 1)? All my sparse features are already stored in separate columns. (col2 : col11133)
Currently I do this:
sparse_features = ['col' + str(i) for i in range(2, n)]
dense_features = []
testing_dataframe[sparse_features] = testing_dataframe[sparse_features].fillna(0, )
testing_dataframe[dense_features] = testing_dataframe[dense_features].fillna(0, )
sparse_feature_list = [SingleFeat(feat, 0) for feat in sparse_features]
dense_feature_list = [SingleFeat(feat, 0) for feat in dense_features]
test_model_input = [df_test[feat.name].values for feat in sparse_feature_list] + \
[df_test[feat.name].values for feat in dense_feature_list]
This makes my network take much longer to initialize. I am using embedding of 50, with ~11K features.
I see the examples shown are more so categorical.
Any assistance would be greatly appreciated!
Describe the bug(问题描述)
When set the embedding size to "auto", the Concatenate layer can't merge all input Embedding with different size at axis=2
def concat_fun(inputs, axis=-1):
if len(inputs) == 1:
return inputs[0]
else:
return Concatenate(axis=axis)(inputs)
To Reproduce(复现步骤)
Steps to reproduce the behavior:
Concatenate
layer requires inputs with matching shapes except for the concat axis. Got inputs shapes: [(None, 1, 36), (None, 1, 30), (None, 1, 6), (None, 1, 12), (None, 1, 12), (None, 1, 30), (None, 1, 12)]Operating environment(运行环境):
Additional context
Add any other context about the problem here.
Describe the question(问题描述)
I am using the DeepFM module and have a lot of data. I would like to run the model in batches, using the model.train function()
history = model.fit(train_model_input
, train[target].values
, batch_size=1024
, epochs=1
, verbose=2
, validation_split=0.05)
This seems to work iteratively, my loss does go down, my AUC does go up, but can we confirm this please ? I have read conflicting articles about this.
Also, I am wondering about the consistency of the embedding across batches. Given that embedding representation depends on the composition of a dataset, does it makes sense/does it work, if I incrementally train the model using calls on model.train() on a sequence of files?
Please let me know. Absolutely love this library so far I must say too 👍
Operating environment(运行环境):
这里的train_model_input 是run_classification_criteo.py 中的模型输入数据。
Describe the question(问题描述)
Im testing the deepctr demo. However, it's running with CPU. How can I modify the code to run it with my own GPU?
Thanks for ur answering.
Additional context
Add any other context about the problem here.
Operating environment(运行环境):
运行官网的例子时,报错信息为:
tensorflow.python.framework.errors_impl.InvalidArgumentError: indices[0,0] = 104 is not in [0, 14) [[{{node sparse_emb_18-C14/embedding_lookup}} = ResourceGather[Tindices=DT_INT32, _class=["loc:@training/Adam/gradients/sparse_emb_18-C14/embedding_lookup_grad/Reshape"], dtype=DT_FLOAT, validate_indices=true, _device="/job:localhost/replica:0/task:0/device:CPU:0"](sparse_emb_18-C14/embeddings, linear_emb_18-C14/Cast)]]
系统版本:
tensorflow:1.11.0
keras:2.2.4
deepctr:0.2.2
model.fit()能正常运行但是在进行预测时候报错,请问有能正常运行model.predict()或者model.predict_on_batch()的example吗
'''
output = model.predict(model_output, batch_size=256, verbose=0, steps=None)
'''
报错:
tensorflow.python.framework.errors_impl.InvalidArgumentError: indices[139, 0] = 28 is not in [0, 27]
在运行demo时,报错AttributeError: 'float' object has no attribute '_unconditional_loss',请问怎么解决?
Describe the question(问题描述)
It's difficult for me to understand doing embedding for a dense feature,Why we need do that?thx !
Additional context
Add any other context about the problem here.
Operating environment(运行环境):
why not set sparse as True this line
sparse_input[feat.name] = Input(shape=(1,), name=prefix+'sparse_' + str(i) + '-' + feat.name)
tensorflow.python.keras.layers.Input
@tf_export('keras.layers.Input', 'keras.Input') def Input( # pylint: disable=invalid-name shape=None, batch_size=None, name=None, dtype=None, sparse=False, tensor=None, **kwargs):
代码中FM层做交叉后好像是1维。不应该是K(embedding的维度)吗?
Describe the question(问题描述)
When I use pip install deepctr
, I found the arguments of DeepFM() is not consistent with those of source code. Which version should I use?
Operating environment(运行环境):
使用deepctr 包 DCN 训练模型没有报错。save模型: save_model(model,outfile_model2) #outfile_model2="./model/DCN.h5"
没有报错,但是load model:
model = load_model(outfile_model2, custom_objects)
报错:TypeError: issubclass() arg 1 must be a class
@shenweichen 麻烦帮忙看看问题在哪里怎么解决?我调用包的其他算法也是同样错误
Describe the solution you'd like
new methods:
你好,请问下DIN,XDeepFM都会有multivalent 分类变量 embedding,这样的特征模型要怎么输入尼
Describe the bug(问题描述)
custom_objects
from deepctr.utils
can't be imported. But models
, SingleFeat
etc. can be imported properly.
To Reproduce(复现步骤)
Steps to reproduce the behavior:
pipenv shell
as I use pipenvpython
from deepctr.utils import custom_objects
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ImportError: cannot import name 'custom_objects'
Operating environment(运行环境):
3.6
1.12.0
0.3.2
Additional context
Exact same code works well on my other PC.
Thanks for this great work.
你好,现在我有一个需求。
CTR预测的时候 输入的数据是时序的,不知该如何实现。
现在基于单个时刻的,用DeepFM可以得到一个结果,但是希望将输入改为时序输入,以此来提升效果。
Describe the bug(问题描述)
I have a sparse feature with 1M items with 256 embedding. That is 1GB memory. "Adam" optimizor will need 2 more copies of GPU memories. So it is around 3GB. But it is still OOM on GPU with 8GB memory.
Is there any way to reduce GPU memory usage, like put some tensors on CPU instead?
Thanks.
To Reproduce(复现步骤)
Steps to reproduce the behavior:
比如一个电影有多个标签。这种embeding支持不。。
您好!
我想在此项目中试一试使用yoyi数据集,yoyi数据集我拿到的是经过张伟楠团队onehot处理后的数据,训练集和测试集的格式均如下:
标签 市场价 特征1 特征2 特征3 ....特征n
1 10 122:1 223:1...2001:1
0 20 433:1 890:1..8981:1
...
去掉市场价后就可以用来做点击率预测。但是我发现deepctr这个项目里对数据集进行编码,是直接在原始数据的基础上进行数值特征和分类特征的处理。那么对于yoyi这一类拿不到原始数据,只有one-hot编码后的数据集,该如何处理? 谢谢
如题。。
Describe the question(问题描述)
想问个问题,假设训练数据很大,需要逐行读入处理,需要对输入数据部分做什么改动,最好能有示例代码。
Additional context
Add any other context about the problem here.
Operating environment(运行环境):
Describe the question(问题描述)
Hello, I would like to understand the difference between setting:
embed_size = 'auto'
vs.
setting embed_size = 1
ALL of my features all numerical and DENSE. Num features = 11190, binary
I have tried to print out tensor sizes and the model.summary() as follows:
print('AUTO')
model = DeepFM({"sparse": sparse_feature_list,
"dense": dense_feature_list}
, embedding_size='auto'
, use_fm=True
, hidden_size=(10,10)
, l2_reg_linear=0.0001
, l2_reg_embedding=0.00001
, l2_reg_deep=0.0001
, init_std=0.0001
, seed=1024
, keep_prob=0.8
, activation='relu'
, final_activation='sigmoid'
, use_bn=False)
print(model.summary())
print('EMBED 1')
model = DeepFM({"sparse": sparse_feature_list,
"dense": dense_feature_list}
, embedding_size=1
, use_fm=True
, hidden_size=(10,10)
, l2_reg_linear=0.0001
, l2_reg_embedding=0.00001
, l2_reg_deep=0.0001
, init_std=0.0001
, seed=1024
, keep_prob=0.8
, activation='relu'
, final_activation='sigmoid'
, use_bn=False)
print(model.summary())
Here's what I see:
Add any other context about the problem here.
FOR AUTO:
Total params: 123,221
Trainable params: 123,221
Non-trainable params: 0
for EMBED 1:
Total params: 123,222
Trainable params: 123,222
Non-trainable params: 0
I have printed out all the layers and we can see the shape of the fm_input:
FOR AUTO:
fm_inputshape - (?, 1, 11190) -- deepinputshape - (?, 11190)
for EMBED 1:
fm_inputshape - (?, 11190, 1) -- deepinputshape - (?, 11190)
Also,
with EMBED 1, I run out of memory with reasonable batch size,
with EMBED AUTO, I do not, and the model compile faster
May you please help explain the difference?
Thanks!
Leo
Operating environment(运行环境):
Describe the question(问题描述)
I want to get the embedding vectors of deepfm and use them as inputs of other models, like this:
https://github.com/jfpuget/LibFM_in_Keras/blob/master/keras_blog.ipynb
Additional context
Add any other context about the problem here.
TypeError: unsupported operand type(s) for /: 'Dimension' and 'float'
Describe the question(问题描述)
A clear and concise description of what the question is.
相同的参数,不同运行结果不一样,但是deepFM随机种子是一样的
Additional context
Add any other context about the problem here.
Operating environment(运行环境):
浅梦您好,我是一个tensorflow新手,不太清楚一个地方:
在DIN模型的LocalActivationUnit模块里,新建DNN函数是在call()里,也就是每一个user-item对里都会新建一个DNN是吗?
如果是的话,我觉得有点不太合理,因为我理解起来所有的user-item对应该共享同一个DNN才对。
不知道是我的理解有偏差还是代码逻辑有问题。请您指教,谢谢!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.