alibaba / x-deeplearning Goto Github PK

An industrial deep learning framework for high-dimension sparse data

License: Apache License 2.0

Python 9.80% Shell 1.05% CMake 0.84% C++ 31.53% TeX 0.08% Java 0.90% Dockerfile 0.02% Batchfile 0.48% XSLT 0.01% C 0.29% HTML 0.28% JavaScript 0.37% CSS 0.10% Cuda 1.18% PureBasic 53.06% SWIG 0.01%

x-deeplearning's Introduction

概述

X-DeepLearning(简称XDL)是面向高维稀疏数据场景（如广告/推荐/搜索等）深度优化的一整套解决方案。XDL1.2版本已于近期发布，主要特性包括：

针对大batch/低并发场景的性能优化：在此类场景下性能提升50-100%
存储及通信优化：参数无需人工干预自动全局分配，请求合并，彻底消除ps的计算/存储/通信热点
完整的流式训练特性：包括特征准入，特征淘汰，模型增量导出，特征counting统计等
Fix了若干1.0中的小bugs

完整介绍请参考XDL1.2 release note

1. XDL训练引擎

联系我们

欢迎通过issue和邮件组([email protected] )联系我们
我们正在寻求合作伙伴，有志于获得XDL企业级支持计划的公司或团队，可以联系[email protected]，与我们进一步商谈。

FAQ

常见问题

License

XDL使用Apache-2.0许可

致谢

XDL项目由阿里妈妈事业部荣誉出品，核心贡献团队包括阿里妈妈工程平台、算法平台、定向广告技术团队、搜索广告技术团队等，同时XDL项目也得到了阿里巴巴计算平台事业部（特别是PAI团队）的帮助。

x-deeplearning's People

Contributors

Stargazers

Watchers

Forkers

p79n6a yz576023775 woso maxiao001 fengdragon jiangwei huangpeng1126 intohole szhl sleeperqp xidianw3 zhuxiaoqiang nature213 mysqlsc zyody jianjian0dandan shanshanpt huangjun6919 luoyeatl awesome-archive wwjiang007 zozozoo zyr0603 yueyedeai lengrui1988 logan-lu aisnnu genpeng wind1239 wangwanwa sandatong zhang261007 hemingchiu wyunfei dapenggg yangchenzheng czh-hw duhangnju weiyang00 gaoxiaoninghit ustcqi qihouying simon821 anythingfrombigban aohanhongzhi weizai118 tak-wah ericsimonzhu chanyoo849639278 daijie1223 rokn329 formath zhangjiahuan17 ericxsun ericdoug gavinljj zpeng1989 zhchxi11 yxk9810 seiriosplus yyht ewardyou ghdeng1992 13683643950 yang9527 mad-apes lixuefeng123 hengqujushi sjyttkl batermj youngjt devhaufior huang851887444 zzanswer mr-w-bin lamalu111232 sampsonguo gavinzjchao yikerainbow rongxxxl yangshaoguang wudingfengbo miselanse cshzc 24flyman stoneyang hitflame ethan199111 leonwanggl snowfeet zjs007 mabuting zorrock lyltsh kai2020-hello 767472021 xdcs100 hackee9 shlpu flyingwing

x-deeplearning's Issues

"auto_rebalance"字段仅在yarn submit生效吗

我自己编写了一份分布式命令行运行xdl的python脚本。运行过程中发现各个ps的带宽占比有较大差异，尝试加入config.json开启auto_rebalance字段，但是发现并没没有生效。这个字段只在yarn submit中有效还是用命令行 -c=config.json也能生效？

问个问题：yarn是否可以支持大规模分布式ML训练

既然阿里妈妈是用yarn来调度xdl，是否说明yarn是可以处理高纬稀疏数据的模型训练，或者说yarn在阿里妈妈的这个体量上不会因为自身的缺陷有问题
（gpu那个补丁除外）

为什么会有这个问题？
现在很多框架已经转移到k8s并使用kubeflow或类似的调度框架，hadoop yarn不太是主流。

Online-Learning的代码没开放出来？

公众号文章里提到的对Online-Learning的支持（去ID化的稀疏特征表示、实时特征频控、过期特征淘汰等），代码里好像没找到对应的实现，是还没开放出来么？
另外，有些op好像也找不到对应的实现

online-learing没有对应的文档

hi, 看了公众号文章说的 [Online-Learning：大规模在线学习] 的介绍, 但是并没有看到在该项目中的文档和代码, 如果有的话, 麻烦告知下.

DIN的paper提到的自适应正则看代码中没有体现呀

DIN的paper提到的自适应正则看代码中没有体现呀，有看到Dice中有parametric_relu方法，没有看到对特征频次变量更新的代码。

CrossMedia例程中的config找不到 resource-types.xml

运行CrossMedia例程的时候，conf.Configuration和resource.ResourceUtils无法找到resource-types.xml

具体日志如下:

2018-12-23 10:31:30,141 INFO xdl.Client: finish uploading files to hdfs
2018-12-23 10:31:30,141 INFO xdl.Client: Upload user files success.
2018-12-23 10:31:30,141 INFO xdl.Client: ApplicationMaster start command is: [$JAVA_HOME/bin/java -Xmx256M com.alibaba.xdl.AppMasterRunner -c=config.json -v=script.tar.gz -u=frankpeng -p=hdfs://master:9000/user/frankpeng/.xdl/application_1545531838842_0001/ 1><LOG_DIR>/stdout 2><LOG_DIR>/stderr]
2018-12-23 10:31:30,231 INFO xdl.Client: local resources: {config.json=resource { scheme: "hdfs" host: "master" port: 9000 file: "/user/frankpeng/.xdl/application_1545531838842_0001/config.json" } size: 831 timestamp: 1545532286328 type: FILE visibility: PUBLIC, xdl-yarn-scheduler-1.0.0-SNAPSHOT-jar-with-dependencies.jar=resource { scheme: "hdfs" host: "master" port: 9000 file: "/user/frankpeng/.xdl/application_1545531838842_0001/xdl-yarn-scheduler-1.0.0-SNAPSHOT-jar-with-dependencies.jar" } size: 4784158 timestamp: 1545532286397 type: FILE visibility: PUBLIC}
2018-12-23 10:31:30,238 INFO xdl.Client: Master add CLASSPATH:$HADOOP_CONF_DIR
2018-12-23 10:31:30,239 INFO xdl.Client: Master add CLASSPATH:$HADOOP_COMMON_HOME/share/hadoop/common/*
2018-12-23 10:31:30,239 INFO xdl.Client: Master add CLASSPATH:$HADOOP_COMMON_HOME/share/hadoop/common/lib/*
2018-12-23 10:31:30,239 INFO xdl.Client: Master add CLASSPATH:$HADOOP_HDFS_HOME/share/hadoop/hdfs/*
2018-12-23 10:31:30,239 INFO xdl.Client: Master add CLASSPATH:$HADOOP_HDFS_HOME/share/hadoop/hdfs/lib/*
2018-12-23 10:31:30,239 INFO xdl.Client: Master add CLASSPATH:$HADOOP_YARN_HOME/share/hadoop/yarn/*
2018-12-23 10:31:30,239 INFO xdl.Client: Master add CLASSPATH:$HADOOP_YARN_HOME/share/hadoop/yarn/lib/*
2018-12-23 10:31:30,240 INFO xdl.Client: Setup ApplicationMaster container success.
2018-12-23 10:31:30,262 INFO conf.Configuration: resource-types.xml not found
2018-12-23 10:31:30,262 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.
2018-12-23 10:31:30,266 INFO xdl.Client: Setup application context success.
2018-12-23 10:31:30,266 INFO xdl.Client: Submitting application application_1545531838842_0001
2018-12-23 10:32:00,675 INFO impl.YarnClientImpl: Submitted application application_1545531838842_0001
2018-12-23 10:32:00,678 INFO xdl.Client: AppMaster host N/A Start waiting application: application_1545531838842_0001 ends.
2018-12-23 10:33:38,574 INFO xdl.Client: Application application_1545531838842_0001 finish with state FAILED
2018-12-23 10:33:38,575 INFO xdl.Utils: ================================FINAL STATUS==================================
2018-12-23 10:33:38,575 INFO xdl.Utils: application_1545531838842_0001 : FAILED
2018-12-23 10:33:38,575 INFO xdl.Utils: ================================FINAL STATUS==================================
2018-12-23 10:34:08,595 INFO xdl.Utils: Delete the hdfs dir:hdfs://master:9000/user/frankpeng/.xdl/application_1545531838842_0001/ success.

Add supports for Kubernetes

Currently, it seems that the x-deeplearning can only run distribution training based on yarn. As Kubernetes become more and more popular, it should support running on K8S.

分布式训练 hadoop

请问分布式训练的hadoop是安装在宿主机上还是docker内进行安装啊？
安装好hadoop的集群上进行分布式训练报错如下：
java.lang.IllegalArgumentException: Wrong FS: hdfs://172.16.7.225:4007user/root/.xdl/application_1546239999977_0006, expected: hdfs://172.16.7.225:4007
at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:649)
at org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:194)
at org.apache.hadoop.hdfs.DistributedFileSystem.access$000(DistributedFileSystem.java:106)
at org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1305)
at org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1301)
at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1317)
at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1426)
at com.alibaba.xdl.Utils.genAppBasePath(Utils.java:58)
at com.alibaba.xdl.Client.run(Client.java:116)
at com.alibaba.xdl.Client.main(Client.java:354)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
org.apache.hadoop.yarn.exceptions.ApplicationNotFoundException: Application with id 'application_1546239999977_0006' doesn't exist in RM.
at org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationReport(ClientRMService.java:327)
at org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationReport(ApplicationClientProtocolPBServiceImpl.java:175)
at org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:417)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2045)

at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53)
at org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:101)
at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getApplicationReport(ApplicationClientProtocolPBClientImpl.java:194)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
at com.sun.proxy.$Proxy8.getApplicationReport(Unknown Source)
at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getApplicationReport(YarnClientImpl.java:430)
at com.alibaba.xdl.Client.dealWithInterruption(Client.java:303)
at com.alibaba.xdl.Client$1.run(Client.java:347)

Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.yarn.exceptions.ApplicationNotFoundException): Application with id 'application_1546239999977_0006' doesn't exist in RM.
at org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationReport(ClientRMService.java:327)
at org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationReport(ApplicationClientProtocolPBServiceImpl.java:175)
at org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:417)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2045)

at org.apache.hadoop.ipc.Client.call(Client.java:1475)
at org.apache.hadoop.ipc.Client.call(Client.java:1412)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
at com.sun.proxy.$Proxy7.getApplicationReport(Unknown Source)
at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getApplicationReport(ApplicationClientProtocolPBClientImpl.java:191)

建议分布式运行xdl时，增加debug模式，当任务失败时能够保留docker镜像

建议分布式运行xdl时，增加debug模式，当任务失败时能够保留docker镜像，当前运行分布式的 deepctr.py 时，通过 yarn 日志看到 worker 发生了 core dump 退出，但是由于 docker 镜像在任务结束时已经删除，不好跟踪问题

关于更详细的实验

请问我们是否有和一些其它的开源框架做横向对比？比如在一些公开的数据集上。

非常期待，非常感谢。

期待XDL能够支持导出兼容tensorflow格式的模型

当前XDL导出的模型貌似与tensorflow格式的模型不一致，且当前尚未看到该模型如何方便的加载到线上用于预估的示例（如C++版本的SDK），建议

XDL 能够提供一个方便将模型转为tensorflow模型格式的工具
或者 XDL 能够提供一个对外的C++ 版本的模型加载接口
如此，方便开发者将模型快速部署用于在线预估

Class RefCounted memory leak

希望提供mac os上的编译流程，以及mac os上可以使用的pip包

使用mxnet报错

mxnet按照编译步骤安装成功后，使用测试的例子，还是报如下错误：
AttributeError: 'module' object has no attribute 'mxnet_wrapper'

为什么了？

【结构化压缩】文档有误

“下图中，a0 和 a1 特征共用一个u0特征；a2, a3, a4共用一个u2特征”

应该是a2, a3, a4共用一个u1特征

ps-plus 这个组件可以单独摘出来使用吗?

就像zookeeper于hadoop一样,ps-plus 这个组件的用途应该更加广泛.

Add supports for CentOS distribution

As far as we know, many develops use centos distribution in production. So it would be popular if we can provide the centos image and install guide.

Where can I get the full API of XDL?

xdl.merged_embedding python层实现存在笔误

xdl.merged_embedding 在python层，initializer = initialier 这里有处笔误。而且merge_sparse似乎没有import。

另外，想问下分开用多个embedding和用一个merged_embedding 在效率上有区别吗？

gpu版本容器内无法使用nvidia-smi等命令

使用XDL Docker镜像registry.cn-hangzhou.aliyuncs.com/xdl/xdl:ubuntu-gpu-tf1.12，在服务器上进行测试，但是容器内使用nvidia-smi等命令，会报错Failed to initialize NVML: Unknown Error。同时跑demo时，仅能以cpu方式训练模型，无法使用gpu。
我们的服务器系统是centos，有安装docker，是否和ubuntu系统驱动不兼容，还是有其他原因？

XDL是否提供线上模型预测的解决方案？

相关文档里没有看到线上模型预测相关的文档，是否计划提供配套的解决方案呢？

通过k8s 来启动单个xdl 实例报错

Error: failed to start container "kml-dtmachine-520": Error response from daemon:
OCI runtime create failed: container_linux.go:348:
starting container process caused "process_linux.go:402:
container init caused \"process_linux.go:385:
running prestart hook 0 caused \\\"error running hook:
exit status 1, stdout: , stderr: exec command:
[/usr/bin/nvidia-container-cli --load-kmods configure --ldconfig=@/sbin/ldconfig --device=GPU-73b2a28a-071f-dfe9-b5a4-8a648de3fdc4 --utility --pid=26740 /media/disk1/docker/overlay/c287354d7e8e0d349d8aa3eb92f7cecae7145b1ec19fbc42ec57c2ba9d8b7830/merged]
\\\\nnvidia-container-cli: mount error: file creation failed:
/media/disk1/docker/overlay/c287354d7e8e0d349d8aa3eb92f7cecae7145b1ec19fbc42ec57c2ba9d8b7830/merged/usr/bin/nvidia-smi:
file exists\\\\n\\\"\"": unknown

我在物理上手动起没问题，通过k8s 调度起有问题

k8s v1.11 
driver 396.44 
default runtime  nvidia-docker

笔记本不支持FMA指令集，tensorflow还得自己编译，蛋疼

tensorflow/core/platform/cpu_feature_guard.cc:37] The TensorFlow library was compiled to use FMA instructions, but these aren't available on your machine.

编译错误！

不知道有自己编译通过的没有，按照文中说的编译过程，宿主机器mac mojave.编译了两个晚上都没有编译成功！不知道是不是文中描述的步骤有什么缺失没有？已经踩了一堆坑了～～

训练时出错

train按照TDM的wiki（https://github.com/alibaba/x-deeplearning/wiki/%E6%B7%B1%E5%BA%A6%E6%A0%91%E5%8C%B9%E9%85%8D%E6%A8%A1%E5%9E%8B(TDM)）
docker镜像中配置hadoop后，按照“单机试验小数据集”中的步骤从头做到了训练的一步，出现了hadoop异常，导致脚本退出，未能正常启动训练，以下是出错的异常信息，请问可能是什么原因。

=========================================================
config: {u'ps': {u'instance_num': 16, u'memory_m': 64000, u'gpu_cores': 0, u'cpu_cores': 16}, u'dependent_dirs': u'/home/hcx/tdm_mock/tdm_ub_att_ubuntu', u'script': u'train.py', u'worker': {u'instance_num': 20, u'memory_m': 100000, u'gpu_cores': 2, u'cpu_cores': 46}, u'max_local_failover_times': 3, u'auto_rebalance': {u'enable': u'false'}, u'min_finish_worker_rate': 100, u'max_failover_times': 3, u'job_name': u'xdl_tdm', u'docker_image': u'trn:img', u'checkpoint': {u'output_dir': u'hdfs:/train_ckpt/checkpoint'}}
mv data/ub_tree.pb data/ub_tree.pb.bak
hadoop fs -get hdfs:/tree_data/data/userbehavoir_tree.pb data/ub_tree.pb
Load successfully, leaf node count:hello1
parsers.txt
fs.hdfs
2018-12-29 11:42:41,021 [main] WARN util.NativeCodeLoader (NativeCodeLoader.java:(60)) - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
hello2
hello3
hello4
hdfsGetPathInfo(hdfs:/tree_data/data/userbehavoir_train_sample.dat_[\d]+): getFileInfo error:
IllegalArgumentException: Wrong FS: hdfs:/tree_data/data/userbehavoir_train_sample.dat_[\d]+, expected: hdfs://localhost:9000java.lang.IllegalArgumentException: Wrong FS: hdfs:/tree_data/data/userbehavoir_train_sample.dat_[\d]+, expected: hdfs://localhost:9000
at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:781)
at org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:240)
at org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1579)
at org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1576)
at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1591)
at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1734)
hdfsGetPathInfo(hdfs:/tree_data/data/userbehavoir_train_sample.dat_[\d]+): getFileInfo error:
IllegalArgumentException: Wrong FS: hdfs:/tree_data/data/userbehavoir_train_sample.dat_[\d]+, expected: hdfs://localhost:9000java.lang.IllegalArgumentException: Wrong FS: hdfs:/tree_data/data/userbehavoir_train_sample.dat_[\d]+, expected: hdfs://localhost:9000
at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:781)
at org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:240)
at org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1579)
at org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1576)
at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1591)
at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1734)
data parallel for re: hdfs:/tree_data/data/userbehavoir_train_sample.dat_[\d]+
hdfsListDirectory(hdfs:/tree_data/data): FileSystem#listStatus error:
IllegalArgumentException: Wrong FS: hdfs:/tree_data/data, expected: hdfs://localhost:9000java.lang.IllegalArgumentException: Wrong FS: hdfs:/tree_data/data, expected: hdfs://localhost:9000
at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:781)
at org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:240)
at org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:1052)
at org.apache.hadoop.hdfs.DistributedFileSystem.access$1000(DistributedFileSystem.java:131)
at org.apache.hadoop.hdfs.DistributedFileSystem$24.doCall(DistributedFileSystem.java:1119)
at org.apache.hadoop.hdfs.DistributedFileSystem$24.doCall(DistributedFileSystem.java:1116)
at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:1126)
2018-12-29 11:42:42.068194: /home/yue.song/x-deeplearning/xdl/xdl/data_io/fs/file_system_hdfs.cc:129] Check failed: info != nullptr can't open dir hdfs:/tree_data/data
Aborted (core dumped)

=========================================================

以下是train.py脚本的修改的情况：

68 def train(is_training=True):
69 #np.set_printoptions(threshold='nan')
70 if is_training or xdl.get_task_index() == 0:
71 init()
72 else:
73 return
74
75 file_type = xdl.parsers.txt
76 if is_training:
77 print "hello1"
78 print file_type
79 print xdl.fs.hdfs
80 data_io = xdl.DataIO("tdm", file_type=file_type, fs_type=xdl.fs.hdfs,
81 namenode="hdfs://localhost:9000", enable_state=False)
82 print "hello2"
83
84 feature_count = 69
85 for i in xrange(1, feature_count + 1):
86 data_io.feature(name=("item_%s" % i), type=xdl.features.sparse, table=1)
87 data_io.feature(name="unit_id_expand", type=xdl.features.sparse, table=0)
88
89 print "hello3"
90 data_io.batch_size(intconf('train_batch_size'))
91 data_io.epochs(intconf('train_epochs'))
92 data_io.threads(intconf('train_threads'))
93 data_io.label_count(2)
94 base_path = '%s/%s/' % (conf('upload_url'), conf('data_dir'))
95 data = base_path + conf('train_sample') + '_' + r'[\d]+'
96 sharding = xdl.DataSharding(data_io.fs())
97 print "hello4"
98 sharding.add_path(data)
99 print "hello5"
100 paths = sharding.partition(rank=xdl.get_task_index(), size=xdl.get_task_num())
101 print "hello6"
102 print 'train: sharding.partition() =', paths

namenode的解释不是很能看明白（原话是：# 修改train的代码中DataIO的参数 namenode="hdfs://your/namenode/hdfs/path:9000"，这是样本读取目录的hdfs根结点路径）

所以取值尝试了几种namenode值，最终配置为"hdfs://localhost:9000"，不知道是不是这里引发了异常，因为对hadoop了解有限，所以看不太出是什么问题。

希望帮忙看一下可能的原因吧，就是想跑通整个TDM训练的流程。

为什么on yarn不支持kerberos？

想问下当前xdl reader 是否支持hdfs压缩格式的数据读取

想问下当前 xdl 在运行分布式时，是否支持读取 hdfs 上压缩格式的数据，如 deflate 格式。
如果支持，应该如何配置;
若未支持，想问下是否有支持的计划

编译时cmake报错

执行cmake .. -DTF_BACKEND=1编译时报错：

cmake .. -DTF_BACKEND=1        
-- build on release
CMake Error at cmake/Dependencies.cmake:11 (add_subdirectory):
  The source directory

    /tmp/x-deeplearning/xdl/third_party/googletest/googletest

  does not contain a CMakeLists.txt file.
Call Stack (most recent call first):
  CMakeLists.txt:46 (include)


CMake Error at cmake/Dependencies.cmake:30 (add_subdirectory):
  The source directory

    /tmp/x-deeplearning/xdl/third_party/glog

  does not contain a CMakeLists.txt file.
Call Stack (most recent call first):
  CMakeLists.txt:46 (include)


CMake Error at cmake/Dependencies.cmake:38 (add_subdirectory):
  The source directory

    /tmp/x-deeplearning/xdl/third_party/librdkafka

  does not contain a CMakeLists.txt file.
Call Stack (most recent call first):
  CMakeLists.txt:46 (include)


/..: No such file or directory
/..: No such file or directory
: No such file or directory
build.sh: line 18: ./configure.py: No such file or directory
: No such file or directory=
: No such file or directorys=
ninja: error: loading 'build.ninja': No such file or directory
build.sh: line 22: popd: directory stack empty
" does not exist.source directory "/tmp/x-deeplearning/xdl/third_party/seastar/service/build_script/
Specify --help for usage, or press the help button on the CMake GUI.
build.sh: line 25: $'\r': command not found
make: *** No targets specified and no makefile found.  Stop.
OK
CMake Error at cmake/Dependencies.cmake:61 (add_subdirectory):
  The source directory

    /tmp/x-deeplearning/xdl/third_party/libevent

  does not contain a CMakeLists.txt file.
Call Stack (most recent call first):
  CMakeLists.txt:46 (include)


-- protofiles: /tmp/x-deeplearning/xdl/xdl/proto/feaconf.proto;/tmp/x-deeplearning/xdl/xdl/proto/io_state.proto;/tmp/x-deeplearning/xdl/xdl/proto/sample_v4.proto;/tmp/x-deeplearning/xdl/xdl/proto/sample.proto
-- protosrcs: proto//feaconf.pb.cc;proto//io_state.pb.cc;proto//sample_v4.pb.cc;proto//sample.pb.cc
-- include: /tmp/x-deeplearning/xdl/third_party/protobuf/src
-- Configuring incomplete, errors occurred!
See also "/tmp/x-deeplearning/xdl/build/CMakeFiles/CMakeOutput.log".
See also "/tmp/x-deeplearning/xdl/build/CMakeFiles/CMakeError.log".

/tmp/x-deeplearning/xdl/build/CMakeFiles/CMakeError.log中的日志为：

Determining if the pthread_create exist failed with the following output:
Change Dir: /tmp/x-deeplearning/xdl/build/CMakeFiles/CMakeTmp

Run Build Command:"/usr/bin/make" "cmTC_bbbf0/fast"
/usr/bin/make -f CMakeFiles/cmTC_bbbf0.dir/build.make CMakeFiles/cmTC_bbbf0.dir/build
make[1]: Entering directory '/tmp/x-deeplearning/xdl/build/CMakeFiles/CMakeTmp'
Building C object CMakeFiles/cmTC_bbbf0.dir/CheckSymbolExists.c.o
/usr/bin/gcc-5    -fPIC    -o CMakeFiles/cmTC_bbbf0.dir/CheckSymbolExists.c.o   -c /tmp/x-deeplearning/xdl/build/CMakeFiles/CMakeTmp/CheckSymbolExists.c
Linking C executable cmTC_bbbf0
/usr/bin/cmake -E cmake_link_script CMakeFiles/cmTC_bbbf0.dir/link.txt --verbose=1
/usr/bin/gcc-5   -fPIC     CMakeFiles/cmTC_bbbf0.dir/CheckSymbolExists.c.o  -o cmTC_bbbf0
CMakeFiles/cmTC_bbbf0.dir/CheckSymbolExists.c.o: In function `main':
CheckSymbolExists.c:(.text+0x1b): undefined reference to `pthread_create'
collect2: error: ld returned 1 exit status
CMakeFiles/cmTC_bbbf0.dir/build.make:97: recipe for target 'cmTC_bbbf0' failed
make[1]: *** [cmTC_bbbf0] Error 1
make[1]: Leaving directory '/tmp/x-deeplearning/xdl/build/CMakeFiles/CMakeTmp'
Makefile:126: recipe for target 'cmTC_bbbf0/fast' failed
make: *** [cmTC_bbbf0/fast] Error 2

File /tmp/x-deeplearning/xdl/build/CMakeFiles/CMakeTmp/CheckSymbolExists.c:
/* */
#include <pthread.h>

int main(int argc, char** argv)
{
  (void)argv;
#ifndef pthread_create
  return ((int*)(&pthread_create))[argc];
#else
  (void)argc;
  return 0;
#endif
}

Determining if the function pthread_create exists in the pthreads failed with the following output:
Change Dir: /tmp/x-deeplearning/xdl/build/CMakeFiles/CMakeTmp

Run Build Command:"/usr/bin/make" "cmTC_e4952/fast"
/usr/bin/make -f CMakeFiles/cmTC_e4952.dir/build.make CMakeFiles/cmTC_e4952.dir/build
make[1]: Entering directory '/tmp/x-deeplearning/xdl/build/CMakeFiles/CMakeTmp'
Building C object CMakeFiles/cmTC_e4952.dir/CheckFunctionExists.c.o
/usr/bin/gcc-5    -fPIC -DCHECK_FUNCTION_EXISTS=pthread_create   -o CMakeFiles/cmTC_e4952.dir/CheckFunctionExists.c.o   -c /usr/share/cmake-3.5/Modules/CheckFunctionExists.c
Linking C executable cmTC_e4952
/usr/bin/cmake -E cmake_link_script CMakeFiles/cmTC_e4952.dir/link.txt --verbose=1
/usr/bin/gcc-5   -fPIC -DCHECK_FUNCTION_EXISTS=pthread_create    CMakeFiles/cmTC_e4952.dir/CheckFunctionExists.c.o  -o cmTC_e4952 -lpthreads
/usr/bin/ld: cannot find -lpthreads
collect2: error: ld returned 1 exit status
CMakeFiles/cmTC_e4952.dir/build.make:97: recipe for target 'cmTC_e4952' failed
make[1]: *** [cmTC_e4952] Error 1
make[1]: Leaving directory '/tmp/x-deeplearning/xdl/build/CMakeFiles/CMakeTmp'
Makefile:126: recipe for target 'cmTC_e4952/fast' failed
make: *** [cmTC_e4952/fast] Error 2

yarn DistributedCache用的有问题

private void setupResource(Path resourcePath, LocalResource localResource) throws IOException {
    FileStatus fileStatus;
    fileStatus = FileSystem.get(conf).getFileStatus(resourcePath);

    localResource.setResource(URL.fromPath(resourcePath));
    localResource.setSize(fileStatus.getLen());
    localResource.setTimestamp(fileStatus.getModificationTime());
    localResource.setType(LocalResourceType.FILE);
    localResource.setVisibility(LocalResourceVisibility.PUBLIC);
  }

LocalResourceVisibility.PUBLIC应该是LocalResourceVisibility.PRIVATE
上传到/user/xxx/.xdl除非每一层路径都是755

xdl的教程和镜像不一致太多了

这个文档质量太低了，至少有以下问题：

文档中出现WORK_PATH叙述中错位
镜像下载后大量的基础库比如sklearn，mxnet报错，纯小白无法顺利安装
关键的修改位置描述含混不清
下的镜像tag是tf12，结果镜像里的train.py脚本里面根本没有对tensorflow的使用

可能是我我段位不够吧，有跑通脚本的朋友可以在issues里分享一下经验。

PS Server 或 AMS 挂了是如何做容灾的？

非常感谢你们开源的 XDL，给了业界很多启发。
查阅 xdl yarn scheduler 源码，可以得知 XDL 容错、重试策略非常明确：任一角色 scheduler/ps/worker/extended_roles 的任一进程/container 挂了，都会在新的 yarn node 拉起。

请问：

ps/ams 重启中，scheduler 扮演什么角色？
ps 和 ams 内存里存储的模型参数是如何还原的？
是否会存在，重新拉起的 ps/ams 的参数的 global step 远远落后于其他的 ps/ams 的参数，的情况？

期待回复，非常感谢！

分布式训练的问题

请问有分布式训练的sample代码吗？
用户文档的分布式训练部分，看到起分布式任务需要传一些参数进去，比如

--task_name=scheduler --zk_addr=zfs://xxx --ps_num=2 --ps_cpu_cores=10 --ps_memory_m=4000 --ckpt_dir=hdfs://xxx/checkpoint

请问这些参数分别是传给哪些类进行处理的？
另外--zk_addr指的是zookeeper吗？后面的路径格式是什么？传给哪些模块进行处理？

读hdfs crash

运行命令: python deepctr.py --run_mode=local
文件路径：hdfs://some-server/aaa/bbb.txt

环境变量: 设置了HADOOP_HOME, PATH等，hdfs命令可以列出文件。

core:

#0  std::function<hdfs_internal* (char const*, unsigned short)>::operator()(char const*, unsigned short) const (__args#1=0,
    __args#0=0x1c20f48 "some-server", this=0x70) at /

样本中的group id的具体含义？

sample id指的是每条样本的id，那么group id是为了给样本分组吗？指定之后有什么好处呢？

具体来说：比如，广告场景中，用户侧特征经常在多条样本中是不变的，通常的做法是数据shuffle之后，在embedding lookup的实现时，都会把key做unique，保证取embedding只取一次之类的。然后根据pull的结果还原出batch数据对应tensor，再进行下一步处理。

那么现在我们把group id指定成用户id，我猜测可能的好处有二：

数据可以按照用户id shuffle，这样一个batch内unique之后的key个数会大幅减小，从而减少通信量。
在还原batch数据对应的tensor时，可以不用完全重建，而是想办法保证用户侧特征只有一份，只计算一次。

综上，我的问题是：

group id是否是为了以上两点的效率优化而存在的？大概的工作原理是否类似上述？
如果是会根据group id来shuffle数据，那么在online learning的场景下，这样会导致比如用户侧特征更新次数较少，有试过在online learning的场景下，auc和线上效果是否会吃亏呢？

分布式提交yarn.io/gpu报错

org.apache.hadoop.yarn.exceptions.ResourceNotFoundException: Unknown resource 'yarn.io/gpu'. Known resources are [name: memory-mb, units: Mi, type: COUNTABLE, value: 15360, minimum allocation: 0, maximum allocation: 9223372036854775807, name: vcores, units: , type: COUNTABLE, value: 8, minimum allocation: 0, maximum allocation: 9223372036854775807]
at org.apache.hadoop.yarn.api.records.Resource.getResourceInformation(Resource.java:269)
at org.apache.hadoop.yarn.api.records.impl.pb.ResourcePBImpl.getResourceInformation(ResourcePBImpl.java:208)
at org.apache.hadoop.yarn.api.records.Resource.getResourceValue(Resource.java:306)
at org.apache.hadoop.yarn.api.records.impl.pb.ResourcePBImpl.getResourceValue(ResourcePBImpl.java:214)
at com.alibaba.xdl.AppMasterBase.initResource(AppMasterBase.java:831)
at com.alibaba.xdl.AppMasterBase.run(AppMasterBase.java:161)
at com.alibaba.xdl.AppMasterRunner.main(AppMasterRunner.java:87)

在config.json配置中已经设置了"gpu_cores": 0。
首先是不打算用GPU的。
所以我的问题是，这个参数是集群配置？还是提交代码的时候有额外的配置。
另外，我看在yarn_patch文件下，有关于GPU的JAR包，这部分怎么使用？

支持国产,mark一下

吹毛求疵一下，python目录下的源码格式不太统一呢。

your negative sampling method make me confused

Hi authors:
in process_data.py file , the function manual_join() code maybe have some problems. the code make me cofused part are follow.
while True:

            asin_neg_index = random.randint(0, len(item_list) - 1)
            asin_neg = item_list[asin_neg_index]
           **# if asign_neg!=asin  but  asign_neg  is a positive example in our dataset???  
            if asin_neg == asin:**
                continue 
            items[1] = asin_neg
            print>>fo, "0" + "\t" + "\t".join(items) + "\t" + meta_map[asin_neg]
            j += 1
            if j == 1:             #negative sampling frequency
                break
        if asin in meta_map:
            print>>fo, "1" + "\t" + line + "\t" + meta_map[asin]
        else:
            print>>fo, "1" + "\t" + line + "\t" + "default_cat"

The negative sampling may be exists some problems . if the asign_neg is reviewed by same user as asign. The negative sampling may be error! I am 't sure the thinking is right. if possible , I hope you can help me solve this confused. Thank you very much!

个人或者企业可以怎样应用XDL？

文档里面错别字好多

The mnist example fail

First, thanks for the great project.
I tried to run the MNIST example and got some failure.

root@adeb1b227a39:~/x-deeplearning/xdl/examples/mnist# python mnist.py

The traceback is as below.

Traceback (most recent call last):
  File "mnist.py", line 95, in <module>
    train()
  File "mnist.py", line 51, in train
    train_sess = xdl.TrainSession(hooks=[ckpt_hook])
  File "/usr/local/lib/python2.7/dist-packages/xdl-1.0-py2.7.egg/xdl/python/training/train_session.py", line 248, in __init__
    self._session = SimpleSession(hooks)
  File "/usr/local/lib/python2.7/dist-packages/xdl-1.0-py2.7.egg/xdl/python/training/train_session.py", line 238, in __init__
    self._session = Session(self._hooks)
  File "/usr/local/lib/python2.7/dist-packages/xdl-1.0-py2.7.egg/xdl/python/framework/session.py", line 36, in __init__
    self._create_session()
  File "/usr/local/lib/python2.7/dist-packages/xdl-1.0-py2.7.egg/xdl/python/framework/session.py", line 40, in _create_session
    hook.create_session()
  File "/usr/local/lib/python2.7/dist-packages/xdl-1.0-py2.7.egg/xdl/python/training/train_session.py", line 195, in create_session
    execute_with_retry(variable_registers())
  File "/usr/local/lib/python2.7/dist-packages/xdl-1.0-py2.7.egg/xdl/python/training/train_session.py", line 159, in execute_with_retry
    return execute(ops)
  File "/usr/local/lib/python2.7/dist-packages/xdl-1.0-py2.7.egg/xdl/python/lib/graph.py", line 242, in execute
    return current_graph().execute(outputs, run_option, run_statistic)
  File "/usr/local/lib/python2.7/dist-packages/xdl-1.0-py2.7.egg/xdl/python/lib/graph.py", line 178, in execute
    check_error(result.status)
  File "/usr/local/lib/python2.7/dist-packages/xdl-1.0-py2.7.egg/xdl/python/lib/error.py", line 35, in check_error
    raise InternalError(status.msg)
xdl.python.lib.error.InternalError: ErrorCode[3], ErrorMsg[Client is not initialized
Check Status [GetClient(&client)] at [/home/yue.song/x-deeplearning/xdl/xdl/core/ops/ps_ops/ps_register_variable_op.cc]Compute@43]

run XDL docker failed

I was try to run xdl docker using :

sudo docker run -it registry.cn-hangzhou.aliyuncs.com/xdl/xdl:ubuntu-cpu-tf1.12

here did the error come:

root@77a9fd80b821:/# python
Python 2.7.12 (default, Nov 12 2018, 14:36:49)
[GCC 5.4.0 20160609] on linux2
Type "help", "copyright", "credits" or "license" for more information.

import xdl
2018-12-21 15:09:59.852108: F tensorflow/core/platform/cpu_feature_guard.cc:37] The TensorFlow library was compiled to use AVX2 instructions, but these aren't available on your machine.
Aborted (core dumped)
root@77a9fd80b821:/#

could anyone help? thanks.

win10 系统支持不

有文档和实际例子来说明下为什么要开发ps-plus这个module吗？

为什么tensorflow的ps不能支持高纬稀疏的场景的计算。有benchmarks可以用来对比吗？
这个感觉更像ps-plus on yarn with tf.
这个和直接在yarn上运行tf有什么区别？

din 的例子中为什么要给id特征编号？

cat_voc.pkl, uid_voc.pkl, mid_voc.pkl 这三个文件，如果理解没错的话，应该是 cat id, userid itemid 对应的连续证整数编号。

在实际的项目中，uid可能有上亿，让每个worker去读一个user id vocabulary是比较低效的。既然xdl支持直接处理sparse特征，为什么不直接拿userid 去训练(或者把userid hash到int64空间)？

查看失败日志问题

任务失败信息如下
18/12/30 23:16:50 INFO xdl.Client: AppMaster host N/A Start waiting application: application_1546112727033_3813 ends.
18/12/30 23:16:51 INFO xdl.Client: Application application_1546112727033_3813 finish with state FAILED
18/12/30 23:16:51 INFO xdl.Utils: ================================FINAL STATUS==================================
18/12/30 23:16:51 INFO xdl.Utils: application_1546112727033_3813 : FAILED
18/12/30 23:16:51 INFO xdl.Utils: ================================FINAL STATUS==================================

但通过yarn logs看日志会报如下错误，看不到任何日志。
18/12/30 23:36:18 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm2

请问是什么原因。

The deepctr example fail

The deepctr example failure log as below, please help, thanks in advance.

root@adeb1b227a39:~/x-deeplearning/xdl/examples/deepctr# python deepctr.py

Traceback (most recent call last):
  File "deepctr.py", line 21, in <module>
    enable_state=False) # enable reader state
  File "/usr/local/lib/python2.7/dist-packages/xdl-1.0-py2.7.egg/xdl/python/io/data_reader.py", line 67, in __init__
    rank=xdl.get_task_index(), size=xdl.get_task_num())
  File "/usr/local/lib/python2.7/dist-packages/xdl-1.0-py2.7.egg/xdl/python/io/data_sharding.py", line 73, in partition
    assert size > 0 and rank >= 0 and rank < size
AssertionError

memory leak in deepctr example

I ran the deepctr example with some minor modify( change epoch, batch_size, embedding_size), then I found the memory usage go up to 10GB after 1 minute.

Here is my mofiy

diff --git a/xdl/examples/deepctr/deepctr.py b/xdl/examples/deepctr/deepctr.py
index 343f352..2ef9106 100644
--- a/xdl/examples/deepctr/deepctr.py
+++ b/xdl/examples/deepctr/deepctr.py
@@ -20,7 +20,7 @@ reader = xdl.DataReader("r1", # name of reader
                         paths=["./data.txt"], # file paths
                         enable_state=False) # enable reader state

-reader.epochs(1).threads(1).batch_size(10).label_count(1)
+reader.epochs(300).threads(1).batch_size(100).label_count(1)
 reader.feature(name='sparse0', type=xdl.features.sparse)\
     .feature(name='sparse1', type=xdl.features.sparse)\
     .feature(name='deep0', type=xdl.features.dense, nvec=256)
@@ -29,11 +29,11 @@ reader.startup()
 def train():
     batch = reader.read()
     sess = xdl.TrainSession()
-    emb1 = xdl.embedding('emb1', batch['sparse0'], xdl.TruncatedNormal(stddev=0.001), 8, 1024, vtype='hash')
-    emb2 = xdl.embedding('emb2', batch['sparse1'], xdl.TruncatedNormal(stddev=0.001), 8, 1024, vtype='hash')
+    emb1 = xdl.embedding('emb1', batch['sparse0'], xdl.TruncatedNormal(stddev=0.001), 128, 1024, vtype='hash')
+    emb2 = xdl.embedding('emb2', batch['sparse1'], xdl.TruncatedNormal(stddev=0.001), 128, 1024, vtype='hash')
     loss = model(batch['deep0'], [emb1, emb2], batch['label'])
     train_op = xdl.SGD(0.5).optimize()
-    log_hook = xdl.LoggerHook(loss, "loss:{0}", 10)
+    log_hook = xdl.LoggerHook(loss, "loss:{0}", 100)
     sess = xdl.TrainSession(hooks=[log_hook])
     while not sess.should_stop():
         sess.run(train_op)

Here is the output of `top` command

18142 root      20   0 45.101g 0.011t  61252 S 367.1  4.7  15:16.10 python deepctr.py --run=local

Seems the memory grow up about 100MB per epoch.

alibaba / x-deeplearning Goto Github PK

x-deeplearning's Introduction

概述

X-DeepLearning(简称XDL)是面向高维稀疏数据场景（如广告/推荐/搜索等）深度优化的一整套解决方案。XDL1.2版本已于近期发布，主要特性包括：

1. XDL训练引擎

2. XDL算法解决方案

3. Blaze预估引擎

4. 深度树匹配模型 TDM 匹配召回引擎