federatedai / kubefate Goto Github PK

View Code? Open in Web Editor NEW

417.0 417.0 222.0 53.05 MB

Manage federated learning workload using cloud native technologies.

License: Apache License 2.0

Shell 13.95% Makefile 0.66% Dockerfile 0.19% Go 83.28% Smarty 1.93%

cloudnative deployment federated-learning kubernetes

kubefate's People

Contributors

Stargazers

Watchers

Forkers

jiahaoc1993 mmyjona hainingzhang hubuwx tanmc123 songziw easson001 yiminggit kissyy124 freeeedooom rob228 njliaojiang zhaobobo10 huttuh xiaoxinixin jacktan1016 dianxianmian vwbzfcai clarityyao daxu425795483 yqyuan525 heyao1234 liuhuiling4 owlet42 qiygan hongyunnchen baishanca chenfengldw dockerzhang laynepeng chrishuo-04 qian777duoduo soolaugust wangchen615 wangqiang1234443 pengluhyd sunshare10 realtaki alejagapatrick wizard1203 whyseu lilovemeng demontf fqiang bukexiusi dangowski scorchfly songpy97 iberryful sunxingxingtf yan234280533 cloudsmallinsect panzhihao2011 lijin0718 liyinxin ynhao andrewli96 li-jp lding04 disperaller exmyth spencergotowork peniridis 1treeup turtlezrs ddhhdd eriksun2020 yyqalisa guo-yunzhe santos5755 judgeeeeee yao544303 laujay hmj110131 furuifr schgdut huiyellow xthzhjwzyc pizzali wb123852 petrawang hexieshenghuo jack-lizhixin lalalapotter matthew-wei tanyatanyatanya wormon wfangchi x007007007 asdfsx sharelinux sally-er linleaf1996 huyz1117 beike2020 wangyoucaocxl xujiangyu acveah vincentarthur laixin86714802

kubefate's Issues

将现有的 KubeFate 部署以及应用涉及到的DB 能够适配外部DB

在一般公司中不同的DB都是由专门的运维人员在运维，希望将当前的KubeFate中部署以及应用的涉及到的DB(mongdb、redis、mysql)都可以适配到外部的DB，能够不依赖自身容器中的独立DB，并提供相关的配置可选项。

error when starting up docker container

If i down all docker containers and restart all containers, i get below error.

File "fate_flow_server.py", line 91, in <module>
python_1        | Traceback (most recent call last):
python_1        |   File "fate_flow_server.py", line 93, in <module>
python_1        |     session.init(mode=RuntimeConfig.WORK_MODE, backend=Backend.EGGROLL)
python_1        |   File "/data/projects/fate/python/arch/api/session.py", line 52, in init
python_1        |     session = build_session(job_id=job_id, work_mode=mode, backend=backend)
python_1        |   File "/data/projects/fate/python/arch/api/table/session.py", line 38, in build_session
python_1        |     session = session_impl.FateSessionImpl(eggroll_session, work_mode, persistent_engine)
python_1        |   File "/data/projects/fate/python/arch/api/table/eggroll/session_impl.py", line 33, in __init__
python_1        |     self._eggroll = eggroll_util.build_eggroll_runtime(work_mode=work_mode, eggroll_session=eggroll_session)
python_1        |   File "/data/projects/fate/python/arch/api/table/eggroll_util.py", line 44, in build_eggroll_runtime
python_1        |     return eggroll_init(eggroll_session)
python_1        |   File "/data/projects/fate/python/eggroll/api/cluster/eggroll.py", line 79, in eggroll_init
python_1        |     eggroll_runtime = _EggRoll(eggroll_session=eggroll_session)
python_1        |   File "/data/projects/fate/python/eggroll/api/cluster/eggroll.py", line 364, in __init__
python_1        |     self.session_stub.getOrCreateSession(self.eggroll_session.to_protobuf())
python_1        |   File "/data/projects/python/venv/lib/python3.6/site-packages/grpc/_channel.py", line 533, in __call__
python_1        |     return _end_unary_response_blocking(state, call, False, None)
python_1        |   File "/data/projects/python/venv/lib/python3.6/site-packages/grpc/_channel.py", line 467, in _end_unary_response_blocking
python_1        |     raise _Rendezvous(state, None, None, deadline)
python_1        | grpc._channel._Rendezvous: <_Rendezvous of RPC that terminated with:
python_1        | 	status = StatusCode.UNAVAILABLE
python_1        | 	details = "Connect Failed"

请问这个框架能不能更新支持ipv6的版本

在ipv6的服务器上部署架构发现有很多问题，建议更新版本后能支持ipv6

docker-compose for fate-serving

Can share the reason why fate-serving has a new docker-compose-serving.yml file instead of re-use existing docker-compose.yml file? In the new docker-compose-serving.yml file, it has separate redis and serving-proxy services? Why not re-use existing Redis and proxy? Otherwise, we have 2 redis and 2 proxy for each party

在线推理端口打不开

打开在线推理端口显示如下，{"code":"103","message":"SYSTEM_ERROR"}，是什么原因呢？

Got rpc error if setting up with non-root user

I changed the user in parties.conf file, from root to ubuntu, which is a sudo user of my cluster.
I manually create /data/ dir and chmod it to 777. Then I run the deploy script as usual.
The output is OK and the containers are all set up. but it seems rpc call cannot work correctly.

(venv) [root@383b62d82f80 toy_example]# python run_toy_example.py 10000 9999 1
stdout:{
"retcode": 100,
"retmsg": "rpc request error: <_Rendezvous of RPC that terminated with:\n\tstatus = StatusCode.UNAVAILABLE\n\tdetails = "io exception"\n\tdebug_error_string = "{"created":"@1575627671.035273330","description":"Error received from peer","file":"src/core/lib/surface/call.cc","file_line":1017,"grpc_message":"io exception","grpc_status":14}"\n>"
}

Traceback (most recent call last):
File "run_toy_example.py", line 196, in
exec_toy_example(runtime_config)
File "run_toy_example.py", line 161, in exec_toy_example
jobid = exec_task(dsl_path, runtime_config)
File "run_toy_example.py", line 91, in exec_task
"failed to exec task, status:{}, stderr is {} stdout:{}".format(status, stderr, stdout))
ValueError: failed to exec task, status:100, stderr is None stdout:{'retcode': 100, 'retmsg': 'rpc request error: <_Rendezvous of RPC that terminated with:\n\tstatus = StatusCode.UNAVAILABLE\n\tdetails = "io exception"\n\tdebug_error_string = "{"created":"@1575627671.035273330","description":"Error received from peer","file":"src/core/lib/surface/call.cc","file_line":1017,"grpc_message":"io exception","grpc_status":14}"\n>'}

Failed when helm installing for each namespace

When runing:
helm install --name=fate-10000 --namespace=fate-10000 ./fate-10000/
got:

Error: release fate-10000 failed: namespaces "fate-10000" is forbidden: User "system:serviceaccount:kube-system:default" cannot get resource "namespaces" in API group "" in the namespace "fate-10000"

It is because tiller pod do not have the authority to access resources in each namespace?

下载miniKube是否需要补充加速说明

**What deployment mode you are use? **

docker-compose;
Kuberentes.

**What KubeFATE and FATE version you are using? **

**What OS you are using for docker-compse or Kubernetes? Please also clear the version of OS. **

OS: [e.g. iOS]
Version [e.g. 22]

Desktop (please complete the following information):

OS: [e.g. iOS]
Browser [e.g. chrome, safari]
Version [e.g. 22]

To Reproduce
Clear how to reproduce your problem.

What happen?
Clear the unexpected response.

Screenshots
If applicable, add screenshots to help explain your problem.

Additional context
Add any other context about the problem here.

Mount the "examples" out of the container.

We should persist the examples out of the container, so that a user can resue customized examples conveniently.

Toy text got error after deploy

After deploying the K8S cluster, got all pods running and ready, I follow the instruction to get into the python container and run the run_toy_example.py.
When I run the following command
kubectl exec -it -c python svc/fateflow bash -n fate-10000
I got

(venv) [root@python-6dc44d6b98-95n6x toy_example]# python run_toy_example.py 10000 9999 1
stdout:{
"data": {
"board_url": "http://fateboard:8080/index.html#/dashboard?job_id=201911170614590079332&role=guest&party_id=10000",
"job_dsl_path": "/data/projects/fate/python/jobs/201911170614590079332/job_dsl.json",
"job_runtime_conf_path": "/data/projects/fate/python/jobs/201911170614590079332/job_runtime_conf.json",
"logs_directory": "/data/projects/fate/python/logs/201911170614590079332",
"model_info": {
"model_id": "guest-10000#host-9999#model",
"model_version": "201911170614590079332"
}
},
"jobId": "201911170614590079332",
"retcode": 0,
"retmsg": "success"
}

job status is running
job status is running
"2019-11-17 06:15:02,523 - task_executor.py[line:127] - ERROR: <_Rendezvous of RPC that terminated with:
status = StatusCode.INTERNAL
details = "10.244.2.33:8011: java.lang.RuntimeException: java.lang.reflect.InvocationTargetException
at com.webank.ai.eggroll.framework.roll.api.grpc.client.StorageServiceClient.get(StorageServiceClient.java:223)
at com.webank.ai.eggroll.framework.roll.api.grpc.server.RollKvServiceImpl.lambda$get$5(RollKvServiceImpl.java:240)
at com.webank.ai.eggroll.core.api.grpc.server.GrpcServerWrapper.wrapGrpcServerRunnable(GrpcServerWrapper.java:52)
at com.webank.ai.eggroll.framework.roll.api.grpc.server.RollKvServiceImpl.get(RollKvServiceImpl.java:235)
at com.webank.ai.eggroll.api.storage.KVServiceGrpc$MethodHandlers.invoke(KVServiceGrpc.java:959)
at io.grpc.stub.ServerCalls$UnaryServerCallHandler$UnaryServerCallListener.onHalfClose(ServerCalls.java:171)
at io.grpc.PartialForwardingServerCallListener.onHalfClose(PartialForwardingServerCallListener.java:35)
at io.grpc.ForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:23)
at io.grpc.ForwardingServerCallListener$SimpleForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:40)
at io.grpc.Contexts$ContextualizedServerCallListener.onHalfClose(Contexts.java:86)
at io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.halfClosed(ServerCallImpl.java:283)
at io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1HalfClosed.runInContext(ServerImpl.java:710)
at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.reflect.InvocationTargetException
at com.webank.ai.eggroll.core.api.grpc.client.GrpcStreamingClientTemplate.calleeStreamingRpcWithImmediateDelayedResult(GrpcStreamingClientTemplate.java:154)
at com.webank.ai.eggroll.framework.roll.api.grpc.client.StorageServiceClient.get(StorageServiceClient.java:219)
... 16 more
Caused by: io.grpc.StatusRuntimeException: UNAVAILABLE: io exception
at io.grpc.Status.asRuntimeException(Status.java:526)
at io.grpc.stub.ClientCalls$StreamObserverToCallListenerAdapter.onClose(ClientCalls.java:434)
at io.grpc.PartialForwardingClientCallListener.onClose(PartialForwardingClientCallListener.java:39)
at io.grpc.ForwardingClientCallListener.onClose(ForwardingClientCallListener.java:23)
at io.grpc.ForwardingClientCallListener$SimpleForwardingClientCallListener.onClose(ForwardingClientCallListener.java:40)
at io.grpc.internal.CensusStatsModule$StatsClientInterceptor$1$1.onClose(CensusStatsModule.java:678)
at io.grpc.PartialForwardingClientCallListener.onClose(PartialForwardingClientCallListener.java:39)
at io.grpc.ForwardingClientCallListener.onClose(ForwardingClientCallListener.java:23)
at io.grpc.ForwardingClientCallListener$SimpleForwardingClientCallListener.onClose(ForwardingClientCallListener.java:40)
at io.grpc.internal.CensusTracingModule$TracingClientInterceptor$1$1.onClose(CensusTracingModule.java:397)
at io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:459)
at io.grpc.internal.ClientCallImpl.access$300(ClientCallImpl.java:63)
at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.close(ClientCallImpl.java:546)
at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.access$600(ClientCallImpl.java:467)
at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:584)
... 5 more
Caused by: io.grpc.netty.shaded.io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: egg/10.103.104.33:7778
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at io.grpc.netty.shaded.io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:327)
at io.grpc.netty.shaded.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:340)
at io.grpc.netty.shaded.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:632)
at io.grpc.netty.shaded.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:579)
at io.grpc.netty.shaded.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:496)
at io.grpc.netty.shaded.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:458)
at io.grpc.netty.shaded.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:897)
at io.grpc.netty.shaded.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
... 1 more
Caused by: java.net.ConnectException: Connection refused
... 11 more

I checked the service list:

ubuntu@gpu01:~/KubeFATE/k8s-deploy$ kubectl get service --all-namespaces
NAMESPACE NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
default kubernetes ClusterIP 10.96.0.1 443/TCP 6d21h
fate-10000 egg ClusterIP 10.103.104.33 7888/TCP,7778/TCP 33m
fate-10000 fateboard ClusterIP 10.96.218.69 8080/TCP 33m
fate-10000 fateflow ClusterIP 10.107.108.66 9360/TCP,9380/TCP 33m
fate-10000 federation ClusterIP 10.105.140.38 9394/TCP 33m
fate-10000 meta-service ClusterIP 10.98.124.197 8590/TCP 33m
fate-10000 mysql ClusterIP 10.110.194.54 3306/TCP 33m
fate-10000 proxy NodePort 10.108.227.49 9370:30010/TCP 33m
fate-10000 redis ClusterIP 10.101.248.97 6379/TCP 33m
fate-10000 roll ClusterIP 10.105.184.33 8011/TCP 33m
fate-9999 egg ClusterIP 10.110.125.195 7888/TCP,7778/TCP 33m

It seems that in fate-10000, there is some connection problems with egg service;

And I run the same test in fate-9999, also get a similar error.

加载模型的时候显示101，然后无法绑定模型，没有很明确的报错信息，该如何排查问题

运行修改绑定模型的配置例子时报错

报错信息如下：
(venv) [root@0c1e90c8bb50 fate_flow]# python fate_flow_client.py -f bind -c examples/bin d_model_service.json
{
"retcode": 100,
"retmsg": "<_Rendezvous of RPC that terminated with:\n\tstatus = StatusCode.UNAVAILA BLE\n\tdetails = "Connect Failed"\n\tdebug_error_string = "{"created":"@1585048168 .113685948","description":"Failed to create subchannel","file":"src/core/ext/fil ters/client_channel/client_channel.cc","file_line":2721,"referenced_errors":[{"cre ated":"@1585048168.113675586","description":"Pick Cancelled","file":"src/core/ ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":241,"refe renced_errors":[{"created":"@1585048168.113456935","description":"Connect Failed ","file":"src/core/ext/filters/client_channel/subchannel.cc","file_line":689,"gr pc_status":14,"referenced_errors":[{"created":"@1585048168.113402662","descripti on":"Failed to connect to remote host: OS Error","errno":113,"file":"src/core/li b/iomgr/tcp_client_posix.cc","file_line":210,"os_error":"No route to host","sysc all":"getsockopt(SO_ERROR)","target_address":"ipv4:10.0.2.16:8000"}]}]}]}"\n>"
}

fate1.4 训练模块经常超时和 predict 模块报错

使用 docker-compose 部署的 fate 的1.4
45 为 guest + exchange
46为 host
使用 examples/federatedml-1.x-examples/hetero_secureboost 中的 test_secureboost_train_dsl.json
和conf 训练自己的数据（train3w+test7k）

模型训练过程经常报错（训练过程中的超时错误，既有在intersection也有bin也有secureboost）（host侧为主）

predict 模块直接报错（数据类型转换错误）
predict 模块使用train 的数据和test数据均会报错（guest侧）排除了数据集的问题

There is an examples file in the python container but no other files can't be tested below.

docker-copome部署后storage-service无法启动，有core down

使用docker-compose方式部署，conf-10000方的 storage-service未启动，进入egg容器后发现有service storage 有core down信息

Configuration error when use version 1.0.2

python container cannot read eggroll host
fate board job dashboard can not read dataset info and graph

generate_config minor error

for this line

sed -i.bak "/'host':./{x;s/^/./;/^.{2}$/{x;s/./ 'host': '${redis_ip}',/;x};x;}" ./confs-$party_id/confs/fate_flow/conf/settings.py

i get error

sed: 1: "/'host':.*/{x;s/^/./;/^ ...": extra characters at the end of x command

on macos

fate-serving 1.2

May I know why serving is removed from 1.2 ? Any estimation of time when this will be available? Thank you

FATECLOUD_REPO_URL问题

        - name: FATECLOUD_REPO_URL
          value: "http://docker-repo.sonic.com:443/chartrepo/chartrepo"

这个应该是下载chart的吧
现在不用这个，而是直接手动方式kubefate chart upload
cluster.yaml修改好后，那么这里的FATECLOUD_REPO_URL环境变量应该怎么填，

主要问题：发现即使用了手动的方式上传chart后，kubefate还是会去访问FATECLOUD_REPO_URL啊？

Expose SSH service from the python container

For developers who design their own algorithm, it is inconvenient to code and debug inside the python container.
A solution to address this situation is to setup a SHH service inside the python container and then expose it. So that developer could view or debug code through code viewer like vscode with ssh plugin.

Version number is wrong

https://github.com/FederatedAI/FATE/blob/master/docker-build/build_cluster_docker.sh
from https://github.com/FederatedAI/FATE/blob/master/cluster-deploy/scripts/default_configurations.sh
version=1.1.1 ---Represents all .tar.gz version numbers
egg_version=1.1
meta_service_version=1.1
roll_version=1.1
federation_version=1.1
proxy_version=1.1
fateboard_version=1.1
fateflow_version=1.1
These represent the version number of the jar package

任务僵死

用 https://github.com/FederatedAI/KubeFATE/blob/master/docker-deploy/README_zh.md 的方法部署不知道为什么总是卡死在 intersection 里面

后来发现偶尔也可以到binning 挂掉

以下是 rollsite 的日志
在 instersection_1 挂掉的时候

然后隔了20分钟又动了一下，从intersection_1到intersection_0 了

然后日志再没动过
fatebord 日志也永远停在了这里

比对了一下上面

发现这个走完intersection就可以成功了

但是经常挂在这，不清楚原因。如果怀疑机器配置原因也可以问。
目前是centos7. docker1.8，docker-compose1.24
补充：
intersection_1 的数据量 7.2k，intersection_0 数据量29.2k，目前 intersection_1 完成概率还是挺高的，intersection_0 很少能完成

测试 toy_example 没问题

toy issue cannot run

I am testing the docker example. After executed python run_toy_example.py 10000 9999 1 ,
fate_flow_client.py, response = requests.post("/".join([server_url, "job", func.rstrip('_job')]), json=post_data) , return error {'retcode': 100, 'retmsg': 'rpc request error: <_Rendezvous of RPC that terminated with:\n\tstatus = StatusCode.UNAVAILABLE\n\tdetails = "io exception"\n\tdebug_error_string = "{"created":"@1574410850.912361635","description":"Error received from peer","file":"src/core/lib/surface/call.cc","file_line":1017,"grpc_message":"io exception","grpc_status":14}"\n>'} { "retcode": 100, "retmsg": "rpc request error: <_Rendezvous of RPC that terminated with:\n\tstatus = StatusCode.UNAVAILABLE\n\tdetails = \"io exception\"\n\tdebug_error_string = \"{\"created\":\"@1574410850.912361635\",\"description\":\"Error received from peer\",\"file\":\"src/core/lib/surface/call.cc\",\"file_line\":1017,\"grpc_message\":\"io exception\",\"grpc_status\":14}\"\n>" }

log from federatedai/python:1.1-release, show

[2019-11-22 08:20:50,912] ERROR in app: Exception on /submit [POST]
Traceback (most recent call last):
  File "/data/projects/fate/python/fate_flow/utils/api_utils.py", line 66, in remote_api
    #stat_logger.info("grpc api request: {}".format(_packet))
  File "/data/projects/python/venv/lib/python3.6/site-packages/grpc/_channel.py", line 533, in __call__
    return _end_unary_response_blocking(state, call, False, None)
  File "/data/projects/python/venv/lib/python3.6/site-packages/grpc/_channel.py", line 467, in _end_unary_response_blocking
    raise _Rendezvous(state, None, None, deadline)
grpc._channel._Rendezvous: <_Rendezvous of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "io exception"
	debug_error_string = "{"created":"@1574410850.912361635","description":"Error received from peer","file":"src/core/lib/surface/call.cc","file_line":1017,"grpc_message":"io exception","grpc_status":14}"
>

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/data/projects/python/venv/lib/python3.6/site-packages/flask/app.py", line 2292, in wsgi_app
    response = self.full_dispatch_request()
  File "/data/projects/python/venv/lib/python3.6/site-packages/flask/app.py", line 1815, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/data/projects/python/venv/lib/python3.6/site-packages/flask/app.py", line 1718, in handle_user_exception
    reraise(exc_type, exc_value, tb)
  File "/data/projects/python/venv/lib/python3.6/site-packages/flask/_compat.py", line 35, in reraise
    raise value
  File "/data/projects/python/venv/lib/python3.6/site-packages/flask/app.py", line 1813, in full_dispatch_request
    rv = self.dispatch_request()
  File "/data/projects/python/venv/lib/python3.6/site-packages/flask/app.py", line 1799, in dispatch_request
    return self.view_functions[rule.endpoint](**req.view_args)
  File "/data/projects/fate/python/fate_flow/apps/job_app.py", line 45, in submit_job
    job_id, job_dsl_path, job_runtime_conf_path, logs_directory, model_info, board_url = JobController.submit_job(request.json)
  File "/data/projects/fate/python/fate_flow/driver/job_controller.py", line 92, in submit_job
    TaskScheduler.distribute_job(job=job, roles=job_runtime_conf['role'], job_initiator=job_initiator)
  File "/data/projects/fate/python/fate_flow/driver/task_scheduler.py", line 55, in distribute_job
    work_mode=job.f_work_mode)
  File "/data/projects/fate/python/fate_flow/utils/api_utils.py", line 53, in federated_api
    dest_party_id=dest_party_id, json_body=json_body, overall_timeout=overall_timeout)
  File "/data/projects/fate/python/fate_flow/utils/api_utils.py", line 72, in remote_api
    except grpc.RpcError as e:
Exception: rpc request error: <_Rendezvous of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "io exception"
	debug_error_string = "{"created":"@1574410850.912361635","description":"Error received from peer","file":"src/core/lib/surface/call.cc","file_line":1017,"grpc_message":"io exception","grpc_status":14}"
>

提交任务时报错,"description":"Error received from peer"

(venv) [root@1d9ab5e9f7b6 fate_flow]# python fate_flow_client.py -f submit_job -d examples/test_hetero_lr_job_dsl.json -c examples/test_hetero_lr_job_conf.json
{
"retcode": 100,
"retmsg": "rpc request error: <_Rendezvous of RPC that terminated with:\n\tstatus = StatusCode.INTERNAL\n\tdetails = "0.0.0.0:9370: java.util.concurrent.TimeoutException: [UNARYCALL][SERVER] unary call server error: overall process time exceeds timeout: 60000, metadata: {"task":{"taskId":"202005080940492786716"},"src":{"name":"202005080940492786716","partyId":"9999","role":"fateflow","callback":{"ip":"0.0.0.0","port":9360}},"dst":{"name":"202005080940492786716","partyId":"10000","role":"fateflow"},"command":{"name":"fateflow"},"operator":"POST","conf":{"overallTimeout":"60000"}}, lastPacketTimestamp: 1588930849341, loopEndTimestamp: 1588930909381\n\tat com.webank.ai.fate.networking.proxy.grpc.service.DataTransferPipedServerImpl.unaryCall(DataTransferPipedServerImpl.java:245)\n\tat com.webank.ai.fate.api.networking.proxy.DataTransferServiceGrpc$MethodHandlers.invoke(DataTransferServiceGrpc.java:346)\n\tat io.grpc.stub.ServerCalls$UnaryServerCallHandler$UnaryServerCallListener.onHalfClose(ServerCalls.java:171)\n\tat io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.halfClosed(ServerCallImpl.java:283)\n\tat io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1HalfClosed.runInContext(ServerImpl.java:710)\n\tat io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)\n\tat io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123)\n\tat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)\n\tat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)\n\tat java.lang.Thread.run(Thread.java:748)\n"\n\tdebug_error_string = "{"created":"@1588930909.382965656","description":"Error received from peer","file":"src/core/lib/surface/call.cc","file_line":1017,"grpc_message":"0.0.0.0:9370: java.util.concurrent.TimeoutException: [UNARYCALL][SERVER] unary call server error: overall process time exceeds timeout: 60000, metadata: {"task":{"taskId":"202005080940492786716"},"src":{"name":"202005080940492786716","partyId":"9999","role":"fateflow","callback":{"ip":"0.0.0.0","port":9360}},"dst":{"name":"202005080940492786716","partyId":"10000","role":"fateflow"},"command":{"name":"fateflow"},"operator":"POST","conf":{"overallTimeout":"60000"}}, lastPacketTimestamp: 1588930849341, loopEndTimestamp: 1588930909381\n\tat com.webank.ai.fate.networking.proxy.grpc.service.DataTransferPipedServerImpl.unaryCall(DataTransferPipedServerImpl.java:245)\n\tat com.webank.ai.fate.api.networking.proxy.DataTransferServiceGrpc$MethodHandlers.invoke(DataTransferServiceGrpc.java:346)\n\tat io.grpc.stub.ServerCalls$UnaryServerCallHandler$UnaryServerCallListener.onHalfClose(ServerCalls.java:171)\n\tat io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.halfClosed(ServerCallImpl.java:283)\n\tat io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1HalfClosed.runInContext(ServerImpl.java:710)\n\tat io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)\n\tat io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123)\n\tat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)\n\tat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)\n\tat java.lang.Thread.run(Thread.java:748)\n","grpc_status":13}"\n>"
}

这个error received from peer这两天经常遇到,偶尔会没问题,请教一下,谢谢

docker版本部署后运行测试样例出错

一开始的时候我使用非docker方式部署了两个FATE，运行正常，后来出现docker版本后，将其中一方切换到docker版本，发现通信失败，不知道是否两个版本不能兼容运行。
为解决这种情况，我尝试将两个party都部署了docker版本，在两边都运行测试样例，结果如下：

两个party 分别为10000和10004

在10000上运行python run_toy_example.py 10000 10004 1

(venv) [root@d94a571e5e8f toy_example]# python run_toy_example.py 10000 10004 1
stdout:{
    "data": {
        "board_url": "http://fateboard:8080/index.html#/dashboard?job_id=2019103103130742954512&role=guest&party_id=10000",
        "job_dsl_path": "/data/projects/fate/python/jobs/2019103103130742954512/job_dsl.json",
        "job_runtime_conf_path": "/data/projects/fate/python/jobs/2019103103130742954512/job_runtime_conf.json",
        "model_info": {
            "model_id": "guest-10000#host-10004#model",
            "model_version": "2019103103130742954512"
        }
    },
    "jobId": "2019103103130742954512",
    "meta": null,
    "retcode": 0,
    "retmsg": "success"
}


toy example is running, jobid is 2019103103130742954512
job status is running
"2019-10-31 03:13:08,784 - task_executor.py[line:123] - ERROR: <_Rendezvous of RPC that terminated with:
status = StatusCode.UNAVAILABLE
details = "Name resolution failure"
debug_error_string = "{"created":"@1572491588.784572961","description":"Failed to create subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":2721,"referenced_errors":[{"created":"@1572491588.784570308","description":"Name resolution failure","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3026,"grpc_status":14}]}"
>"
Traceback (most recent call last):
File "/data/projects/fate/python/fate_flow/driver/task_executor.py", line 112, in run_task
run_object.run(parameters, task_run_args)
File "/data/projects/fate/python/federatedml/toy_example/secure_add_guest.py", line 106, in run
self.sync_share_to_host()
File "/data/projects/fate/python/federatedml/toy_example/secure_add_guest.py", line 81, in sync_share_to_host
idx=0)
File "/data/projects/fate/python/arch/api/federation.py", line 73, in remote
return RuntimeInstance.FEDERATION.remote(obj=obj, name=name, tag=tag, role=role, idx=idx)
File "/data/projects/fate/python/arch/api/cluster/federation.py", line 182, in remote
type=federation_pb2.SEND))
File "/data/projects/fate/venv/lib/python3.6/site-packages/grpc/_channel.py", line 533, in __call__
return _end_unary_response_blocking(state, call, False, None)
File "/data/projects/fate/venv/lib/python3.6/site-packages/grpc/_channel.py", line 467, in _end_unary_response_blocking
raise _Rendezvous(state, None, None, deadline)
grpc._channel._Rendezvous: <_Rendezvous of RPC that terminated with:
status = StatusCode.UNAVAILABLE
details = "Name resolution failure"
debug_error_string = "{"created":"@1572491588.784572961","description":"Failed to create subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":2721,"referenced_errors":[{"created":"@1572491588.784570308","description":"Name resolution failure","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3026,"grpc_status":14}]}"
>

在10004上运行python run_toy_example.py 10004 10000 1

(venv) [root@e64635054c61 toy_example]# python run_toy_example.py 10004 10000 1
stdout:{
    "data": {
        "board_url": "http://fateboard:8080/index.html#/dashboard?job_id=201910310311108975682&role=guest&party_id=10004",
        "job_dsl_path": "/data/projects/fate/python/jobs/201910310311108975682/job_dsl.json",
        "job_runtime_conf_path": "/data/projects/fate/python/jobs/201910310311108975682/job_runtime_conf.json",
        "model_info": {
            "model_id": "guest-10004#host-10000#model",
            "model_version": "201910310311108975682"
        }
    },
    "jobId": "201910310311108975682",
    "meta": null,
    "retcode": 0,
    "retmsg": "success"
}


toy example is running, jobid is 201910310311108975682
job status is running
Traceback (most recent call last):
  File "run_toy_example.py", line 197, in <module>
    exec_toy_example(runtime_config)
  File "run_toy_example.py", line 171, in exec_toy_example
    show_log(jobid, "error")
  File "run_toy_example.py", line 149, in show_log
    with open(error_log, "r") as fin:
FileNotFoundError: [Errno 2] No such file or directory: '/data/projects/fate/python/examples/toy_example/test/log/job_201910310311108975682_log/guest/10004/secure_add_example_0/ERROR.log'

两者都成功启动了任务，但是从报错结果的不对称来看，我猜测其中有一方出错了，但是不知道报错的原因。双方部署的流程配置并无不同。

还需要任何其他的日志请评论留言。

run toy example erro

k8s Python container initialization is too slow

run toy issue error

Hi I am testing the docker version deployment, when I executed python run_toy_example.py 10000 9999 1 , errors occur:
stdout:{
"retcode": 100,
"retmsg": "rpc request error: <_Rendezvous of RPC that terminated with:\n\tstatus = StatusCode.INTERNAL\n\tdetails = "0.0.0.0:9370: java.util.concurrent.TimeoutException: [UNARYCALL][SERVER] unary call server error: overall process time exceeds timeout: 60000, metadata: {"task":{"taskId":"201911250719135674612"},"src":{"name":"201911250719135674612","partyId":"10000","role":"fateflow","callback":{"ip":"0.0.0.0","port":9360}},"dst":{"name":"201911250719135674612","partyId":"9999","role":"fateflow"},"command":{"name":"fateflow"},"operator":"POST","conf":{"overallTimeout":"60000"}}, lastPacketTimestamp: 1574666366182, loopEndTimestamp: 1574666426622\n\tat com.webank.ai.fate.networking.proxy.grpc.service.DataTransferPipedServerImpl.unaryCall(DataTransferPipedServerImpl.java:245)\n\tat com.webank.ai.fate.api.networking.proxy.DataTransferServiceGrpc$MethodHandlers.invoke(DataTransferServiceGrpc.java:346)\n\tat io.grpc.stub.ServerCalls$UnaryServerCallHandler$UnaryServerCallListener.onHalfClose(ServerCalls.java:171)\n\tat io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.halfClosed(ServerCallImpl.java:283)\n\tat io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1HalfClosed.runInContext(ServerImpl.java:710)\n\tat io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)\n\tat io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123)\n\tat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)\n\tat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)\n\tat java.lang.Thread.run(Thread.java:748)\n"\n\tdebug_error_string = "{"created":"@1574666426.772088324","description":"Error received from peer","file":"src/core/lib/surface/call.cc","file_line":1017,"grpc_message":"0.0.0.0:9370: java.util.concurrent.TimeoutException: [UNARYCALL][SERVER] unary call server error: overall process time exceeds timeout: 60000, metadata: {"task":{"taskId":"201911250719135674612"},"src":{"name":"201911250719135674612","partyId":"10000","role":"fateflow","callback":{"ip":"0.0.0.0","port":9360}},"dst":{"name":"201911250719135674612","partyId":"9999","role":"fateflow"},"command":{"name":"fateflow"},"operator":"POST","conf":{"overallTimeout":"60000"}}, lastPacketTimestamp: 1574666366182, loopEndTimestamp: 1574666426622\n\tat com.webank.ai.fate.networking.proxy.grpc.service.DataTransferPipedServerImpl.unaryCall(DataTransferPipedServerImpl.java:245)\n\tat com.webank.ai.fate.api.networking.proxy.DataTransferServiceGrpc$MethodHandlers.invoke(DataTransferServiceGrpc.java:346)\n\tat io.grpc.stub.ServerCalls$UnaryServerCallHandler$UnaryServerCallListener.onHalfClose(ServerCalls.java:171)\n\tat io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.halfClosed(ServerCallImpl.java:283)\n\tat io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1HalfClosed.runInContext(ServerImpl.java:710)\n\tat io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)\n\tat io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123)\n\tat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)\n\tat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)\n\tat java.lang.Thread.run(Thread.java:748)\n","grpc_status":13}"\n>"
}

Traceback (most recent call last):
File "run_toy_example.py", line 196, in
exec_toy_example(runtime_config)
File "run_toy_example.py", line 161, in exec_toy_example
jobid = exec_task(dsl_path, runtime_config)
File "run_toy_example.py", line 91, in exec_task
"failed to exec task, status:{}, stderr is {} stdout:{}".format(status, stderr, stdout))
ValueError: failed to exec task, status:100, stderr is None stdout:{'retcode': 100, 'retmsg': 'rpc request error: <_Rendezvous of RPC that terminated with:\n\tstatus = StatusCode.INTERNAL\n\tdetails = "0.0.0.0:9370: java.util.concurrent.TimeoutException: [UNARYCALL][SERVER] unary call server error: overall process time exceeds timeout: 60000, metadata: {"task":{"taskId":"201911250719135674612"},"src":{"name":"201911250719135674612","partyId":"10000","role":"fateflow","callback":{"ip":"0.0.0.0","port":9360}},"dst":{"name":"201911250719135674612","partyId":"9999","role":"fateflow"},"command":{"name":"fateflow"},"operator":"POST","conf":{"overallTimeout":"60000"}}, lastPacketTimestamp: 1574666366182, loopEndTimestamp: 1574666426622\n\tat com.webank.ai.fate.networking.proxy.grpc.service.DataTransferPipedServerImpl.unaryCall(DataTransferPipedServerImpl.java:245)\n\tat com.webank.ai.fate.api.networking.proxy.DataTransferServiceGrpc$MethodHandlers.invoke(DataTransferServiceGrpc.java:346)\n\tat io.grpc.stub.ServerCalls$UnaryServerCallHandler$UnaryServerCallListener.onHalfClose(ServerCalls.java:171)\n\tat io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.halfClosed(ServerCallImpl.java:283)\n\tat io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1HalfClosed.runInContext(ServerImpl.java:710)\n\tat io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)\n\tat io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123)\n\tat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)\n\tat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)\n\tat java.lang.Thread.run(Thread.java:748)\n"\n\tdebug_error_string = "{"created":"@1574666426.772088324","description":"Error received from peer","file":"src/core/lib/surface/call.cc","file_line":1017,"grpc_message":"0.0.0.0:9370: java.util.concurrent.TimeoutException: [UNARYCALL][SERVER] unary call server error: overall process time exceeds timeout: 60000, metadata: {"task":{"taskId":"201911250719135674612"},"src":{"name":"201911250719135674612","partyId":"10000","role":"fateflow","callback":{"ip":"0.0.0.0","port":9360}},"dst":{"name":"201911250719135674612","partyId":"9999","role":"fateflow"},"command":{"name":"fateflow"},"operator":"POST","conf":{"overallTimeout":"60000"}}, lastPacketTimestamp: 1574666366182, loopEndTimestamp: 1574666426622\n\tat com.webank.ai.fate.networking.proxy.grpc.service.DataTransferPipedServerImpl.unaryCall(DataTransferPipedServerImpl.java:245)\n\tat com.webank.ai.fate.api.networking.proxy.DataTransferServiceGrpc$MethodHandlers.invoke(DataTransferServiceGrpc.java:346)\n\tat io.grpc.stub.ServerCalls$UnaryServerCallHandler$UnaryServerCallListener.onHalfClose(ServerCalls.java:171)\n\tat io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.halfClosed(ServerCallImpl.java:283)\n\tat io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1HalfClosed.runInContext(ServerImpl.java:710)\n\tat io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)\n\tat io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123)\n\tat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)\n\tat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)\n\tat java.lang.Thread.run(Thread.java:748)\n","grpc_status":13}"\n>'}

I've already disabled firewall by:
"systemctl disable firewalld.service"
"systemctl stop firewalld.service"

k8s(v1.18.0)集群部署FAET(v1.3)后，如何配置挂载hostpath？

hostpath部署参考：https://github.com/rancher/local-path-provisioner
主要执行以下三步：
1、kubectl apply -f https://raw.githubusercontent.com/rancher/local-path-provisioner/master/deploy/local-path-storage.yaml
2、kubectl create -f https://raw.githubusercontent.com/rancher/local-path-provisioner/master/examples/pvc.yaml
3、kubectl create -f https://raw.githubusercontent.com/rancher/local-path-provisioner/master/examples/pod.yaml
其中，localpath、pvc、pod的配置请直接访问以上链接查看。
请问，我应该如何修改以上三个配置文件，以达到本地目录与k8s集群中的FATE共享目录对的效果？
或者是否有其他部署hostpath的更好方式？

本人使用k8s集群部署FATE，仅差这最后一步，若解决，会将部署详细文档分享大家参考。
盼复，非常感谢！！！

k8s(v1.18.0)

what should i do when the third party or fourth party or even more parties want to join into current env when i use docker-compose deploying environment？

Hi , i have succesfully deployed the FATE env with this parties.conf and it runs well when i only use this two nodes:
user=app
dir=/data/projects/fate
partylist=(10000 9999)
partyiplist=(172.40.20.3 172.40.20.5)
venvdir=/data/projects/fate/venv

i meet this problem when i use when i'm trying to add another host to this current env
first i update parties.conf:
user=app
dir=/data/projects/fate
partylist=(10000 9999 9998)
partyiplist=(172.40.20.3 172.40.20.5 172.40.20.7)
venvdir=/data/projects/fate/venv

and i execute : bash generate_config.sh

then : bash docker_deploy.sh 9998

and i got the successful deployment info

but this problem occurred to me

when i'm using party 10000 and party 9998 to run a job

can you tell me how to add the third and fourth or more parties to current docker env and how should i use them ? thanks

绑定模型时报错 retcode:100

我用的kubefate,按照
https://github.com/FederatedAI/KubeFATE/blob/master/docker-deploy/README_zh.md
一步步运行.
到绑定模型这一步时报错
$ python fate_flow_client.py -f bind -c examples/bind_model_service.json
报错内容是
(venv) [root@b8968ce974c3 fate_flow]# python fate_flow_client.py -f bind -c examples/bind_model_service.json
{
"retcode": 100,
"retmsg": "<_Rendezvous of RPC that terminated with:\n\tstatus = StatusCode.UNAVAILABLE\n\tdetails = "Connect Failed"\n\tdebug_error_string = "{"created":"@1588239041.208150875","description":"Failed to create subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":2721,"referenced_errors":[{"created":"@1588239041.208145968","description":"Pick Cancelled","file":"src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":241,"referenced_errors":[{"created":"@1588239041.207990948","description":"Connect Failed","file":"src/core/ext/filters/client_channel/subchannel.cc","file_line":689,"grpc_status":14,"referenced_errors":[{"created":"@1588239041.207960859","description":"Failed to connect to remote host: OS Error","errno":113,"file":"src/core/lib/iomgr/tcp_client_posix.cc","file_line":210,"os_error":"No route to host","syscall":"getsockopt(SO_ERROR)","target_address":"ipv4:192.168.216.130:8000"}]}]}]}"\n>"
}
能帮我看一下吗,谢谢

The version number of docker build is dead and it is not convenient to expand

https://github.com/FederatedAI/FATE/blob/master/docker-build/build_cluster_docker.sh
The version number is dead and it is not convenient to expand.

bash docker_deploy.sh all --training中的docker-compose up -d 阶段出现问题

部署的时候，执行bash docker_deploy.sh all --training，然后在docker-compose up -d的阶段出现如下错误：

> docker-compose down
....
> docker-compose up -d
Creating network "confs-10000_fate-network" with the default driver
Creating volume "confs-10000_shared_dir_examples" with local driver
Creating volume "confs-10000_shared_dir_federatedml" with local driver
Creating confs-10000_federation_1   ... done
Creating confs-10000_proxy_1        ... done
Creating confs-10000_egg_1          ... done
Creating confs-10000_redis_1      ... done
Creating confs-10000_mysql_1      ... done
Creating confs-10000_meta-service_1 ... done
Creating confs-10000_roll_1         ... done
Creating confs-10000_python_1       ... error  

ERROR: for confs-10000_python_1  Cannot create container for service python: failed to mount local volume: mount /path/to/host/dir/federatedml:/var/lib/docker/volumes/confs-10000_shared_dir_federatedml/_data, flags: 0x1000: no such file or directory  

ERROR: for python  Cannot create container for service python: failed to mount local volume: mount /path/to/host/dir/federatedml:/var/lib/docker/volumes/confs-10000_shared_dir_federatedml/_data, flags: 0x1000: no such file or directory  

ERROR: Encountered errors while bringing up the project.

Pod got stuck in CrashLoopBackOff due to FileNotFoundException

After helm install fate-10000, fate-9999, fate-exchange, and run
kubectl -n fate-10000 get pod
found that many pod got stuck in CrashLoopBackOff;
for example egg-756d95b5b8-dhrjr pod using federatedai/egg:1.1-release image.
Run
kubectl logs -n fate-10000 egg-756d95b5b8-dhrjr
got:

Exception in thread "main" org.springframework.beans.factory.BeanDefinitionStoreException: IOException parsing XML document from class path resource [applicationContext-egg.xml]; nested exception is java.io.FileNotFoundException: class path resource [applicationContext-egg.xml] cannot be opened because it does not exist
at org.springframework.beans.factory.xml.XmlBeanDefinitionReader.loadBeanDefinitions(XmlBeanDefinitionReader.java:344)
at org.springframework.beans.factory.xml.XmlBeanDefinitionReader.loadBeanDefinitions(XmlBeanDefinitionReader.java:304)
at org.springframework.beans.factory.support.AbstractBeanDefinitionReader.loadBeanDefinitions(AbstractBeanDefinitionReader.java:188)
at org.springframework.beans.factory.support.AbstractBeanDefinitionReader.loadBeanDefinitions(AbstractBeanDefinitionReader.java:224)
at org.springframework.beans.factory.support.AbstractBeanDefinitionReader.loadBeanDefinitions(AbstractBeanDefinitionReader.java:195)
at org.springframework.beans.factory.support.AbstractBeanDefinitionReader.loadBeanDefinitions(AbstractBeanDefinitionReader.java:257)
at org.springframework.context.support.AbstractXmlApplicationContext.loadBeanDefinitions(AbstractXmlApplicationContext.java:128)
at org.springframework.context.support.AbstractXmlApplicationContext.loadBeanDefinitions(AbstractXmlApplicationContext.java:94)
at org.springframework.context.support.AbstractRefreshableApplicationContext.refreshBeanFactory(AbstractRefreshableApplicationContext.java:133)
at org.springframework.context.support.AbstractApplicationContext.obtainFreshBeanFactory(AbstractApplicationContext.java:636)
at org.springframework.context.support.AbstractApplicationContext.refresh(AbstractApplicationContext.java:521)
at org.springframework.context.support.ClassPathXmlApplicationContext.(ClassPathXmlApplicationContext.java:144)
at org.springframework.context.support.ClassPathXmlApplicationContext.(ClassPathXmlApplicationContext.java:85)
at com.webank.ai.eggroll.framework.egg.Egg.main(Egg.java:47)
Caused by: java.io.FileNotFoundException: class path resource [applicationContext-egg.xml] cannot be opened because it does not exist
at org.springframework.core.io.ClassPathResource.getInputStream(ClassPathResource.java:180)
at org.springframework.beans.factory.xml.XmlBeanDefinitionReader.loadBeanDefinitions(XmlBeanDefinitionReader.java:330)
... 13 more

docker-deploy/docker_deploy.sh当中的一段代码问题

您好，可以方便提一个问题吗：我今天在重新部署FATE的时候发现了docker-deploy/docker_deploy.sh当中的2行代码有问题，具体代码在文件的第156-157行（详情如下）

docker volume rm confs-${target_party_id}_shared_dir_examples
docker volume rm confs-${target_party_id}_shared_dir_federatedml

从我在机器上测试的效果来看，confs与${target_party_id}之间应该没有“-”，我部署了FATE之后再重新部署的时候就报错了，说没有这个volume。

在我机器上，我运行了命令sudo docker volume ls之后的结果如下，可以发现，confs与${target_party_id}之间并没有“-”。

Would you please check this problem( task_executor.py", line 118 )? the errlog is uploaded ,thanks

i have followed the steps of docker compose deployment
(https://github.com/FederatedAI/KubeFATE/blob/master/docker-deploy/README.md)

and at the last step:Verify the Deployment
when i execute scripts:
python run_toy_example.py 10000 9999 1
i got this problem:

(part of the log)
Traceback (most recent call last):
File "/data/projects/fate/python/fate_flow/driver/task_executor.py", line 118, in run_task
run_object.run(parameters, task_run_args)
File "/data/projects/fate/python/federatedml/toy_example/secure_add_guest.py", line 113, in run
self._init_data()
File "/data/projects/fate/python/federatedml/toy_example/secure_add_guest.py", line 54, in _init_data
self.x = session.parallelize(kvs, include_key=True, partition=self.partition)
File "/data/projects/fate/python/arch/api/utils/profile_util.py", line 31, in _fn
rtn = func(*args, **kwargs)
File "/data/projects/fate/python/arch/api/session.py", line 76, in parallelize
error_if_exist=error_if_exist)
File "/data/projects/fate/python/arch/api/table/eggroll/session_impl.py", line 69, in parallelize
error_if_exist=error_if_exist)
File "/data/projects/fate/python/eggroll/api/cluster/eggroll.py", line 410, in parallelize
_table = self._create_table(create_table_info)
File "/data/projects/fate/python/eggroll/api/cluster/eggroll.py", line 444, in _create_table
count = self.eggroll_session._gc_table.get(info.storageLocator.name)
File "/data/projects/fate/python/eggroll/api/cluster/eggroll.py", line 214, in get
return _EggRoll.get_instance().get(self, k, use_serialize=use_serialize)
File "/data/projects/fate/python/eggroll/api/cluster/eggroll.py", line 571, in get
operand = self.kv_stub.get(kv_pb2.Operand(key=k), metadata=_get_meta(_table))
File "/data/projects/python/venv/lib/python3.6/site-packages/grpc/_channel.py", line 533, in call
return _end_unary_response_blocking(state, call, False, None)
File "/data/projects/python/venv/lib/python3.6/site-packages/grpc/_channel.py", line 467, in _end_unary_response_blocking
raise _Rendezvous(state, None, None, deadline)
grpc._channel._Rendezvous: <_Rendezvous of RPC that terminated with:
status = StatusCode.INTERNAL
details = "172.18.0.8:8011: java.lang.RuntimeException: java.lang.reflect.InvocationTargetException
at

err_20191128.log

Connection refuse error when running toy_example after deployment

I followed the docker-deployment construction to deploy Fate1.1. All the services have been started up. but there was an connection refuse error when running toy_example. i have checked the logs and have no idea about it. could you please help to look into it? thanks.

deployment configuration and error logs are shown as below.

fate_flow_debug.log
fate-proxy.log
eggroll-meta-service.log
eggroll-egg.log

Python containers and fateboard containers are not easy to maintain because they are embedded together

docker load instruction update

The "docker load fate_-images.tar.gz" instruction in the "https://github.com/FederatedAI/KubeFATE/blob/master/docker-deploy/README_zh.md" should be update to "docker load --input fate_-images.tar.gz"

While running jobs, it would silently lose some python module files within the egg service

As titled, which leads to job stuck there forever without any errors on the fateboard ui.

All the logs were also attached.

party-9999-2.tar.gz
party-9999-1.tar.gz
party-9999-0.tar.gz
party-10000-2.tar.gz
party-10000-1.tar.gz
party-10000-0.tar.gz

文档调整建议

文档： https://github.com/FederatedAI/KubeFATE/blob/master/k8s-deploy/README_zh.md

测试方法参考文档：https://github.com/FederatedAI/FATE/blob/master/cluster-deploy/doc/Fate-cluster%E9%83%A8%E7%BD%B2%E6%8C%87%E5%8D%97(install).md

1, Toy_example测试应该B端进行验证测试

# kubectl exec -it -c python svc/fateflow bash -n fate-9999
[root@python-6d84bd779-xnjnj python]# source /data/projects/python/venv/bin/activate
(venv) [root@python-6d84bd779-xnjnj python]# cd /data/projects/fate/python/examples/toy_example/
(venv) [root@python-6d84bd779-xnjnj toy_example]# python run_toy_example.py 9999 10000  1
stdout:{
    "data": {
        "board_url": "http://fateboard:8080/index.html#/dashboard?job_id=201912261339220480571&role=guest&party_id=9999",
        "job_dsl_path": "/data/projects/fate/python/jobs/201912261339220480571/job_dsl.json",
        "job_runtime_conf_path": "/data/projects/fate/python/jobs/201912261339220480571/job_runtime_conf.json",
        "logs_directory": "/data/projects/fate/python/logs/201912261339220480571",
        "model_info": {
            "model_id": "guest-9999#host-10000#model",
            "model_version": "201912261339220480571"
        }
    },
    "jobId": "201912261339220480571",
    "retcode": 0,
    "retmsg": "success"
}


job status is running
job status is running
job status is running
job status is running
job status is running
job status is running
job status is running
job status is running
job status is running
job status is running
job status is running
job status is running
job status is running
job status is running
job status is running
job status is running
job status is running
"2019-12-26 13:39:24,724 - secure_add_guest.py[line:109] - INFO: begin to init parameters of secure add example guest"
"2019-12-26 13:39:24,724 - secure_add_guest.py[line:112] - INFO: begin to make guest data"
"2019-12-26 13:39:26,121 - secure_add_guest.py[line:115] - INFO: split data into two random parts"
"2019-12-26 13:39:36,665 - secure_add_guest.py[line:118] - INFO: share one random part data to host"
"2019-12-26 13:39:37,026 - secure_add_guest.py[line:121] - INFO: get share of one random part data from host"
"2019-12-26 13:39:41,527 - secure_add_guest.py[line:124] - INFO: begin to get sum of guest and host"
"2019-12-26 13:39:42,315 - secure_add_guest.py[line:127] - INFO: receive host sum from guest"
"2019-12-26 13:39:42,439 - secure_add_guest.py[line:134] - INFO: success to calculate secure_sum, it is 2000.0000000000002"

2, 最小化测试（建议增加）
host端执行：

kubectl exec -it -c python svc/fateflow bash -n fate-10000
source /data/projects/python/venv/bin/activate
cd /data/projects/fate/python/examples/min_test_task
sh run.sh host fast
role is host
task is fast
Upload data config json: {'file': '/data/projects/fate/python/examples/min_test_task/../data/breast_a.csv', 'head': 1, 'partition': 10, 'work_mode': 1, 'table_name': 'host_table_name_1577367741_9875', 'namespace': 'host_table_namespace_1577367741_9875'}
stdout:{
    "data": {
        "board_url": "http://fateboard:8080/index.html#/dashboard?job_id=201912261342213201742&role=local&party_id=0",
        "job_dsl_path": "/data/projects/fate/python/jobs/201912261342213201742/job_dsl.json",
        "job_runtime_conf_path": "/data/projects/fate/python/jobs/201912261342213201742/job_runtime_conf.json",
        "logs_directory": "/data/projects/fate/python/logs/201912261342213201742",
        "namespace": "host_table_namespace_1577367741_9875",
        "table_name": "host_table_name_1577367741_9875"
    },
    "jobId": "201912261342213201742",
    "retcode": 0,
    "retmsg": "success"
}


Upload output is {'data': {'board_url': 'http://fateboard:8080/index.html#/dashboard?job_id=201912261342213201742&role=local&party_id=0', 'job_dsl_path': '/data/projects/fate/python/jobs/201912261342213201742/job_dsl.json', 'job_runtime_conf_path': '/data/projects/fate/python/jobs/201912261342213201742/job_runtime_conf.json', 'logs_directory': '/data/projects/fate/python/logs/201912261342213201742', 'namespace': 'host_table_namespace_1577367741_9875', 'table_name': 'host_table_name_1577367741_9875'}, 'jobId': '201912261342213201742', 'retcode': 0, 'retmsg': 'success'}
table_name:host_table_name_1577367741_9875
namespace:host_table_namespace_1577367741_9875
process 697 thread 140574064924480 run __init__ init table name:__gc_get_intersect_output, namespace:get_intersect_output
created table: storage_type: LMDB, namespace: get_intersect_output, name: __gc_get_intersect_output, partitions: 1, in_place_computing: False
process 697 thread 140574064924480 run __init__ init table name:host_table_name_1577367741_9875, namespace:host_table_namespace_1577367741_9875
created table: storage_type: LMDB, namespace: host_table_namespace_1577367741_9875, name: host_table_name_1577367741_9875, partitions: 10, in_place_computing: False
table count:569
method:upload, count:569
The table name and namespace is needed by GUEST. To start a modeling task, please inform GUEST with the table name and namespace.
finish upload intersect data
*********************
*******finish!*******

guest端执行：

kubectl exec -it -c python svc/fateflow bash -n fate-9999
source /data/projects/python/venv/bin/activate
cd /data/projects/fate/python/examples/min_test_task
sh run.sh guest fast host_table_name_1577367741_9875 host_table_namespace_1577367741_9875

role is guest
task is fast
Start Upload Data
Upload data config json: {'file': '/data/projects/fate/python/examples/min_test_task/../data/breast_b.csv', 'head': 1, 'partition': 10, 'work_mode': 1, 'table_name': 'guest_table_name_1577367799_5279', 'namespace': 'guest_table_namespace_1577367799_5279'}
stdout:{
    "data": {
        "board_url": "http://fateboard:8080/index.html#/dashboard?job_id=201912261343198264522&role=local&party_id=0",
        "job_dsl_path": "/data/projects/fate/python/jobs/201912261343198264522/job_dsl.json",
        "job_runtime_conf_path": "/data/projects/fate/python/jobs/201912261343198264522/job_runtime_conf.json",
        "logs_directory": "/data/projects/fate/python/logs/201912261343198264522",
        "namespace": "guest_table_namespace_1577367799_5279",
        "table_name": "guest_table_name_1577367799_5279"
    },
    "jobId": "201912261343198264522",
    "retcode": 0,
    "retmsg": "success"
}


Upload output is {'data': {'board_url': 'http://fateboard:8080/index.html#/dashboard?job_id=201912261343198264522&role=local&party_id=0', 'job_dsl_path': '/data/projects/fate/python/jobs/201912261343198264522/job_dsl.json', 'job_runtime_conf_path': '/data/projects/fate/python/jobs/201912261343198264522/job_runtime_conf.json', 'logs_directory': '/data/projects/fate/python/logs/201912261343198264522', 'namespace': 'guest_table_namespace_1577367799_5279', 'table_name': 'guest_table_name_1577367799_5279'}, 'jobId': '201912261343198264522', 'retcode': 0, 'retmsg': 'success'}
table_name:guest_table_name_1577367799_5279
namespace:guest_table_namespace_1577367799_5279
Data uploaded, expected table count: 569
process 292 thread 139864159831872 run __init__ init table name:__gc_get_intersect_output, namespace:get_intersect_output
created table: storage_type: LMDB, namespace: get_intersect_output, name: __gc_get_intersect_output, partitions: 1, in_place_computing: False
process 292 thread 139864159831872 run __init__ init table name:guest_table_name_1577367799_5279, namespace:guest_table_namespace_1577367799_5279
created table: storage_type: LMDB, namespace: guest_table_namespace_1577367799_5279, name: guest_table_name_1577367799_5279, partitions: 10, in_place_computing: False
table count:569
Test upload task success, upload count match DTable count
[Intersect] Start intersect task
stdout:{
    "data": {
        "board_url": "http://fateboard:8080/index.html#/dashboard?job_id=201912261343266632783&role=guest&party_id=9999",
        "job_dsl_path": "/data/projects/fate/python/jobs/201912261343266632783/job_dsl.json",
        "job_runtime_conf_path": "/data/projects/fate/python/jobs/201912261343266632783/job_runtime_conf.json",
        "logs_directory": "/data/projects/fate/python/logs/201912261343266632783",
        "model_info": {
            "model_id": "guest-9999#host-10000#model",
            "model_version": "201912261343266632783"
        }
    },
    "jobId": "201912261343266632783",
    "retcode": 0,
    "retmsg": "success"
}


[Intersect] Start intersect job status checker, status counter: 0, jobid:201912261343266632783
[Intersect] cur job status:running, wait_time: 10.263408184051514
[Intersect] Start intersect job status checker, status counter: 1, jobid:201912261343266632783
Current task status: ['running', 'running']
[Intersect] cur job status:running, wait_time: 20.541152715682983
[Intersect] Start intersect job status checker, status counter: 2, jobid:201912261343266632783
Current task status: ['success', 'success']
[Intersect] cur job status:success, wait_time: 30.793477535247803
[Intersect] intersect task status is success
exec cmd: ['python', '/data/projects/fate/python/examples/min_test_task/../../fate_flow/fate_flow_client.py', '-f', 'component_output_data', '-j', '201912261343266632783', '-p', '9999', '-r', 'guest', '-cpn', 'intersect_0', '-o', '/data/projects/fate/python/examples/min_test_task/user_data']
task_type: component_output_data, jobid: 201912261343266632783, party_id: 9999, role: guest, component_name: intersect_0
intersect result:{'retcode': 0, 'directory': '/data/projects/fate/python/examples/min_test_task/user_data/job_201912261343266632783_intersect_0_guest_9999_output_data', 'retmsg': 'download successfully, please check /data/projects/fate/python/examples/min_test_task/user_data/job_201912261343266632783_intersect_0_guest_9999_output_data directory'}
Current subp status: 0
Job_status_checker Stdout is : 569
[Train] Start train task
stdout:{
    "data": {
        "board_url": "http://fateboard:8080/index.html#/dashboard?job_id=201912261343585998584&role=guest&party_id=9999",
        "job_dsl_path": "/data/projects/fate/python/jobs/201912261343585998584/job_dsl.json",
        "job_runtime_conf_path": "/data/projects/fate/python/jobs/201912261343585998584/job_runtime_conf.json",
        "logs_directory": "/data/projects/fate/python/logs/201912261343585998584",
        "model_info": {
            "model_id": "arbiter-10000#guest-9999#host-10000#model",
            "model_version": "201912261343585998584"
        }
    },
    "jobId": "201912261343585998584",
    "retcode": 0,
    "retmsg": "success"
}


[Train] cur job status:running, jobid:201912261343585998584, wait_time: 10.245079278945923
[Train] cur job status:running, jobid:201912261343585998584, wait_time: 20.497931241989136
[Train] cur job status:running, jobid:201912261343585998584, wait_time: 30.749210596084595
[Train] cur job status:running, jobid:201912261343585998584, wait_time: 41.02689456939697
[Train] cur job status:running, jobid:201912261343585998584, wait_time: 51.27661728858948
[Train] cur job status:running, jobid:201912261343585998584, wait_time: 61.58478283882141
[Train] cur job status:running, jobid:201912261343585998584, wait_time: 71.84794425964355
[Train] cur job status:running, jobid:201912261343585998584, wait_time: 82.09212589263916
[Train] cur job status:running, jobid:201912261343585998584, wait_time: 92.37399768829346
[Train] cur job status:running, jobid:201912261343585998584, wait_time: 102.65567255020142
[Train] cur job status:running, jobid:201912261343585998584, wait_time: 112.89676547050476
[Train] cur job status:running, jobid:201912261343585998584, wait_time: 123.17980813980103
[Train] cur job status:running, jobid:201912261343585998584, wait_time: 133.4493010044098
[Train] cur job status:running, jobid:201912261343585998584, wait_time: 143.71157217025757
[Train] cur job status:running, jobid:201912261343585998584, wait_time: 153.96755576133728
[Train] cur job status:running, jobid:201912261343585998584, wait_time: 164.19690942764282
[Train] cur job status:running, jobid:201912261343585998584, wait_time: 174.42942643165588
[Train] cur job status:running, jobid:201912261343585998584, wait_time: 184.67726016044617
[Train] cur job status:running, jobid:201912261343585998584, wait_time: 194.93053078651428
[Train] cur job status:running, jobid:201912261343585998584, wait_time: 205.17497754096985
[Train] cur job status:running, jobid:201912261343585998584, wait_time: 215.425639629364
[Train] cur job status:running, jobid:201912261343585998584, wait_time: 225.69933485984802
[Train] cur job status:running, jobid:201912261343585998584, wait_time: 235.9535937309265
[Train] cur job status:running, jobid:201912261343585998584, wait_time: 246.17912769317627
[Train] cur job status:running, jobid:201912261343585998584, wait_time: 256.4435749053955
[Train] cur job status:running, jobid:201912261343585998584, wait_time: 266.7383930683136
[Train] cur job status:running, jobid:201912261343585998584, wait_time: 277.03925704956055
[Train] cur job status:running, jobid:201912261343585998584, wait_time: 287.29895973205566
[Train] cur job status:running, jobid:201912261343585998584, wait_time: 297.53812742233276
[Train] cur job status:running, jobid:201912261343585998584, wait_time: 307.7968189716339
[Train] cur job status:running, jobid:201912261343585998584, wait_time: 318.084840297699
Current task status: ['running', 'success']
[Train] cur job status:running, jobid:201912261343585998584, wait_time: 328.3697247505188
Current task status: ['success', 'success']
[Train] cur job status:success, jobid:201912261343585998584, wait_time: 338.6157777309418
[Train] train task status is success
exec cmd: ['python', '/data/projects/fate/python/examples/min_test_task/../../fate_flow/fate_flow_client.py', '-f', 'component_metric_all', '-j', '201912261343585998584', '-p', '9999', '-r', 'guest', '-cpn', 'evaluation_0']
task_type: component_metric_all, jobid: 201912261343585998584, party_id: 9999, role: guest, component_name: evaluation_0
[Train] train eval:[['auc', 0.989562], ['ks', 0.92762]]
TEST_UPLOAD is success
TEST_INTERSECT is success
TEST_TRAIN is success
Test success:3, failed:0
*********************
*******finish!*******

mysql容器一直处于restarting状态

mysql容器一直处于 restart。
查看日志如下：
mysql的容器日志发现报错如下:

尝试执行docker-compose down 然后执行docker-compose up
mysql容器日志如下:

我是使用的普通用户用sudo 安装的docker和docker-compose，然后把普通用户加入docker用户组。
用普通用户执行的部署操作。

容器启动时fate-network网段随机，通过down、up -d重启容器后，旧任务无法删除、导出数据和访问详细内容

通过docker-compose 启动容器，假设第一次启动时fate-network的网段为172.19.0.x。
中间因为运维需求，（例如发起任务因故障失败，需要关闭容器修改配置），通过docker-compose down 和docker-compose up -d 重启容器组。重启后fate-network的网段为172.30.0.x。

则重启后，fateboard删除上一次运行中的任务会出现错误：please start execute server：172.19.0.4:9380
fateboard上旧任务的详细信息页面等也有类似报错。

解决途径有两个，一个是FATE本身修改任务的管理方式，不按照任务记录里的地址找server，另一个是在KubeFATE种固定容器网段

networks:
  fate-network:
    ipam:
      config:
      - subnet: 172.30.0.0/16

start up docker confs-9999_python_1 and confs-10000_python_1 have error

when i start the docker containers, i see below errors in confs-10000_python_1, confs-9999_python_1

Traceback (most recent call last):
  File "fate_flow_server.py", line 91, in <module>
    session.init(mode=RuntimeConfig.WORK_MODE, backend=Backend.EGGROLL)
  File "/data/projects/fate/python/arch/api/session.py", line 50, in init
    session = build_session(job_id=job_id, work_mode=mode, backend=backend)
  File "/data/projects/fate/python/arch/api/table/session.py", line 34, in build_session
    session = session_impl.FateSessionImpl(eggroll_session, work_mode)
  File "/data/projects/fate/python/arch/api/table/eggroll/session_impl.py", line 32, in __init__
    self._eggroll = eggroll_util.build_eggroll_runtime(work_mode=work_mode, eggroll_session=eggroll_session)
  File "/data/projects/fate/python/arch/api/table/eggroll_util.py", line 44, in build_eggroll_runtime
    return eggroll_init(eggroll_session)
  File "/data/projects/fate/python/eggroll/api/cluster/eggroll.py", line 79, in eggroll_init
    eggroll_runtime = _EggRoll(eggroll_session=eggroll_session)
  File "/data/projects/fate/python/eggroll/api/cluster/eggroll.py", line 364, in __init__
    self.session_stub.getOrCreateSession(self.eggroll_session.to_protobuf())
  File "/data/projects/python/venv/lib/python3.6/site-packages/grpc/_channel.py", line 533, in __call__
    return _end_unary_response_blocking(state, call, False, None)
  File "/data/projects/python/venv/lib/python3.6/site-packages/grpc/_channel.py", line 467, in _end_unary_response_blocking
    raise _Rendezvous(state, None, None, deadline)
grpc._channel._Rendezvous: <_Rendezvous of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "Connect Failed"
	debug_error_string = "{"created":"@1574473852.541712846","description":"Failed to create subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":2721,"referenced_errors":[{"created":"@1574473852.541710746","description":"Pick Cancelled","file":"src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":241,"referenced_errors":[{"created":"@1574473852.541691945","description":"Connect Failed","file":"src/core/ext/filters/client_channel/subchannel.cc","file_line":689,"grpc_status":14,"referenced_errors":[{"created":"@1574473852.541670345","description":"Failed to connect to remote host: OS Error","errno":111,"file":"src/core/lib/iomgr/tcp_client_posix.cc","file_line":205,"os_error":"Connection refused","syscall":"connect","target_address":"ipv4:172.19.0.8:8011"}]}]}]}"
>
created table: storage_type: LMDB, namespace: b3844b54-0d93-11ea-8b9f-0242ac130009, name: __gc_b3844b54-0d93-11ea-8b9f-0242ac130009, partitions: 1, in_place_computing: False
 * Running on http://0.0.0.0:9380/ (Press CTRL+C to quit)

roll container does not have error

[INFO ] 2019-11-23T01:50:52,103 [main] [ThreadPoolTaskExecutor:171] - Initializing ExecutorService
[INFO ] 2019-11-23T01:50:52,154 [main] [PostProcessorRegistrationDelegate$BeanPostProcessorChecker:330] - Bean 'asyncThreadPool' of type [org.springframework.scheduling.config.TaskExecutorFactoryBean] is not eligible for getting processed by all BeanPostProcessors (for example: not eligible for auto-proxying)
[INFO ] 2019-11-23T01:50:52,160 [main] [PostProcessorRegistrationDelegate$BeanPostProcessorChecker:330] - Bean 'asyncThreadPool' of type [org.springframework.scheduling.concurrent.ThreadPoolTaskExecutor] is not eligible for getting processed by all BeanPostProcessors (for example: not eligible for auto-proxying)
[INFO ] 2019-11-23T01:50:54,326 [main] [ThreadPoolTaskExecutor:171] - Initializing ExecutorService
[INFO ] 2019-11-23T01:50:54,346 [main] [ThreadPoolTaskExecutor:171] - Initializing ExecutorService
[INFO ] 2019-11-23T01:50:54,471 [main] [ThreadPoolTaskScheduler:171] - Initializing ExecutorService 'routineScheduler'
[INFO ] 2019-11-23T01:50:54,695 [main] [DefaultGrpcServerFactory:119] - final conf path: /data/projects/fate/roll/conf/roll.properties
[INFO ] 2019-11-23T01:50:55,130 [pool-3-thread-1] [GrpcChannelFactory:108] - [COMMON][CHANNEL][CREATE] creating insecure channel for endpoint: ip: meta-service, port: 8590, hostname:
[INFO ] 2019-11-23T01:50:55,429 [pool-3-thread-1] [GrpcChannelFactory:201] - [COMMON][CHANNEL][CREATE] creating channel to {"ip":"meta-service","port":8590,"hostname":""}, isSecure: false
[INFO ] 2019-11-23T01:50:55,818 [pool-3-thread-1] [GrpcChannelFactory:231] - [COMMON][CHANNEL][CREATE] created channel to {"ip":"meta-service","port":8590,"hostname":""}, isSecure: false
[INFO ] 2019-11-23T01:50:56,013 [main] [DefaultGrpcServerFactory:62] - server build on port only :8011
[INFO ] 2019-11-23T01:51:01,416 [grpcServiceExecutor-1] [RollSessionServiceImpl:33] - [ROLL][SESSION] getOrCreateSession. request: {"sessionId":"b3844b54-0d93-11ea-8b9f-0242ac130009","computingEngineConf":{"eggroll.roll.port":"8011","eggroll.server.conf.path":"eggroll/conf/server_conf.json","eggroll.roll.host":"roll"},"namingPolicy":"DEFAULT","tag":""}
[INFO ] 2019-11-23T01:51:01,465 [grpcServiceExecutor-1] [RollKvServiceImpl:119] - Kv.createIfAbsent request received. request: {"storageLocator":{"type":"LMDB","namespace":"b3844b54-0d93-11ea-8b9f-0242ac130009","name":"__gc_b3844b54-0d93-11ea-8b9f-0242ac130009","fragment":0},"fragmentCount":1}

min_test_task error in docker cluster version

The toy example can run correctly in my docker cluster version, but when I run the min_test_task in guest, got error as followed:

When the python pod is starting, the mysql service sometimes is not yet stared!

When the python pod is starting, the mysql service sometimes is not yet stared!, so the python pod always restarting multi times. so i add a initContainer to wating for the mysql server running.

Kubernetes pod evicted state

I am using Kube deploy. During the training, the egg service pod status becomes 'evicted' after running for sometime. I think this is due to low memory. Is anyone else experience this?

run demo error

demo: python run_toy_example.py 10000 10000 1
error as below:
"2019-11-25 04:01:52,599 - task_executor.py[line:127] - ERROR: must be real number, not NoneType"
Traceback (most recent call last):
File "/data/projects/fate/python/fate_flow/driver/task_executor.py", line 118, in run_task
run_object.run(parameters, task_run_args)
File "/data/projects/fate/python/federatedml/toy_example/secure_add_guest.py", line 130, in run
secure_sum = self.reconstruct(guest_sum, host_sum)
File "/data/projects/fate/python/federatedml/toy_example/secure_add_guest.py", line 71, in reconstruct
print("host sum is %.4f" % host_sum)
TypeError: must be real number, not NoneType

anthor error:
Nov 25, 2019 7:30:20 AM io.grpc.internal.ManagedChannelOrphanWrapper$ManagedChannelReference cleanQueue
SEVERE: ~* Channel ManagedChannelImpl{logId=95, target=federation:9394} was not shutdown properly!!! *~
Make sure to call shutdown()/shutdownNow() and wait until awaitTermination() returns true.
java.lang.RuntimeException: ManagedChannel allocation site
at io.grpc.internal.ManagedChannelOrphanWrapper$ManagedChannelReference.(ManagedChannelOrphanWrapper.java:103)
at io.grpc.internal.ManagedChannelOrphanWrapper.(ManagedChannelOrphanWrapper.java:53)
at io.grpc.internal.ManagedChannelOrphanWrapper.(ManagedChannelOrphanWrapper.java:44)
at io.grpc.internal.AbstractManagedChannelImplBuilder.build(AbstractManagedChannelImplBuilder.java:411)
at com.webank.ai.fate.networking.proxy.factory.GrpcStubFactory.createChannel(GrpcStubFactory.java:221)
at com.webank.ai.fate.networking.proxy.factory.GrpcStubFactory.access$100(GrpcStubFactory.java:51)
at com.webank.ai.fate.networking.proxy.factory.GrpcStubFactory$1.load(GrpcStubFactory.java:89)
at com.webank.ai.fate.networking.proxy.factory.GrpcStubFactory$1.load(GrpcStubFactory.java:83)
at com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3528)
at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2277)
at com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2154)
at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2044)
at com.google.common.cache.LocalCache.get(LocalCache.java:3952)
at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3974)
at com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4958)
at com.webank.ai.fate.networking.proxy.factory.GrpcStubFactory.getChannel(GrpcStubFactory.java:157)
at com.webank.ai.fate.networking.proxy.factory.GrpcStubFactory.getStubBase(GrpcStubFactory.java:123)
at com.webank.ai.fate.networking.proxy.factory.GrpcStubFactory.getAsyncStub(GrpcStubFactory.java:113)
at com.webank.ai.fate.networking.proxy.factory.GrpcStubFactory.getAsyncStub(GrpcStubFactory.java:109)
at com.webank.ai.fate.networking.proxy.grpc.client.DataTransferPipedClient.getStub(DataTransferPipedClient.java:215)
at com.webank.ai.fate.networking.proxy.grpc.client.DataTransferPipedClient.unaryCall(DataTransferPipedClient.java:183)
at com.webank.ai.fate.networking.proxy.service.CascadedCaller.run(CascadedCaller.java:64)
at com.webank.ai.fate.networking.proxy.service.CascadedCaller$$FastClassBySpringCGLIB$$5248343b.invoke()
at org.springframework.cglib.proxy.MethodProxy.invoke(MethodProxy.java:218)
at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.invokeJoinpoint(CglibAopProxy.java:749)
at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:163)
at org.springframework.aop.interceptor.AsyncExecutionInterceptor.lambda$invoke$0(AsyncExecutionInterceptor.java:115)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

confs-10000.zip

How can i get the parameters transferred between Party A and Party B , so i can prove it is safe ?

when we are running jobs ， we can only get the information of gradient descent rate and some other info , we really want to get some more info like parameters so we can prove that info transferred between two partis can be safe

and how can i print out these two part?

thanks