federatedai / eggroll Goto Github PK
View Code? Open in Web Editor NEWA Simple High Performance Computing Framework for [Federated] Machine Learning
License: Apache License 2.0
A Simple High Performance Computing Framework for [Federated] Machine Learning
License: Apache License 2.0
Cannot find linking libraries in centos.
Eggroll 1.x supports data import from memory only. Users have to deal with their data and import into Eggroll.
We should provide users of importing data directly from a file.
Users can pass-in their split function, returning a tuple of (key, value)
. Keys and values will be imported into Eggroll.
Implementing it in v1.x as there is a requirement from FATE side. Porting it into v2.x later.
Labeled as v1.x.
Migrating core library from 1.x to 2.x with the following changes:
some sensitive info need to remove
add log webserver for every node
The package-version in auto-packaging.sh still 0.3
Update the path in services.sh
Hello, please describe the role of rolframe in eggroll. I did not find the package eggroll-roll-pair-2.0.1.jar in the packaged lib
Currently put_all is single threaded. This results in very low data input performance.
Advise implementing a parallel mechanism, e.g. input a multi-thread or multi-process put_all.
Add rocksdb and network support
No exception is thrown when querying a table that does not exist in a directory. For example, LevelDB is not supported yet, but if we querying a levelDB table, no exception is thrown.
Current PR template does not contain a space before #.
Adding a space to permit reference issue without typing an extra one.
When session is null, no computing engine will be created.
Need to provide default computing engines when session is null.
The existing mechanism supports data import from csv file and memory. But database and hdfs are common data sources. We need to support data import directly from them.
Labeling to v1.x but v2.x also needs this feature.
core/pom.xml version info change from 2.9.9 to 2.9.9.1
Currently, only LMDB is supported storage engine. A disk based engine is required for large data computing.
Describe the bug
In eggroll 1.x, the return value of flatMap or mapPartition2 is a list, which depends heavily on memory. Hope that in later version of eggroll, the result is a generator can be support
Such as
FATE: 1.4.2
EGGROLL: 2.0.1
使用KubeFATE中的Docker-Compose部署,分别启动rollsite, clustermanager, nodemanager, mysql,4个容器, 在两个主机上做集群
1.用docker pull ***拉取images,再按Docker-Compose部署 ,一切正常
2. 但如果离线build docker images,再按Docker-Compose部署,就发现6.2 roll_pair测试
python -m unittest test_roll_pair.TestRollPairCluster --集群模式
失败,请问是哪里出问题了:
ERROR: setUpClass (test_roll_pair.TestRollPairCluster)
...
ValueError: processor in session meta is not valid:<ErSessionMeta(id=er_session_py_20200827.----_192.167.0.4, name=, status=ERROR, tag=, processors=[***, len=11], options=[{'eggroll.session.processors.per.node':'10'}]) at 0x...>
Exception in thread roll_pair-send_command-90f32f70-dba0-11ea-88c4-fa163e1070a0-py-job-93d9f2a0-dba0-11ea-8514-fa163e1070a0_putAll:
Traceback (most recent call last):
File "/data/projects/fate/eggroll/python/eggroll/core/client.py", line 71, in sync_send
response = _command_stub.call(request.to_proto())
File "/data/projects/fate/common/python/venv/lib/python3.6/site-packages/grpc/_channel.py", line 565, in call
return _end_unary_response_blocking(state, call, False, None)
File "/data/projects/fate/common/python/venv/lib/python3.6/site-packages/grpc/_channel.py", line 467, in _end_unary_response_blocking
raise _Rendezvous(state, None, None, deadline)
grpc._channel._Rendezvous: <_Rendezvous of RPC that terminated with:
status = StatusCode.UNAVAILABLE
details = "Connection reset by peer"
debug_error_string = "{"created":"@1597129331.341114809","description":"Error received from peer ipv4:xx.xx.xx.xx:32882","file":"src/core/lib/surface/call.cc","file_line":1052,"grpc_message":"Connection reset by peer","grpc_status":14}"
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/data/projects/fate/common/miniconda3/lib/python3.6/threading.py", line 916, in _bootstrap_inner
self.run()
File "/data/projects/fate/common/miniconda3/lib/python3.6/threading.py", line 864, in run
self._target(*self._args, **self._kwargs)
File "/data/projects/fate/eggroll/python/eggroll/roll_pair/roll_pair.py", line 568, in send_command
serdes_type=SerdesTypes.PROTOBUF)
File "/data/projects/fate/eggroll/python/eggroll/core/client.py", line 54, in simple_sync_send
results = self.sync_send(inputs=[input], output_types=[output_type], endpoint=endpoint, command_uri=command_uri, serdes_type=serdes_type)
File "/data/projects/fate/eggroll/python/eggroll/core/client.py", line 84, in sync_send
raise CommandCallError(command_uri, endpoint, e)
eggroll.core.client.CommandCallError: ('Failed to call command: CommandURI(_uri=v1/roll-pair/runJob) to endpoint: xx.xx.xx.xx:32882, caused by: ', <_Rendezvous of RPC that terminated with:
status = StatusCode.UNAVAILABLE
details = "Connection reset by peer"
debug_error_string = "{"created":"@1597129331.341114809","description":"Error received from peer ipv4:xx.xxx.xx.xx:32882","file":"src/core/lib/surface/call.cc","file_line":1052,"grpc_message":"Connection reset by peer","grpc_status":14}"
)
Frameworking poc. Including cluster / node manager, storage format, data transfer etc.
This work mainly bases on roll frame poc and core lib migration.
RollPair / RollTensor poc will start soon.
If users choose LEVEL_DB (actually RocksDB) as their storage engine, a destroy()
call will not delete the data file.
Seems like a bug and it needs fix.
To ease debug, call sequence number need to be added for each call.
This sequence number should be unique.
Suggest adding in gRPC call's metadata to avoid proto change.
Labeled in v1.x.
In 2.x, consider whether should be added in proto file.
When creating new processor in a heavy loaded machine, roll might fail to connect to a processor, showing 'connection refused'.
Data import for 1 billion rows of data.
Need to improve performance too.
Implements in v1.x as there is a requirement from the FATE side, and port to 2.x later.
Labeled as v1.x.
各位大婶,是否可以提供api把暴露线程池当前占用的线程数量 或者打印到日志?
Require a session mechanism for the following reasons:
Describe the bug
Stream error occours when lauching a new job in fate
What version of Eggroll and what programming language (including its version) are you using?
0.3,python
**What is the severance of this issue and why? **
L0 : the training job got stuck and cannot solve it by rebooting.
How to reproduce this issue?
Steps to reproduce the behavior:
What did you expect to see?
Should be all good with no errors
What did you see instead?
Stream error
Could you offer us the error logs or error screenshot?
If applicable, add logs or screenshots to help explain your problem.
What is your environment information (please complete the following information)?
Anything else we should know about your project / environment?
Describe the bug
roll report connection refuse when lauching a new job
What version of Eggroll and what programming language (including its version) are you using?
python.Eggroll 0.3
**What is the severance of this issue and why? **
L1 - System totally unavailable;
the training job got stuck and cannot solve it by rebooting.
How to reproduce this issue?
Steps to reproduce the behavior:
1.Lauch a new training job in fate(and stuck)
2.go to roll/logs/fate-roll.log
3.See 'connection refuse'
What did you expect to see?
should not be an error
What did you see instead?
connection refuse
Could you offer us the error logs or error screenshot?
What is your environment information (please complete the following information)?
POC of RollPair and RollFrame are completed. Each has its data structure and scheduling framework (though very similar).
A code merge of these 2 module is required.
DTable objects support GC
a frame based computing and storage and transfer roll objets
changes:
columnar frame foramt support
local threads first
concurrent computing in a partition
in memory computing
a = eggroll.table('name', 'ns')
b = a.get('key')
b is None
run_cleanup_task(func)
Parameters func is a function , not None
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.