Comments (27)
+1. Now the decommission could be used by exclude node file in coordinator side.
Besides, the exclude-node-file could be stored in HDFS.
from incubator-uniffle.
I understand that you need a rolling upgrade
feature. In our plan, we want to accomplish this feature by k8s operator. For the standalone mode, we don't have the plans,. And it's also necessary to do some surveys about this feature, I think we should have more discuss about this problem.
from incubator-uniffle.
I understand that you need a
rolling upgrade
feature. In our plan, we want to accomplish this feature by k8s operator. For the standalone mode, we don't have the plans,. And it's also necessary to do some surveys about this feature, I think we should have more discuss about this problem.
Depoly on k8s is a good choice. But one more choice is not a bad thing. Not all teams willing to use k8s . I have create a pr.
from incubator-uniffle.
Could you write a design doc's (use google doc) ? Because this issue is a little complex.
from incubator-uniffle.
If we want to add some interface to control shuffle server's behavior, we should have a complete design, and we think we need detailed discussions. We ever have similar mind in issue #37
from incubator-uniffle.
Yarn node's decommission. https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/GracefulDecommission.html Maybe, we should also realize other system decommission implement.
from incubator-uniffle.
Yarn node's decommission. https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/GracefulDecommission.html
Yes. In #85, I follow the rule of YARN decommission mechanism. So i think it's better to control the decommission by coordinator. Feel free to discuss more
Maybe, we should also realize other system decommission implement.
I looked the HDFS datanode decommission, it's also like YARN decommission. refer: https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsDataNodeAdminGuide.html
from incubator-uniffle.
I think we should consider more things, such as
- Is it easy to use if we deploy on k8s and IP is not fixed?
- Split-brain. If pass commands through heartbeat, shuffle server may receive different messages meanwhile and how do we ensure correctness. The decommission function can be solved. What about other functions in the future?
- Compatibility. If pass commands through heartbeat, we need to modify this interface frequently.
from incubator-uniffle.
cc @colinmjj . What do you think? I remember that you want to use coordinator to dispatch the configuration to shuffle servers. It's similar to use coordinator to decommission.
from incubator-uniffle.
I think such feature is about command line or some API to manage the behavior of coordinator/shuffle server.
There should be an overall picture to describe how to make this happen.
Besides decommission, how about update some configuration in shuffle server, clear shuffle data(which maybe useful for streaming jog), etc.
All above feature is management related, so I prefer to have a framework which can involve all these things.
from incubator-uniffle.
@jerqi @colinmjj I want to know if you have plan recently. We have some functions that need to be built on the function of decommission, such as auto scaling We don't want to deviate too much from the community.
from incubator-uniffle.
@jerqi @colinmjj I want to know if you have plan recently. We have some functions that need to be built on the function of decommission, such as auto scaling We don't want to deviate too much from the community.
We have no related plan recently. If you have interest about this topic, we can start a offline meeting to discuss this issue.
from incubator-uniffle.
@jerqi @colinmjj I want to know if you have plan recently. We have some functions that need to be built on the function of decommission, such as auto scaling We don't want to deviate too much from the community.
We have no related plan recently. If you have interest about this topic, we can start a offline meeting to discuss this issue.
+1.
from incubator-uniffle.
we can start a offline meeting to discuss this issue.
I am looking forward to it.
from incubator-uniffle.
@zuston @xianjingfeng There are some other issues which we need to discuss, so I will send a email to our dev mail list, and select a proper date to start the meeting.
from incubator-uniffle.
@xianjingfeng I have already send an email https://lists.apache.org/thread/2jlm3fswmsxy619ldyo4px700p3ybnvc. Do you have time at 11 am (UTC +8) Thursday this week?
from incubator-uniffle.
@xianjingfeng I have already send an email https://lists.apache.org/thread/2jlm3fswmsxy619ldyo4px700p3ybnvc. Do you have time at 11 am (UTC +8) Thursday this week?
Yes, i have time.
from incubator-uniffle.
@xianjingfeng I have already send an email https://lists.apache.org/thread/2jlm3fswmsxy619ldyo4px700p3ybnvc. Do you have time at 11 am (UTC +8) Thursday this week?
Yes, i have time.
Meeting link is https://meeting.tencent.com/dm/oR95wASCNe91
from incubator-uniffle.
@xianjingfeng I have already send an email https://lists.apache.org/thread/2jlm3fswmsxy619ldyo4px700p3ybnvc. Do you have time at 11 am (UTC +8) Thursday this week?
Yes, i have time.
Meeting link is https://meeting.tencent.com/dm/oR95wASCNe91
Get
from incubator-uniffle.
Offline Discussion Result:
Coordinator provide admin rest api, Coordinator is only used as proxy, Coordinator redirect the request to shuffle server by rpc.
Currently, we need the api
- Decommission
- UpdateConfiguration
- Upgrade
cc @zuston, you can give your advice for us.
from incubator-uniffle.
Design doc: https://docs.google.com/document/d/1p1PksBN2LJ-OtGEHvdyEuH9b1Mv1aD_exMPl4TNaTs0/edit?usp=sharing
PTAL @jerqi @zuston
from incubator-uniffle.
Thanks a lot for proposing this, I will take a look ASSP
from incubator-uniffle.
Design doc: https://docs.google.com/document/d/1p1PksBN2LJ-OtGEHvdyEuH9b1Mv1aD_exMPl4TNaTs0/edit?usp=sharing PTAL @jerqi @zuston
Commented. @xianjingfeng PTAL
from incubator-uniffle.
As we discuss in design doc, i will make the following adjustments:
- Add concrete rest api list in the design doc.
- Remove token in this design.
Any other suggestions? @jerqi @zuston @advancedxy
from incubator-uniffle.
I'm OK.
from incubator-uniffle.
As we discuss in design doc, i will make the following adjustments:
- Add concrete rest api list in the design doc.
- Remove token in this design.
Any other suggestions? @jerqi @zuston @advancedxy
+1. Thanks for your effort
from incubator-uniffle.
Thanks @xianjingfeng for working on this feature.
I'm closing this issue now. Please feel free to reopen it if there is more work.
from incubator-uniffle.
Related Issues (20)
- [Improvement] use the disk size obtained from periodic check to determine whether the disk can be written
- [FEATURE] Support pending tasks number metrics for Netty EventLoopGroup
- [FEATURE] Show read_used_buffer_size in DashBoard HOT 1
- [Bug] Asynchronous verification causes invalid resending of data blocks. HOT 3
- [Flaky Test] Tests fail because of VM crash HOT 3
- [Improvement] Upgrade from commons-collections:commons-collections:3.2.2 to org.apache.commons:commons-collections:4.4
- [Improvement] Bump Netty from 4.1.106.Final to 4.1.109.Final
- [Improvement] Bump gRPC from 1.61.1 to 1.63.0
- [Improvement] Upgrade Jetty to the latest stable version
- [Improvement] Upgrade the default NodeJS and npm versions of dashboard.
- [FEATURE] support use skip list to store shuffleBuffer in memory HOT 2
- [FEATURE] Introduce pluggable clientConf access in coordinator when clients fetch client conf
- [FEATURE] Refactor reconfigurable conf framework and apply to shuffleServer module
- [Improvement] Log message should indicate RPC error during after close / shutdown
- [Improvement] pick partitions instead of shuffles for flushing
- [Bug] ClassCastExpection of boolean -> string when getting remote client conf in coordinator
- [Improvement] Introduce local allocation buffer to store blocks in memory HOT 2
- [FEATURE] Introduce disks timeout metrics
- [Improvement] Bump gRPC from 1.63.0 to 1.64.0
- [Bug] MultiException class not found when reassign or stag retry is enabled
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from incubator-uniffle.