kill process is not graceful, so we need shuffle server support decommissioned

I understand that you need a rolling upgrade</c

Yarn node's decommission. <a href="https://hadoop.apache.org/docs/current/hadoop-yarn/

Yarn node's decommission. <a href="https://hadoop.apache.org/docs/current

I think we should consider more things, such as Is it easy to

cc <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

[Feature Request] Support shuffle server decommissioned about incubator-uniffle HOT 27 CLOSED

apache commented on June 1, 2024

[Feature Request] Support shuffle server decommissioned

from incubator-uniffle.

Comments (27)

zuston commented on June 1, 2024

+1. Now the decommission could be used by exclude node file in coordinator side.

Besides, the exclude-node-file could be stored in HDFS.

from incubator-uniffle.

jerqi commented on June 1, 2024

I understand that you need a rolling upgrade feature. In our plan, we want to accomplish this feature by k8s operator. For the standalone mode, we don't have the plans,. And it's also necessary to do some surveys about this feature, I think we should have more discuss about this problem.

from incubator-uniffle.

xianjingfeng commented on June 1, 2024

I understand that you need a rolling upgrade feature. In our plan, we want to accomplish this feature by k8s operator. For the standalone mode, we don't have the plans,. And it's also necessary to do some surveys about this feature, I think we should have more discuss about this problem.

Depoly on k8s is a good choice. But one more choice is not a bad thing. Not all teams willing to use k8s . I have create a pr.

from incubator-uniffle.

jerqi commented on June 1, 2024

Could you write a design doc's (use google doc) ? Because this issue is a little complex.

from incubator-uniffle.

jerqi commented on June 1, 2024

If we want to add some interface to control shuffle server's behavior, we should have a complete design, and we think we need detailed discussions. We ever have similar mind in issue #37

from incubator-uniffle.

jerqi commented on June 1, 2024

Yarn node's decommission. https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/GracefulDecommission.html Maybe, we should also realize other system decommission implement.

from incubator-uniffle.

zuston commented on June 1, 2024

Yarn node's decommission. https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/GracefulDecommission.html

Yes. In #85, I follow the rule of YARN decommission mechanism. So i think it's better to control the decommission by coordinator. Feel free to discuss more

Maybe, we should also realize other system decommission implement.

I looked the HDFS datanode decommission, it's also like YARN decommission. refer: https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsDataNodeAdminGuide.html

from incubator-uniffle.

xianjingfeng commented on June 1, 2024

I think we should consider more things, such as

Is it easy to use if we deploy on k8s and IP is not fixed?
Split-brain. If pass commands through heartbeat, shuffle server may receive different messages meanwhile and how do we ensure correctness. The decommission function can be solved. What about other functions in the future?
Compatibility. If pass commands through heartbeat, we need to modify this interface frequently.

from incubator-uniffle.

jerqi commented on June 1, 2024

cc @colinmjj . What do you think? I remember that you want to use coordinator to dispatch the configuration to shuffle servers. It's similar to use coordinator to decommission.

from incubator-uniffle.

colinmjj commented on June 1, 2024

I think such feature is about command line or some API to manage the behavior of coordinator/shuffle server.
There should be an overall picture to describe how to make this happen.
Besides decommission, how about update some configuration in shuffle server, clear shuffle data(which maybe useful for streaming jog), etc.
All above feature is management related, so I prefer to have a framework which can involve all these things.

from incubator-uniffle.

xianjingfeng commented on June 1, 2024

@jerqi @colinmjj I want to know if you have plan recently. We have some functions that need to be built on the function of decommission, such as auto scaling We don't want to deviate too much from the community.

from incubator-uniffle.

jerqi commented on June 1, 2024

@jerqi @colinmjj I want to know if you have plan recently. We have some functions that need to be built on the function of decommission, such as auto scaling We don't want to deviate too much from the community.

We have no related plan recently. If you have interest about this topic, we can start a offline meeting to discuss this issue.

from incubator-uniffle.

zuston commented on June 1, 2024

@jerqi @colinmjj I want to know if you have plan recently. We have some functions that need to be built on the function of decommission, such as auto scaling We don't want to deviate too much from the community.

We have no related plan recently. If you have interest about this topic, we can start a offline meeting to discuss this issue.

+1.

from incubator-uniffle.

xianjingfeng commented on June 1, 2024

we can start a offline meeting to discuss this issue.

I am looking forward to it.

from incubator-uniffle.

jerqi commented on June 1, 2024

@zuston @xianjingfeng There are some other issues which we need to discuss, so I will send a email to our dev mail list, and select a proper date to start the meeting.

from incubator-uniffle.

jerqi commented on June 1, 2024

@xianjingfeng I have already send an email https://lists.apache.org/thread/2jlm3fswmsxy619ldyo4px700p3ybnvc. Do you have time at 11 am (UTC +8) Thursday this week?

from incubator-uniffle.

xianjingfeng commented on June 1, 2024

@xianjingfeng I have already send an email https://lists.apache.org/thread/2jlm3fswmsxy619ldyo4px700p3ybnvc. Do you have time at 11 am (UTC +8) Thursday this week?

Yes, i have time.

from incubator-uniffle.

jerqi commented on June 1, 2024

@xianjingfeng I have already send an email https://lists.apache.org/thread/2jlm3fswmsxy619ldyo4px700p3ybnvc. Do you have time at 11 am (UTC +8) Thursday this week?

Yes, i have time.

Meeting link is https://meeting.tencent.com/dm/oR95wASCNe91

from incubator-uniffle.

xianjingfeng commented on June 1, 2024

@xianjingfeng I have already send an email https://lists.apache.org/thread/2jlm3fswmsxy619ldyo4px700p3ybnvc. Do you have time at 11 am (UTC +8) Thursday this week?

Yes, i have time.

Meeting link is https://meeting.tencent.com/dm/oR95wASCNe91

Get

from incubator-uniffle.

jerqi commented on June 1, 2024

Offline Discussion Result:
Coordinator provide admin rest api, Coordinator is only used as proxy, Coordinator redirect the request to shuffle server by rpc.
Currently, we need the api

Decommission
UpdateConfiguration
Upgrade

cc @zuston, you can give your advice for us.

from incubator-uniffle.

xianjingfeng commented on June 1, 2024

Design doc: https://docs.google.com/document/d/1p1PksBN2LJ-OtGEHvdyEuH9b1Mv1aD_exMPl4TNaTs0/edit?usp=sharing
PTAL @jerqi @zuston

from incubator-uniffle.

zuston commented on June 1, 2024

Thanks a lot for proposing this, I will take a look ASSP

from incubator-uniffle.

zuston commented on June 1, 2024

Design doc: https://docs.google.com/document/d/1p1PksBN2LJ-OtGEHvdyEuH9b1Mv1aD_exMPl4TNaTs0/edit?usp=sharing PTAL @jerqi @zuston

Commented. @xianjingfeng PTAL

from incubator-uniffle.

xianjingfeng commented on June 1, 2024

As we discuss in design doc, i will make the following adjustments:

Add concrete rest api list in the design doc.
Remove token in this design.

Any other suggestions? @jerqi @zuston @advancedxy

from incubator-uniffle.

jerqi commented on June 1, 2024

I'm OK.

from incubator-uniffle.

zuston commented on June 1, 2024

As we discuss in design doc, i will make the following adjustments:

Add concrete rest api list in the design doc.

Remove token in this design.

Any other suggestions? @jerqi @zuston @advancedxy

+1. Thanks for your effort

from incubator-uniffle.

kaijchen commented on June 1, 2024

Thanks @xianjingfeng for working on this feature.
I'm closing this issue now. Please feel free to reopen it if there is more work.

from incubator-uniffle.

[Feature Request] Support shuffle server decommissioned about incubator-uniffle HOT 27 CLOSED

Comments (27)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent