Comments (8)
There is max bytes limit in shuffle server to protect the server, see https://github.com/uber/RemoteShuffleService/blob/master/src/main/java/com/uber/rss/execution/ShuffleExecutor.java#L81
You could change that value if your shuffle data exceeds that limit.
from remoteshuffleservice.
Thanks,I'll try it
from remoteshuffleservice.
Hi, @Lobo2008 Let us know as Bo mentioned, if the max app shuffle data size per server is the issue or not. You should see a RssTooMuchDataException in the stack trace.
If that's not the issue, please check
- are you using the latest master
- what's task time of the failing task and shuffle data written
from remoteshuffleservice.
Hi @mayurdb
- It's the latest version. I cloned and compiled the master branch in April 2022.
- no
RssTooMuchDataException
ever happened, justRssNetworkException
- I have re-run the app without change the size as Bo mentioned ( i'll try it later) and so far it runs well. I'll post the detail if the application finished or failed
- Wonder if the
DEFAULT_APP_MAX_WRITE_BYTES=3TB
is one stage shuffle size limitation or the accumulative size of all the shuffle write(?) stages for one application ? Stage-6 has 3TB but still works fine.
from remoteshuffleservice.
I think that DEFAULT_APP_MAX_WRITE_BYTES is actually per server, so if you write 3TB of data but evenly distribute it to multiple servers you would not run into the issue
from remoteshuffleservice.
I think that DEFAULT_APP_MAX_WRITE_BYTES is actually per server, so if you write 3TB of data but evenly distribute it to multiple servers you would not run into the issue
I guess so.
from remoteshuffleservice.
Hi @mayurdb
- It's the latest version. I cloned and compiled the master branch in April 2022.
- no
RssTooMuchDataException
ever happened, justRssNetworkException
- I have re-run the app without change the size as Bo mentioned ( i'll try it later) and so far it runs well. I'll post the detail if the application finished or failed
- Wonder if the
DEFAULT_APP_MAX_WRITE_BYTES=3TB
is one stage shuffle size limitation or the accumulative size of all the shuffle write(?) stages for one application ? Stage-6 has 3TB but still works fine.
Finished successfully. But I found that the exception hit exception writing heading bytes
is caused by one or some of RSS running out of disk storage space.
from remoteshuffleservice.
Cool, glad you found the cause, and thanks for the update!
from remoteshuffleservice.
Related Issues (20)
- [Spark 3] RSS performance with Adaptive Skew Join Optimization HOT 3
- Corrupted block detected during decompression
- spark 3.0 HOT 4
- Using remote shuffle service with Spark operator HOT 2
- Shuffle Files Storage Is stored by default.Whether alluxio storage is supported and how to implement it. HOT 5
- write amplification HOT 2
- fault tolerance of restarting server HOT 7
- Does RSS support multiple StreamServers on the same node? HOT 4
- Metrics in ScheduledMetricCollector
- How long the shuffle data of each ShuffleStage will be stored in RSS nodes? HOT 6
- Root directory not configurable via Helm chart
- Disk damage causes failure HOT 10
- Rss shuffle data size is much larger than external shuffle service HOT 6
- How to evaluate rss cluster size? HOT 1
- Can Rss have stage retry when one server is down? HOT 13
- what may cause RssInvalidServerVersionException? HOT 2
- Does zeus only support jdk 11 + HOT 2
- Does Rss support YARN executor preemption?
- Spark 3.1/3.2 failed sql skew and local reader tests HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from remoteshuffleservice.