Comments (12)
We recommend users to use StorageType MEMORY_LOCALFILE_HDFS or MEMORY_LOCALFILE. The application won't commit data.
from incubator-uniffle.
We recommend users to use StorageType MEMORY_LOCALFILE_HDFS or MEMORY_LOCALFILE. The application won't commit data.
So, should we remove LOCALFILE
?
from incubator-uniffle.
We recommend users to use StorageType MEMORY_LOCALFILE_HDFS or MEMORY_LOCALFILE. The application won't commit data.
So, should we remove
LOCALFILE
?
It's useful for test code. And if we want to reduce our shuffle server state, it's a good way to commit data although the LOCALFILE storageType don't have a good practice. So I think we shouldn't remove it currently.
from incubator-uniffle.
I found if use MEMORY_LOCALFILE
, finishShuffle
will not be called, and buffer in server side may not flush in time, and than reader will fail because read index file fail. as follows
Error happened when get shuffle index for appId[application_xxx], shuffleId[3], partitionId[1], Can't find folder /HDATA/2/rssdata/application_xxx/3/1-1
from incubator-uniffle.
I found if use
MEMORY_LOCALFILE
,finishShuffle
will not be called, and buffer in server side may not flush in time, and than reader will fail because read index file fail. as followsError happened when get shuffle index for appId[application_xxx], shuffleId[3], partitionId[1], Can't find folder /HDATA/2/rssdata/application_xxx/3/1-1
@xianjingfeng With current implementation, write shuffle data to N shuffle server can handle the situation about shuffle server failed. But it will cost N times storage. User can make such choice.
from incubator-uniffle.
We had set spark.rss.data.replica.write=2
and spark.rss.data.replica=3
.But we found all shuffle server of a partition have not flush in time today and we have found in two applications. It may be easy to encounter when our cluster is not in high load.
from incubator-uniffle.
We had set
spark.rss.data.replica.write=2
andspark.rss.data.replica=3
.But we found all shuffle server of a partition have not flush in time today and we have found in two applications. It may be easy to encounter when our cluster is not in high load.
what kind of storage type used in your case?
from incubator-uniffle.
what kind of storage type used in your case?
MEMORY_LOCALFILE
from incubator-uniffle.
@frankliee can you do more clarification about how to config spark.rss.data.replica.write
& spark.rss.data.replica.read
?
from incubator-uniffle.
These configs are come from quorum protocol.
rss.data.replica
is default replica number of partition.
rss.data.replica.write
is the minimum replica that writer should write metadata and data successfully.
rss.data.replica.read
is the minimum replica that reader should read metadata successfully (data can read from only one replica).
So the recommended values are (1,1,1) and (3,2,2).
These are client-side configs, and will not change server-side state.
The flush is controlled by server configs, such as memory capacity and watermarks.
from incubator-uniffle.
So, Is there a problem?
from incubator-uniffle.
Fix by #213
from incubator-uniffle.
Related Issues (20)
- [Improvement] Operator should support K8S 1.24 HOT 4
- [Flaky Test] Tests may fail on different machines
- [Bug] Incorrect disk size for local storage HOT 1
- [Bug] When a application is expired in one of shuffle servers assigned, all application data on HDFS will be deleted HOT 3
- [Improvement] Introduce the local_storage_is_writable metric HOT 3
- [Improvement] use the disk size obtained from periodic check to determine whether the disk can be written
- [FEATURE] Support pending tasks number metrics for Netty EventLoopGroup
- [FEATURE] Show read_used_buffer_size in DashBoard HOT 1
- [Bug] Asynchronous verification causes invalid resending of data blocks. HOT 3
- [Flaky Test] Tests fail because of VM crash HOT 3
- [Improvement] Upgrade from commons-collections:commons-collections:3.2.2 to org.apache.commons:commons-collections:4.4
- [Improvement] Bump Netty from 4.1.106.Final to 4.1.109.Final
- [Improvement] Bump gRPC from 1.61.1 to 1.63.0
- [Improvement] Upgrade Jetty to the latest stable version
- [Improvement] Upgrade the default NodeJS and npm versions of dashboard.
- [FEATURE] support use skip list to store shuffleBuffer in memory HOT 2
- [FEATURE] Introduce pluggable clientConf access in coordinator when clients fetch client conf
- [FEATURE] Refactor reconfigurable conf framework and apply to shuffleServer module
- [Improvement] Log message should indicate RPC error during after close / shutdown
- [Improvement] pick partitions instead of shuffles for flushing
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from incubator-uniffle.