Comments (5)
can you share the client's config?
from firestorm.
spark.properties
#Java properties built from Kubernetes config map with name: spark-drv-e832d57dfa6994bc-conf-map
#Mon Dec 27 13:42:43 CST 2021
spark.shuffle.manager=org.apache.spark.shuffle.RssShuffleManager
spark.rss.coordinator.quorum=xxx\:19999,xxx\:19999,xxx\:19999
spark.rss.storage.type=LOCALFILE
spark.driver.port=7078
spark.kubernetes.resource.type=java
spark.kubernetes.driver.volumes.persistentVolumeClaim.data.options.claimName=bigdata-pv
spark.executor.cores=6
spark.history.fs.cleaner.enabled=true
spark.kubernetes.executor.request.cores=6
spark.submit.pyFiles=
spark.executor.memory=30g
spark.kubernetes.driverEnv.APP_TYPES=spark
spark.driver.memoryOverhead=4g
spark.kubernetes.container.image=127.0.0.1\:65001/xxx/service-spark\:staging
spark.master=k8s\://https\://kubernetes.default
spark.driver.memory=4g
spark.kubernetes.driver.request.cores=0.05
spark.kubernetes.driver.pod.name=bigdata-warehouseeditorfirestorm
spark.driver.host=bigdata-warehouseeditorfirestorm-d275ac7dfa6992c2-driver-svc.scrm.svc
spark.kubernetes.executor.volumes.persistentVolumeClaim.data.options.claimName=bigdata-pv
spark.eventLog.compress=true
spark.submit.deployMode=cluster
spark.executor.extraJavaOptions=-DREQ_ID\=98cf7a8b-730a-4902-ab12-f213f4268156
spark.kubernetes.authenticate.driver.serviceAccountName=bigdata-api
spark.history.fs.logDirectory=file\:///data/spark-history
spark.kubernetes.submitInDriver=true
spark.kubernetes.pyspark.pythonVersion=3
spark.kubernetes.memoryOverheadFactor=0.2
spark.app.name=bigdata-warehouseeditorfirestorm
spark.eventLog.enabled=true
spark.driver.cores=1
spark.kubernetes.executor.volumes.persistentVolumeClaim.data.mount.path=/data
spark.kubernetes.authenticate.submission.caCertFile=/var/run/secrets/kubernetes.io/serviceaccount/ca.crt
spark.driver.blockManager.port=7079
spark.kubernetes.driverEnv.SPRING_PROFILES_ACTIVE=spark,staging
spark.executor.memoryOverhead=5g
spark.kubernetes.driver.volumes.persistentVolumeClaim.data.mount.path=/data
spark.app.id=spark-cfbe5cf4715042cd82ebd6cab82d069c
spark.eventLog.dir=file\:///data/spark-history
spark.kubernetes.authenticate.submission.oauthTokenFile=/var/run/secrets/kubernetes.io/serviceaccount/token
spark.kubernetes.namespace=scrm
spark.executor.instances=5
spark.jars=local\:///opt/spark/jars/app.jar
from firestorm.
I can't tell the root cause for now. Shuffle server's memory is composed with buffer of write + buffer of read + metadata used, and there shouldn't be OOM with your configuration.
Shuffle server's log should be checked, and update following configuration in shuffle server with one storage device:
rss.server.flush.thread.alive 5 ->2
rss.server.flush.threadPool.size 20 -> 4
from firestorm.
rss.server.flush.thread.alive 5 ->2
rss.server.flush.threadPool.size 20 -> 4
OOM also happened.
I called the metrics API(/metrics/jvm, /metrics/server) to check the buffer usage and jvm metrics: buffer-related metrics are all 0 or very small, but jvm_memory_bytes_used
is 16512134504, about 16G. Except read/write
/inflush/preallocated buffer, is it because metadata occupies 16G of resources?
shuffle server metrics:
{
"metrics": [
{
"name": "event_size_threshold_level4",
"labelNames": [],
"labelValues": [],
"value": 199,
"timestampMs": null
},
{
"name": "registered_shuffle",
"labelNames": [],
"labelValues": [],
"value": 0,
"timestampMs": null
},
{
"name": "total_write_num",
"labelNames": [],
"labelValues": [],
"value": 199,
"timestampMs": null
},
{
"name": "event_size_threshold_level3",
"labelNames": [],
"labelValues": [],
"value": 0,
"timestampMs": null
},
{
"name": "event_size_threshold_level2",
"labelNames": [],
"labelValues": [],
"value": 0,
"timestampMs": null
},
{
"name": "total_read_memory_data",
"labelNames": [],
"labelValues": [],
"value": 976536477845,
"timestampMs": null
},
{
"name": "in_flush_buffer_size",
"labelNames": [],
"labelValues": [],
"value": 0,
"timestampMs": null
},
{
"name": "total_write_data",
"labelNames": [],
"labelValues": [],
"value": 10785298196,
"timestampMs": null
},
{
"name": "used_buffer_size",
"labelNames": [],
"labelValues": [],
"value": 0,
"timestampMs": null
},
{
"name": "app_num_with_node",
"labelNames": [],
"labelValues": [],
"value": 0,
"timestampMs": null
},
{
"name": "total_read_time",
"labelNames": [],
"labelValues": [],
"value": 162889,
"timestampMs": null
},
{
"name": "registered_shuffle_engine",
"labelNames": [],
"labelValues": [],
"value": 0,
"timestampMs": null
},
{
"name": "total_received_data",
"labelNames": [],
"labelValues": [],
"value": 26123582195,
"timestampMs": null
},
{
"name": "total_upload_time_s",
"labelNames": [],
"labelValues": [],
"value": 0,
"timestampMs": null
},
{
"name": "total_write_exception",
"labelNames": [],
"labelValues": [],
"value": 0,
"timestampMs": null
},
{
"name": "total_read_data",
"labelNames": [],
"labelValues": [],
"value": 1184709143754,
"timestampMs": null
},
{
"name": "allocated_buffer_size",
"labelNames": [],
"labelValues": [],
"value": 0,
"timestampMs": null
},
{
"name": "total_read_local_data_file",
"labelNames": [],
"labelValues": [],
"value": 208101043509,
"timestampMs": null
},
{
"name": "total_upload_size",
"labelNames": [],
"labelValues": [],
"value": 0,
"timestampMs": null
},
{
"name": "partition_num_with_node",
"labelNames": [],
"labelValues": [],
"value": 0,
"timestampMs": null
},
{
"name": "buffered_data_size",
"labelNames": [],
"labelValues": [],
"value": 0,
"timestampMs": null
},
{
"name": "total_write_slow",
"labelNames": [],
"labelValues": [],
"value": 13,
"timestampMs": null
},
{
"name": "total_dropped_event_num",
"labelNames": [],
"labelValues": [],
"value": 0,
"timestampMs": null
},
{
"name": "total_read_local_index_file",
"labelNames": [],
"labelValues": [],
"value": 71622400,
"timestampMs": null
},
{
"name": "total_write_time",
"labelNames": [],
"labelValues": [],
"value": 468057,
"timestampMs": null
},
{
"name": "event_size_threshold_level1",
"labelNames": [],
"labelValues": [],
"value": 0,
"timestampMs": null
},
{
"name": "total_write_block",
"labelNames": [],
"labelValues": [],
"value": 58669,
"timestampMs": null
},
{
"name": "event_queue_size",
"labelNames": [],
"labelValues": [],
"value": 0,
"timestampMs": null
},
{
"name": "total_write_handler",
"labelNames": [],
"labelValues": [],
"value": 0,
"timestampMs": null
}
],
"timeStamp": 1640933730956
}
jvm metrics
{
"metrics": [
{
"name": "jvm_info",
"labelNames": [
"version",
"vendor",
"runtime"
],
"labelValues": [
"1.8.0_292-b10",
"AdoptOpenJDK",
"OpenJDK Runtime Environment"
],
"value": 1,
"timestampMs": null
},
{
"name": "jvm_gc_collection_seconds_count",
"labelNames": [
"gc"
],
"labelValues": [
"G1 Young Generation"
],
"value": 128,
"timestampMs": null
},
{
"name": "jvm_gc_collection_seconds_sum",
"labelNames": [
"gc"
],
"labelValues": [
"G1 Young Generation"
],
"value": 25.19,
"timestampMs": null
},
{
"name": "jvm_gc_collection_seconds_count",
"labelNames": [
"gc"
],
"labelValues": [
"G1 Old Generation"
],
"value": 0,
"timestampMs": null
},
{
"name": "jvm_gc_collection_seconds_sum",
"labelNames": [
"gc"
],
"labelValues": [
"G1 Old Generation"
],
"value": 0,
"timestampMs": null
},
{
"name": "jvm_threads_current",
"labelNames": [],
"labelValues": [],
"value": 1067,
"timestampMs": null
},
{
"name": "jvm_threads_daemon",
"labelNames": [],
"labelValues": [],
"value": 1051,
"timestampMs": null
},
{
"name": "jvm_threads_peak",
"labelNames": [],
"labelValues": [],
"value": 1067,
"timestampMs": null
},
{
"name": "jvm_threads_started_total",
"labelNames": [],
"labelValues": [],
"value": 1069,
"timestampMs": null
},
{
"name": "jvm_threads_deadlocked",
"labelNames": [],
"labelValues": [],
"value": 0,
"timestampMs": null
},
{
"name": "jvm_threads_deadlocked_monitor",
"labelNames": [],
"labelValues": [],
"value": 0,
"timestampMs": null
},
{
"name": "jvm_threads_state",
"labelNames": [
"state"
],
"labelValues": [
"NEW"
],
"value": 0,
"timestampMs": null
},
{
"name": "jvm_threads_state",
"labelNames": [
"state"
],
"labelValues": [
"TIMED_WAITING"
],
"value": 6,
"timestampMs": null
},
{
"name": "jvm_threads_state",
"labelNames": [
"state"
],
"labelValues": [
"TERMINATED"
],
"value": 0,
"timestampMs": null
},
{
"name": "jvm_threads_state",
"labelNames": [
"state"
],
"labelValues": [
"RUNNABLE"
],
"value": 36,
"timestampMs": null
},
{
"name": "jvm_threads_state",
"labelNames": [
"state"
],
"labelValues": [
"WAITING"
],
"value": 1018,
"timestampMs": null
},
{
"name": "jvm_threads_state",
"labelNames": [
"state"
],
"labelValues": [
"BLOCKED"
],
"value": 7,
"timestampMs": null
},
{
"name": "jvm_classes_loaded",
"labelNames": [],
"labelValues": [],
"value": 4689,
"timestampMs": null
},
{
"name": "jvm_classes_loaded_total",
"labelNames": [],
"labelValues": [],
"value": 4692,
"timestampMs": null
},
{
"name": "jvm_classes_unloaded_total",
"labelNames": [],
"labelValues": [],
"value": 3,
"timestampMs": null
},
{
"name": "jvm_buffer_pool_used_bytes",
"labelNames": [
"pool"
],
"labelValues": [
"direct"
],
"value": 32769,
"timestampMs": null
},
{
"name": "jvm_buffer_pool_used_bytes",
"labelNames": [
"pool"
],
"labelValues": [
"mapped"
],
"value": 0,
"timestampMs": null
},
{
"name": "jvm_buffer_pool_capacity_bytes",
"labelNames": [
"pool"
],
"labelValues": [
"direct"
],
"value": 32768,
"timestampMs": null
},
{
"name": "jvm_buffer_pool_capacity_bytes",
"labelNames": [
"pool"
],
"labelValues": [
"mapped"
],
"value": 0,
"timestampMs": null
},
{
"name": "jvm_buffer_pool_used_buffers",
"labelNames": [
"pool"
],
"labelValues": [
"direct"
],
"value": 5,
"timestampMs": null
},
{
"name": "jvm_buffer_pool_used_buffers",
"labelNames": [
"pool"
],
"labelValues": [
"mapped"
],
"value": 0,
"timestampMs": null
},
{
"name": "jvm_memory_bytes_used",
"labelNames": [
"area"
],
"labelValues": [
"heap"
],
"value": 16512134504,
"timestampMs": null
},
{
"name": "jvm_memory_bytes_used",
"labelNames": [
"area"
],
"labelValues": [
"nonheap"
],
"value": 51022000,
"timestampMs": null
},
{
"name": "jvm_memory_bytes_committed",
"labelNames": [
"area"
],
"labelValues": [
"heap"
],
"value": 53687091200,
"timestampMs": null
},
{
"name": "jvm_memory_bytes_committed",
"labelNames": [
"area"
],
"labelValues": [
"nonheap"
],
"value": 52494336,
"timestampMs": null
},
{
"name": "jvm_memory_bytes_max",
"labelNames": [
"area"
],
"labelValues": [
"heap"
],
"value": 53687091200,
"timestampMs": null
},
{
"name": "jvm_memory_bytes_max",
"labelNames": [
"area"
],
"labelValues": [
"nonheap"
],
"value": -1,
"timestampMs": null
},
{
"name": "jvm_memory_bytes_init",
"labelNames": [
"area"
],
"labelValues": [
"heap"
],
"value": 53687091200,
"timestampMs": null
},
{
"name": "jvm_memory_bytes_init",
"labelNames": [
"area"
],
"labelValues": [
"nonheap"
],
"value": 2555904,
"timestampMs": null
},
{
"name": "jvm_memory_pool_bytes_used",
"labelNames": [
"pool"
],
"labelValues": [
"Code Cache"
],
"value": 19920576,
"timestampMs": null
},
{
"name": "jvm_memory_pool_bytes_used",
"labelNames": [
"pool"
],
"labelValues": [
"Metaspace"
],
"value": 31101424,
"timestampMs": null
},
{
"name": "jvm_memory_pool_bytes_used",
"labelNames": [
"pool"
],
"labelValues": [
"G1 Eden Space"
],
"value": 805306368,
"timestampMs": null
},
{
"name": "jvm_memory_pool_bytes_used",
"labelNames": [
"pool"
],
"labelValues": [
"G1 Survivor Space"
],
"value": 33554432,
"timestampMs": null
},
{
"name": "jvm_memory_pool_bytes_used",
"labelNames": [
"pool"
],
"labelValues": [
"G1 Old Gen"
],
"value": 15673273704,
"timestampMs": null
},
{
"name": "jvm_memory_pool_bytes_committed",
"labelNames": [
"pool"
],
"labelValues": [
"Code Cache"
],
"value": 20512768,
"timestampMs": null
},
{
"name": "jvm_memory_pool_bytes_committed",
"labelNames": [
"pool"
],
"labelValues": [
"Metaspace"
],
"value": 31981568,
"timestampMs": null
},
{
"name": "jvm_memory_pool_bytes_committed",
"labelNames": [
"pool"
],
"labelValues": [
"G1 Eden Space"
],
"value": 5603590144,
"timestampMs": null
},
{
"name": "jvm_memory_pool_bytes_committed",
"labelNames": [
"pool"
],
"labelValues": [
"G1 Survivor Space"
],
"value": 33554432,
"timestampMs": null
},
{
"name": "jvm_memory_pool_bytes_committed",
"labelNames": [
"pool"
],
"labelValues": [
"G1 Old Gen"
],
"value": 48049946624,
"timestampMs": null
},
{
"name": "jvm_memory_pool_bytes_max",
"labelNames": [
"pool"
],
"labelValues": [
"Code Cache"
],
"value": 251658240,
"timestampMs": null
},
{
"name": "jvm_memory_pool_bytes_max",
"labelNames": [
"pool"
],
"labelValues": [
"Metaspace"
],
"value": -1,
"timestampMs": null
},
{
"name": "jvm_memory_pool_bytes_max",
"labelNames": [
"pool"
],
"labelValues": [
"G1 Eden Space"
],
"value": -1,
"timestampMs": null
},
{
"name": "jvm_memory_pool_bytes_max",
"labelNames": [
"pool"
],
"labelValues": [
"G1 Survivor Space"
],
"value": -1,
"timestampMs": null
},
{
"name": "jvm_memory_pool_bytes_max",
"labelNames": [
"pool"
],
"labelValues": [
"G1 Old Gen"
],
"value": 53687091200,
"timestampMs": null
},
{
"name": "jvm_memory_pool_bytes_init",
"labelNames": [
"pool"
],
"labelValues": [
"Code Cache"
],
"value": 2555904,
"timestampMs": null
},
{
"name": "jvm_memory_pool_bytes_init",
"labelNames": [
"pool"
],
"labelValues": [
"Metaspace"
],
"value": 0,
"timestampMs": null
},
{
"name": "jvm_memory_pool_bytes_init",
"labelNames": [
"pool"
],
"labelValues": [
"G1 Eden Space"
],
"value": 5637144576,
"timestampMs": null
},
{
"name": "jvm_memory_pool_bytes_init",
"labelNames": [
"pool"
],
"labelValues": [
"G1 Survivor Space"
],
"value": 0,
"timestampMs": null
},
{
"name": "jvm_memory_pool_bytes_init",
"labelNames": [
"pool"
],
"labelValues": [
"G1 Old Gen"
],
"value": 48049946624,
"timestampMs": null
},
{
"name": "jvm_memory_pool_allocated_bytes_total",
"labelNames": [
"pool"
],
"labelValues": [
"G1 Old Gen"
],
"value": 1055979279032,
"timestampMs": null
},
{
"name": "jvm_memory_pool_allocated_bytes_total",
"labelNames": [
"pool"
],
"labelValues": [
"Code Cache"
],
"value": 23076416,
"timestampMs": null
},
{
"name": "jvm_memory_pool_allocated_bytes_total",
"labelNames": [
"pool"
],
"labelValues": [
"G1 Eden Space"
],
"value": 297090940928,
"timestampMs": null
},
{
"name": "jvm_memory_pool_allocated_bytes_total",
"labelNames": [
"pool"
],
"labelValues": [
"G1 Survivor Space"
],
"value": 9294577664,
"timestampMs": null
},
{
"name": "jvm_memory_pool_allocated_bytes_total",
"labelNames": [
"pool"
],
"labelValues": [
"Metaspace"
],
"value": 27300720,
"timestampMs": null
},
{
"name": "process_cpu_seconds_total",
"labelNames": [],
"labelValues": [],
"value": 2042.23,
"timestampMs": null
},
{
"name": "process_start_time_seconds",
"labelNames": [],
"labelValues": [],
"value": 1640913776.419,
"timestampMs": null
},
{
"name": "process_open_fds",
"labelNames": [],
"labelValues": [],
"value": 379,
"timestampMs": null
},
{
"name": "process_max_fds",
"labelNames": [],
"labelValues": [],
"value": 999999,
"timestampMs": null
},
{
"name": "process_virtual_memory_bytes",
"labelNames": [],
"labelValues": [],
"value": 66492104704,
"timestampMs": null
},
{
"name": "process_resident_memory_bytes",
"labelNames": [],
"labelValues": [],
"value": 58613448704,
"timestampMs": null
}
],
"timeStamp": 1640933954231
}
from firestorm.
It seems to be killed by kernel. You should use a virtual machine that have more memory.
from firestorm.
Related Issues (20)
- Whether multiple disks are supported for local storage? HOT 4
- duplicate servlets map in Coordinator Server
- 使用firestorm-0.4.0 运行spark3.1.1官方的JavaWordCount报如下错误,并且在yarn-client模式下driver端进程一直不退出 HOT 10
- What‘s the difference between `spark.rss.storage.type` and `rss.storage.type`? HOT 18
- yarn-client模式下driver端进程一直不退出 HOT 9
- In local mode, why directory should be deleted first? HOT 1
- [QUESTION] 依赖Hadoop环境? HOT 3
- [QUESTION] Executor在shuffle write/read 过程中是否落本地盘? HOT 2
- [Feature Request]Add a web UI in Coordinated Server to show the detailed server/job/metrics information HOT 1
- hardcoded relative paths HOT 6
- Whether local multiple replicas are supported? HOT 2
- Compared to the native spark, the shuffle write data of firestorm is always smaller HOT 2
- Unexpected crc value for blockId[474989042101783], expected:1518107711, actual:3331113690 HOT 5
- Shuffle read does not read all data completely? HOT 31
- Support shuffle data replica? HOT 5
- Coordinator HA problem HOT 6
- fault tolerance HOT 4
- Clear buffered data when acquiring memory failed and then retry
- To support more tasks with Firestorm
- how to enter into uniffle wechat or dingtalk?
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from firestorm.