Giter VIP home page Giter VIP logo

Comments (5)

colinmjj avatar colinmjj commented on April 28, 2024

can you share the client's config?

from firestorm.

packageman avatar packageman commented on April 28, 2024

spark.properties

#Java properties built from Kubernetes config map with name: spark-drv-e832d57dfa6994bc-conf-map
#Mon Dec 27 13:42:43 CST 2021
spark.shuffle.manager=org.apache.spark.shuffle.RssShuffleManager
spark.rss.coordinator.quorum=xxx\:19999,xxx\:19999,xxx\:19999
spark.rss.storage.type=LOCALFILE

spark.driver.port=7078
spark.kubernetes.resource.type=java
spark.kubernetes.driver.volumes.persistentVolumeClaim.data.options.claimName=bigdata-pv
spark.executor.cores=6
spark.history.fs.cleaner.enabled=true
spark.kubernetes.executor.request.cores=6
spark.submit.pyFiles=
spark.executor.memory=30g
spark.kubernetes.driverEnv.APP_TYPES=spark
spark.driver.memoryOverhead=4g
spark.kubernetes.container.image=127.0.0.1\:65001/xxx/service-spark\:staging
spark.master=k8s\://https\://kubernetes.default
spark.driver.memory=4g
spark.kubernetes.driver.request.cores=0.05
spark.kubernetes.driver.pod.name=bigdata-warehouseeditorfirestorm
spark.driver.host=bigdata-warehouseeditorfirestorm-d275ac7dfa6992c2-driver-svc.scrm.svc
spark.kubernetes.executor.volumes.persistentVolumeClaim.data.options.claimName=bigdata-pv
spark.eventLog.compress=true
spark.submit.deployMode=cluster
spark.executor.extraJavaOptions=-DREQ_ID\=98cf7a8b-730a-4902-ab12-f213f4268156
spark.kubernetes.authenticate.driver.serviceAccountName=bigdata-api
spark.history.fs.logDirectory=file\:///data/spark-history
spark.kubernetes.submitInDriver=true
spark.kubernetes.pyspark.pythonVersion=3
spark.kubernetes.memoryOverheadFactor=0.2
spark.app.name=bigdata-warehouseeditorfirestorm
spark.eventLog.enabled=true
spark.driver.cores=1
spark.kubernetes.executor.volumes.persistentVolumeClaim.data.mount.path=/data
spark.kubernetes.authenticate.submission.caCertFile=/var/run/secrets/kubernetes.io/serviceaccount/ca.crt
spark.driver.blockManager.port=7079
spark.kubernetes.driverEnv.SPRING_PROFILES_ACTIVE=spark,staging
spark.executor.memoryOverhead=5g
spark.kubernetes.driver.volumes.persistentVolumeClaim.data.mount.path=/data
spark.app.id=spark-cfbe5cf4715042cd82ebd6cab82d069c
spark.eventLog.dir=file\:///data/spark-history
spark.kubernetes.authenticate.submission.oauthTokenFile=/var/run/secrets/kubernetes.io/serviceaccount/token
spark.kubernetes.namespace=scrm
spark.executor.instances=5
spark.jars=local\:///opt/spark/jars/app.jar

from firestorm.

colinmjj avatar colinmjj commented on April 28, 2024

I can't tell the root cause for now. Shuffle server's memory is composed with buffer of write + buffer of read + metadata used, and there shouldn't be OOM with your configuration.
Shuffle server's log should be checked, and update following configuration in shuffle server with one storage device:
rss.server.flush.thread.alive 5 ->2
rss.server.flush.threadPool.size 20 -> 4

from firestorm.

packageman avatar packageman commented on April 28, 2024

rss.server.flush.thread.alive 5 ->2
rss.server.flush.threadPool.size 20 -> 4

OOM also happened.

I called the metrics API(/metrics/jvm, /metrics/server) to check the buffer usage and jvm metrics: buffer-related metrics are all 0 or very small, but jvm_memory_bytes_used is 16512134504, about 16G. Except read/write
/inflush/preallocated buffer, is it because metadata occupies 16G of resources?

shuffle server metrics:

{
    "metrics": [
        {
            "name": "event_size_threshold_level4",
            "labelNames": [],
            "labelValues": [],
            "value": 199,
            "timestampMs": null
        },
        {
            "name": "registered_shuffle",
            "labelNames": [],
            "labelValues": [],
            "value": 0,
            "timestampMs": null
        },
        {
            "name": "total_write_num",
            "labelNames": [],
            "labelValues": [],
            "value": 199,
            "timestampMs": null
        },
        {
            "name": "event_size_threshold_level3",
            "labelNames": [],
            "labelValues": [],
            "value": 0,
            "timestampMs": null
        },
        {
            "name": "event_size_threshold_level2",
            "labelNames": [],
            "labelValues": [],
            "value": 0,
            "timestampMs": null
        },
        {
            "name": "total_read_memory_data",
            "labelNames": [],
            "labelValues": [],
            "value": 976536477845,
            "timestampMs": null
        },
        {
            "name": "in_flush_buffer_size",
            "labelNames": [],
            "labelValues": [],
            "value": 0,
            "timestampMs": null
        },
        {
            "name": "total_write_data",
            "labelNames": [],
            "labelValues": [],
            "value": 10785298196,
            "timestampMs": null
        },
        {
            "name": "used_buffer_size",
            "labelNames": [],
            "labelValues": [],
            "value": 0,
            "timestampMs": null
        },
        {
            "name": "app_num_with_node",
            "labelNames": [],
            "labelValues": [],
            "value": 0,
            "timestampMs": null
        },
        {
            "name": "total_read_time",
            "labelNames": [],
            "labelValues": [],
            "value": 162889,
            "timestampMs": null
        },
        {
            "name": "registered_shuffle_engine",
            "labelNames": [],
            "labelValues": [],
            "value": 0,
            "timestampMs": null
        },
        {
            "name": "total_received_data",
            "labelNames": [],
            "labelValues": [],
            "value": 26123582195,
            "timestampMs": null
        },
        {
            "name": "total_upload_time_s",
            "labelNames": [],
            "labelValues": [],
            "value": 0,
            "timestampMs": null
        },
        {
            "name": "total_write_exception",
            "labelNames": [],
            "labelValues": [],
            "value": 0,
            "timestampMs": null
        },
        {
            "name": "total_read_data",
            "labelNames": [],
            "labelValues": [],
            "value": 1184709143754,
            "timestampMs": null
        },
        {
            "name": "allocated_buffer_size",
            "labelNames": [],
            "labelValues": [],
            "value": 0,
            "timestampMs": null
        },
        {
            "name": "total_read_local_data_file",
            "labelNames": [],
            "labelValues": [],
            "value": 208101043509,
            "timestampMs": null
        },
        {
            "name": "total_upload_size",
            "labelNames": [],
            "labelValues": [],
            "value": 0,
            "timestampMs": null
        },
        {
            "name": "partition_num_with_node",
            "labelNames": [],
            "labelValues": [],
            "value": 0,
            "timestampMs": null
        },
        {
            "name": "buffered_data_size",
            "labelNames": [],
            "labelValues": [],
            "value": 0,
            "timestampMs": null
        },
        {
            "name": "total_write_slow",
            "labelNames": [],
            "labelValues": [],
            "value": 13,
            "timestampMs": null
        },
        {
            "name": "total_dropped_event_num",
            "labelNames": [],
            "labelValues": [],
            "value": 0,
            "timestampMs": null
        },
        {
            "name": "total_read_local_index_file",
            "labelNames": [],
            "labelValues": [],
            "value": 71622400,
            "timestampMs": null
        },
        {
            "name": "total_write_time",
            "labelNames": [],
            "labelValues": [],
            "value": 468057,
            "timestampMs": null
        },
        {
            "name": "event_size_threshold_level1",
            "labelNames": [],
            "labelValues": [],
            "value": 0,
            "timestampMs": null
        },
        {
            "name": "total_write_block",
            "labelNames": [],
            "labelValues": [],
            "value": 58669,
            "timestampMs": null
        },
        {
            "name": "event_queue_size",
            "labelNames": [],
            "labelValues": [],
            "value": 0,
            "timestampMs": null
        },
        {
            "name": "total_write_handler",
            "labelNames": [],
            "labelValues": [],
            "value": 0,
            "timestampMs": null
        }
    ],
    "timeStamp": 1640933730956
}

jvm metrics

{
    "metrics": [
        {
            "name": "jvm_info",
            "labelNames": [
                "version",
                "vendor",
                "runtime"
            ],
            "labelValues": [
                "1.8.0_292-b10",
                "AdoptOpenJDK",
                "OpenJDK Runtime Environment"
            ],
            "value": 1,
            "timestampMs": null
        },
        {
            "name": "jvm_gc_collection_seconds_count",
            "labelNames": [
                "gc"
            ],
            "labelValues": [
                "G1 Young Generation"
            ],
            "value": 128,
            "timestampMs": null
        },
        {
            "name": "jvm_gc_collection_seconds_sum",
            "labelNames": [
                "gc"
            ],
            "labelValues": [
                "G1 Young Generation"
            ],
            "value": 25.19,
            "timestampMs": null
        },
        {
            "name": "jvm_gc_collection_seconds_count",
            "labelNames": [
                "gc"
            ],
            "labelValues": [
                "G1 Old Generation"
            ],
            "value": 0,
            "timestampMs": null
        },
        {
            "name": "jvm_gc_collection_seconds_sum",
            "labelNames": [
                "gc"
            ],
            "labelValues": [
                "G1 Old Generation"
            ],
            "value": 0,
            "timestampMs": null
        },
        {
            "name": "jvm_threads_current",
            "labelNames": [],
            "labelValues": [],
            "value": 1067,
            "timestampMs": null
        },
        {
            "name": "jvm_threads_daemon",
            "labelNames": [],
            "labelValues": [],
            "value": 1051,
            "timestampMs": null
        },
        {
            "name": "jvm_threads_peak",
            "labelNames": [],
            "labelValues": [],
            "value": 1067,
            "timestampMs": null
        },
        {
            "name": "jvm_threads_started_total",
            "labelNames": [],
            "labelValues": [],
            "value": 1069,
            "timestampMs": null
        },
        {
            "name": "jvm_threads_deadlocked",
            "labelNames": [],
            "labelValues": [],
            "value": 0,
            "timestampMs": null
        },
        {
            "name": "jvm_threads_deadlocked_monitor",
            "labelNames": [],
            "labelValues": [],
            "value": 0,
            "timestampMs": null
        },
        {
            "name": "jvm_threads_state",
            "labelNames": [
                "state"
            ],
            "labelValues": [
                "NEW"
            ],
            "value": 0,
            "timestampMs": null
        },
        {
            "name": "jvm_threads_state",
            "labelNames": [
                "state"
            ],
            "labelValues": [
                "TIMED_WAITING"
            ],
            "value": 6,
            "timestampMs": null
        },
        {
            "name": "jvm_threads_state",
            "labelNames": [
                "state"
            ],
            "labelValues": [
                "TERMINATED"
            ],
            "value": 0,
            "timestampMs": null
        },
        {
            "name": "jvm_threads_state",
            "labelNames": [
                "state"
            ],
            "labelValues": [
                "RUNNABLE"
            ],
            "value": 36,
            "timestampMs": null
        },
        {
            "name": "jvm_threads_state",
            "labelNames": [
                "state"
            ],
            "labelValues": [
                "WAITING"
            ],
            "value": 1018,
            "timestampMs": null
        },
        {
            "name": "jvm_threads_state",
            "labelNames": [
                "state"
            ],
            "labelValues": [
                "BLOCKED"
            ],
            "value": 7,
            "timestampMs": null
        },
        {
            "name": "jvm_classes_loaded",
            "labelNames": [],
            "labelValues": [],
            "value": 4689,
            "timestampMs": null
        },
        {
            "name": "jvm_classes_loaded_total",
            "labelNames": [],
            "labelValues": [],
            "value": 4692,
            "timestampMs": null
        },
        {
            "name": "jvm_classes_unloaded_total",
            "labelNames": [],
            "labelValues": [],
            "value": 3,
            "timestampMs": null
        },
        {
            "name": "jvm_buffer_pool_used_bytes",
            "labelNames": [
                "pool"
            ],
            "labelValues": [
                "direct"
            ],
            "value": 32769,
            "timestampMs": null
        },
        {
            "name": "jvm_buffer_pool_used_bytes",
            "labelNames": [
                "pool"
            ],
            "labelValues": [
                "mapped"
            ],
            "value": 0,
            "timestampMs": null
        },
        {
            "name": "jvm_buffer_pool_capacity_bytes",
            "labelNames": [
                "pool"
            ],
            "labelValues": [
                "direct"
            ],
            "value": 32768,
            "timestampMs": null
        },
        {
            "name": "jvm_buffer_pool_capacity_bytes",
            "labelNames": [
                "pool"
            ],
            "labelValues": [
                "mapped"
            ],
            "value": 0,
            "timestampMs": null
        },
        {
            "name": "jvm_buffer_pool_used_buffers",
            "labelNames": [
                "pool"
            ],
            "labelValues": [
                "direct"
            ],
            "value": 5,
            "timestampMs": null
        },
        {
            "name": "jvm_buffer_pool_used_buffers",
            "labelNames": [
                "pool"
            ],
            "labelValues": [
                "mapped"
            ],
            "value": 0,
            "timestampMs": null
        },
        {
            "name": "jvm_memory_bytes_used",
            "labelNames": [
                "area"
            ],
            "labelValues": [
                "heap"
            ],
            "value": 16512134504,
            "timestampMs": null
        },
        {
            "name": "jvm_memory_bytes_used",
            "labelNames": [
                "area"
            ],
            "labelValues": [
                "nonheap"
            ],
            "value": 51022000,
            "timestampMs": null
        },
        {
            "name": "jvm_memory_bytes_committed",
            "labelNames": [
                "area"
            ],
            "labelValues": [
                "heap"
            ],
            "value": 53687091200,
            "timestampMs": null
        },
        {
            "name": "jvm_memory_bytes_committed",
            "labelNames": [
                "area"
            ],
            "labelValues": [
                "nonheap"
            ],
            "value": 52494336,
            "timestampMs": null
        },
        {
            "name": "jvm_memory_bytes_max",
            "labelNames": [
                "area"
            ],
            "labelValues": [
                "heap"
            ],
            "value": 53687091200,
            "timestampMs": null
        },
        {
            "name": "jvm_memory_bytes_max",
            "labelNames": [
                "area"
            ],
            "labelValues": [
                "nonheap"
            ],
            "value": -1,
            "timestampMs": null
        },
        {
            "name": "jvm_memory_bytes_init",
            "labelNames": [
                "area"
            ],
            "labelValues": [
                "heap"
            ],
            "value": 53687091200,
            "timestampMs": null
        },
        {
            "name": "jvm_memory_bytes_init",
            "labelNames": [
                "area"
            ],
            "labelValues": [
                "nonheap"
            ],
            "value": 2555904,
            "timestampMs": null
        },
        {
            "name": "jvm_memory_pool_bytes_used",
            "labelNames": [
                "pool"
            ],
            "labelValues": [
                "Code Cache"
            ],
            "value": 19920576,
            "timestampMs": null
        },
        {
            "name": "jvm_memory_pool_bytes_used",
            "labelNames": [
                "pool"
            ],
            "labelValues": [
                "Metaspace"
            ],
            "value": 31101424,
            "timestampMs": null
        },
        {
            "name": "jvm_memory_pool_bytes_used",
            "labelNames": [
                "pool"
            ],
            "labelValues": [
                "G1 Eden Space"
            ],
            "value": 805306368,
            "timestampMs": null
        },
        {
            "name": "jvm_memory_pool_bytes_used",
            "labelNames": [
                "pool"
            ],
            "labelValues": [
                "G1 Survivor Space"
            ],
            "value": 33554432,
            "timestampMs": null
        },
        {
            "name": "jvm_memory_pool_bytes_used",
            "labelNames": [
                "pool"
            ],
            "labelValues": [
                "G1 Old Gen"
            ],
            "value": 15673273704,
            "timestampMs": null
        },
        {
            "name": "jvm_memory_pool_bytes_committed",
            "labelNames": [
                "pool"
            ],
            "labelValues": [
                "Code Cache"
            ],
            "value": 20512768,
            "timestampMs": null
        },
        {
            "name": "jvm_memory_pool_bytes_committed",
            "labelNames": [
                "pool"
            ],
            "labelValues": [
                "Metaspace"
            ],
            "value": 31981568,
            "timestampMs": null
        },
        {
            "name": "jvm_memory_pool_bytes_committed",
            "labelNames": [
                "pool"
            ],
            "labelValues": [
                "G1 Eden Space"
            ],
            "value": 5603590144,
            "timestampMs": null
        },
        {
            "name": "jvm_memory_pool_bytes_committed",
            "labelNames": [
                "pool"
            ],
            "labelValues": [
                "G1 Survivor Space"
            ],
            "value": 33554432,
            "timestampMs": null
        },
        {
            "name": "jvm_memory_pool_bytes_committed",
            "labelNames": [
                "pool"
            ],
            "labelValues": [
                "G1 Old Gen"
            ],
            "value": 48049946624,
            "timestampMs": null
        },
        {
            "name": "jvm_memory_pool_bytes_max",
            "labelNames": [
                "pool"
            ],
            "labelValues": [
                "Code Cache"
            ],
            "value": 251658240,
            "timestampMs": null
        },
        {
            "name": "jvm_memory_pool_bytes_max",
            "labelNames": [
                "pool"
            ],
            "labelValues": [
                "Metaspace"
            ],
            "value": -1,
            "timestampMs": null
        },
        {
            "name": "jvm_memory_pool_bytes_max",
            "labelNames": [
                "pool"
            ],
            "labelValues": [
                "G1 Eden Space"
            ],
            "value": -1,
            "timestampMs": null
        },
        {
            "name": "jvm_memory_pool_bytes_max",
            "labelNames": [
                "pool"
            ],
            "labelValues": [
                "G1 Survivor Space"
            ],
            "value": -1,
            "timestampMs": null
        },
        {
            "name": "jvm_memory_pool_bytes_max",
            "labelNames": [
                "pool"
            ],
            "labelValues": [
                "G1 Old Gen"
            ],
            "value": 53687091200,
            "timestampMs": null
        },
        {
            "name": "jvm_memory_pool_bytes_init",
            "labelNames": [
                "pool"
            ],
            "labelValues": [
                "Code Cache"
            ],
            "value": 2555904,
            "timestampMs": null
        },
        {
            "name": "jvm_memory_pool_bytes_init",
            "labelNames": [
                "pool"
            ],
            "labelValues": [
                "Metaspace"
            ],
            "value": 0,
            "timestampMs": null
        },
        {
            "name": "jvm_memory_pool_bytes_init",
            "labelNames": [
                "pool"
            ],
            "labelValues": [
                "G1 Eden Space"
            ],
            "value": 5637144576,
            "timestampMs": null
        },
        {
            "name": "jvm_memory_pool_bytes_init",
            "labelNames": [
                "pool"
            ],
            "labelValues": [
                "G1 Survivor Space"
            ],
            "value": 0,
            "timestampMs": null
        },
        {
            "name": "jvm_memory_pool_bytes_init",
            "labelNames": [
                "pool"
            ],
            "labelValues": [
                "G1 Old Gen"
            ],
            "value": 48049946624,
            "timestampMs": null
        },
        {
            "name": "jvm_memory_pool_allocated_bytes_total",
            "labelNames": [
                "pool"
            ],
            "labelValues": [
                "G1 Old Gen"
            ],
            "value": 1055979279032,
            "timestampMs": null
        },
        {
            "name": "jvm_memory_pool_allocated_bytes_total",
            "labelNames": [
                "pool"
            ],
            "labelValues": [
                "Code Cache"
            ],
            "value": 23076416,
            "timestampMs": null
        },
        {
            "name": "jvm_memory_pool_allocated_bytes_total",
            "labelNames": [
                "pool"
            ],
            "labelValues": [
                "G1 Eden Space"
            ],
            "value": 297090940928,
            "timestampMs": null
        },
        {
            "name": "jvm_memory_pool_allocated_bytes_total",
            "labelNames": [
                "pool"
            ],
            "labelValues": [
                "G1 Survivor Space"
            ],
            "value": 9294577664,
            "timestampMs": null
        },
        {
            "name": "jvm_memory_pool_allocated_bytes_total",
            "labelNames": [
                "pool"
            ],
            "labelValues": [
                "Metaspace"
            ],
            "value": 27300720,
            "timestampMs": null
        },
        {
            "name": "process_cpu_seconds_total",
            "labelNames": [],
            "labelValues": [],
            "value": 2042.23,
            "timestampMs": null
        },
        {
            "name": "process_start_time_seconds",
            "labelNames": [],
            "labelValues": [],
            "value": 1640913776.419,
            "timestampMs": null
        },
        {
            "name": "process_open_fds",
            "labelNames": [],
            "labelValues": [],
            "value": 379,
            "timestampMs": null
        },
        {
            "name": "process_max_fds",
            "labelNames": [],
            "labelValues": [],
            "value": 999999,
            "timestampMs": null
        },
        {
            "name": "process_virtual_memory_bytes",
            "labelNames": [],
            "labelValues": [],
            "value": 66492104704,
            "timestampMs": null
        },
        {
            "name": "process_resident_memory_bytes",
            "labelNames": [],
            "labelValues": [],
            "value": 58613448704,
            "timestampMs": null
        }
    ],
    "timeStamp": 1640933954231
}

from firestorm.

jerqi avatar jerqi commented on April 28, 2024

It seems to be killed by kernel. You should use a virtual machine that have more memory.

from firestorm.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.