Giter VIP home page Giter VIP logo

harvester's People

Contributors

a6350202 avatar danilaoleynik avatar davidgcameron avatar dougbenjamin avatar fbarreir avatar jtchilders avatar lincolnbryant avatar mightqxc avatar mweinberg2718 avatar nikmagini avatar silas1704 avatar tmaeno avatar tsulaiav avatar wguanicedew avatar wyang007 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

harvester's Issues

Implement minNewWorkersPerCycle?

From Peter Love:
"we need to have a continuous stream (1/cycle) submitted regardless of how much work is in panda. This avoids the problem we see now where no workers are submitted for whatever reason."

Implement a setting to force submission of a min number of new workers per cycle, regardless of assigned work.
Keep default at 0; to be set for special queues used to test pilots.

Is this feasible?
Cheers
N

check_credential_lifetime not in ArcproxyCredManager

I see the method in no_voms_cred_manager.py and base_cred_manager.py

lifetime = exeCore.check_credential_lifetime()

AttributeError: 'ArcproxyCredManager' object has no attribute 'check_credential_lifetime'

Cheers,
Rod.

enforcing memory limits with safety factor for k8s jobs

Hi @fbarreir
I have tried the memory enforcement setting in AGIS for the dev cluster. It protects the nodes from crashing and OOM, but is a bit too aggressive on killing jobs. The limit is set equal to the request for each job type.

    resources:
      limits:
        cpu: "8"
        memory: 16G
      requests:
        cpu: 7200m
        memory: 16G

In order to reasonably protect the nodes from misconfigured user jobs, but also avoid killing jobs too aggressively, we need a safety factor so that the limit exceeds the request. This way jobs will be free to use extra memory (as available) beyond what they requested, up to the limit.

Currently the memory requests for all the different job types are 2G, 4G, 16G, 32G.
I propose something like
limit = request + min(2G, 0.25 * request)
That way the limits would be 4G, 6G, 20G, 40G. What do you think?

Perhaps the safety factor 0.25 could be a configurable parameter in AGIS, but either way it would be a big help because we need to protect the nodes from crashing/OOM in production (happened to ~ 6 nodes today).

Optional pip dependencies

Investigate possibilities of setting up optional pip dependencies, e.g. pyyaml and kubernetes packages when setting up a K8S harvester.

K8s: Connection errors to cluster lead to cancelled workers

Burst in connection failures led to many cancelled workers, although they were actually running.

[root@aipanda169 harvester]# grep "Failed to establish a new connection" panda-k8s_utils.log-20210629
2021-06-28 16:30:43,457 panda.log.k8s_utils: ERROR    get_pods_info : Failed call to list_namespaced_pod with: HTTPSConnectionPool(host='35.195.9.85', port=443): Max retries exceeded with url: /api/v1/namespaces/default/pods?labelSelector=job-name+in+%28grid-job-7541357%2Cgrid-job-7541188%2Cgrid-job-7541384%2Cgrid-job-7541771%2Cgrid-job-7541382%2Cgrid-job-7540599%2Cgrid-job-7541775%2Cgrid-job-7541773%2Cgrid-job-7541348%2Cgrid-job-7541377%2Cgrid-job-7541326%2Cgrid-job-7541330%2Cgrid-job-7541182%2Cgrid-job-7541385%2Cgrid-job-7541378%2Cgrid-job-7541323%2Cgrid-job-7541344%2Cgrid-job-7541386%2Cgrid-job-7541342%2Cgrid-job-7541772%2Cgrid-job-7541327%2Cgrid-job-7541779%2Cgrid-job-7541185%2Cgrid-job-7541352%2Cgrid-job-7541337%2Cgrid-job-7541373%2Cgrid-job-7541350%2Cgrid-job-7541335%2Cgrid-job-7541758%2Cgrid-job-7541193%2Cgrid-job-7541743%2Cgrid-job-7541360%2Cgrid-job-7541338%2Cgrid-job-7541191%2Cgrid-job-7541768%2Cgrid-job-7541364%2Cgrid-job-7541333%2Cgrid-job-7541346%2Cgrid-job-7541776%2Cgrid-job-7541761%2Cgrid-job-7541754%2Cgrid-job-7541372%2Cgrid-job-7541780%2Cgrid-job-7541361%2Cgrid-job-7541354%2Cgrid-job-7541762%2Cgrid-job-7541358%2Cgrid-job-7541195%2Cgrid-job-7540594%2Cgrid-job-7541777%2Cgrid-job-7541770%2Cgrid-job-7541328%2Cgrid-job-7541765%2Cgrid-job-7541774%2Cgrid-job-7541355%2Cgrid-job-7541778%2Cgrid-job-7541376%2Cgrid-job-7541757%2Cgrid-job-7540591%2Cgrid-job-7541744%2Cgrid-job-7541782%2Cgrid-job-7541444%2Cgrid-job-7541369%2Cgrid-job-7541748%2Cgrid-job-7541322%2Cgrid-job-7541367%2Cgrid-job-7541032%2Cgrid-job-7541756%2Cgrid-job-7541375%2Cgrid-job-7541030%2Cgrid-job-7541745%2Cgrid-job-7541363%2Cgrid-job-7541764%2Cgrid-job-7541383%2Cgrid-job-7541742%2Cgrid-job-7541469%2Cgrid-job-7541752%2Cgrid-job-7541781%2Cgrid-job-7541598%2Cgrid-job-7541371%2Cgrid-job-7541026%2Cgrid-job-7541325%2Cgrid-job-7541750%2Cgrid-job-7541183%2Cgrid-job-7541760%2Cgrid-job-7540582%2Cgrid-job-7541379%2Cgrid-job-7541591%2Cgrid-job-7540592%2Cgrid-job-7541594%2Cgrid-job-7541387%2Cgrid-job-7541753%2Cgrid-job-7541343%2Cgrid-job-7540998%2Cgrid-job-7540596%2Cgrid-job-7541194%2Cgrid-job-7541751%2Cgrid-job-7541324%2Cgrid-job-7540585%2Cgrid-job-7541759%2Cgrid-job-7541181%2Cgrid-job-7540593%2Cgrid-job-7541749%2Cgrid-job-7541747%2Cgrid-job-7541767%2Cgrid-job-7541320%2Cgrid-job-7541349%2Cgrid-job-7540590%2Cgrid-job-7541362%2Cgrid-job-7541009%2Cgrid-job-7541002%2Cgrid-job-7541755%2Cgrid-job-7541388%2Cgrid-job-7540598%2Cgrid-job-7540587%2Cgrid-job-7541180%2Cgrid-job-7541763%2Cgrid-job-7541332%2Cgrid-job-7541374%2Cgrid-job-7541321%2Cgrid-job-7541359%29 (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fd5c7f836a0>: Failed to establish a new connection: [Errno 111] Connection refused',))
2021-06-28 16:30:49,228 panda.log.k8s_utils: ERROR    create_or_patch_configmap_starter : Could not create configmap with: HTTPSConnectionPool(host='35.195.9.85', port=443): Max retries exceeded with url: /api/v1/namespaces/default/configmaps/pilots-starter (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fd61c0f60b8>: Failed to establish a new connection: [Errno 111] Connection refused',))
2021-06-28 16:30:53,825 panda.log.k8s_utils: ERROR    get_pods_info : Failed call to list_namespaced_pod with: HTTPSConnectionPool(host='35.195.9.85', port=443): Max retries exceeded with url: /api/v1/namespaces/default/pods?labelSelector=job-name+in+%28grid-job-7539100%2Cgrid-job-7538950%2Cgrid-job-7538951%2Cgrid-job-7538952%2Cgrid-job-7538953%2Cgrid-job-7538958%2Cgrid-job-7538960%2Cgrid-job-7538961%2Cgrid-job-7538966%2Cgrid-job-7538969%2Cgrid-job-7539041%2Cgrid-job-7539045%2Cgrid-job-7539050%2Cgrid-job-7539097%2Cgrid-job-7538933%2Cgrid-job-7538925%2Cgrid-job-7538921%2Cgrid-job-7538881%2Cgrid-job-7538883%2Cgrid-job-7538889%2Cgrid-job-7538890%2Cgrid-job-7538891%2Cgrid-job-7538898%2Cgrid-job-7538907%2Cgrid-job-7538909%2Cgrid-job-7538910%2Cgrid-job-7538914%2Cgrid-job-7538916%2Cgrid-job-7538917%29 (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fd5d8ca5668>: Failed to establish a new connection: [Errno 111] Connection refused',))
2021-06-28 16:30:53,904 panda.log.k8s_utils: ERROR    delete_job <job_name=grid-job-7539100> Failed call to delete_namespaced_job with: HTTPSConnectionPool(host='35.195.9.85', port=443): Max retries exceeded with url: /apis/batch/v1/namespaces/default/jobs/grid-job-7539100?gracePeriodSeconds=0 (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fd5d84612b0>: Failed to establish a new connection: [Errno 111] Connection refused',))
2021-06-28 16:30:53,983 panda.log.k8s_utils: ERROR    delete_job <job_name=grid-job-7538950> Failed call to delete_namespaced_job with: HTTPSConnectionPool(host='35.195.9.85', port=443): Max retries exceeded with url: /apis/batch/v1/namespaces/default/jobs/grid-job-7538950?gracePeriodSeconds=0 (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fd62d14b978>: Failed to establish a new connection: [Errno 111] Connection refused',))
2021-06-28 16:30:54,062 panda.log.k8s_utils: ERROR    delete_job <job_name=grid-job-7538951> Failed call to delete_namespaced_job with: HTTPSConnectionPool(host='35.195.9.85', port=443): Max retries exceeded with url: /apis/batch/v1/namespaces/default/jobs/grid-job-7538951?gracePeriodSeconds=0 (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fd5f8692eb8>: Failed to establish a new connection: [Errno 111] Connection refused',))
2021-06-28 16:30:54,142 panda.log.k8s_utils: ERROR    delete_job <job_name=grid-job-7538952> Failed call to delete_namespaced_job with: HTTPSConnectionPool(host='35.195.9.85', port=443): Max retries exceeded with url: /apis/batch/v1/namespaces/default/jobs/grid-job-7538952?gracePeriodSeconds=0 (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fd5c777fdd8>: Failed to establish a new connection: [Errno 111] Connection refused',))
2021-06-28 16:30:54,224 panda.log.k8s_utils: ERROR    delete_job <job_name=grid-job-7538953> Failed call to delete_namespaced_job with: HTTPSConnectionPool(host='35.195.9.85', port=443): Max retries exceeded with url: /apis/batch/v1/namespaces/default/jobs/grid-job-7538953?gracePeriodSeconds=0 (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fd5c7d70438>: Failed to establish a new connection: [Errno 111] Connection refused',))
2021-06-28 16:30:54,303 panda.log.k8s_utils: ERROR    delete_job <job_name=grid-job-7538958> Failed call to delete_namespaced_job with: HTTPSConnectionPool(host='35.195.9.85', port=443): Max retries exceeded with url: /apis/batch/v1/namespaces/default/jobs/grid-job-7538958?gracePeriodSeconds=0 (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fd5f8692c18>: Failed to establish a new connection: [Errno 111] Connection refused',))
2021-06-28 16:30:54,383 panda.log.k8s_utils: ERROR    delete_job <job_name=grid-job-7538960> Failed call to delete_namespaced_job with: HTTPSConnectionPool(host='35.195.9.85', port=443): Max retries exceeded with url: /apis/batch/v1/namespaces/default/jobs/grid-job-7538960?gracePeriodSeconds=0 (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fd62d14bd68>: Failed to establish a new connection: [Errno 111] Connection refused',))
2021-06-28 16:30:54,458 panda.log.k8s_utils: ERROR    delete_job <job_name=grid-job-7538961> Failed call to delete_namespaced_job with: HTTPSConnectionPool(host='35.195.9.85', port=443): Max retries exceeded with url: /apis/batch/v1/namespaces/default/jobs/grid-job-7538961?gracePeriodSeconds=0 (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fd61c287048>: Failed to establish a new connection: [Errno 111] Connection refused',))
2021-06-28 16:30:54,537 panda.log.k8s_utils: ERROR    delete_job <job_name=grid-job-7538966> Failed call to delete_namespaced_job with: HTTPSConnectionPool(host='35.195.9.85', port=443): Max retries exceeded with url: /apis/batch/v1/namespaces/default/jobs/grid-job-7538966?gracePeriodSeconds=0 (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fd5c74aafd0>: Failed to establish a new connection: [Errno 111] Connection refused',))
2021-06-28 16:30:54,615 panda.log.k8s_utils: ERROR    delete_job <job_name=grid-job-7538969> Failed call to delete_namespaced_job with: HTTPSConnectionPool(host='35.195.9.85', port=443): Max retries exceeded with url: /apis/batch/v1/namespaces/default/jobs/grid-job-7538969?gracePeriodSeconds=0 (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fd5f860db38>: Failed to establish a new connection: [Errno 111] Connection refused',))
2021-06-28 16:30:54,696 panda.log.k8s_utils: ERROR    delete_job <job_name=grid-job-7539041> Failed call to delete_namespaced_job with: HTTPSConnectionPool(host='35.195.9.85', port=443): Max retries exceeded with url: /apis/batch/v1/namespaces/default/jobs/grid-job-7539041?gracePeriodSeconds=0 (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fd5d900c978>: Failed to establish a new connection: [Errno 111] Connection refused',))
2021-06-28 16:30:54,774 panda.log.k8s_utils: ERROR    delete_job <job_name=grid-job-7539045> Failed call to delete_namespaced_job with: HTTPSConnectionPool(host='35.195.9.85', port=443): Max retries exceeded with url: /apis/batch/v1/namespaces/default/jobs/grid-job-7539045?gracePeriodSeconds=0 (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fd5d97a3208>: Failed to establish a new connection: [Errno 111] Connection refused',))
2021-06-28 16:30:54,852 panda.log.k8s_utils: ERROR    delete_job <job_name=grid-job-7539050> Failed call to delete_namespaced_job with: HTTPSConnectionPool(host='35.195.9.85', port=443): Max retries exceeded with url: /apis/batch/v1/namespaces/default/jobs/grid-job-7539050?gracePeriodSeconds=0 (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fd5c7d0b128>: Failed to establish a new connection: [Errno 111] Connection refused',))
2021-06-28 16:30:54,930 panda.log.k8s_utils: ERROR    delete_job <job_name=grid-job-7539097> Failed call to delete_namespaced_job with: HTTPSConnectionPool(host='35.195.9.85', port=443): Max retries exceeded with url: /apis/batch/v1/namespaces/default/jobs/grid-job-7539097?gracePeriodSeconds=0 (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fd62d14b048>: Failed to establish a new connection: [Errno 111] Connection refused',))
2021-06-28 16:30:55,009 panda.log.k8s_utils: ERROR    delete_job <job_name=grid-job-7538933> Failed call to delete_namespaced_job with: HTTPSConnectionPool(host='35.195.9.85', port=443): Max retries exceeded with url: /apis/batch/v1/namespaces/default/jobs/grid-job-7538933?gracePeriodSeconds=0 (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fd5f8692c18>: Failed to establish a new connection: [Errno 111] Connection refused',))
2021-06-28 16:30:55,089 panda.log.k8s_utils: ERROR    delete_job <job_name=grid-job-7538925> Failed call to delete_namespaced_job with: HTTPSConnectionPool(host='35.195.9.85', port=443): Max retries exceeded with url: /apis/batch/v1/namespaces/default/jobs/grid-job-7538925?gracePeriodSeconds=0 (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fd5c7d704a8>: Failed to establish a new connection: [Errno 111] Connection refused',))
2021-06-28 16:30:55,167 panda.log.k8s_utils: ERROR    delete_job <job_name=grid-job-7538921> Failed call to delete_namespaced_job with: HTTPSConnectionPool(host='35.195.9.85', port=443): Max retries exceeded with url: /apis/batch/v1/namespaces/default/jobs/grid-job-7538921?gracePeriodSeconds=0 (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fd5c777a5f8>: Failed to establish a new connection: [Errno 111] Connection refused',))
2021-06-28 16:30:55,245 panda.log.k8s_utils: ERROR    delete_job <job_name=grid-job-7538881> Failed call to delete_namespaced_job with: HTTPSConnectionPool(host='35.195.9.85', port=443): Max retries exceeded with url: /apis/batch/v1/namespaces/default/jobs/grid-job-7538881?gracePeriodSeconds=0 (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fd5c7d70f60>: Failed to establish a new connection: [Errno 111] Connection refused',))
2021-06-28 16:30:55,325 panda.log.k8s_utils: ERROR    delete_job <job_name=grid-job-7538883> Failed call to delete_namespaced_job with: HTTPSConnectionPool(host='35.195.9.85', port=443): Max retries exceeded with url: /apis/batch/v1/namespaces/default/jobs/grid-job-7538883?gracePeriodSeconds=0 (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fd62f354908>: Failed to establish a new connection: [Errno 111] Connection refused',))
2021-06-28 16:30:55,404 panda.log.k8s_utils: ERROR    delete_job <job_name=grid-job-7538889> Failed call to delete_namespaced_job with: HTTPSConnectionPool(host='35.195.9.85', port=443): Max retries exceeded with url: /apis/batch/v1/namespaces/default/jobs/grid-job-7538889?gracePeriodSeconds=0 (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fd62d14b080>: Failed to establish a new connection: [Errno 111] Connection refused',))
2021-06-28 16:30:55,483 panda.log.k8s_utils: ERROR    delete_job <job_name=grid-job-7538890> Failed call to delete_namespaced_job with: HTTPSConnectionPool(host='35.195.9.85', port=443): Max retries exceeded with url: /apis/batch/v1/namespaces/default/jobs/grid-job-7538890?gracePeriodSeconds=0 (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fd5d84612e8>: Failed to establish a new connection: [Errno 111] Connection refused',))
2021-06-28 16:30:55,561 panda.log.k8s_utils: ERROR    delete_job <job_name=grid-job-7538891> Failed call to delete_namespaced_job with: HTTPSConnectionPool(host='35.195.9.85', port=443): Max retries exceeded with url: /apis/batch/v1/namespaces/default/jobs/grid-job-7538891?gracePeriodSeconds=0 (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fd5c79944a8>: Failed to establish a new connection: [Errno 111] Connection refused',))
2021-06-28 16:30:55,639 panda.log.k8s_utils: ERROR    delete_job <job_name=grid-job-7538898> Failed call to delete_namespaced_job with: HTTPSConnectionPool(host='35.195.9.85', port=443): Max retries exceeded with url: /apis/batch/v1/namespaces/default/jobs/grid-job-7538898?gracePeriodSeconds=0 (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fd5c777a358>: Failed to establish a new connection: [Errno 111] Connection refused',))
2021-06-28 16:30:55,719 panda.log.k8s_utils: ERROR    delete_job <job_name=grid-job-7538907> Failed call to delete_namespaced_job with: HTTPSConnectionPool(host='35.195.9.85', port=443): Max retries exceeded with url: /apis/batch/v1/namespaces/default/jobs/grid-job-7538907?gracePeriodSeconds=0 (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fd5f860d358>: Failed to establish a new connection: [Errno 111] Connection refused',))
2021-06-28 16:30:55,798 panda.log.k8s_utils: ERROR    delete_job <job_name=grid-job-7538909> Failed call to delete_namespaced_job with: HTTPSConnectionPool(host='35.195.9.85', port=443): Max retries exceeded with url: /apis/batch/v1/namespaces/default/jobs/grid-job-7538909?gracePeriodSeconds=0 (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fd5d8461fd0>: Failed to establish a new connection: [Errno 111] Connection refused',))
2021-06-28 16:30:55,876 panda.log.k8s_utils: ERROR    delete_job <job_name=grid-job-7538910> Failed call to delete_namespaced_job with: HTTPSConnectionPool(host='35.195.9.85', port=443): Max retries exceeded with url: /apis/batch/v1/namespaces/default/jobs/grid-job-7538910?gracePeriodSeconds=0 (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fd5d8e29c18>: Failed to establish a new connection: [Errno 111] Connection refused',))
2021-06-28 16:30:55,954 panda.log.k8s_utils: ERROR    delete_job <job_name=grid-job-7538914> Failed call to delete_namespaced_job with: HTTPSConnectionPool(host='35.195.9.85', port=443): Max retries exceeded with url: /apis/batch/v1/namespaces/default/jobs/grid-job-7538914?gracePeriodSeconds=0 (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fd62c03f828>: Failed to establish a new connection: [Errno 111] Connection refused',))
2021-06-28 16:30:56,034 panda.log.k8s_utils: ERROR    delete_job <job_name=grid-job-7538916> Failed call to delete_namespaced_job with: HTTPSConnectionPool(host='35.195.9.85', port=443): Max retries exceeded with url: /apis/batch/v1/namespaces/default/jobs/grid-job-7538916?gracePeriodSeconds=0 (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fd5c7b1f048>: Failed to establish a new connection: [Errno 111] Connection refused',))
2021-06-28 16:30:56,114 panda.log.k8s_utils: ERROR    delete_job <job_name=grid-job-7538917> Failed call to delete_namespaced_job with: HTTPSConnectionPool(host='35.195.9.85', port=443): Max retries exceeded with url: /apis/batch/v1/namespaces/default/jobs/grid-job-7538917?gracePeriodSeconds=0 (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fd5c777add8>: Failed to establish a new connection: [Errno 111] Connection refused',))
2021-06-28 16:31:35,332 panda.log.k8s_utils: ERROR    get_pods_info : Failed call to list_namespaced_pod with: HTTPSConnectionPool(host='35.195.9.85', port=443): Max retries exceeded with url: /api/v1/namespaces/default/pods?labelSelector=job-name+in+%28grid-job-7541917%2Cgrid-job-7541920%2Cgrid-job-7541927%2Cgrid-job-7541930%2Cgrid-job-7541925%2Cgrid-job-7541953%2Cgrid-job-7541932%2Cgrid-job-7541939%2Cgrid-job-7541918%2Cgrid-job-7541921%2Cgrid-job-7541929%2Cgrid-job-7541944%2Cgrid-job-7541942%2Cgrid-job-7541950%2Cgrid-job-7541946%2Cgrid-job-7541941%2Cgrid-job-7541931%2Cgrid-job-7541933%2Cgrid-job-7541943%2Cgrid-job-7541951%2Cgrid-job-7541936%2Cgrid-job-7541919%2Cgrid-job-7541948%2Cgrid-job-7541938%2Cgrid-job-7541945%2Cgrid-job-7541934%2Cgrid-job-7541916%2Cgrid-job-7541924%2Cgrid-job-7541947%2Cgrid-job-7541949%2Cgrid-job-7541937%2Cgrid-job-7541915%2Cgrid-job-7541923%2Cgrid-job-7541952%2Cgrid-job-7541935%2Cgrid-job-7541954%2Cgrid-job-7541928%2Cgrid-job-7541926%2Cgrid-job-7541940%2Cgrid-job-7541922%29 (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fd62f11a400>: Failed to establish a new connection: [Errno 111] Connection refused',))
2021-06-28 16:31:50,066 panda.log.k8s_utils: ERROR    create_or_patch_configmap_starter : Could not create configmap with: HTTPSConnectionPool(host='35.195.9.85', port=443): Max retries exceeded with url: /api/v1/namespaces/default/configmaps/pilots-starter (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fd5c790abe0>: Failed to establish a new connection: [Errno 111] Connection refused',))
2021-06-28 16:33:15,522 panda.log.k8s_utils: ERROR    get_pods_info : Failed call to list_namespaced_pod with: HTTPSConnectionPool(host='35.195.9.85', port=443): Max retries exceeded with url: /api/v1/namespaces/default/pods?labelSelector=job-name+in+%28grid-job-7538935%2Cgrid-job-7538939%2Cgrid-job-7538940%2Cgrid-job-7538941%29 (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fd5b916ada0>: Failed to establish a new connection: [Errno 111] Connection refused',))
2021-06-28 16:33:15,602 panda.log.k8s_utils: ERROR    delete_job <job_name=grid-job-7538935> Failed call to delete_namespaced_job with: HTTPSConnectionPool(host='35.195.9.85', port=443): Max retries exceeded with url: /apis/batch/v1/namespaces/default/jobs/grid-job-7538935?gracePeriodSeconds=0 (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fd62f374fd0>: Failed to establish a new connection: [Errno 111] Connection refused',))
2021-06-28 16:33:15,681 panda.log.k8s_utils: ERROR    delete_job <job_name=grid-job-7538939> Failed call to delete_namespaced_job with: HTTPSConnectionPool(host='35.195.9.85', port=443): Max retries exceeded with url: /apis/batch/v1/namespaces/default/jobs/grid-job-7538939?gracePeriodSeconds=0 (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fd62f374828>: Failed to establish a new connection: [Errno 111] Connection refused',))
2021-06-28 16:33:15,759 panda.log.k8s_utils: ERROR    delete_job <job_name=grid-job-7538940> Failed call to delete_namespaced_job with: HTTPSConnectionPool(host='35.195.9.85', port=443): Max retries exceeded with url: /apis/batch/v1/namespaces/default/jobs/grid-job-7538940?gracePeriodSeconds=0 (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fd5bb6abeb8>: Failed to establish a new connection: [Errno 111] Connection refused',))
2021-06-28 16:33:15,839 panda.log.k8s_utils: ERROR    delete_job <job_name=grid-job-7538941> Failed call to delete_namespaced_job with: HTTPSConnectionPool(host='35.195.9.85', port=443): Max retries exceeded with url: /apis/batch/v1/namespaces/default/jobs/grid-job-7538941?gracePeriodSeconds=0 (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fd5bb6abeb8>: Failed to establish a new connection: [Errno 111] Connection refused',))
2021-06-28 16:33:21,186 panda.log.k8s_utils: ERROR    get_pods_info : Failed call to list_namespaced_pod with: HTTPSConnectionPool(host='35.195.9.85', port=443): Max retries exceeded with url: /api/v1/namespaces/default/pods?labelSelector=job-name+in+%28grid-job-7539208%2Cgrid-job-7539209%2Cgrid-job-7538999%29 (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fd5c735c940>: Failed to establish a new connection: [Errno 111] Connection refused',))
2021-06-28 16:33:24,545 panda.log.k8s_utils: ERROR    get_pods_info : Failed call to list_namespaced_pod with: HTTPSConnectionPool(host='35.195.9.85', port=443): Max retries exceeded with url: /api/v1/namespaces/default/pods?labelSelector=job-name+in+%28grid-job-7538869%2Cgrid-job-7538742%2Cgrid-job-7538902%2Cgrid-job-7538911%2Cgrid-job-7538919%2Cgrid-job-7538926%2Cgrid-job-7538930%2Cgrid-job-7538740%2Cgrid-job-7538934%2Cgrid-job-7538887%2Cgrid-job-7538906%29 (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fd62f6e4828>: Failed to establish a new connection: [Errno 111] Connection refused',))
2021-06-28 16:33:24,623 panda.log.k8s_utils: ERROR    delete_job <job_name=grid-job-7538869> Failed call to delete_namespaced_job with: HTTPSConnectionPool(host='35.195.9.85', port=443): Max retries exceeded with url: /apis/batch/v1/namespaces/default/jobs/grid-job-7538869?gracePeriodSeconds=0 (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fd5d96993c8>: Failed to establish a new connection: [Errno 111] Connection refused',))
2021-06-28 16:33:24,699 panda.log.k8s_utils: ERROR    delete_job <job_name=grid-job-7538742> Failed call to delete_namespaced_job with: HTTPSConnectionPool(host='35.195.9.85', port=443): Max retries exceeded with url: /apis/batch/v1/namespaces/default/jobs/grid-job-7538742?gracePeriodSeconds=0 (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fd5d9699358>: Failed to establish a new connection: [Errno 111] Connection refused',))
2021-06-28 16:33:24,776 panda.log.k8s_utils: ERROR    delete_job <job_name=grid-job-7538902> Failed call to delete_namespaced_job with: HTTPSConnectionPool(host='35.195.9.85', port=443): Max retries exceeded with url: /apis/batch/v1/namespaces/default/jobs/grid-job-7538902?gracePeriodSeconds=0 (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fd5c736c518>: Failed to establish a new connection: [Errno 111] Connection refused',))
2021-06-28 16:33:24,853 panda.log.k8s_utils: ERROR    delete_job <job_name=grid-job-7538911> Failed call to delete_namespaced_job with: HTTPSConnectionPool(host='35.195.9.85', port=443): Max retries exceeded with url: /apis/batch/v1/namespaces/default/jobs/grid-job-7538911?gracePeriodSeconds=0 (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fd62d34b748>: Failed to establish a new connection: [Errno 111] Connection refused',))
2021-06-28 16:33:24,933 panda.log.k8s_utils: ERROR    delete_job <job_name=grid-job-7538919> Failed call to delete_namespaced_job with: HTTPSConnectionPool(host='35.195.9.85', port=443): Max retries exceeded with url: /apis/batch/v1/namespaces/default/jobs/grid-job-7538919?gracePeriodSeconds=0 (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fd62d34b908>: Failed to establish a new connection: [Errno 111] Connection refused',))
2021-06-28 16:33:25,010 panda.log.k8s_utils: ERROR    delete_job <job_name=grid-job-7538926> Failed call to delete_namespaced_job with: HTTPSConnectionPool(host='35.195.9.85', port=443): Max retries exceeded with url: /apis/batch/v1/namespaces/default/jobs/grid-job-7538926?gracePeriodSeconds=0 (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fd5d9699f60>: Failed to establish a new connection: [Errno 111] Connection refused',))
2021-06-28 16:33:25,087 panda.log.k8s_utils: ERROR    delete_job <job_name=grid-job-7538930> Failed call to delete_namespaced_job with: HTTPSConnectionPool(host='35.195.9.85', port=443): Max retries exceeded with url: /apis/batch/v1/namespaces/default/jobs/grid-job-7538930?gracePeriodSeconds=0 (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fd5d9699a90>: Failed to establish a new connection: [Errno 111] Connection refused',))
2021-06-28 16:33:25,167 panda.log.k8s_utils: ERROR    delete_job <job_name=grid-job-7538740> Failed call to delete_namespaced_job with: HTTPSConnectionPool(host='35.195.9.85', port=443): Max retries exceeded with url: /apis/batch/v1/namespaces/default/jobs/grid-job-7538740?gracePeriodSeconds=0 (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fd62f6e4b00>: Failed to establish a new connection: [Errno 111] Connection refused',))
2021-06-28 16:33:25,244 panda.log.k8s_utils: ERROR    delete_job <job_name=grid-job-7538934> Failed call to delete_namespaced_job with: HTTPSConnectionPool(host='35.195.9.85', port=443): Max retries exceeded with url: /apis/batch/v1/namespaces/default/jobs/grid-job-7538934?gracePeriodSeconds=0 (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fd5c76fd7b8>: Failed to establish a new connection: [Errno 111] Connection refused',))
2021-06-28 16:33:25,324 panda.log.k8s_utils: ERROR    delete_job <job_name=grid-job-7538887> Failed call to delete_namespaced_job with: HTTPSConnectionPool(host='35.195.9.85', port=443): Max retries exceeded with url: /apis/batch/v1/namespaces/default/jobs/grid-job-7538887?gracePeriodSeconds=0 (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fd5d91d1fd0>: Failed to establish a new connection: [Errno 111] Connection refused',))
2021-06-28 16:33:25,402 panda.log.k8s_utils: ERROR    delete_job <job_name=grid-job-7538906> Failed call to delete_namespaced_job with: HTTPSConnectionPool(host='35.195.9.85', port=443): Max retries exceeded with url: /apis/batch/v1/namespaces/default/jobs/grid-job-7538906?gracePeriodSeconds=0 (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fd62f2d2ef0>: Failed to establish a new connection: [Errno 111] Connection refused',))
2021-06-28 16:33:52,876 panda.log.k8s_utils: ERROR    get_pods_info : Failed call to list_namespaced_pod with: HTTPSConnectionPool(host='35.195.9.85', port=443): Max retries exceeded with url: /api/v1/namespaces/default/pods?labelSelector=job-name+in+%28grid-job-7540676%2Cgrid-job-7539070%2Cgrid-job-7539077%2Cgrid-job-7540678%2Cgrid-job-7540677%2Cgrid-job-7537923%2Cgrid-job-7539987%2Cgrid-job-7540082%29 (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fd5d8786048>: Failed to establish a new connection: [Errno 111] Connection refused',))
2021-06-28 16:34:06,384 panda.log.k8s_utils: ERROR    get_pods_info : Failed call to list_namespaced_pod with: HTTPSConnectionPool(host='35.195.9.85', port=443): Max retries exceeded with url: /api/v1/namespaces/default/pods?labelSelector=job-name+in+%28grid-job-7541596%2Cgrid-job-7541033%2Cgrid-job-7541599%2Cgrid-job-7540583%2Cgrid-job-7541440%2Cgrid-job-7541448%2Cgrid-job-7541446%2Cgrid-job-7541593%2Cgrid-job-7541460%2Cgrid-job-7541442%2Cgrid-job-7541445%2Cgrid-job-7541443%2Cgrid-job-7540597%2Cgrid-job-7541450%2Cgrid-job-7541595%2Cgrid-job-7541597%2Cgrid-job-7541452%2Cgrid-job-7541600%2Cgrid-job-7541441%2Cgrid-job-7541449%2Cgrid-job-7541447%2Cgrid-job-7541602%2Cgrid-job-7541461%2Cgrid-job-7541459%2Cgrid-job-7541464%2Cgrid-job-7541462%2Cgrid-job-7541453%2Cgrid-job-7541601%2Cgrid-job-7541458%2Cgrid-job-7541451%2Cgrid-job-7541463%2Cgrid-job-7541592%2Cgrid-job-7541590%2Cgrid-job-7541439%2Cgrid-job-7541029%2Cgrid-job-7541468%29 (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fd5d89dfef0>: Failed to establish a new connection: [Errno 111] Connection refused',))
2021-06-28 16:34:17,647 panda.log.k8s_utils: ERROR    get_pods_info : Failed call to list_namespaced_pod with: HTTPSConnectionPool(host='35.195.9.85', port=443): Max retries exceeded with url: /api/v1/namespaces/default/pods?labelSelector=job-name+in+%28grid-job-7541259%2Cgrid-job-7539215%2Cgrid-job-7541261%2Cgrid-job-7541263%2Cgrid-job-7539223%2Cgrid-job-7540866%2Cgrid-job-7541260%2Cgrid-job-7541267%2Cgrid-job-7540869%2Cgrid-job-7541256%2Cgrid-job-7541087%2Cgrid-job-7541254%2Cgrid-job-7541264%2Cgrid-job-7541268%2Cgrid-job-7541257%2Cgrid-job-7540868%2Cgrid-job-7541093%2Cgrid-job-7541085%2Cgrid-job-7541265%2Cgrid-job-7541258%2Cgrid-job-7541262%2Cgrid-job-7541266%2Cgrid-job-7539222%2Cgrid-job-7541471%2Cgrid-job-7541092%2Cgrid-job-7540651%2Cgrid-job-7541122%2Cgrid-job-7539226%2Cgrid-job-7539933%2Cgrid-job-7541086%2Cgrid-job-7540867%2Cgrid-job-7541089%2Cgrid-job-7541090%2Cgrid-job-7541255%2Cgrid-job-7540650%29 (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fd5ba037278>: Failed to establish a new connection: [Errno 111] Connection refused',))
2021-06-28 16:34:18,865 panda.log.k8s_utils: ERROR    get_pods_info : Failed call to list_namespaced_pod with: HTTPSConnectionPool(host='35.195.9.85', port=443): Max retries exceeded with url: /api/v1/namespaces/default/pods?labelSelector=job-name+in+%28grid-job-7541978%2Cgrid-job-7541973%2Cgrid-job-7541971%2Cgrid-job-7541968%2Cgrid-job-7541980%2Cgrid-job-7541962%2Cgrid-job-7540752%2Cgrid-job-7541976%2Cgrid-job-7541974%2Cgrid-job-7541965%2Cgrid-job-7541983%2Cgrid-job-7541964%2Cgrid-job-7539088%2Cgrid-job-7541979%2Cgrid-job-7541981%2Cgrid-job-7541972%2Cgrid-job-7541969%2Cgrid-job-7541966%2Cgrid-job-7541970%2Cgrid-job-7541963%2Cgrid-job-7541982%2Cgrid-job-7541967%2Cgrid-job-7541977%2Cgrid-job-7541975%29 (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fd5f861fe80>: Failed to establish a new connection: [Errno 111] Connection refused',))

K8s: improve ephemeral storage management

I found more info about this, didn't realize it could be done this way but I think it would be better to set pod requests and limits for ephemeral storage: https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#setting-requests-and-limits-for-local-ephemeral-storage

instead of the fixed sizeLimit applied directly to the emptyDir volume (which is like request = limit , so less flexibility to accommodate disk oversubscription).

We can do it based on maxwdir, however as this way is all-inclusive (empty dir scratch, pod images, and pod stdout logs) , may need a little more logic or configurable parameters to come up with the right numbers, will think about it.

state to consider failed in k8s monitor

[root@aipanda169 ]# kubectl describe pods grid-job-7995388-5t86q
Name: grid-job-7995388-5t86q
Namespace: default
Priority: 0
Node: gke-panda-gke-bulk-preemptible-08fded9c-ng76/10.132.0.74
Start Time: Thu, 08 Jul 2021 12:32:49 +0000
Labels: controller-uid=419158eb-ca64-4168-9094-42978215c340
job-name=grid-job-7995388
pq=GOOGLE_BULK
prodSourceLabel=managed
resourceType=MCORE
Annotations:
Status: Failed
IP: 10.44.74.46
IPs:
IP: 10.44.74.46
Controlled By: Job/grid-job-7995388
Containers:
atlas-grid-centos7:
Container ID: containerd://39d285c276b7a2abd051e4ca26eacbdf31ce90b4f4cc55cd62367ccda80c37de
Image: gitlab-registry.cern.ch/panda/harvester-k8s-images/adc-centos7-singularity:work
Image ID: gitlab-registry.cern.ch/panda/harvester-k8s-images/adc-centos7-singularity@sha256:a914076d2f890ef695645b47d1b8ed040f0770e84258517e2fd185f8f5412d2a
Port:
Host Port:
Command:
/usr/bin/bash
Args:
-c
cd; python $EXEC_DIR/pilots_starter.py || true
State: Terminated
Reason: StartError
Message: failed to create containerd task: OCI runtime create failed: container_linux.go:349: starting container process caused "process_linux.go:449: container init caused "rootfs_linux.go:58: mounting \"/var/lib/kubelet/pods/2e2a4885-8ca1-41f2-a9a3-7f0153410ad2/volumes/kubernetes.io
local-volume/cvmfs-config-atlas\" to rootfs \"/run/containerd/io.containerd.runtime.v2.task/k8s.io/39d285c276b7a2abd051e4ca26eacbdf31ce90b4f4cc55cd62367ccda80c37de/rootfs\" at \"/run/containerd/io.containerd.runtime.v2.task/k8s.io/39d285c276b7a2abd051e4ca26eacbdf31ce90b4f4cc55cd62367ccda80c37de/rootfs/cvmfs/atlas.cern.ch\" caused \"no such file or directory\""": unknown
Exit Code: 128
Started: Thu, 01 Jan 1970 00:00:00 +0000
Finished: Thu, 08 Jul 2021 12:35:33 +0000
Ready: False
Restart Count: 0
Limits:
cpu: 8
Requests:
cpu: 7200m
memory: 16G
Liveness: exec [/bin/sh -c [ df -h | grep cvmfs2 | wc -l -eq 7 ] && find /tmp/wrapper-wid.log -mmin -10 | egrep '.*'] delay=60s timeout=1s period=120s #success=1 #failure=3
Environment:
computingSite: $computingSite
pandaQueueName: $pandaQueueName
proxySecretPath: $proxySecretPath
proxyContent: $proxyContent
workerID: $workerID
logs_frontend_w: $logs_frontend_w
logs_frontend_r: $logs_frontend_r
resourceType: $resourceType
HARVESTER_WORKER_ID: $HARVESTER_WORKER_ID
HARVESTER_ID: $HARVESTER_ID
PANDA_JSID: $PANDA_JSID
TMPDIR: /pilotdir
PILOT_NOKILL: True
computingSite: GOOGLE_BULK
pandaQueueName: GOOGLE_BULK
resourceType: MCORE
prodSourceLabel: managed
pilotType: PR
pilotUrlOpt:
pythonOption: --pythonversion 3
jobType: managed
proxySecretPath: /proxy/x509up_u25606_prod
workerID: 7995388
logs_frontend_w: https://aipanda047.cern.ch:25443/server/panda
logs_frontend_r: https://aipanda047.cern.ch:25443/cache
stdout_name:
PANDA_JSID: harvester-CERN_central_k8s
HARVESTER_WORKER_ID: 7995388
HARVESTER_ID: CERN_central_k8s
submit_mode: PULL
EXEC_DIR: /scratch/executables
Mounts:
/cvmfs/atlas-condb.cern.ch from atlas-condb (rw)
/cvmfs/atlas-nightlies.cern.ch from atlas-nightlies (rw)
/cvmfs/atlas.cern.ch from atlas (rw)
/cvmfs/grid.cern.ch from grid (rw)
/cvmfs/sft-nightlies.cern.ch from sft-nightlies (rw)
/cvmfs/sft.cern.ch from sft (rw)
/cvmfs/unpacked.cern.ch from unpacked (rw)
/pilotdir from pilot-dir (rw)
/proxy from proxy-secret (rw)
/scratch/executables from pilots-starter (rw)
/var/run/secrets/kubernetes.io/serviceaccount from default-token-rlmxk (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
atlas:
Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
ClaimName: cvmfs-config-atlas
ReadOnly: true
atlas-condb:
Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
ClaimName: cvmfs-config-atlas-condb
ReadOnly: true
atlas-nightlies:
Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
ClaimName: cvmfs-config-atlas-nightlies
ReadOnly: true
sft:
Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
ClaimName: cvmfs-config-sft
ReadOnly: true
grid:
Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
ClaimName: cvmfs-config-grid
ReadOnly: true
sft-nightlies:
Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
ClaimName: cvmfs-config-sft-nightlies
ReadOnly: true
unpacked:
Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
ClaimName: cvmfs-config-unpacked
ReadOnly: true
proxy-secret:
Type: Secret (a volume populated by a Secret)
SecretName: proxy-secret
Optional: false
pilot-dir:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: 170G
pilots-starter:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: pilots-starter
Optional: false
default-token-rlmxk:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-rlmxk
Optional: false
QoS Class: Burstable
Node-Selectors: bulk=True
Tolerations: bulk=True:NoSchedule
node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message


Normal Scheduled 3m26s default-scheduler Successfully assigned default/grid-job-7995388-5t86q to gke-panda-gke-bulk-preemptible-08fded9c-ng76
Warning FailedMount 2m50s kubelet, gke-panda-gke-bulk-preemptible-08fded9c-ng76 MountVolume.NewMounter initialization failed for volume "cvmfs-config-sft-nightlies" : path "/mnt/disks/cvmfs-k8s/sft-nightlies.cern.ch" does not exist
Warning FailedMount 2m35s (x2 over 2m47s) kubelet, gke-panda-gke-bulk-preemptible-08fded9c-ng76 MountVolume.NewMounter initialization failed for volume "cvmfs-config-sft" : path "/mnt/disks/cvmfs-k8s/sft.cern.ch" does not exist
Warning FailedMount 106s (x7 over 2m52s) kubelet, gke-panda-gke-bulk-preemptible-08fded9c-ng76 MountVolume.NewMounter initialization failed for volume "cvmfs-config-unpacked" : path "/mnt/disks/cvmfs-k8s/unpacked.cern.ch" does not exist
Warning FailedMount 83s kubelet, gke-panda-gke-bulk-preemptible-08fded9c-ng76 Unable to attach or mount volumes: unmounted volumes=[unpacked], unattached volumes=[unpacked grid proxy-secret pilot-dir pilots-starter default-token-rlmxk atlas atlas-condb atlas-nightlies sft sft-nightlies]: timed out waiting for the condition
Normal Pulled 42s kubelet, gke-panda-gke-bulk-preemptible-08fded9c-ng76 Container image "gitlab-registry.cern.ch/panda/harvester-k8s-images/adc-centos7-singularity:work" already present on machine
Normal Created 42s kubelet, gke-panda-gke-bulk-preemptible-08fded9c-ng76 Created container atlas-grid-centos7
Warning Failed 42s kubelet, gke-panda-gke-bulk-preemptible-08fded9c-ng76 Error: failed to create containerd task: OCI runtime create failed: container_linux.go:349: starting container process caused "process_linux.go:449: container init caused "rootfs_linux.go:58: mounting \"/var/lib/kubelet/pods/2e2a4885-8ca1-41f2-a9a3-7f0153410ad2/volumes/kubernetes.io~local-volume/cvmfs-config-atlas\" to rootfs \"/run/containerd/io.containerd.runtime.v2.task/k8s.io/atlas-grid-centos7/rootfs\" at \"/run/containerd/io.containerd.runtime.v2.task/k8s.io/atlas-grid-centos7/rootfs/cvmfs/atlas.cern.ch\" caused \"no such file or directory\""": unknown

Job fetcher does not pick up new queues correctly

I was trying to figure out why new unified queues in push mode are never picked up correctly. The prodSourceLabel is always set to "managed" when it should be "unified". Harvester has to be manually restarted every time a new queue is added to make it use the correct label.

I think it's because the panda queue info used by the JobFetcher is never refreshed. This means the check for grand unified state of a queue which was created after harvester started always returns false here: https://github.com/HSF/harvester/blob/master/pandaharvester/harvesterbody/job_fetcher.py#L55

So the solution would be to update the queue info periodically.

re-enable proxy checks for harvester k8s jobs

Reminder for us to follow up on the issue with the '-t' flag and doing proxy checks.
It is useful to have the pilot or wrapper do a proxy check and have the output printed in the logs.

There was an issue related to arcproxy and the way the proxy secret is mounted in pods, but this should be resolved in a new arcproxy version:
https://source.coderefinery.org/nordugrid/arc/-/merge_requests/1214/diffs
https://ggus.eu/index.php?mode=ticket_info&ticket_id=152002

optional priorityClasses for score / mcore jobs

Hi @fbarreir many thanks for the memory and disk related improvements!

Following up from the " CA-VICTORIA-K8S-T2 MCORE " email, I suggest enabling PriorityClasses for Harvester jobs.
It is just a string in the pod spec of the job:
https://kubernetes.io/docs/concepts/scheduling-eviction/pod-priority-preemption/#pod-priority

We could have e.g. k8s.scheduling.priorityclass.score and k8s.scheduling.priorityclass.mcore. If a string is defined in CRIC for one of those parameters, it is used as the priorityClassName in the pod spec of the job, otherwise priorityClassName is omitted.

At least this could allow us to start testing an elevated priority for mcore to see if is a suitable solution.

K8S: nodes to remove from GKE

Events:
Type Reason Age From Message


Warning SystemOOM 47m kubelet, gke-panda-gke-bulk-preemptible-08fded9c-31pw System OOM encountered, victim process: otelsvc, pid: 1147056
Warning OOMKilling 47m kernel-monitor, gke-panda-gke-bulk-preemptible-08fded9c-31pw Memory cgroup out of memory: Killed process 1147056 (otelsvc) total-vm:757036kB, anon-rss:33400kB, file-rss:34244kB, shmem-rss:0kB, UID:1000 pgtables:260kB oom_score_adj:-998
Warning SystemOOM 2m8s kubelet, gke-panda-gke-bulk-preemptible-08fded9c-31pw System OOM encountered, victim process: otelsvc, pid: 1154801
Warning OOMKilling 2m8s kernel-monitor, gke-panda-gke-bulk-preemptible-08fded9c-31pw Memory cgroup out of memory: Killed process 1154801 (otelsvc) total-vm:756780kB, anon-rss:32396kB, file-rss:34564kB, shmem-rss:0kB, UID:1000 pgtables:264kB oom_score_adj:-998

K8S: state that can be considered failed, but is treated as pending

{'api_version': 'v1',
'items': [{'api_version': None,
'kind': None,
'metadata': {'annotations': None,
'cluster_name': None,
'creation_timestamp': datetime.datetime(2021, 6, 28, 10, 39, 1, tzinfo=tzlocal()),
'deletion_grace_period_seconds': None,
'deletion_timestamp': None,
'finalizers': None,
'generate_name': 'grid-job-7535669-',
'generation': None,
'initializers': None,
'labels': {'controller-uid': '154b9624-8b14-43b7-95dc-57e921e99efc',
'job-name': 'grid-job-7535669',
'pq': 'GOOGLE100',
'prodSourceLabel': 'user',
'resourceType': 'SCORE'},
'managed_fields': None,
'name': 'grid-job-7535669-xzkx5',
'namespace': 'default',
'owner_references': [{'api_version': 'batch/v1',
'block_owner_deletion': True,
'controller': True,
'kind': 'Job',
'name': 'grid-job-7535669',
'uid': '154b9624-8b14-43b7-95dc-57e921e99efc'}],
'resource_version': '39039893',
'self_link': '/api/v1/namespaces/default/pods/grid-job-7535669-xzkx5',
'uid': 'b9167a6c-9880-4d5d-8914-0066b93e4d23'},
...
'status': {'conditions': [{'last_probe_time': None,
'last_transition_time': datetime.datetime(2021, 6, 28, 10, 39, 1, tzinfo=tzlocal()),
'message': None,
'reason': None,
'status': 'True',
'type': 'Initialized'},
{'last_probe_time': None,
'last_transition_time': datetime.datetime(2021, 6, 28, 10, 39, 1, tzinfo=tzlocal()),
'message': 'containers with unready '
'status: '
'[atlas-grid-centos7]',
'reason': 'ContainersNotReady',
'status': 'False',
'type': 'Ready'},
{'last_probe_time': None,
'last_transition_time': datetime.datetime(2021, 6, 28, 10, 39, 1, tzinfo=tzlocal()),
'message': 'containers with unready '
'status: '
'[atlas-grid-centos7]',
'reason': 'ContainersNotReady',
'status': 'False',
'type': 'ContainersReady'},
{'last_probe_time': None,
'last_transition_time': datetime.datetime(2021, 6, 28, 10, 39, 1, tzinfo=tzlocal()),
'message': None,
'reason': None,
'status': 'True',
'type': 'PodScheduled'}],
'container_statuses': [{'container_id': None,
'image': 'gitlab-registry.cern.ch/panda/harvester-k8s-images/adc-centos7-singularity:work',
'image_id': '',
'last_state': {'running': None,
'terminated': None,
'waiting': None},
'name': 'atlas-grid-centos7',
'ready': False,
'restart_count': 0,
'state': {'running': None,
'terminated': None,
'waiting': {'message': 'failed '
'to '
'generate '
'container '
'"bf07ac8f6042d4b3f875ec75dd0b00ad0a89843b7055c2db6258b82818268084" '
'spec: '
'failed '
'to '
'generate '
'spec: '
'failed '
'to '
'stat '
'"/var/lib/kubelet/pods/b9167a6c-9880-4d5d-8914-0066b93e4d23/volumes/kubernetes.iolocal-volume/cvmfs-config-atlas": '
'stat '
'/var/lib/kubelet/pods/b9167a6c-9880-4d5d-8914-0066b93e4d23/volumes/kubernetes.io
local-volume/cvmfs-config-atlas: '
'transport '
'endpoint '
'is '
'not '
'connected',
'reason': 'CreateContainerError'}}}],
'host_ip': '10.132.15.225',
'init_container_statuses': None,
'message': None,
'nominated_node_name': None,
'phase': 'Pending',
'pod_ip': '10.44.19.8',
'qos_class': 'Burstable',
'reason': None,
'start_time': datetime.datetime(2021, 6, 28, 10, 39, 1, tzinfo=tzlocal())}}],
'kind': 'PodList',
'metadata': {'_continue': None,
'remaining_item_count': None,
'resource_version': '39040043',
'self_link': '/api/v1/namespaces/default/pods'}}

RT aggregations not respected for GrandUnified queues

I think that we have a bug in RT throttling for GU queues, see log excerpt below.

Here Harvester was instructed by PanDA to submit 70 new managed MCORE + 182 new user SCORE, which is more than maxNewWorkersPerCycle=200

So it throttled submission to respect resource type aggregation - but the result was unexpectedly 70 managed MCORE, 144 user SCORE and even 1 user MCORE (which wasn't even requested?)

I would have expected something like 56 managed MCORE + 144 user SCORE instead (total 200, and proportional to what was requested)

So it seems that Harvester is throttling only user, not managed, when both are requested.


2020-02-06 03:36:36,722 panda.log.worker_adjuster: DEBUG define_num_workers <site=CERN> start
2020-02-06 03:36:36,722 panda.log.worker_adjuster: DEBUG define_num_workers <site=CERN> static_num_workers: {'CERN': {'ANY': {'ANY': {'nReady': 0, 'nRunning': 5388, 'nQueue': 1401, 'nNew
Workers': 0}}, 'managed': {'SCORE': {'nReady': 0, 'nRunning': 182, 'nQueue': 31, 'nNewWorkers': 0}, 'MCORE': {'nReady': 0, 'nRunning': 1689, 'nQueue': 994, 'nNewWorkers': 70}, 'MCORE_HIMEM'
: {'nReady': 0, 'nRunning': 3, 'nQueue': 10, 'nNewWorkers': 0}, 'SCORE_HIMEM': {'nReady': 0, 'nRunning': 134, 'nQueue': 18, 'nNewWorkers': 0}}, 'user': {'SCORE': {'nReady': 0, 'nRunning': 1
940, 'nQueue': 339, 'nNewWorkers': 182}, 'MCORE': {'nReady': 0, 'nRunning': 63, 'nQueue': 0, 'nNewWorkers': 0}, 'SCORE_HIMEM': {'nReady': 0, 'nRunning': 1377, 'nQueue': 9, 'nNewWorkers': 0}
, 'MCORE_HIMEM': {'nReady': 0, 'nRunning': 0, 'nQueue': 0, 'nNewWorkers': 0}}}}
2020-02-06 03:36:36,784 panda.log.worker_adjuster: DEBUG define_num_workers <site=CERN> Processing queue CERN job_type ANY resource_type ANY with static_num_workers {'nReady': 0, 'nRunni
ng': 5388, 'nQueue': 1401, 'nNewWorkers': 0}
2020-02-06 03:36:36,784 panda.log.worker_adjuster: DEBUG define_num_workers <site=CERN> set n_new_workers=0 by panda in slave mode
2020-02-06 03:36:36,785 panda.log.worker_adjuster: DEBUG define_num_workers <site=CERN> Processing queue CERN job_type managed resource_type SCORE with static_num_workers {'nReady': 0, '
nRunning': 182, 'nQueue': 31, 'nNewWorkers': 0}
2020-02-06 03:36:36,785 panda.log.worker_adjuster: DEBUG define_num_workers <site=CERN> set n_new_workers=0 by panda in slave mode
2020-02-06 03:36:36,785 panda.log.worker_adjuster: DEBUG define_num_workers <site=CERN> Processing queue CERN job_type managed resource_type MCORE with static_num_workers {'nReady': 0, '
nRunning': 1689, 'nQueue': 994, 'nNewWorkers': 70}
2020-02-06 03:36:36,785 panda.log.worker_adjuster: DEBUG define_num_workers <site=CERN> setting n_new_workers to 70 in max_queued_workers calculation
2020-02-06 03:36:36,785 panda.log.worker_adjuster: DEBUG define_num_workers <site=CERN> setting n_new_workers to 70 to respect max_workers
2020-02-06 03:36:36,785 panda.log.worker_adjuster: DEBUG define_num_workers <site=CERN> setting n_new_workers to 70 in order to respect maxNewWorkersPerCycle
2020-02-06 03:36:36,785 panda.log.worker_adjuster: DEBUG define_num_workers <site=CERN> Processing queue CERN job_type managed resource_type MCORE_HIMEM with static_num_workers {'nReady'
: 0, 'nRunning': 3, 'nQueue': 10, 'nNewWorkers': 0}
2020-02-06 03:36:36,785 panda.log.worker_adjuster: DEBUG define_num_workers <site=CERN> set n_new_workers=0 by panda in slave mode
2020-02-06 03:36:36,785 panda.log.worker_adjuster: DEBUG define_num_workers <site=CERN> Processing queue CERN job_type managed resource_type SCORE_HIMEM with static_num_workers {'nReady'
: 0, 'nRunning': 134, 'nQueue': 18, 'nNewWorkers': 0}
2020-02-06 03:36:36,785 panda.log.worker_adjuster: DEBUG define_num_workers <site=CERN> set n_new_workers=0 by panda in slave mode
2020-02-06 03:36:36,785 panda.log.worker_adjuster: DEBUG define_num_workers <site=CERN> Processing queue CERN job_type user resource_type SCORE with static_num_workers {'nReady': 0, 'nRu
nning': 1940, 'nQueue': 339, 'nNewWorkers': 182}
2020-02-06 03:36:36,785 panda.log.worker_adjuster: DEBUG define_num_workers <site=CERN> setting n_new_workers to 182 in max_queued_workers calculation
2020-02-06 03:36:36,785 panda.log.worker_adjuster: DEBUG define_num_workers <site=CERN> setting n_new_workers to 182 to respect max_workers
2020-02-06 03:36:36,785 panda.log.worker_adjuster: DEBUG define_num_workers <site=CERN> setting n_new_workers to 182 in order to respect maxNewWorkersPerCycle
2020-02-06 03:36:36,785 panda.log.worker_adjuster: DEBUG define_num_workers <site=CERN> Processing queue CERN job_type user resource_type MCORE with static_num_workers {'nReady': 0, 'nRu
nning': 63, 'nQueue': 0, 'nNewWorkers': 0}
2020-02-06 03:36:36,785 panda.log.worker_adjuster: DEBUG define_num_workers <site=CERN> set n_new_workers=0 by panda in slave mode
2020-02-06 03:36:36,786 panda.log.worker_adjuster: DEBUG define_num_workers <site=CERN> Processing queue CERN job_type user resource_type SCORE_HIMEM with static_num_workers {'nReady': 0
, 'nRunning': 1377, 'nQueue': 9, 'nNewWorkers': 0}
2020-02-06 03:36:36,786 panda.log.worker_adjuster: DEBUG define_num_workers <site=CERN> set n_new_workers=0 by panda in slave mode
2020-02-06 03:36:36,786 panda.log.worker_adjuster: DEBUG define_num_workers <site=CERN> Processing queue CERN job_type user resource_type MCORE_HIMEM with static_num_workers {'nReady': 0
, 'nRunning': 0, 'nQueue': 0, 'nNewWorkers': 0}
2020-02-06 03:36:36,786 panda.log.worker_adjuster: DEBUG define_num_workers <site=CERN> set n_new_workers=0 by panda in slave mode
2020-02-06 03:36:36,786 panda.log.worker_adjuster: DEBUG define_num_workers <site=CERN> n_new_workers_max_agg=200 for UCORE
2020-02-06 03:36:36,786 panda.log.worker_adjuster: DEBUG define_num_workers <site=CERN> setting n_new_workers to 0 of job_type managed resource_type SCORE in order to respect RT aggregations for UCORE
2020-02-06 03:36:36,786 panda.log.worker_adjuster: DEBUG define_num_workers <site=CERN> setting n_new_workers to 70 of job_type managed resource_type MCORE in order to respect RT aggregations for UCORE
2020-02-06 03:36:36,786 panda.log.worker_adjuster: DEBUG define_num_workers <site=CERN> setting n_new_workers to 0 of job_type managed resource_type MCORE_HIMEM in order to respect RT aggregations for UCORE
2020-02-06 03:36:36,786 panda.log.worker_adjuster: DEBUG define_num_workers <site=CERN> setting n_new_workers to 0 of job_type managed resource_type SCORE_HIMEM in order to respect RT aggregations for UCORE
2020-02-06 03:36:36,786 panda.log.worker_adjuster: DEBUG define_num_workers <site=CERN> setting n_new_workers to 144 of job_type user resource_type SCORE in order to respect RT aggregations for UCORE
2020-02-06 03:36:36,786 panda.log.worker_adjuster: DEBUG define_num_workers <site=CERN> setting n_new_workers to 1 of job_type user resource_type MCORE in order to respect RT aggregations for UCORE
2020-02-06 03:36:36,786 panda.log.worker_adjuster: DEBUG define_num_workers <site=CERN> setting n_new_workers to 0 of job_type user resource_type SCORE_HIMEM in order to respect RT aggregations for UCORE
2020-02-06 03:36:36,786 panda.log.worker_adjuster: DEBUG define_num_workers <site=CERN> setting n_new_workers to 0 of job_type user resource_type MCORE_HIMEM in order to respect RT aggregations for UCORE
2020-02-06 03:36:38,287 panda.log.worker_adjuster: DEBUG define_num_workers <site=CERN> defined {'CERN': {'ANY': {'ANY': {'nReady': 0, 'nRunning': 5388, 'nQueue': 1401, 'nNewWorkers': 0}}, 'managed': {'SCORE': {'nReady': 0, 'nRunning': 182, 'nQueue': 31, 'nNewWorkers': 0}, 'MCORE': {'nReady': 0, 'nRunning': 1689, 'nQueue': 994, 'nNewWorkers': 70}, 'MCORE_HIMEM': {'nReady': 0, 'nRunning': 3, 'nQueue': 10, 'nNewWorkers': 0}, 'SCORE_HIMEM': {'nReady': 0, 'nRunning': 134, 'nQueue': 18, 'nNewWorkers': 0}}, 'user': {'SCORE': {'nReady': 0, 'nRunning': 1940, 'nQueue': 339, 'nNewWorkers': 144}, 'MCORE': {'nReady': 0, 'nRunning': 63, 'nQueue': 0, 'nNewWorkers': 1}, 'SCORE_HIMEM': {'nReady': 0, 'nRunning': 1377, 'nQueue': 9, 'nNewWorkers': 0}, 'MCORE_HIMEM': {'nReady': 0, 'nRunning': 0, 'nQueue': 0, 'nNewWorkers': 0}}}}

k8s queue customization through CRIC

Our k8s clusters are in AGIS as CEs:
https://atlas-agis.cern.ch/agis/pandaqueue/detail/CA-VICTORIA-K8S-T2/full/
https://atlas-agis.cern.ch/agis/pandaqueue/detail/CA-VICTORIA-K8S-TEST-T2/full/

update: now in CRIC
https://atlas-cric.cern.ch/core/ce/detail/1825/
https://atlas-cric.cern.ch/core/ce/detail/1826/

pandaqueues too:
https://atlas-cric.cern.ch/atlas/pandaqueue/detail/CA-VICTORIA-K8S-TEST-T2/
https://atlas-cric.cern.ch/atlas/pandaqueue/detail/CA-VICTORIA-K8S-T2/

Currently harvester has the k8s namespace (named "harvester") embedded in some static bits of config on the server.
For k8s, the namespace is generally analogous to the queue name of a CE. When Rod put the k8s clusters into AGIS he filled the queue field with arbitrary text "atlas" but we could change it to "harvester" to represent the namespace.

It would help to eliminate a logical point of failure, and keep all important configuration in AGIS/CRIC, and tidy up config on the harvester server, if Harvester would read the CE queue name from AGIS/CRIC and use that as the k8s namespace, instead of getting it from static config files on the k8s harvester server.

set job and node environment variables for k8s

Hi @fbarreir
To address PanDAWMS/pilot3#51 and PanDAWMS/pilot3#50 we can include this in the pod spec of a job:

      env:
        - name: PANDA_HOSTNAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: spec.nodeName
        - name: K8S_JOB_ID
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.name

I would suggest to take care of this in the Harvester k8s plugin, as it is coordinated with the pilot making it somewhat of a standard, and will be applicable to all queues. What do you think? We can go ahead and try it even if the pilot is not ready to read those environment variables yet. I have already tested this YAML in a pod and confirmed it works.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.