Performing a change that requires an EKS cluster to be replaced fails, and leaves the

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Here's a minimum repro Pulumi program: <a href="https://gist.g

For a little more insight into this - with <a class="user-mention notranslate" data-ho

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Unable to replace EKS cluster: Error revoking security group ingress rules,about pulumi/pulumi-eks

Comments (16)

lukehoban commented on June 27, 2024

Curious - did you use any non-default settings - in particular for the node securitygroup?

from pulumi-eks.

colinbankier commented on June 27, 2024

Nothing non-default - I reproduced the issue with an extremely minimal config, just

const cluster = new eks.Cluster("cluster", {
    vpcId: "vpc-xxxx",
    subnetIds: ["subnet-xxxx", "subnet-xxxx"],
    instanceType: "r5.large",
    desiredCapacity: 2,
    minSize: 2,
    maxSize: 2,
    storageClasses: "gp2",
    deployDashboard: true,
});

And then added some subnets to the list and the same issue occurred.
(The initial subnets were public ones, I added extra private ones, if that makes any difference).

from pulumi-eks.

ggilmore commented on June 27, 2024

Any updates for this issue? I just ran into the same situation.

from pulumi-eks.

metral commented on June 27, 2024

This appears to be related to a long-standing TF bug when using descriptions in the security group rules: hashicorp/terraform-provider-aws#2879, and a fix supposedly came in terraform-provider-aws 1.19.0 but doesn't appear to be a remedy in this scenario.

Per user comments, removing the descriptions from the security group rules seems to skate around this issue. thoughts @jen20 @lukehoban

from pulumi-eks.

metral commented on June 27, 2024

Update: this error seems related to various other TF user issues when mixing & matching secgroup in-line rules vs standalone secgroup rules, and all issues have been bubbled up into hashicorp/terraform-provider-aws#2069 by TF staff. There's a fix that's supposed to correct this in this comment.

However, the merged fix does not actually fix the problem, and a comment in that issue calls out the identical behavior we're experiencing:

In the result of the second apply action above, note the ingress rule with number 2396394866 Terraform, somehow and from somewhere sees an ingress rule that tries to delete and which doesn't exist anywhere in my state file or my security group in AWS. And that's why it complains that the security group ingress rule does not exist. Why and how is Terraform seeing something that doesn't exist in the .tf or the state file is the question.

This crossing of ingress rules listed above matches similar experiences we're seeing in CloudTrail events denoting the attempt to revoke an ingress rule without permission, and the rules in question for the nodeSecurityGroup can at times show another rule that doesn't exist anywhere, often times using the eksClusterSecurityGroup id in lieu of the desired nodeSecurityGroup id or some other variation, while none of this matches the actual state in AWS. There definitely seems to be some crossing of wires happening in the request.

Per the linked issue and others, all signs point to avoiding the use of defining in-line rules in SecurityGroups in TF due to issues in how ingress rules are gathered within SecurityGroups, and instead opting for separate SecurityGroupRules seems to resolve most folks in general. Attempts at using SecurityGroupRules seem like the right path, but this does not seem to play well with existing clusters.

from pulumi-eks.

ggilmore commented on June 27, 2024

@metral Do you have a repro for this behavior so that an issue can be opened on https://github.com/terraform-providers/terraform-provider-aws ?

from pulumi-eks.

metral commented on June 27, 2024

Here's a minimum repro Pulumi program:

Steps:

Run pulumi up in the unzipped dir
After initial deployment is complete, comment out line #74 subnetIds.pop(), and run another update.
- This simulates having a VPC with existing subnets, and increasing up from using 2 subnets in the cluster to 3 subnets (as noted in the earlier comment).
After about ~10 min it'll error out with Error revoking security group ingress rules: InvalidPermission.NotFound: The specified rule does not exist in this security group.

from pulumi-eks.

lukehoban commented on June 27, 2024

For a little more insight into this - with @metral's repro, here's what the changes are that are happening as part of the second update:

Previewing update (dev):

     Type                                          Name                           Plan        Info
     pulumi:pulumi:Stack                           eks69-dev
     └─ eks:index:Cluster                          eks69
 +-     ├─ aws:eks:Cluster                         eks69-eksCluster               replace     [diff: ~name,vpcConfig]
 ~      ├─ pulumi:providers:kubernetes             eks69-eks-k8s                  update      [diff: ~kubeconfig]
 ~      ├─ kubernetes:storage.k8s.io:StorageClass  eks69-gp2                      update      [diff: ~metadata]; 1 warning
 ~      ├─ pulumi-nodejs:dynamic:Resource          eks69-vpc-cni                  update      [diff: ~kubeconfig]
 ~      ├─ aws:ec2:SecurityGroup                   eks69-nodeSecurityGroup        update      [diff: ~tags]
 +-     ├─ aws:ec2:LaunchConfiguration             eks69-nodeLaunchConfiguration  replace     [diff: ~name,userData]
 ~      ├─ aws:cloudformation:Stack                eks69-nodes                    update      [diff: ~templateBody]
 ~      └─ pulumi:providers:kubernetes             eks69-provider                 update      [diff: ~kubeconfig]

Note in particular that the SecurityGroup is actually only even being updated due to a change in it's tags.

from pulumi-eks.

lukehoban commented on June 27, 2024

Turning on verbose logging from the Terraform Provider (export TF_LOG=TRACE) and then logging the pulumi update (e.g. pulumi up --logtostderr -v=9 --debug 2> out.txt) shows this:

    debug: Revoking security group {
    debug:   Description: "Managed by Pulumi",
    debug:   GroupId: "sg-00ed8f2d8501ba93f",
    debug:   GroupName: "eks69-nodeSecurityGroup-dc62a71",
    debug:   IpPermissions: [{
    debug:       IpProtocol: "-1",
    debug:       UserIdGroupPairs: [{
    debug:           Description: "Allow nodes to communicate with each other",
    debug:           GroupId: "sg-00ed8f2d8501ba93f",
    debug:           UserId: "153052954103"
    debug:         }]
    debug:     },{
    debug:       FromPort: 1025,
    debug:       IpProtocol: "tcp",
    debug:       ToPort: 65535,
    debug:       UserIdGroupPairs: [{
    debug:           Description: "Allow worker Kubelets and pods to receive communication from the cluster control plane",
    debug:           GroupId: "sg-01fb5f6f2cd802f8f",
    debug:           UserId: "153052954103"
    debug:         }]
    debug:     },{
    debug:       FromPort: 443,
    debug:       IpProtocol: "tcp",
    debug:       ToPort: 443,
    debug:       UserIdGroupPairs: [{
    debug:           Description: "Allow pods running extension API servers on port 443 to receive communication from cluster control plane",
    debug:           GroupId: "sg-01fb5f6f2cd802f8f",
    debug:           UserId: "153052954103"
    debug:         }]
    debug:     }],
    debug:   IpPermissionsEgress: [{
    debug:       IpProtocol: "-1",
    debug:       IpRanges: [{
    debug:           CidrIp: "0.0.0.0/0",
    debug:           Description: "Allow internet access."
    debug:         }]
    debug:     }],
    debug:   OwnerId: "153052954103",
    debug:   Tags: [{
    debug:       Key: "kubernetes.io/cluster/eks69-eksCluster-254e5bc",
    debug:       Value: "owned"
    debug:     }],
    debug:   VpcId: "vpc-0eab3666fe9e0ab7b"
    debug: } ingress rule: []*ec2.IpPermission{{
    debug:   FromPort: 0,
    debug:   IpProtocol: "-1",
    debug:   ToPort: 0,
    debug:   UserIdGroupPairs: [{
    debug:       Description: "Allow nodes to communicate with each other",
    debug:       GroupId: "sg-01fb5f6f2cd802f8f"
    debug:     }]
    debug: }}
    debug: [aws-sdk-go] DEBUG: Request ec2/RevokeSecurityGroupIngress Details:
    debug: ---[ REQUEST POST-SIGN ]-----------------------------
    debug: POST / HTTP/1.1
    debug: Host: ec2.us-west-2.amazonaws.com
    debug: User-Agent: aws-sdk-go/1.19.4 (go1.12.3; darwin; amd64) APN/1.0 HashiCorp/1.0 Terraform/0.11.12
    debug: Content-Length: 297
    debug: Authorization: AWS4-HMAC-SHA256 Credential=redacted/20190420/us-west-2/ec2/aws4_request, SignedHeaders=content-length;content-type;host;x-amz-date, Signature=7b1037d29442740b6c5d3cd1b8b86c97304eb2dd84a49b250c6fcf06a920baf9
    debug: Content-Type: application/x-www-form-urlencoded; charset=utf-8
    debug: X-Amz-Date: 20190420T191141Z
    debug: Accept-Encoding: gzip
    debug: 
    debug: Action=RevokeSecurityGroupIngress&GroupId=sg-00ed8f2d8501ba93f&IpPermissions.1.FromPort=0&IpPermissions.1.Groups.1.Description=Allow+nodes+to+communicate+with+each+other&IpPermissions.1.Groups.1.GroupId=sg-01fb5f6f2cd802f8f&IpPermissions.1.IpProtocol=-1&IpPermissions.1.ToPort=0&Version=2016-11-15
    debug: -----------------------------------------------------
    debug: RegisterResource RPC finished: resource:eks69-gp2[kubernetes:storage.k8s.io/v1:StorageClass]; err: null, resp: urn:pulumi:dev::eks69::eks:index:Cluster$kubernetes:storage.k8s.io/v1:StorageClass::eks69-gp2,eks69-gp2-ffhxrwou,__inputs,,,,,apiVersion,,,storage.k8s.io/v1,kind,,,StorageClass,metadata,,,,,annotations,,,,,pulumi.com/autonamed,,,true,storageclass.kubernetes.io/is-default-class,,,true,name,,,eks69-gp2-ffhxrwou,parameters,,,,,type,,,gp2,provisioner,,,kubernetes.io/aws-ebs,apiVersion,,,storage.k8s.io/v1,kind,,,StorageClass,metadata,,,,,annotations,,,,,pulumi.com/autonamed,,,true,storageclass.kubernetes.io/is-default-class,,,true,creationTimestamp,,,2019-04-20T18:57:19Z,name,,,eks69-gp2-ffhxrwou,resourceVersion,,,370,selfLink,,,/apis/storage.k8s.io/v1/storageclasses/eks69-gp2-ffhxrwou,uid,,,1f9c24b2-639e-11e9-b4bc-02ace6ef69c6,parameters,,,,,type,,,gp2,provisioner,,,kubernetes.io/aws-ebs,reclaimPolicy,,,Delete,volumeBindingMode,,,Immediate,true,
    debug: [aws-sdk-go] DEBUG: Response ec2/RevokeSecurityGroupIngress Details:
    debug: ---[ RESPONSE ]--------------------------------------
    debug: HTTP/1.1 400 Bad Request
    debug: Connection: close
    debug: Transfer-Encoding: chunked
    debug: Date: Sat, 20 Apr 2019 19:11:41 GMT
    debug: Server: AmazonEC2
    debug: 
    debug: 
    debug: -----------------------------------------------------
    debug: [aws-sdk-go] <?xml version="1.0" encoding="UTF-8"?>
    debug: <Response><Errors><Error><Code>InvalidPermission.NotFound</Code><Message>The specified rule does not exist in this security group.</Message></Error></Errors><RequestID>994d0790-1dc9-487c-969e-993b4e89f8f4</RequestID></Response>
    debug: [aws-sdk-go] DEBUG: Validate Response ec2/RevokeSecurityGroupIngress failed, not retrying, error InvalidPermission.NotFound: The specified rule does not exist in this security group.
    debug:      status code: 400, request id: 994d0790-1dc9-487c-969e-993b4e89f8f4
    error: update failed

This corresponds to the code at https://github.com/terraform-providers/terraform-provider-aws/blob/7beffd7ce24efe62af769e581b3c354ddb724a6d/aws/resource_aws_security_group.go#L725.

What's really surprising here is that it says we are revoking this:

    debug: []*ec2.IpPermission{{
    debug:   FromPort: 0,
    debug:   IpProtocol: "-1",
    debug:   ToPort: 0,
    debug:   UserIdGroupPairs: [{
    debug:       Description: "Allow nodes to communicate with each other",
    debug:       GroupId: "sg-01fb5f6f2cd802f8f"
    debug:     }]
    debug: }}

But the corresponding ingress rule looks like this:

    debug:    {
    debug:       IpProtocol: "-1",
    debug:       UserIdGroupPairs: [{
    debug:           Description: "Allow nodes to communicate with each other",
    debug:           GroupId: "sg-00ed8f2d8501ba93f",
    debug:           UserId: "153052954103"
    debug:         }]
    debug:     }

Which is then what gets embedded into the AWS API request that fails:

Action=RevokeSecurityGroupIngress&GroupId=sg-00ed8f2d8501ba93f&IpPermissions.1.FromPort=0&IpPermissions.1.Groups.1.Description=Allow+nodes+to+communicate+with+each+other&IpPermissions.1.Groups.1.GroupId=sg-01fb5f6f2cd802f8f&IpPermissions.1.IpProtocol=-1&IpPermissions.1.ToPort=0&Version=2016-11-15

Note that these have different GroupId values.

And indeed, when I look at the actual AWS resources, there is a resource corresponding to the second, but no resource corresponding to the first. It is unclear where the Terraform provider is even coming up with that sg-01fb5f6f2cd802f8f value. It's the right value for a different ingress rule, but is not present in any form for this ingress rule.

In the source code, this ingress rule is defined as:

            {
                description: "Allow nodes to communicate with each other",
                fromPort: 0,
                toPort: 0,
                protocol: "-1", // all
                self: true,
            },

Notably - it uses self: true to specify the source securityGroup, but the Terraform Provider appears to be getting confused and using some totally unrelated Security Group in place of the self link.

from pulumi-eks.

metral commented on June 27, 2024

@lukehoban Your analysis is spot on to what I've been seeing. I've even tried removing the use of self in place for the nodeSecurityGroupId itself, but even that doesn't seem to fix the TF provider from tripping up when gathering rules - which is a spot ref'd in a couple of issues.

Moving away from in-line rules to secgroup rules sounds like the best path forward to avoid the errors we're seeing. Though doing so cleanly for existing clusters seems to be a challenge without it wanting to recreate the worker nodes per the dep chain, and it will still hit the same errors. Thoughts?

from pulumi-eks.

lukehoban commented on June 27, 2024

Okay - I'm beginning to believe the root cause here is actually a core Pulumi engine bug. And moreover, I'm also beginning to think it's the same bug I spent most of this week fighting as part of pulumi/pulumi-terraform#362 and ultimately currently being tracked in pulumi/pulumi#2650.

The key is to look at the Pulumi state file (pulumi stack export > stack.json), and notice the following values for ingress on the "inputs" and "outputs": of the security group in question respectively:

Inputs:

"ingress": [
                        {
                            "__defaults": [],
                            "description": "Allow nodes to communicate with each other",
                            "fromPort": 0,
                            "protocol": "-1",
                            "self": true,
                            "toPort": 0
                        },
                        {
                            "__defaults": [
                                "self"
                            ],
                            "description": "Allow worker Kubelets and pods to receive communication from the cluster control plane",
                            "fromPort": 1025,
                            "protocol": "tcp",
                            "securityGroups": [
                                "sg-01fb5f6f2cd802f8f"
                            ],
                            "self": false,
                            "toPort": 65535
                        },
                        {
                            "__defaults": [
                                "self"
                            ],
                            "description": "Allow pods running extension API servers on port 443 to receive communication from cluster control plane",
                            "fromPort": 443,
                            "protocol": "tcp",
                            "securityGroups": [
                                "sg-01fb5f6f2cd802f8f"
                            ],
                            "self": false,
                            "toPort": 443
                        }
                    ],

Outputs:

"ingress": [
                        {
                            "cidrBlocks": [],
                            "description": "Allow worker Kubelets and pods to receive communication from the cluster control plane",
                            "fromPort": 1025,
                            "ipv6CidrBlocks": [],
                            "prefixListIds": [],
                            "protocol": "tcp",
                            "securityGroups": [
                                "sg-01fb5f6f2cd802f8f"
                            ],
                            "self": false,
                            "toPort": 65535
                        },
                        {
                            "cidrBlocks": [],
                            "description": "Allow nodes to communicate with each other",
                            "fromPort": 0,
                            "ipv6CidrBlocks": [],
                            "prefixListIds": [],
                            "protocol": "-1",
                            "securityGroups": [],
                            "self": true,
                            "toPort": 0
                        },
                        {
                            "cidrBlocks": [],
                            "description": "Allow pods running extension API servers on port 443 to receive communication from cluster control plane",
                            "fromPort": 443,
                            "ipv6CidrBlocks": [],
                            "prefixListIds": [],
                            "protocol": "tcp",
                            "securityGroups": [
                                "sg-01fb5f6f2cd802f8f"
                            ],
                            "self": false,
                            "toPort": 443
                        }
                    ],

Note that the two arrays are in different orders. But then as highlighted in pulumi/pulumi#2650, the Pulumi engine does a (highly suspicious) deep merge of input and outputs to compute the "old inputs", including an element-wise merge of arrays. This will cause the values passed to the Terraform provider Update call to be a strange mix of two of the specified rules - which is capable of triggering the results seen above.

And indeed - a close inspection of the Update call Pulumi schedules shows:

I0420 12:11:39.874548   23508 rpc.go:68] Marshaling property for RPC[Provider[aws, 0xc004f81d60].Update(sg-00ed8f2d8501ba93f,urn:pulumi:dev::eks69::eks:index:Cluster$aws:ec2/securityGroup:SecurityGroup::eks69-nodeSecurityGroup).olds]: ingress={[{map[__defaults:{[]} cidrBlocks:{[]} description:{Allow worker Kubelets and pods to receive communication from the cluster control plane} fromPort:{1025} ipv6CidrBlocks:{[]} prefixListIds:{[]} protocol:{tcp} securityGroups:{[{sg-01fb5f6f2cd802f8f}]} self:{false} toPort:{65535}]} {map[__defaults:{[{self}]} cidrBlocks:{[]} description:{Allow nodes to communicate with each other} fromPort:{0} ipv6CidrBlocks:{[]} prefixListIds:{[]} protocol:{-1} securityGroups:{[{sg-01fb5f6f2cd802f8f}]} self:{true} toPort:{0}]} {map[__defaults:{[{self}]} cidrBlocks:{[]} description:{Allow pods running extension API servers on port 443 to receive communication from cluster control plane} fromPort:{443} ipv6CidrBlocks:{[]} prefixListIds:{[]} protocol:{tcp} securityGroups:{[{sg-01fb5f6f2cd802f8f}]} self:{false} toPort:{443}]}]}

Which includes:

{map[__defaults:{[{self}]} cidrBlocks:{[]} description:{Allow nodes to communicate with each other} fromPort:{0} ipv6CidrBlocks:{[]} prefixListIds:{[]} protocol:{-1} securityGroups:{[{sg-01fb5f6f2cd802f8f}]} self:{true} toPort:{0}]}

Which includes both self: true and securityGroups: ["sg-01fb5f6f2cd802f8f"] - even though the user never specified this, and no such thing exists in the cloud or even in the state file.

Net - I have fairly high confidence the root cause here is pulumi/pulumi#2650.

from pulumi-eks.

lukehoban commented on June 27, 2024

As a result - a workaround for this bug is to do the following:

pulumi stack export > stack.json
Find the nodeSecurityGroup outputs ingress array and ensure it is ordered the same as the inputs ingress. This most likely means swapping the first two elements of the three element array. Save the file.
pulumi stack import < stack.json

from pulumi-eks.

lukehoban commented on June 27, 2024

As for near term fixes to the specific case here.

Moving away from in-line rules to secgroup rules sounds like the best path forward to avoid the errors we're seeing.

I agree that in general this is a better approach - both to avoid issues like the one described above - but also just because it's generally easier to extend (users can add their own rules if the want).

Though doing so cleanly for existing clusters seems to be a challenge without it wanting to recreate the worker nodes per the dep chain, and it will still hit the same errors .

I think you are right that any attempt to change the SecurityGroup that this resource currently deploys will fail, leading to any attempt to "fix" this to actually lead to triggering this issue. We may want to wait to see if we can fix pulumi/pulumi#2650 in the core engine first, and then after we do that, I believe we could safely transition to a cleaner approach to managing ingress rules here.

For right now, the workaround in #69 (comment) should unblock anyone currently blocked by this.

from pulumi-eks.

lukehoban commented on June 27, 2024

The original issue here will be fixed by using a pulumi CLI with the fix for pulumi/pulumi#2650. @metral will open a new issue to track some related improvements we can make to avoid using inline security groups. But the specific issue here should be fixed.

from pulumi-eks.

metral commented on June 27, 2024

The fix for this issue is now available in the current latest version of the CLI, v0.17.8: https://github.com/pulumi/pulumi/releases/tag/v0.17.8

from pulumi-eks.

steveellis commented on June 27, 2024

I just encountered Error revoking security group ingress rules: InvalidPermission.NotFound: The specified rule does not exist in this security group. I'm on pulumi cli version 3.19.0. Running pulumi destroy -r was all I needed. Hopefully this will help someone else.

from pulumi-eks.

Unable to replace EKS cluster: Error revoking security group ingress rules about pulumi-eks HOT 16 CLOSED

Comments (16)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent