Comments (16)
Curious - did you use any non-default settings - in particular for the node securitygroup?
from pulumi-eks.
Nothing non-default - I reproduced the issue with an extremely minimal config, just
const cluster = new eks.Cluster("cluster", {
vpcId: "vpc-xxxx",
subnetIds: ["subnet-xxxx", "subnet-xxxx"],
instanceType: "r5.large",
desiredCapacity: 2,
minSize: 2,
maxSize: 2,
storageClasses: "gp2",
deployDashboard: true,
});
And then added some subnets to the list and the same issue occurred.
(The initial subnets were public ones, I added extra private ones, if that makes any difference).
from pulumi-eks.
Any updates for this issue? I just ran into the same situation.
from pulumi-eks.
This appears to be related to a long-standing TF bug when using descriptions in the security group rules: hashicorp/terraform-provider-aws#2879, and a fix supposedly came in terraform-provider-aws 1.19.0 but doesn't appear to be a remedy in this scenario.
Per user comments, removing the descriptions from the security group rules seems to skate around this issue. thoughts @jen20 @lukehoban
from pulumi-eks.
Update: this error seems related to various other TF user issues when mixing & matching secgroup in-line rules vs standalone secgroup rules, and all issues have been bubbled up into hashicorp/terraform-provider-aws#2069 by TF staff. There's a fix that's supposed to correct this in this comment.
However, the merged fix does not actually fix the problem, and a comment in that issue calls out the identical behavior we're experiencing:
In the result of the second apply action above, note the ingress rule with number 2396394866 Terraform, somehow and from somewhere sees an ingress rule that tries to delete and which doesn't exist anywhere in my state file or my security group in AWS. And that's why it complains that the security group ingress rule does not exist. Why and how is Terraform seeing something that doesn't exist in the .tf or the state file is the question.
This crossing of ingress rules listed above matches similar experiences we're seeing in CloudTrail events denoting the attempt to revoke an ingress rule without permission, and the rules in question for the nodeSecurityGroup
can at times show another rule that doesn't exist anywhere, often times using the eksClusterSecurityGroup
id in lieu of the desired nodeSecurityGroup
id or some other variation, while none of this matches the actual state in AWS. There definitely seems to be some crossing of wires happening in the request.
Per the linked issue and others, all signs point to avoiding the use of defining in-line rules in SecurityGroups in TF due to issues in how ingress rules are gathered within SecurityGroups, and instead opting for separate SecurityGroupRules seems to resolve most folks in general. Attempts at using SecurityGroupRules seem like the right path, but this does not seem to play well with existing clusters.
from pulumi-eks.
@metral Do you have a repro for this behavior so that an issue can be opened on https://github.com/terraform-providers/terraform-provider-aws ?
from pulumi-eks.
Here's a minimum repro Pulumi program:
Steps:
- Run
pulumi up
in the unzipped dir - After initial deployment is complete, comment out line #74
subnetIds.pop()
, and run another update.- This simulates having a VPC with existing subnets, and increasing up from using 2 subnets in the cluster to 3 subnets (as noted in the earlier comment).
- After about ~10 min it'll error out with
Error revoking security group ingress rules: InvalidPermission.NotFound: The specified rule does not exist in this security group.
from pulumi-eks.
For a little more insight into this - with @metral's repro, here's what the changes are that are happening as part of the second update:
Previewing update (dev):
Type Name Plan Info
pulumi:pulumi:Stack eks69-dev
└─ eks:index:Cluster eks69
+- ├─ aws:eks:Cluster eks69-eksCluster replace [diff: ~name,vpcConfig]
~ ├─ pulumi:providers:kubernetes eks69-eks-k8s update [diff: ~kubeconfig]
~ ├─ kubernetes:storage.k8s.io:StorageClass eks69-gp2 update [diff: ~metadata]; 1 warning
~ ├─ pulumi-nodejs:dynamic:Resource eks69-vpc-cni update [diff: ~kubeconfig]
~ ├─ aws:ec2:SecurityGroup eks69-nodeSecurityGroup update [diff: ~tags]
+- ├─ aws:ec2:LaunchConfiguration eks69-nodeLaunchConfiguration replace [diff: ~name,userData]
~ ├─ aws:cloudformation:Stack eks69-nodes update [diff: ~templateBody]
~ └─ pulumi:providers:kubernetes eks69-provider update [diff: ~kubeconfig]
Note in particular that the SecurityGroup is actually only even being updated due to a change in it's tags
.
from pulumi-eks.
Turning on verbose logging from the Terraform Provider (export TF_LOG=TRACE
) and then logging the pulumi update
(e.g. pulumi up --logtostderr -v=9 --debug 2> out.txt
) shows this:
debug: Revoking security group {
debug: Description: "Managed by Pulumi",
debug: GroupId: "sg-00ed8f2d8501ba93f",
debug: GroupName: "eks69-nodeSecurityGroup-dc62a71",
debug: IpPermissions: [{
debug: IpProtocol: "-1",
debug: UserIdGroupPairs: [{
debug: Description: "Allow nodes to communicate with each other",
debug: GroupId: "sg-00ed8f2d8501ba93f",
debug: UserId: "153052954103"
debug: }]
debug: },{
debug: FromPort: 1025,
debug: IpProtocol: "tcp",
debug: ToPort: 65535,
debug: UserIdGroupPairs: [{
debug: Description: "Allow worker Kubelets and pods to receive communication from the cluster control plane",
debug: GroupId: "sg-01fb5f6f2cd802f8f",
debug: UserId: "153052954103"
debug: }]
debug: },{
debug: FromPort: 443,
debug: IpProtocol: "tcp",
debug: ToPort: 443,
debug: UserIdGroupPairs: [{
debug: Description: "Allow pods running extension API servers on port 443 to receive communication from cluster control plane",
debug: GroupId: "sg-01fb5f6f2cd802f8f",
debug: UserId: "153052954103"
debug: }]
debug: }],
debug: IpPermissionsEgress: [{
debug: IpProtocol: "-1",
debug: IpRanges: [{
debug: CidrIp: "0.0.0.0/0",
debug: Description: "Allow internet access."
debug: }]
debug: }],
debug: OwnerId: "153052954103",
debug: Tags: [{
debug: Key: "kubernetes.io/cluster/eks69-eksCluster-254e5bc",
debug: Value: "owned"
debug: }],
debug: VpcId: "vpc-0eab3666fe9e0ab7b"
debug: } ingress rule: []*ec2.IpPermission{{
debug: FromPort: 0,
debug: IpProtocol: "-1",
debug: ToPort: 0,
debug: UserIdGroupPairs: [{
debug: Description: "Allow nodes to communicate with each other",
debug: GroupId: "sg-01fb5f6f2cd802f8f"
debug: }]
debug: }}
debug: [aws-sdk-go] DEBUG: Request ec2/RevokeSecurityGroupIngress Details:
debug: ---[ REQUEST POST-SIGN ]-----------------------------
debug: POST / HTTP/1.1
debug: Host: ec2.us-west-2.amazonaws.com
debug: User-Agent: aws-sdk-go/1.19.4 (go1.12.3; darwin; amd64) APN/1.0 HashiCorp/1.0 Terraform/0.11.12
debug: Content-Length: 297
debug: Authorization: AWS4-HMAC-SHA256 Credential=redacted/20190420/us-west-2/ec2/aws4_request, SignedHeaders=content-length;content-type;host;x-amz-date, Signature=7b1037d29442740b6c5d3cd1b8b86c97304eb2dd84a49b250c6fcf06a920baf9
debug: Content-Type: application/x-www-form-urlencoded; charset=utf-8
debug: X-Amz-Date: 20190420T191141Z
debug: Accept-Encoding: gzip
debug:
debug: Action=RevokeSecurityGroupIngress&GroupId=sg-00ed8f2d8501ba93f&IpPermissions.1.FromPort=0&IpPermissions.1.Groups.1.Description=Allow+nodes+to+communicate+with+each+other&IpPermissions.1.Groups.1.GroupId=sg-01fb5f6f2cd802f8f&IpPermissions.1.IpProtocol=-1&IpPermissions.1.ToPort=0&Version=2016-11-15
debug: -----------------------------------------------------
debug: RegisterResource RPC finished: resource:eks69-gp2[kubernetes:storage.k8s.io/v1:StorageClass]; err: null, resp: urn:pulumi:dev::eks69::eks:index:Cluster$kubernetes:storage.k8s.io/v1:StorageClass::eks69-gp2,eks69-gp2-ffhxrwou,__inputs,,,,,apiVersion,,,storage.k8s.io/v1,kind,,,StorageClass,metadata,,,,,annotations,,,,,pulumi.com/autonamed,,,true,storageclass.kubernetes.io/is-default-class,,,true,name,,,eks69-gp2-ffhxrwou,parameters,,,,,type,,,gp2,provisioner,,,kubernetes.io/aws-ebs,apiVersion,,,storage.k8s.io/v1,kind,,,StorageClass,metadata,,,,,annotations,,,,,pulumi.com/autonamed,,,true,storageclass.kubernetes.io/is-default-class,,,true,creationTimestamp,,,2019-04-20T18:57:19Z,name,,,eks69-gp2-ffhxrwou,resourceVersion,,,370,selfLink,,,/apis/storage.k8s.io/v1/storageclasses/eks69-gp2-ffhxrwou,uid,,,1f9c24b2-639e-11e9-b4bc-02ace6ef69c6,parameters,,,,,type,,,gp2,provisioner,,,kubernetes.io/aws-ebs,reclaimPolicy,,,Delete,volumeBindingMode,,,Immediate,true,
debug: [aws-sdk-go] DEBUG: Response ec2/RevokeSecurityGroupIngress Details:
debug: ---[ RESPONSE ]--------------------------------------
debug: HTTP/1.1 400 Bad Request
debug: Connection: close
debug: Transfer-Encoding: chunked
debug: Date: Sat, 20 Apr 2019 19:11:41 GMT
debug: Server: AmazonEC2
debug:
debug:
debug: -----------------------------------------------------
debug: [aws-sdk-go] <?xml version="1.0" encoding="UTF-8"?>
debug: <Response><Errors><Error><Code>InvalidPermission.NotFound</Code><Message>The specified rule does not exist in this security group.</Message></Error></Errors><RequestID>994d0790-1dc9-487c-969e-993b4e89f8f4</RequestID></Response>
debug: [aws-sdk-go] DEBUG: Validate Response ec2/RevokeSecurityGroupIngress failed, not retrying, error InvalidPermission.NotFound: The specified rule does not exist in this security group.
debug: status code: 400, request id: 994d0790-1dc9-487c-969e-993b4e89f8f4
error: update failed
This corresponds to the code at https://github.com/terraform-providers/terraform-provider-aws/blob/7beffd7ce24efe62af769e581b3c354ddb724a6d/aws/resource_aws_security_group.go#L725.
What's really surprising here is that it says we are revoking this:
debug: []*ec2.IpPermission{{
debug: FromPort: 0,
debug: IpProtocol: "-1",
debug: ToPort: 0,
debug: UserIdGroupPairs: [{
debug: Description: "Allow nodes to communicate with each other",
debug: GroupId: "sg-01fb5f6f2cd802f8f"
debug: }]
debug: }}
But the corresponding ingress rule looks like this:
debug: {
debug: IpProtocol: "-1",
debug: UserIdGroupPairs: [{
debug: Description: "Allow nodes to communicate with each other",
debug: GroupId: "sg-00ed8f2d8501ba93f",
debug: UserId: "153052954103"
debug: }]
debug: }
Which is then what gets embedded into the AWS API request that fails:
Action=RevokeSecurityGroupIngress&GroupId=sg-00ed8f2d8501ba93f&IpPermissions.1.FromPort=0&IpPermissions.1.Groups.1.Description=Allow+nodes+to+communicate+with+each+other&IpPermissions.1.Groups.1.GroupId=sg-01fb5f6f2cd802f8f&IpPermissions.1.IpProtocol=-1&IpPermissions.1.ToPort=0&Version=2016-11-15
Note that these have different GroupId
values.
And indeed, when I look at the actual AWS resources, there is a resource corresponding to the second, but no resource corresponding to the first. It is unclear where the Terraform provider is even coming up with that sg-01fb5f6f2cd802f8f
value. It's the right value for a different ingress rule, but is not present in any form for this ingress rule.
In the source code, this ingress rule is defined as:
{
description: "Allow nodes to communicate with each other",
fromPort: 0,
toPort: 0,
protocol: "-1", // all
self: true,
},
Notably - it uses self: true
to specify the source securityGroup, but the Terraform Provider appears to be getting confused and using some totally unrelated Security Group in place of the self
link.
from pulumi-eks.
@lukehoban Your analysis is spot on to what I've been seeing. I've even tried removing the use of self
in place for the nodeSecurityGroupId
itself, but even that doesn't seem to fix the TF provider from tripping up when gathering rules - which is a spot ref'd in a couple of issues.
Moving away from in-line rules to secgroup rules sounds like the best path forward to avoid the errors we're seeing. Though doing so cleanly for existing clusters seems to be a challenge without it wanting to recreate the worker nodes per the dep chain, and it will still hit the same errors. Thoughts?
from pulumi-eks.
Okay - I'm beginning to believe the root cause here is actually a core Pulumi engine bug. And moreover, I'm also beginning to think it's the same bug I spent most of this week fighting as part of pulumi/pulumi-terraform#362 and ultimately currently being tracked in pulumi/pulumi#2650.
The key is to look at the Pulumi state file (pulumi stack export > stack.json
), and notice the following values for ingress
on the "inputs" and "outputs": of the security group in question respectively:
Inputs:
"ingress": [
{
"__defaults": [],
"description": "Allow nodes to communicate with each other",
"fromPort": 0,
"protocol": "-1",
"self": true,
"toPort": 0
},
{
"__defaults": [
"self"
],
"description": "Allow worker Kubelets and pods to receive communication from the cluster control plane",
"fromPort": 1025,
"protocol": "tcp",
"securityGroups": [
"sg-01fb5f6f2cd802f8f"
],
"self": false,
"toPort": 65535
},
{
"__defaults": [
"self"
],
"description": "Allow pods running extension API servers on port 443 to receive communication from cluster control plane",
"fromPort": 443,
"protocol": "tcp",
"securityGroups": [
"sg-01fb5f6f2cd802f8f"
],
"self": false,
"toPort": 443
}
],
Outputs:
"ingress": [
{
"cidrBlocks": [],
"description": "Allow worker Kubelets and pods to receive communication from the cluster control plane",
"fromPort": 1025,
"ipv6CidrBlocks": [],
"prefixListIds": [],
"protocol": "tcp",
"securityGroups": [
"sg-01fb5f6f2cd802f8f"
],
"self": false,
"toPort": 65535
},
{
"cidrBlocks": [],
"description": "Allow nodes to communicate with each other",
"fromPort": 0,
"ipv6CidrBlocks": [],
"prefixListIds": [],
"protocol": "-1",
"securityGroups": [],
"self": true,
"toPort": 0
},
{
"cidrBlocks": [],
"description": "Allow pods running extension API servers on port 443 to receive communication from cluster control plane",
"fromPort": 443,
"ipv6CidrBlocks": [],
"prefixListIds": [],
"protocol": "tcp",
"securityGroups": [
"sg-01fb5f6f2cd802f8f"
],
"self": false,
"toPort": 443
}
],
Note that the two arrays are in different orders. But then as highlighted in pulumi/pulumi#2650, the Pulumi engine does a (highly suspicious) deep merge of input and outputs to compute the "old inputs", including an element-wise merge of arrays. This will cause the values passed to the Terraform provider Update
call to be a strange mix of two of the specified rules - which is capable of triggering the results seen above.
And indeed - a close inspection of the Update
call Pulumi schedules shows:
I0420 12:11:39.874548 23508 rpc.go:68] Marshaling property for RPC[Provider[aws, 0xc004f81d60].Update(sg-00ed8f2d8501ba93f,urn:pulumi:dev::eks69::eks:index:Cluster$aws:ec2/securityGroup:SecurityGroup::eks69-nodeSecurityGroup).olds]: ingress={[{map[__defaults:{[]} cidrBlocks:{[]} description:{Allow worker Kubelets and pods to receive communication from the cluster control plane} fromPort:{1025} ipv6CidrBlocks:{[]} prefixListIds:{[]} protocol:{tcp} securityGroups:{[{sg-01fb5f6f2cd802f8f}]} self:{false} toPort:{65535}]} {map[__defaults:{[{self}]} cidrBlocks:{[]} description:{Allow nodes to communicate with each other} fromPort:{0} ipv6CidrBlocks:{[]} prefixListIds:{[]} protocol:{-1} securityGroups:{[{sg-01fb5f6f2cd802f8f}]} self:{true} toPort:{0}]} {map[__defaults:{[{self}]} cidrBlocks:{[]} description:{Allow pods running extension API servers on port 443 to receive communication from cluster control plane} fromPort:{443} ipv6CidrBlocks:{[]} prefixListIds:{[]} protocol:{tcp} securityGroups:{[{sg-01fb5f6f2cd802f8f}]} self:{false} toPort:{443}]}]}
Which includes:
{map[__defaults:{[{self}]} cidrBlocks:{[]} description:{Allow nodes to communicate with each other} fromPort:{0} ipv6CidrBlocks:{[]} prefixListIds:{[]} protocol:{-1} securityGroups:{[{sg-01fb5f6f2cd802f8f}]} self:{true} toPort:{0}]}
Which includes both self: true
and securityGroups: ["sg-01fb5f6f2cd802f8f"]
- even though the user never specified this, and no such thing exists in the cloud or even in the state file.
Net - I have fairly high confidence the root cause here is pulumi/pulumi#2650.
from pulumi-eks.
As a result - a workaround for this bug is to do the following:
pulumi stack export > stack.json
- Find the
nodeSecurityGroup
outputs
ingress
array and ensure it is ordered the same as theinputs
ingress
. This most likely means swapping the first two elements of the three element array. Save the file. pulumi stack import < stack.json
from pulumi-eks.
As for near term fixes to the specific case here.
Moving away from in-line rules to secgroup rules sounds like the best path forward to avoid the errors we're seeing.
I agree that in general this is a better approach - both to avoid issues like the one described above - but also just because it's generally easier to extend (users can add their own rules if the want).
Though doing so cleanly for existing clusters seems to be a challenge without it wanting to recreate the worker nodes per the dep chain, and it will still hit the same errors .
I think you are right that any attempt to change the SecurityGroup that this resource currently deploys will fail, leading to any attempt to "fix" this to actually lead to triggering this issue. We may want to wait to see if we can fix pulumi/pulumi#2650 in the core engine first, and then after we do that, I believe we could safely transition to a cleaner approach to managing ingress rules here.
For right now, the workaround in #69 (comment) should unblock anyone currently blocked by this.
from pulumi-eks.
The original issue here will be fixed by using a pulumi
CLI with the fix for pulumi/pulumi#2650. @metral will open a new issue to track some related improvements we can make to avoid using inline security groups. But the specific issue here should be fixed.
from pulumi-eks.
The fix for this issue is now available in the current latest version of the CLI, v0.17.8: https://github.com/pulumi/pulumi/releases/tag/v0.17.8
from pulumi-eks.
I just encountered Error revoking security group ingress rules: InvalidPermission.NotFound: The specified rule does not exist in this security group.
I'm on pulumi cli version 3.19.0. Running pulumi destroy -r
was all I needed. Hopefully this will help someone else.
from pulumi-eks.
Related Issues (20)
- ManagedNodeGroup should be able to specify an AMI ID override
- Workflow failure: cron HOT 1
- Workflow failure: master HOT 1
- Unable to specify `nodeRootVolumeIops` and `nodeRootVolumeThroughput` in node group resources
- Upgrade tests are broken after release v2.5.0
- Keep pulumi-aws reference up-to-date with automation
- `instanceRole` still created when `skipDefaultNodeGroup` is enabled. HOT 1
- eks.ManagedNodeGroup with amiType: 'AL2023_ARM_64_STANDARD' breaks when kubeletExtraArgs specified HOT 5
- Upgrade tests fails whenever there is a new recommended AMI ID
- Generate Node SDK using the standard codegen tooling HOT 3
- Enforce linting rules in CI
- Workflow failure: master
- Workflow failure: master
- VPC CNI creation does not respect the `proxy` config or `HTTP_PROXY` environment variable HOT 1
- eks.ManagedNodeGroup: launchTemplate is ignored HOT 2
- clusterSecurityGroupTags applied to Additional security group. Must assign to Acutal cluster security group. HOT 3
- Support AL2023 & Bottlerocket for managed node groups
- Cannot create `ManagedNodeGroup` for cluster with `API` authentication mode
- Allow users to set KMS key when encrypting node block devices HOT 2
- Workflow failure: cron HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pulumi-eks.