aws / amazon-cloudwatch-agent-operator Goto Github PK

View Code? Open in Web Editor NEW

16.0 16.0 18.0 6.22 MB

The Amazon CloudWatch Agent Operator is software developed to manage the CloudWatch Agent on kubernetes.

License: Apache License 2.0

Dockerfile 0.15% Makefile 1.07% Go 98.22% HCL 0.56%

amazon-cloudwatch-agent-operator's People

Contributors

Stargazers

Watchers

Forkers

harrryr chadpatel lisguo vastin nathalapooja bhanuba xinranzhaws klwntsingh aetheriaxai arunsathiya jefchien aditya-purang srprash jj22ee wangzlei paramadon musa-asad bjrara

amazon-cloudwatch-agent-operator's Issues

Agent fails with credential errors, cannot use IAM Roles for Service Accounts (IRSA) or EKS Pod Identities with EKS Addon

Users may wish to run the CloudWatch Agent using pod-based IAM roles, using the IRSA or EKS Pod Identities technologies. Recently (PR below) this was enabled when an environment variable is set on the agent pod, RUN_WITH_IRSA=true, and this enables the agent to utilize the default provider chain for AWS authentication.

However, the EKS Addon for AWS CloudWatch Observability creates a managed AmazonCloudWatchAgent configuration, making it unsafe - there is no guarantee it won't be overridden - to add environment variables.

Background

CloudWatch Agent PR:

https://github.com/aws/amazon-cloudwatch-agent/pull/682/files

Expected behavior

Running the EKS Addon for AWS CloudWatch Observability with pod-based IAM should work by default.

Actual behavior

The agent fails, and there is no knob available to users to ensure the agent works.

Proposal

Either of these solutions would address this:

The Addon and Operator should permit an additional configuration, to merge either individual env vars or arbitrary config into the AmazonCloudWatchAgent custom resource
The Agent's configuration file, cwagentconfig.json, which is managed by the add-on should accept a configuration key to enable the RUN_WITH_IRSA mode.

Cloudwatch agent pods don't get restarted when doing rollout-restart

Describe the bug

All the pods part of the ds cloudwatch-agent are not getting restarted when doing kubectl rollout restart ds cloudwatch-agent -n amazon-cloudwatch. Only one pod is getting restarted.

Steps to reproduce

Created a cluster of version 1.28 and installed the addon Amazon CloudWatch Observability of version v1.2.2-eksbuild.1.

Intially we have 2 pods:

kubectl get pods  -A -l app.kubernetes.io/component=amazon-cloudwatch-agent -o wide   
NAMESPACE           NAME                     READY   STATUS    RESTARTS   AGE   IP              NODE                           NOMINATED NODE   READINESS GATES
amazon-cloudwatch   cloudwatch-agent-hdbmv   1/1     Running   0          7s    172.31.78.188   ip-172-31-67-14.ec2.internal   <none>           <none>
amazon-cloudwatch   cloudwatch-agent-ttfbd   1/1     Running   0          7s    172.31.1.111    ip-172-31-5-6.ec2.internal     <none>           <none>

1st Restart:

kubectl rollout restart ds cloudwatch-agent -n amazon-cloudwatch                      
daemonset.apps/cloudwatch-agent restarted

We can see that only 1 pod got restarted, other pod is still running:

kubectl get pods  -A -l app.kubernetes.io/component=amazon-cloudwatch-agent -o wide  -w
NAMESPACE           NAME                     READY   STATUS    RESTARTS   AGE   IP              NODE                           NOMINATED NODE   READINESS GATES
amazon-cloudwatch   cloudwatch-agent-hdbmv   1/1     Running   0          33s   172.31.78.188   ip-172-31-67-14.ec2.internal   <none>           <none>
amazon-cloudwatch   cloudwatch-agent-l2mgm   1/1     Running   0          8s    172.31.0.110    ip-172-31-5-6.ec2.internal     <none>           <none>

Same behaviour every time:

kubectl rollout restart ds cloudwatch-agent -n amazon-cloudwatch                       
daemonset.apps/cloudwatch-agent restarted


kubectl get pods  -A -l app.kubernetes.io/component=amazon-cloudwatch-agent -o wide  
NAMESPACE           NAME                     READY   STATUS    RESTARTS   AGE   IP              NODE                           NOMINATED NODE   READINESS GATES
amazon-cloudwatch   cloudwatch-agent-hdbmv   1/1     Running   0          71s   172.31.78.188   ip-172-31-67-14.ec2.internal   <none>           <none>
amazon-cloudwatch   cloudwatch-agent-st4p9   1/1     Running   0          4s    172.31.1.111    ip-172-31-5-6.ec2.internal     <none>           <none>

What did you expect to see?
I expected that all the pods of the ds should be restarted

What did you see instead?
Instead, I see that only 1 pod is getting restarted

What version did you use?
v1.2.2-eksbuild.1

What config did you use?
NA

Environment
Tried for cluster version 1.26, 1.27 & 1.28

Additional context

I could observer difference in the creation of the controllerrevisisons.

For a sample ds, where rollout restart works perfectly fine, 1 new controllerrevision is created when we perform rollout restart

% kubectl get controllerrevision -A
NAMESPACE           NAME                              CONTROLLER                            REVISION   AGE
amazon-cloudwatch   cloudwatch-agent-6ddd78df4        daemonset.apps/cloudwatch-agent       1          34m
amazon-cloudwatch   fluent-bit-57659b7864             daemonset.apps/fluent-bit             1          34m
default             web-79dc58f667                    statefulset.apps/web                  1          46d
kube-system         aws-node-5b47bbc5c8               daemonset.apps/aws-node               2          16d
kube-system         aws-node-5bdc4b45f4               daemonset.apps/aws-node               3          16d
kube-system         aws-node-7845867c85               daemonset.apps/aws-node               1          31d

Whereas in case of cloudwatch agent pods, the 1st controllerrevision is deleted and 2 new controller revisions are created. 3rd one is same as the 1st one. Below is the pattern:

$kubectl get controllerrevision -A | grep watch               
amazon-cloudwatch   cloudwatch-agent-5f44485c55       daemonset.apps/cloudwatch-agent       1          20m

$kubectl get controllerrevision -A | grep watch               
amazon-cloudwatch   cloudwatch-agent-5f44485c55       daemonset.apps/cloudwatch-agent       2          36m
amazon-cloudwatch   cloudwatch-agent-746f576ff6       daemonset.apps/cloudwatch-agent       3          47m

$kubectl get controllerrevision -A | grep watch    
amazon-cloudwatch   cloudwatch-agent-5f44485c55       daemonset.apps/cloudwatch-agent       2          40m
amazon-cloudwatch   cloudwatch-agent-746f576ff6       daemonset.apps/cloudwatch-agent       5          51m
amazon-cloudwatch   cloudwatch-agent-cd885487d        daemonset.apps/cloudwatch-agent       4          16s

$kubectl get controllerrevision -A | grep watch                  
amazon-cloudwatch   cloudwatch-agent-5f44485c55       daemonset.apps/cloudwatch-agent       2          42m
amazon-cloudwatch   cloudwatch-agent-746f576ff6       daemonset.apps/cloudwatch-agent       7          53m
amazon-cloudwatch   cloudwatch-agent-779d495df4       daemonset.apps/cloudwatch-agent       6          4s
amazon-cloudwatch   cloudwatch-agent-cd885487d        daemonset.apps/cloudwatch-agent       4          2m2s

$kubectl get controllerrevision -A | grep watch                  
amazon-cloudwatch   cloudwatch-agent-5f44485c55       daemonset.apps/cloudwatch-agent       2          42m
amazon-cloudwatch   cloudwatch-agent-746f576ff6       daemonset.apps/cloudwatch-agent       9          53m
amazon-cloudwatch   cloudwatch-agent-779d495df4       daemonset.apps/cloudwatch-agent       6          21s
amazon-cloudwatch   cloudwatch-agent-84df56d566       daemonset.apps/cloudwatch-agent       8          3s
amazon-cloudwatch   cloudwatch-agent-cd885487d        daemonset.apps/cloudwatch-agent       4          2m19s

Fluent Bit DaemonSet created even when containerLogs.enabled is false.

We are in the process of starting to use the amazon-cloudwatch-observability EKS Addon in our EKS Clusters. This addon seems to use the amazon-cloudwatch-agent-operator project under the hood.

Issue

We noticed that the Fluent Bit DaemonSet is enabled even when the containerLogs.enabled is set to false in the Helm chart. According to our understanding, the containerLogs.enabled setting only modifies the resource limits and logging configuration for Fluent Bit, but does not control the creation of the DaemonSet itself.

Initially, we would like to continue using our existing Fluent Bit deployment model and avoid creating separate and unnecessary DaemonSets for the EKS nodes if they are not used.

Question

Would it be possible to modify the Helm chart to skip the creation of the Fluent Bit DaemonSet resource when containerLogs.enabled is set to false? For example, by adding the following condition in the helm/templates/linux/fluent-bit-daemonset.yaml file line 1:

amazon-cloudwatch-agent-operator/helm/templates/linux/fluent-bit-daemonset.yaml

Line 1 in ffdef1c

{{- $region := .Values.region | required ".Values.region is required." -}}

{{- if .Values.containerLogs.enabled }}

If you have no objections, we are more than happy to create a pull request for the modification.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.