Giter VIP home page Giter VIP logo

opensearch-ci's People

Contributors

abhinavgupta16 avatar amazon-auto avatar bbarani avatar dblock avatar dependabot[bot] avatar divyaasm avatar gaiksaya avatar jordarlu avatar kavilla avatar mend-for-github-com[bot] avatar peternied avatar peterzhuamazon avatar prudhvigodithi avatar reta avatar rishabh6788 avatar varun-lodaya avatar zelinh avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

opensearch-ci's Issues

Design ECR cdk stack to deploy repositories and roles related to it.

@abhinavGupta16 commented on Thu Feb 03 2022

Description

We are able to deploy the ECR related roles and repositories on Isengard using cdk automatically.
The ecr roles set up trust relationships with existing roles/nodes deployed with jenkins-ci

Acceptance Criteria

  • Write a design/proposal doc for cdk ECR deployment
  • I am able to deploy ecr repositories and roles with cdk
  • roles trust relationships are automatically managed with the nodes deployed by jenkins-ci

@abhinavGupta16 commented on Fri Feb 11 2022

Following on offline team discussion, we will add the cdk to deploy ECR as a part of opensearch-ci . Moving therefore, moving this issue to opensearch-ci

[Documentation] Which Jenkins Staging jobs need to be saved / removed

Here is a list of all of the jobs on the Jenkins staging system, lets figure out if we need them or not

Final results are in:

  • Save 1.0.0-maven-build-job
  • Save 1.0.0-maven-sign-and-release
  • Save bundle-build
  • Save test-orchestration-pipeline
  • Save integ-test
  • Save bwc-test
  • Save perf-test
  • Save opensearch-java-maven-sign-and-release

  • Remove Single-Node-CDK-run
  • Remove fetch_build_pipelines_Sai
  • Remove fetch_bundles
  • Remove Integ-test-demo
  • Remove opensearch_bundle_integtest - legacy integ_test job
  • Remove perf-tests
  • Remove perf_test_owais - Has sample script to run perf test
  • Remove random_test
  • Remove bundle-build-inline-test
  • Remove bundle-build-petern
  • Remove bundle_pipeliness_1.0.1
  • Remove sample-cmd-testings
  • Remove Test-handalm
  • Remove test-job-1
  • Remove test-pipeline
  • Removetest-release-pipeline

[Bug] Jenkins doesn't startup

Looks like a version update of jenkins is needed as some plugins are not longer working after the log4j security vulnerability.

From my dev stacks /var/log/jenkins/jenkins.log

Running from: /usr/lib/jenkins/jenkins.war
2022-01-27 00:11:34.261+0000 [id=1]     WARNING winstone.Logger#logInternal: Parameter handlerCountMax is now deprecated
2022-01-27 00:11:34.274+0000 [id=1]     WARNING winstone.Logger#logInternal: Parameter handlerCountMaxIdle is now deprecated
2022-01-27 00:11:34.278+0000 [id=1]     INFO    org.eclipse.jetty.util.log.Log#initialized: Logging initialized @308ms to org.eclipse.jetty.util.log.JavaUtilLog
2022-01-27 00:11:34.317+0000 [id=1]     INFO    winstone.Logger#logInternal: Beginning extraction from war file
java.lang.SecurityException: class "Log4jHotPatch"'s signer information does not match signer information of other classes in the same package
        at java.lang.ClassLoader.checkCerts(ClassLoader.java:891)
        at java.lang.ClassLoader.preDefineClass(ClassLoader.java:661)
        at java.lang.ClassLoader.defineClass(ClassLoader.java:754)
        at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
        at java.net.URLClassLoader.defineClass(URLClassLoader.java:473)
        at java.net.URLClassLoader.access$100(URLClassLoader.java:74)
        at java.net.URLClassLoader$1.run(URLClassLoader.java:369)
        at java.net.URLClassLoader$1.run(URLClassLoader.java:363)
        at java.security.AccessController.doPrivileged(Native Method)
        at java.net.URLClassLoader.findClass(URLClassLoader.java:362)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
        at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
        at sun.instrument.InstrumentationImpl.loadClassAndStartAgent(InstrumentationImpl.java:304)
        at sun.instrument.InstrumentationImpl.loadClassAndCallAgentmain(InstrumentationImpl.java:411)
2022-01-27 00:11:35.233+0000 [id=1]     WARNING o.e.j.s.handler.ContextHandler#setContextPath: Empty contextPath
2022-01-27 00:11:35.273+0000 [id=1]     INFO    org.eclipse.jetty.server.Server#doStart: jetty-9.4.33.v20201020; built: 2020-10-20T23:39:24.803Z; git: 1be68755656cef678b79a2ef1c2ebbca99e25420; jvm 1.8.0_312-b07
2022-01-27 00:11:35.463+0000 [id=1]     INFO    o.e.j.w.StandardDescriptorProcessor#visitServlet: NO JSP Support for /, did not find org.eclipse.jetty.jsp.JettyJspServlet
2022-01-27 00:11:35.487+0000 [id=1]     INFO    o.e.j.s.s.DefaultSessionIdManager#doStart: DefaultSessionIdManager workerName=node0
2022-01-27 00:11:35.488+0000 [id=1]     INFO    o.e.j.s.s.DefaultSessionIdManager#doStart: No SessionScavenger set, using defaults
2022-01-27 00:11:35.488+0000 [id=1]     INFO    o.e.j.server.session.HouseKeeper#startScavenging: node0 Scavenging every 660000ms
2022-01-27 00:11:35.719+0000 [id=1]     INFO    hudson.WebAppMain#contextInitialized: Jenkins home directory: /var/lib/jenkins found at: SystemProperties.getProperty("JENKINS_HOME")
2022-01-27 00:11:35.782+0000 [id=1]     INFO    o.e.j.s.handler.ContextHandler#doStart: Started w.@3aacf32a{Jenkins v2.263.4,/,file:///var/cache/jenkins/war/,AVAILABLE}{/var/cache/jenkins/war}

Certification Deployment and Management

We need to import the certificates for the Jenkins instance into the cdk project and adjust the settings and configuration so a new deployment is all wired up.

  1. Create a JenkinsCertsSecretsStack - this will be used to store the certificates and and credentials associated with them
  2. Update the existing stack to pull in the secrets and get them operational on the Jenkins MainNode, leave http traffic unaffected
  3. Create a new ALB which is also has the certificate installed and only routes in https traffic to the main node

Acceptance Criteria:

  • After following the new deployment process (multiple stacks) the Jenkins instance is https accessible.
    • It is expected that between the cert stack and the jenkins stack there are manual deployment steps.

Load groovry scripts from a file instead of string

@abhinavGupta16 Alternatively, would it be better to maybe write the groovy code as a groovy file and import it?
Values can then later be replaced as variables from the props. This would make the groovy code more readable and IDEs would also be able to parse it to verify for errors etc

I did think about that, about using InitFile.fromAsset but how would we change the values once the file is created in the ec2 or even before it? These files contain dynamic values like SG, subnet ids, etc. How should we replace that?
Will look into it if we can use Cfn:Fn:sub like I used it here

Originally posted by @gaiksaya in #39 (comment)

Add settings so opensearch-ci can run docker

Is your feature request related to a problem? Please describe

Currently, no jenkins job will run on jenkins since docker and withAws is not installed.

Describe the solution you'd like

Make changes so jenkins jobs with docker agent nodes are able to run.

Describe alternatives you've considered

No response

Additional context

No response

Upgrade jenkins version to latest LTS version

Is your feature request related to a problem? Please describe

We are currently deploying jenkins version jenkins-2.263.4
However, the latest LTS has many security fixes including the plugins. Jenkins is smart enough to install latest compatible plugin and hence we can also remove hard coded version of plugins.
Since the configuration is loaded from jenkins.yaml, the plugin specific version should not matter.

Describe the solution you'd like

Upgrade to latest LTS version and remove plugin hard coded versions.

Describe alternatives you've considered

No response

Additional context

The upgrade is not a direct change as jenkins has changed the way it installs on CentOS.
A quick check concluded that changes like https://github.com/opensearch-project/opensearch-ci/blob/main/lib/compute/jenkins-main-node.ts#L213 does not take effect on jenkins.service anymore.
Needs deep dive into configuration.

Open-source all CI/CD scripts and configuration that runs in Jenkins

Everything in GHA is open-source and in repos. The Jenkins scripts and configuration should follow suit. Anyone should be able to reproduce the OpenSearch CI/CD infrastructure after #2 and importing scripts and configuration that would be OSS.

We have those scripts in source control now, but it remains private. Need to OSS those.

[Enhancement] Manual `start gradle check` is required to merge anything

Describe the bug

Before PR's are allowed to be merged the OpenSearch repository enforces running a CI task, ./gradlew check, which runs full unit, integration, smoke, and varying cluster tests. This takes a very long time and consumes a lot of resources. To avoid becoming a bottleneck we introduced a manual start gradle check that a maintainer can kick off by adding a comment to a PR once a basic review of a PR has been made. This is a manual step.

Expected behavior

No manual interaction from a maintainer other than approving/merging a PR.

Screenshots

Screen Shot 2021-04-21 at 3 19 12 PM

Additional context

[META] Jenkins workflows and fleet monitoring

This is a meta issue tracking all subtasks for enabling Jenkins workflows and fleet monitoring

We already have some system level alarms configured as a part of the current cdk code. CodeBase
Apart from that we need to add below monitoring metrics:

  1. SSL certs expiration and reporting
  2. Alarms monitoring critical jenkins workflows
  3. More..

Make Jenkins instance publicly visible

Is your feature request related to a problem? Please describe.

Community wants to see which jobs are being run for CI/CD and logs.

Describe the solution you'd like

We are running a Jenkins instance that is currently private. Make it public.

Config the environment for release to beta

  • Set up SSL
    • Get certificates
    • Set up monitoring for the certs [will be a part of https://github.com//issues/87]
    • Update Secret values
    • Enable SSL
    • Redeploy the stack
  • Set up OIDC
    • Get credentials
    • Update Secret values
    • Enable OIDC
    • Redeploy the stack

Support building individual components on ARM64

Is your feature request related to a problem? Please describe.
I would like to execute Continuous Integration on arm64 hosts.

Describe the solution you'd like
https://docs.github.com/en/actions/hosting-your-own-runners/adding-self-hosted-runners to create arm64 at organization level, hence, this can be used by all OpenSearch projects.

Describe alternatives you've considered
Running integration tests manually from ARM64 laptop

Additional context
N/A

[Bug]: useSsl parameter does not work

Describe the bug

The useSsl parameter is always false irrespective of the input parameter value.

To reproduce

set the --parameter useSsl=true and output the value of useSsl flag value. The value will be false.

Expected behavior

The value of the useSsl boolean value should be set according to the parameter value.

Screenshots

If applicable, add screenshots to help explain your problem.

Host / Environment

No response

Additional context

No response

Relevant log output

No response

[Bug]: Http doesnot auto redirect to https when SSL is enabled

Describe the bug

http://url should auto redirect to https when SSL is enabled.
currently it does not.

Need to change this line to http:// https://github.com/opensearch-project/opensearch-ci/blob/main/lib/compute/jenkins-main-node.ts#L222

To reproduce

Deploy the cluster using ssl enabled and try http://url. It hangs up
where as https://url works.

Expected behavior

Http should redirect to https automatically

Screenshots

If applicable, add screenshots to help explain your problem.

Host / Environment

No response

Additional context

No response

Relevant log output

No response

Setup Agent node config

Is your feature request related to a problem? Please describe

Agent node config and keys setup is done manually currently which needs to be automated.

Describe the solution you'd like

Secrets Management: The setup should be automated and done using the CI-Config-Stack as a part of Jenkins deployment via cdk
Agent Node configuration: TBD

  • Set up KMS key that encrypts all secrets
  • Set up the secrets required by agent nodes to communicate with main node
  • Set up Groovy scripts that describes the agent node configuration
  • Set up the process to execute the groovy scripts (mostly using jenkins-cli)

[Bug] Plugin versions should be locked in the cdk

From #26

.@gaiksaya
Quick question about jenkins plugin abhinavGupta16 if we do not specify the version of these plugins, won't it install most recent version that can differ in functionality in someway? peterzhuamazon can weigh in here regarding plugins and how important their versioning is (or not important)

.@abhinavGupta16
The script that ran to install plugins did not have versions for the plugins. This is a good point we should think about.

.@peterzhuamazon
I dont see this as a problem as we want to update the plugins to the latest versions to patch all vulnerabilities.
However, I also think maintaining a list of versions for stable usage is also beneficial.
You can have 1 template for latest and 1 template for stable. What do you think about this?

Data retention for Jenkins

Problem Statement

With current set up whenever there is a change with EC2 configuration or change in InitCommands, the stack replaces the EC2 with new one with new configurations. In initial stages it is fine because there is no relevant data. However, once in advance stage the replacement of EC2 will lead to data loss as EC2 is replaced.

Possible Solutions

Currently looking at the below data that needs to be retained:

  1. Jobs and job history: Stored in $JENKINS_HOME/jobs folder
  2. Config.xml : Contains all the global config changes (eg: env variables, any global settings, etc)
  3. More?

We are uploading all logs from (/var/log/jenkins/jenkins.log) to cloudwatch so no need to have that.

I do not have concrete solution for config.xml because of the following concern:
Config.xml stores all the global configurations in Jenkins including env variables, global settings, etc.
If we replace the newly deployed config.xml, we overwrite new changes made by new deployment.
If we keep the new one, we overwrite all the old changes made by the system.
Possible solution:

Two ways to implement any data storage:

1. Mount it on /data:

  • Sync only the required files with newly deployed system.
  • We would also need a reverse sync to keep the data on multi-attach EBS up to date.

Pros:

  1. Avoids data corruption as data in one way writable

Cons:

  1. Need to manage or come up with the cronjob/script/lambda function that runs this sync inside EC2
  2. Reading the data while Jenkins or System is actively writing might lead to partial data
  3. Depending on the frequency of the sync, if the instance is replaced in between, the data between the last sync and node replacement will be lost

2 Mount directly on required path eg: $JENKINS_HOME/jobs

Pros:

  1. No syncing between 2 drives required
  2. No external scripts to manage
  3. No time gap for data loss

Cons:

  1. Cloudformation first brings up new instance and then deletes the old one. With the old instance still writing to the file system, mounting the drive to new one can lead to data corruption and overwriting the old data.

Data Storage Types

1. Multi-attach Elastic Block Storage volumes:

A multi-attach EBS volume that can be attached to the EC2 instance.

Pros:

  1. EBS is a good option for storage with high performance, low latency
  2. Easy to mount and attach to EC2 during boot up time
  3. Easy data backup and restoration – via snapshots that can be taken at hourly intervals
  4. EBS ensures all your data is well protected
  5. Multi-attach gives us flexibility to attach the same volume to terminating and initializing instances

Cons:

  1. Need to lock availability zone for EBS and EC2 instance. Availability region boundaries
  2. If take hourly snapshots of the current EBS volume and launch a multi-attach EBS volume using the most recent snapshot and attach it to the new instance, we might lose 1 hour of data

2. Elastic File System:

Use EFS to store the data entirely out of EC2 on shared file system.
Mount the EFS at a specific location so that old data and history is accessible.

Pros:

  1. Auto-scalable solution. Bursting Throughput as the data grows.
  2. No availability region boundaries
  3. Back up available using AWS Backup

Cons:

  1. Needs to be within the same VPC as of EC2 instance or VPC needs to be peered
  2. Low performance of high volume data

Make sure that jenkins cannot be automatically updated by yum

We've seen an unexplained update on our staging jenkins environment, we need to be sure that updates are only driven by operator actions.

sh-4.2$ sudo su -
Last login: Sat Dec 11 03:06:17 UTC 2021 on pts/1
[root@ip-10-0-181-89 ~]# yum history
Loaded plugins: extras_suggestions, langpacks, priorities, update-motd
ID     | Command line             | Date and time    | Action(s)      | Altered
-------------------------------------------------------------------------------
     5 | install -y adoptopenjdk- | 2021-08-12 18:48 | Install        |    1
     4 | install java-11-openjdk  | 2021-08-11 21:57 | Install        |    2
     3 | update                   | 2021-08-11 21:56 | I, O, U        |   23 EE
     2 | install -y jenkins-2.263 | 2021-08-02 04:54 | Install        |    1
     1 | -y install git java-1.8. | 2021-08-02 04:54 | Install        |   82
history list
[root@ip-10-0-181-89 ~]# rpm -qa | grep jenkins
jenkins-2.289.3-1.1.noarch

[META] Make Jenkins infrastructure setup public

The Jenkins instance that runs CI/CD jobs is setup via CDK. There's no reason why that should not be OSS AFAIK so that anyone can deploy a similar instance.

Work checklist:

  • Create base Jenkins CDK codebase
    • #10
    • #12
    • Deploy main jenkins instance
    • Follow security guidance for locking down the instance
    • Add instance health monitoring
    • Create a health dashboard
  • Create stack for manual secrets deployment
    • House and deploy certs
    • House and deploy github account credentials
    • #23
  • Deploy operational instance
  • Release Process
    • #43
    • #61
    • #85
    • Config the environment for release to prod
    • Finalize any requirements/features before initial release
  • Post Release
    • #87
    • Load saved build scripts from DLS
    • Migrate existing configuration
    • Validate deployment
    • Job Canary

Open questions:

  • How can we make sure that we can recover from a down jenkins main node?

Migrate existing jenkins job to public jenkins

Create a proposal on how we plan to configure new jobs on public jenkins.
The recommended approach is to auto-configure jenkins workflow when a jenkins file is added to https://github.com/opensearch-project/opensearch-build/tree/main/jenkins

We want to avoid workflow configuration via UI entirely and look at configuration as a code.

Acceptance Criteria:

  1. Once the workflow PR is merged (workflow is added to above build repo folder), the jenkins job should be auto-configured.
  2. All jobs and console logs should be publicly viewable.

[META] Design and deploy ECR cdk stack to deploy repositories and roles related to it.

@abhinavGupta16 commented on Thu Feb 03 2022

Description

We are able to deploy the ECR related roles and repositories on Isengard using cdk automatically.
The ecr roles set up trust relationships with existing roles/nodes deployed with jenkins-ci

Tasks

Acceptance Criteria

  • ecr repositories and roles are automatically deployed and setup using cdk
  • roles trust relationships are automatically managed with the nodes deployed by jenkins-ci

@abhinavGupta16 commented on Fri Feb 11 2022

Following on offline team discussion, we will add the cdk to deploy ECR as a part of opensearch-ci . Moving therefore, moving this issue to opensearch-ci

[Bug]: Transient issue: Jenkins fails to install/bootstrap correctly

Describe the bug

Sometimes while deploying the stack, jenkins fails to bootstrap correctly.
I see only failed-boot-attempts.txt in /var/lib/jenkins and copied yaml nothing else.
The txt file only contains date/time of the failure.

The logs does not show anything relevant:

2022-04-06 23:07:00.303+0000 [id=1]     INFO    o.e.j.w.StandardDescriptorProcessor#visitServlet: NO JSP Support for /, did not find org.eclipse.jetty.jsp.JettyJspServlet
2022-04-06 23:07:00.330+0000 [id=1]     INFO    o.e.j.s.s.DefaultSessionIdManager#doStart: DefaultSessionIdManager workerName=node0
2022-04-06 23:07:00.330+0000 [id=1]     INFO    o.e.j.s.s.DefaultSessionIdManager#doStart: No SessionScavenger set, using defaults
2022-04-06 23:07:00.331+0000 [id=1]     INFO    o.e.j.server.session.HouseKeeper#startScavenging: node0 Scavenging every 600000ms
2022-04-06 23:07:00.578+0000 [id=1]     INFO    hudson.WebAppMain#contextInitialized: Jenkins home directory: /var/lib/jenkins found at: SystemProperties.getProperty("JENKINS_HOME")
2022-04-06 23:07:00.642+0000 [id=1]     INFO    o.e.j.s.handler.ContextHandler#doStart: Started w.@3aacf32a{Jenkins v2.263.4,/,file:///var/cache/jenkins/war/,AVAILABLE}{/var/cache/jenkins/war}

To reproduce

Its a transient issue. Try deploying the stack using the codebase

Expected behavior

Jenkins should deploy without any hiccups.

Screenshots

If applicable, add screenshots to help explain your problem.

Host / Environment

No response

Additional context

No response

Relevant log output

No response

Create access logs S3 buckets

Is your feature request related to a problem? Please describe

Currently we need to manually create a S3 bucket and setup all the access logs, vpc flow logs, and such to it, and use it for access logging. Also need to setup all the required policies.

Describe the solution you'd like

  • Automate the creation a S3 bucket and setup all the access logs, vpc flow logs, and such to it.
  • Setup all the policies to it.
  • Encrypt the s3 bucket that is setup
  • Needs to be a part of the JenkinsStack deployment
  • This would be a part of the ci-stack
  • Create a access-logging.ts file for access logs changes

Describe alternatives you've considered

No response

Additional context

No response

Deployment process for Beta

Plan how we can integrate the public OpenSearch-CI components into the internal infrastructure to manage deployments automatically.

Acceptance Criteria:

  • Figure how we can automatically pull this code base into the internal system
  • Update the internal tooling to pull in the CDK job for deployment
  • Support development environment deployment testing
  • Update any stack compatibility issues between environments
  • Verify that the beta environment gets automatic deployments.

Setup OpenId Connect for Jenkins Cluster Setup

Is your feature request related to a problem? Please describe

Currently, OpenId Connect for Jenkins Cluster Setup is a manual process which needs to be automated using Jenkins deployment using cdk

Describe the solution you'd like

More information on Amazon Internal

Describe alternatives you've considered

No response

Additional context

No response

CDK deployment creates role and policy for ECR support

Is your feature request related to a problem? Please describe

Add support to create a role which can log into ecr and push docker images.

Describe the solution you'd like

Create cdk support to be able to create the below policy and add trust relationship with the agent node and main node.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "ecr-public:BatchCheckLayerAvailability",
                "ecr-public:CompleteLayerUpload",
                "ecr-public:InitiateLayerUpload",
                "ecr-public:PutImage",
                "ecr-public:UploadLayerPart"
            ],
            "Resource": "*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "ecr-public:GetAuthorizationToken",
                "sts:GetServiceBearerToken"
            ],
            "Resource": "*"
        }
    ]
}

Describe alternatives you've considered

No response

Additional context

No response

Diagrams should be as editable as markdown code

Is your feature request related to a problem? Please describe

When we want to describe a system, make a flow chart, or an activity diagram there should be a descriptive language that is used and can be edited without using another tool.

Describe the solution you'd like

Inline detection via a GithubAction that converts plantuml into images would be great.

[Bug]: Add repo description

Coming from opensearch-project/opensearch-plugins#92.

Each plugin has a short descriptive text blurb that says what it does: this appears in plugin-descriptor.properties (OpenSearch plugins) or package.json (OpenSearch Dashboards plugins), as well as on the project website's "source" page and in the "About" section of every repo. These were created one-by-one over the years as plugins were created, so looking at them all together now it would be hard for somebody new to the project to use these to understand what the plugins do.

Dynamically load configuration from Jenkins.yaml

Is your feature request related to a problem? Please describe

Currently, we need to manually change the jenkins.yaml file on the server and reload the config, or load the config file via the UI manually. We need a mechanism which detects the changes in jenkins.yaml and loads the changes on jenkins automatically.

Things to lookout for -

  1. jenkins.yaml can potentially change the security config. We need to be careful for changes on the file.
  2. OIDC changes reflect on config file. We would need to commit the oidc changes on the config file. Not doing that would revert the oidc setup changes.
  3. reloading the config file requires to run a url. username:passoword will not work with oidc

Describe the solution you'd like

Create a job on jenkins that detects the changes in config file on github and reloads it. We could use jenkins API tokens to run the urls, but they’ll will disappear after config reload.

Describe alternatives you've considered

No response

Additional context

No response

CDN public artifact locations needs to be managed

For the existing instance we are hardcoding the path https://ci.opensearch.org/ci/dbc set as CDN_PUBLIC_ARTIFACT_PREFIX in the jenkins environment. Is this the correct way to set this and how can we make sure this is unique for all deployments?

[Proposal] Internal Monitoring Agent - How to install

Problem:

At Amazon there is a hard requirement to log shell actions, for detection of compromised ec2 instances and store audit history. This software agent is not meant for public distribution and as such should not be made available. The software gets updated monthly it should be up to date on new instances. All of the convenient tooling for this system is not available on GitHub repositories and is part of Amazon internal tooling.

Recap; proprietary software, often updated, no existing bridge to this deployment infrastructure.

Proposal 1: Create an internal AMI builder and import images into this infrastructure

By creating a stack within Amazon tooling ecosystem, it is trivial to install the components and receive updates automatically. On a periodic basis AMIs could be created with EC2 Image Builder and then SNS message fired that would trigger this stack to update the configured AMI image.

  • Create Internal Stack ~3 days
  • Deployment to Beta/Prod accounts ~2 days
  • Lambda message subscription update AMIs in Jenkins ~3 days
    • Note; Could be manually as part of done monthly maintenance

Pro:

  • Built into images, easy to validate
  • Reuses Amazon infra

Con:

  • Multiple deployments sources to Beta/Prod stack
  • Does not address long live instances such as the Jenkins Main Node

Proposal 2: Create a SSM Job to audit and update the software

AWS System Manager can be configured to run scripts against existing hosts and new hosts. By using a State Manager association new instance can have scripts executed against them* there will also be a periodic scan of hosts to ensure they are up to date

  • Create Internal Stack ~3 days
  • Deployment to Beta/Prod accounts ~2 days

Pro:

  • Hits all cases for short/long lived instances

Con:

  • Multiple deployments sources to Beta/Prod stack
  • Trusting SO answer with no upvotes for core implementation detail

Proposal 3: Create hooks for software deployment, copy internal scripts into private repository

Add a cron hook into the long lived nodes to download and run the existing script from S3, add a section to the cloud-init for short-lived instanced. Port over the software into a private repository which we were already going to use to manage actual deployment scripts.

  • Update infra deployment scripts with the assets ~1 day
  • Add hook for Jenkins main instance to run the script via cron ~1 day
  • Copy the download/install script and make sure its part of the cloud-init part of the jenkins

Pro:

  • No additional internal infrastructure to manage

Con:

  • Its feels like a huge hack
  • Requires monthly updates
  • Leaky installation process visible in public jenkins

Load jenkins system configuration

Description

Configuring jenkins is a complex process. Currently, we configure jenkins using the UI which cannot be replicated easily. We need a method to be able to load and change the jenkins configuration through a configuration file.

Acceptance Criteria

CDK deployment deploys a configuration file which is used to configure jenkins automatically during deployment.

[Bug]: Hardcoding of rolename avoids redeployment of multiple stacks

Describe the bug

Since IAM is global every new role name and policy name should be unique.
Having rolename hard coded throws error saying the role already exists and hence cannot create new role.
https://github.com/opensearch-project/opensearch-ci/blob/main/lib/compute/jenkins-main-node.ts#L93

To reproduce

Deploy stack in one region and try deploying the same stack in another region

Expected behavior

Stacks should deploy in multiple regions

Screenshots

If applicable, add screenshots to help explain your problem.

Host / Environment

No response

Additional context

No response

Relevant log output

No response

Load jenkins via jenkins.yaml

Load configuration via code using Jenkins Configuration as Code plugin that uses jenkins.yaml to load config. We need load the entire config via jenkins.yaml. This means we need to move current config changes made to config.xml to jenkins.yaml

  • - OIDC config (also includes Role Based Auth config) : See #89
  • - Setting up agent nodes

Originally posted by @gaiksaya in #85 (comment)

Mechanism to verify json from Secret Manager for OIDC

Is your feature request related to a problem? Please describe

Currently, there is no mechanism to verify the json format and checking required fields for JSON put in the secret manager for OIDC. If there is something missing, the deployment will fail and there won't be descriptive error messages helping to debug the issue.

Describe the solution you'd like

Write a mechanism to verify the JSON coming in from the secret manager for the OIDC credential and fail with descriptive error messages if any field is missing.

Describe alternatives you've considered

No response

Additional context

No response

[Bug]: We need to specify the parameter flags when destroying a stack

Describe the bug

When we destroy the cdk stack, we are currently required to specify the flag parameters. We should be able to destroy the stack without specifying the parameters.

To reproduce

create a CI-Dev stack and destroy is using cdk destroy CI-Dev

Expected behavior

We should be able to destroy the stack without specifying the parameter flags

Screenshots

If applicable, add screenshots to help explain your problem.

Host / Environment

No response

Additional context

No response

Relevant log output

No response

[Feature Request] Can we disable the requirement for admins to allow PRs to run gradle check

We have been living with the start gradle check passphrase for a while now and it is grating on our team of admins.

Lets disabling this feature for a trial period of 1 week and have a way to determine if this is something we should do long term.

Beforehand we need to make sure we have the correct monitoring in place to make sure that our CI fleet is health with the reduced human intervention.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.