Giter VIP home page Giter VIP logo

opstrace's Introduction

opstrace's People

Contributors

ajayk avatar dependabot[bot] avatar estroz avatar fabianotessarolo avatar github-actions[bot] avatar gliptak avatar jamakase avatar jgehrcke avatar juliachvyrova avatar matapple avatar mosattler avatar nickbp avatar patrickheneise avatar spahl avatar sreis avatar terrcin avatar triclambert avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

opstrace's Issues

aws: latest main: https://default.jp2.opstrace.io/ -> 500, https://<clustername>.opstrace.io/ -> 404, https://jp2.opstrace.io/login -> 404

$ ./opstrace --version --log-level=debug
e790a06a-ci
2020-12-02T09:38:13.389Z debug: BUILD_INFO_COMMIT: e790a06
2020-12-02T09:38:13.389Z debug: BUILD_INFO_TIME_RFC3339: 2020-12-02 08:12:55+00:00
2020-12-02T09:38:13.390Z debug: BUILD_INFO_HOSTNAME: c918aab90943
2020-12-02T09:38:13.390Z debug: BUILD_INFO_BRANCH_NAME: main
2020-12-02T09:38:13.391Z debug: shut down logger, then exit with code 0
$ ./opstrace create aws jp2 -c ~/dev/opstrace/ci/cluster-config.yaml 

2020-12-02T09:38:26.924Z info: rendered cluster config:
{
  "data_api_authorized_ip_ranges": [
    "0.0.0.0/0"
  ],
  "data_api_authentication_disabled": false,
  "metric_retention_days": 7,
  "log_retention_days": 7,
  "cert_issuer": "letsencrypt-staging",
  "env_label": "ci",
  "tenants": [
    "default"
  ],
  "controller_image": "opstrace/controller:e790a06a-ci",
  "node_count": 3,
  "aws": {
    "zone_suffix": "a",
    "region": "us-west-2",
    "instance_type": "t3.2xlarge"
  },
  "cloud_provider": "aws",
  "cluster_name": "jp2",

}
2020-12-02T09:38:26.925Z info: Before we continue, please review the set of state-mutating AWS API calls emitted by this CLI during cluster creation: https://go.opstrace.com/cli-aws-mutating-api-calls/e790a06a-ci
Proceed? [y/N] y
...
2020-12-02T10:13:26.405Z info: cluster creation finished: jp2 (aws)

--

Screenshot from 2020-12-02 12-12-39
Screenshot from 2020-12-02 12-12-50

quickstart: broken for linux: bzip2: (stdin) is not a bzip2 file

$ curl -L https://go.opstrace.com/cli-latest-linux-tbz | tar xjf -
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   283  100   283    0     0    484      0 --:--:-- --:--:-- --:--:--   484
100     1    0     1    0     0      0      0 --:--:--  0:00:01 --:--:--     0
bzip2: (stdin) is not a bzip2 file.
tar: Child died with signal 13
tar: Error is not recoverable: exiting now

cli/(un)installer: properly handle a 401 response upon dns service login

When the DNS service login fails with a permanent error (which a 401 error is supposed to be: a non-retryable error indicating that credentials are bad) then we have to handle that situation properly.

What's currently happening is that the error is unhandled, emits a stack trace, and consumes a high-level create attempt -- until exhausted:

[2020-12-01T13:54:24Z] 2020-12-01T13:54:24.270Z info: setting up DNS
[2020-12-01T13:54:24Z] 2020-12-01T13:54:24.270Z debug: DNSClient.GetAll()
[2020-12-01T13:54:24Z] 2020-12-01T13:54:24.342Z error: error during cluster creation (attempt 3):
[2020-12-01T13:54:24Z] Error: Request failed with status code 401
[2020-12-01T13:54:24Z]     at createError (/snapshot/build/node_modules/axios/lib/core/createError.js:16:15)
[2020-12-01T13:54:24Z]     at settle (/snapshot/build/node_modules/axios/lib/core/settle.js:17:12)
[2020-12-01T13:54:24Z]     at IncomingMessage.handleStreamEnd (/snapshot/build/node_modules/axios/lib/adapters/http.js:236:11)
[2020-12-01T13:54:24Z]     at IncomingMessage.emit (events.js:327:22)
[2020-12-01T13:54:24Z]     at IncomingMessage.EventEmitter.emit (domain.js:485:12)
[2020-12-01T13:54:24Z]     at endReadableNT (_stream_readable.js:1224:12)
[2020-12-01T13:54:24Z]     at processTicksAndRejections (internal/process/task_queues.js:84:21) {
[2020-12-01T13:54:24Z]   config: [Object],
[2020-12-01T13:54:24Z]   request: [ClientRequest],
[2020-12-01T13:54:24Z]   response: [Object],
[2020-12-01T13:54:24Z]   isAxiosError: true,
[2020-12-01T13:54:24Z]   toJSON: [Function (anonymous)]
[2020-12-01T13:54:24Z] }
[2020-12-01T13:54:24Z] 2020-12-01T13:54:24.342Z error: JSON representation of err: {
[2020-12-01T13:54:24Z]   "message": "Request failed with status code 401",
[2020-12-01T13:54:24Z]   "name": "Error",
[2020-12-01T13:54:24Z]   "stack": "Error: Request failed with status code 401\n    at createError (/snapshot/build/node_modules/axios/lib/core/createError.js:16:15)\n    at settle (/snapshot/build/node_modules/axios/lib/core/settle.js:17:12)\n    at IncomingMessage.handleStreamEnd (/snapshot/build/node_modules/axios/lib/adapters/http.js:236:11)\n    at IncomingMessage.emit (events.js:327:22)\n    at IncomingMessage.EventEmitter.emit (domain.js:485:12)\n    at endReadableNT (_stream_readable.js:1224:12)\n    at processTicksAndRejections (internal/process/task_queues.js:84:21)",
[2020-12-01T13:54:24Z]   "config": {
[2020-12-01T13:54:24Z]     "url": "https://dns-api.opstrace.net/dns/",
[2020-12-01T13:54:24Z]     "method": "get",
[2020-12-01T13:54:24Z]     "headers": {
[2020-12-01T13:54:24Z]       "Accept": "application/json, text/plain, */*",
[2020-12-01T13:54:24Z]       "authorization": "Bearer null",
[2020-12-01T13:54:24Z]       "Content-Type": "application/json",
[2020-12-01T13:54:24Z]       "User-Agent": "axios/0.19.2"
[2020-12-01T13:54:24Z]     },
[2020-12-01T13:54:24Z]     "transformRequest": [
[2020-12-01T13:54:24Z]       null
[2020-12-01T13:54:24Z]     ],
[2020-12-01T13:54:24Z]     "transformResponse": [
[2020-12-01T13:54:24Z]       null
[2020-12-01T13:54:24Z]     ],
[2020-12-01T13:54:24Z]     "timeout": 0,
[2020-12-01T13:54:24Z]     "xsrfCookieName": "XSRF-TOKEN",
[2020-12-01T13:54:24Z]     "xsrfHeaderName": "X-XSRF-TOKEN",
[2020-12-01T13:54:24Z]     "maxContentLength": -1
[2020-12-01T13:54:24Z]   }
[2020-12-01T13:54:24Z] }
[2020-12-01T13:54:24Z] 2020-12-01T13:54:24.343Z error: 3 attempt(s) failed. Stop retrying. Exit.
[2020-12-01T13:54:24Z] 2020-12-01T13:54:24.343Z debug: shut down logger, then exit with code 1

installer: show UI URL when cluster creation finished

2020-12-02T09:07:04.532Z info: https://loki.default.jp.opstrace.io/loki/api/v1/labels: got expected HTTP response
2020-12-02T09:07:04.533Z info: All probe URLs returned expected HTTP responses, continue
2020-12-02T09:07:04.533Z info: cluster creation finished: jp (aws)

Show a friendly message pointing to https://jp.opstrace.io/ in that case.

uninstaller: gcp: does not recover from network resource 'projects/vast-pad-snip/global/networks/snip is already being used by 'projects/vast-pad-snip/global/firewalls/k8s-snip-node-hc'

Consumed all high-level retries, didn't break the dependency cycle:

...
2020-12-01T11:07:08.623Z info: Destroying VPC
2020-12-01T11:07:30.531Z info: VPC deletion has started with status: RUNNING
2020-12-01T11:07:37.189Z error: error during cluster teardown (attempt 5):
ApiError: The network resource 'projects/vast-pad-240918/global/networks/jpdev' is already being used by 'projects/vast-pad-<snip>/global/firewalls/k8s-3cc02042af359f14-node-hc'
    at new ApiError (/home/jp/dev/opstrace/node_modules/@google-cloud/common/build/src/util.js:59:15)
    at Util.parseHttpRespBody (/home/jp/dev/opstrace/node_modules/@google-cloud/common/build/src/util.js:194:38)
    at /home/jp/dev/opstrace/node_modules/@google-cloud/compute/src/operation.js:251:46
    at /home/jp/dev/opstrace/node_modules/@google-cloud/compute/src/operation.js:234:7
    at /home/jp/dev/opstrace/node_modules/@google-cloud/common/build/src/service-object.js:193:13
    at /home/jp/dev/opstrace/node_modules/@google-cloud/common/build/src/util.js:369:25
    at Util.handleResp (/home/jp/dev/opstrace/node_modules/@google-cloud/common/build/src/util.js:145:9)
    at /home/jp/dev/opstrace/node_modules/@google-cloud/common/build/src/util.js:434:22
    at onResponse (/home/jp/dev/opstrace/node_modules/retry-request/index.js:214:7)
    at /home/jp/dev/opstrace/node_modules/teeny-request/src/index.ts:325:11 {
  code: undefined,
  errors: [Array],
  response: undefined
}
2020-12-01T11:07:37.190Z error: JSON representation of err: {
  "errors": [
    {
      "code": "RESOURCE_IN_USE_BY_ANOTHER_RESOURCE",
      "message": "The network resource 'projects/vast-pad-240918/global/networks/jpdev' is already being used by 'projects/vast-pad-<snip>/global/firewalls/k8s-3cc02042af359f14-node-hc'"
    }
  ],
  "message": "The network resource 'projects/vast-pad-240918/global/networks/jpdev' is already being used by 'projects/vast-pad-240918/global/firewalls/k8s-3cc02042af359f14-node-hc'"
}
2020-12-01T11:07:37.191Z error: 5 attempt(s) failed. Stop retrying. Exit.
2020-12-01T11:07:37.192Z debug: shut down logger, then exit with code 1

(Seen locally, was trying to tear down an oldish GCP dev cluster of mine).

cli: add ability to add a tenant

If there's a simple and robust way to add a tenant dynamically to a running cluster then I think we should offer that feature soon, from the CLI.

Let's discuss tenant removal separately.

ci: gcp: create SQLInstance: failed: reached the max instance per project/creator limit

Describe the bug

[2020-12-01T20:13:53Z] 2020-12-01T20:13:53.517Z error: error during cluster creation (attempt 3):
[2020-12-01T20:13:53Z] GaxiosError: Failed to create instance because the project or creator has reached the max instance per project/creator limit.
[2020-12-01T20:13:53Z]     at Gaxios._request (/snapshot/build/node_modules/gaxios/src/gaxios.ts:117:15)
[2020-12-01T20:13:53Z]     at runMicrotasks (<anonymous>)
[2020-12-01T20:13:53Z]     at processTicksAndRejections (internal/process/task_queues.js:97:5)
[2020-12-01T20:13:53Z]     at JWT.requestAsync (/snapshot/build/node_modules/google-auth-library/build/src/auth/oauth2client.js:343:18) {
[2020-12-01T20:13:53Z]   response: [Object],
[2020-12-01T20:13:53Z]   config: [Object],
[2020-12-01T20:13:53Z]   code: 403,
[2020-12-01T20:13:53Z]   errors: [Array]
[2020-12-01T20:13:53Z] }

https://buildkite.com/opstrace/scheduled-main-builds/builds/1285#f7b824b4-4945-488d-8367-0cc01229c654/200-1075

installer: aws: MalformedPolicyDocument during createRole() not handled properly

Should not consume a high-level retry attempt:

[2020-12-02T21:57:59Z] 2020-12-02T21:57:59.729Z debug: aws sdk: [AWS iam 400 0.332s 0 retries] createRole({
[2020-12-02T21:57:59Z]   AssumeRolePolicyDocument: '{"Version":"2012-10-17","Statement":[{"Effect":"Allow","Principal":{"AWS":"arn:aws:iam::959325414060:role/bk-1294-f61-a-eks-nodes"},"Action":"sts:AssumeRole"}]}',
[2020-12-02T21:57:59Z]   RoleName: 'bk-1294-f61-a-cert-manager'
[2020-12-02T21:57:59Z] })
[2020-12-02T21:57:59Z] 2020-12-02T21:57:59.734Z error: error during cluster creation (attempt 1):
[2020-12-02T21:57:59Z] MalformedPolicyDocument: Invalid principal in policy: "AWS":"arn:aws:iam::959325414060:role/bk-1294-f61-a-eks-nodes"
[2020-12-02T21:57:59Z]     at Request.extractError (/snapshot/build/node_modules/aws-sdk/lib/protocol/query.js:50:29)
[2020-12-02T21:57:59Z]     at Request.callListeners (/snapshot/build/node_modules/aws-sdk/lib/sequential_executor.js:106:20)
[2020-12-02T21:57:59Z]     at Request.emit (/snapshot/build/node_modules/aws-sdk/lib/sequential_executor.js:78:10)
[2020-12-02T21:57:59Z]     at Request.emit (/snapshot/build/node_modules/aws-sdk/lib/request.js:688:14)
[2020-12-02T21:57:59Z]     at Request.transition (/snapshot/build/node_modules/aws-sdk/lib/request.js:22:10)
[2020-12-02T21:57:59Z]     at AcceptorStateMachine.runTo (/snapshot/build/node_modules/aws-sdk/lib/state_machine.js:14:12)
[2020-12-02T21:57:59Z]     at /snapshot/build/node_modules/aws-sdk/lib/state_machine.js:26:10
[2020-12-02T21:57:59Z]     at Request.<anonymous> (/snapshot/build/node_modules/aws-sdk/lib/request.js:38:9)
[2020-12-02T21:57:59Z]     at Request.<anonymous> (/snapshot/build/node_modules/aws-sdk/lib/request.js:690:12)
[2020-12-02T21:57:59Z]     at Request.callListeners (/snapshot/build/node_modules/aws-sdk/lib/sequential_executor.js:116:18)
[2020-12-02T21:57:59Z]     at Request.emit (/snapshot/build/node_modules/aws-sdk/lib/sequential_executor.js:78:10)
[2020-12-02T21:57:59Z]     at Request.emit (/snapshot/build/node_modules/aws-sdk/lib/request.js:688:14)
[2020-12-02T21:57:59Z]     at Request.transition (/snapshot/build/node_modules/aws-sdk/lib/request.js:22:10)
[2020-12-02T21:57:59Z]     at AcceptorStateMachine.runTo (/snapshot/build/node_modules/aws-sdk/lib/state_machine.js:14:12)
[2020-12-02T21:57:59Z]     at /snapshot/build/node_modules/aws-sdk/lib/state_machine.js:26:10
[2020-12-02T21:57:59Z]     at Request.<anonymous> (/snapshot/build/node_modules/aws-sdk/lib/request.js:38:9)
[2020-12-02T21:57:59Z]     at Request.<anonymous> (/snapshot/build/node_modules/aws-sdk/lib/request.js:690:12)
[2020-12-02T21:57:59Z]     at Request.callListeners (/snapshot/build/node_modules/aws-sdk/lib/sequential_executor.js:116:18)
[2020-12-02T21:57:59Z]     at callNextListener (/snapshot/build/node_modules/aws-sdk/lib/sequential_executor.js:96:12)
[2020-12-02T21:57:59Z]     at IncomingMessage.onEnd (/snapshot/build/node_modules/aws-sdk/lib/event_listeners.js:313:13)
[2020-12-02T21:57:59Z]     at IncomingMessage.emit (events.js:327:22)
[2020-12-02T21:57:59Z]     at IncomingMessage.EventEmitter.emit (domain.js:485:12)
[2020-12-02T21:57:59Z]     at endReadableNT (_stream_readable.js:1224:12)
[2020-12-02T21:57:59Z]     at processTicksAndRejections (internal/process/task_queues.js:84:21) {
[2020-12-02T21:57:59Z]   code: 'MalformedPolicyDocument',
[2020-12-02T21:57:59Z]   time: 2020-12-02T21:57:59.729Z,
[2020-12-02T21:57:59Z]   requestId: '0487635b-a75f-4168-8a32-ccb65cae157b',
[2020-12-02T21:57:59Z]   statusCode: 400,
[2020-12-02T21:57:59Z]   retryable: false,
[2020-12-02T21:57:59Z]   retryDelay: 1000
[2020-12-02T21:57:59Z] }
[2020-12-02T21:57:59Z] 2020-12-02T21:57:59.734Z error: JSON representation of err: {
[2020-12-02T21:57:59Z]   "message": "Invalid principal in policy: \"AWS\":\"arn:aws:iam::959325414060:role/bk-1294-f61-a-eks-nodes\"",
[2020-12-02T21:57:59Z]   "code": "MalformedPolicyDocument",
[2020-12-02T21:57:59Z]   "time": "2020-12-02T21:57:59.729Z",
[2020-12-02T21:57:59Z]   "requestId": "0487635b-a75f-4168-8a32-ccb65cae157b",
[2020-12-02T21:57:59Z]   "statusCode": 400,
[2020-12-02T21:57:59Z]   "retryable": false,
[2020-12-02T21:57:59Z]   "retryDelay": 1000
[2020-12-02T21:57:59Z] }

Let's move the createRole() call under the AWSResource control.

Well. This resolved itself through retrying, i.e. the underlying issue was eventual consistency within AWS.

controller OOM-killed N times in cluster; cluster never became healthy

Carrying over main findings from https://github.com/opstrace/opstrace-prelaunch/issues/1840.

@triclambert reported that the cluster never became healthy.
@sreis found that the controller was OOM-killed, 4 times: https://github.com/opstrace/opstrace-prelaunch/issues/1840#issuecomment-722553747

We did not root-cause this, and as far as I understand we have no reason to believe this is fixed -- situation may happen again.

Might relate to
https://github.com/opstrace/opstrace-prelaunch/issues/1089
https://github.com/opstrace/opstrace-prelaunch/issues/1089#issuecomment-668786259
kubernetes-client/javascript#494

ci: perform commit message linting

So far, commit message linting happens voluntarily on in the dev's environment. CI does not enforce commit message linting rules. Let's change that.

review cluster name length limitation (13 chars right now)

The Opstrace cluster name is used in contexts that impose limitations on length and character set.

Historically, that's how we ended up with current cluster name limitations.

https://github.com/opstrace/opstrace/blob/7b6b7f589f069c53357be3a83be36e459998b6d1/packages/cli/src/util.ts#L25

Some of these limitations go back to infrastructure components that we don't use anymore, such as Bigtable. See this commit: b1ea161

  // const infraNamePrefix = getInfrastructureName(stack.org, stack.name);
  // return `${infraNamePrefix}-idx`;
  // Note that the derived Bigtable instance cluster ID must not get longer
  // than 30 chars.
  return `${stack.name}-idxdb`;
};

TODO: play with this, see which length limitation is as of 'today' still justified. Then adjust the limit.

cli: redux-saga race effect: automatic cancellation does not go deep

Carried over from opstrace/opstrace-prelaunch/issues/1457, Sep 29.

Having this code:

function* destroyClusterAttemptWithTimeout() {
  log.debug("destroyClusterAttemptWithTimeout");
  const { timeout } = yield race({
    destroy: call(destroyClusterCore),
    timeout: delay(DESTROY_ATTEMPT_TIMEOUT_SECONDS * SECOND)
  });

  if (timeout) {
    // Note that in this case redux-saga guarantees to have cancelled the
    // task(s) that lost the race, i.e. the `destroy` task above.
...

I've seen that redux-saga tasks spawned (well, fork()ed) deep in the redux saga task history do not reliably get cancelled upon said timeout.

from docs (https://redux-saga.js.org/docs/advanced/TaskCancellation.html):

Besides manual cancellation there are cases where cancellation is triggered automatically
In a race effect. All race competitors, except the winner, are automatically cancelled.

also docs claim that cancellation works through hierarchy:

So we saw that Cancellation propagates downward (in contrast returned values and uncaught errors propagates upward).

Data:

[2020-09-28T21:14:36Z] })
[2020-09-28T21:14:36Z] 2020-09-28T21:14:36.143Z οΏ½[34mdebugοΏ½[39m: internet gateway teardown: cycle 598
[2020-09-28T21:14:36Z] 2020-09-28T21:14:36.184Z οΏ½[34mdebugοΏ½[39m: aws sdk: [AWS ec2 200 0.041s 0 retries]
[...]
[2020-09-28T21:14:36Z] 2020-09-28T21:14:36.184Z οΏ½[32minfoοΏ½[39m: internet gateway teardown: sleep 10.00 s
[...]
[2020-09-28T21:14:36Z] 2020-09-28T21:14:36.301Z οΏ½[34mdebugοΏ½[39m: internet gateway teardown: cycle 420
[...]
[2020-09-28T21:14:41Z] 2020-09-28T21:14:41.409Z οΏ½[31mwarningοΏ½[39m: cluster teardown attempt timed out after 2100 seconds
[2020-09-28T21:14:41Z] 2020-09-28T21:14:41.410Z οΏ½[32minfoοΏ½[39m: start attempt 4 in 30 s
[2020-09-28T21:14:45Z] 2020-09-28T21:14:45.574Z οΏ½[34mdebugοΏ½[39m: aws sdk: [AWS ec2 200 0.045s 0 retries] [...]
[2020-09-28T21:14:45Z] 2020-09-28T21:14:45.574Z οΏ½[34mdebugοΏ½[39m: internet gateway teardown: cycle 210

Seeing these internet gateway teardown cycle numbers makes it obvious that tasks spawned by one cluster teardown iteration survived even after that timed out.

Might be misusage of redux-saga, but in view of https://github.com/opstrace/opstrace-prelaunch/issues/1445 I think we might want to look into explicit self-controlled cancellation upon timeout.


From Nov 13:

Btw, this is present as ever, and whether or not this breaks a user's workflow depends on the context. It's certainly an architectural bug, and quite a messy state. Here, for example, a scenario where creation tasks overlap after a high-level timeout "aborted" the first attempt (well, it didn't, but a second high level attempt just added itself on top of the soup):

[2020-11-13T17:36:19Z] 2020-11-13T17:36:19.640Z info: EKS cluster status: CREATING
[2020-11-13T17:36:19Z] 2020-11-13T17:36:19.642Z info: EKS cluster setup: desired state not reached, sleep 10.00 s
[2020-11-13T17:36:24Z] 2020-11-13T17:36:24.704Z warning: cluster creation attempt timed out after 2400 seconds
[2020-11-13T17:36:24Z] 2020-11-13T17:36:24.705Z info: start attempt 2 in 10 s
[2020-11-13T17:36:29Z] 2020-11-13T17:36:29.643Z debug: EKS cluster setup: cycle 23
[2020-11-13T17:36:30Z] 2020-11-13T17:36:30.148Z debug: aws sdk: [AWS eks 200 0.505s 0 retries] describeCluster({ name: 'bk-2756-fb5-a' })
[2020-11-13T17:36:30Z] 2020-11-13T17:36:30.149Z info: EKS cluster status: CREATING
[2020-11-13T17:36:30Z] 2020-11-13T17:36:30.149Z info: EKS cluster setup: desired state not reached, sleep 10.00 s
[2020-11-13T17:36:34Z] 2020-11-13T17:36:34.705Z debug: createClusterAttemptWithTimeout
[2020-11-13T17:36:34Z] 2020-11-13T17:36:34.705Z info: validate controller config
...

502 response code and unhandled error when dns-service authentication fails

When running the destroy operation with auth tokens generated yesterday:

2020-11-27T16:21:09.528Z info: Try to delete policy matdev-eks-linked-service
2020-11-27T16:21:09.529Z info: Try to delete policy matdev-cortex-s3
2020-11-27T16:21:09.531Z info: Try to delete policy matdev-loki-s3
2020-11-27T16:21:09.532Z info: Try to delete policy matdev-externaldns
2020-11-27T16:21:10.306Z info: All policy-role attachments detached
2020-11-27T16:21:36.913Z error: error during cluster teardown (attempt 1):
Error: Request failed with status code 502
    at createError (/snapshot/opstrace/node_modules/axios/lib/core/createError.js:16:15)
    at settle (/snapshot/opstrace/node_modules/axios/lib/core/settle.js:17:12)
    at IncomingMessage.handleStreamEnd (/snapshot/opstrace/node_modules/axios/lib/adapters/http.js:236:11)
    at IncomingMessage.emit (events.js:327:22)
    at IncomingMessage.EventEmitter.emit (domain.js:485:12)
    at endReadableNT (_stream_readable.js:1224:12)
    at processTicksAndRejections (internal/process/task_queues.js:84:21) {
  config: [Object],
  request: [ClientRequest],
  response: [Object],
  isAxiosError: true,
  toJSON: [Function (anonymous)]
}
2020-11-27T16:21:36.913Z error: JSON representation of err: {
  "message": "Request failed with status code 502",
  "name": "Error",
  "stack": "Error: Request failed with status code 502\n    at createError (/snapshot/opstrace/node_modules/axios/lib/core/createError.js:16:15)\n    at settle (/snapshot/opstrace/node_modules/axios/lib/core/settle.js:17:12)\n    at IncomingMessage.handleStreamEnd (/snapshot/opstrace/node_modules/axios/lib/adapters/http.js:236:11)\n    at IncomingMessage.emit (events.js:327:22)\n    at IncomingMessage.EventEmitter.emit (domain.js:485:12)\n    at endReadableNT (_stream_readable.js:1224:12)\n    at processTicksAndRejections (internal/process/task_queues.js:84:21)",
  "config": {
    "url": "https://dns-api.opstrace.net/dns/",
    "method": "delete",
    "data": "{\"clustername\":\"matdev\"}",
    "headers": {
      "Accept": "application/json, text/plain, */*",
      "authorization": "Bearer eyJh<snip>A",
      "x-opstrace-id-token": "eyJhbG<snip>k8eAw",
      "Content-Type": "application/json",
      "User-Agent": "axios/0.19.2",
      "Content-Length": 24
    },
    "transformRequest": [
      null
    ],
    "transformResponse": [
      null
    ],
    "timeout": 0,
    "xsrfCookieName": "XSRF-TOKEN",
    "xsrfHeaderName": "X-XSRF-TOKEN",
    "maxContentLength": -1
  }
}
2020-11-27T16:21:36.915Z info: start attempt 2 in 30 s

After deleting the tokens with rm id.jwt access.jwt it prompted a new login in the CLI and succeeded. Would be nice to handle this auth failure automatically by deleting the local tokens and prompt for login.

Improve docs only changes script detection to reduce false positives

Current Behavior / Proposed Behavior

We are comparing the main branch and the PR branch using git .

We can try using the github API to get a list of files that changed in PR.

Context

If there are changes merged to main branch and the docs only PR branch is not up to date it can trigger unnecessary CI builds.

Possible Technical Solution

Example,

curl -H "Accept: application/vnd.github.v3+json" https://api.github.com/repos/opstrace/opstrace/pulls/92/files \
  | jq '.[].filename'

And then check if there's only changes to docs files from there.

cli hangs for > 1 min without feedback when having to specify --region during destroy

➜  opstrace git:(mat/deploy-ui-bits) βœ— ./build/bin/opstrace destroy aws $OPSTRACE_CLUSTER_NAME --region us-west2
2020-11-27T04:50:59.012Z info: logging to file: opstrace_cli_destroy_20201127-045059Z.log
2020-11-27T04:50:59.013Z info: Discovered AWS credentials. Access key: AKIA...5LWX
2020-11-27T04:50:59.014Z info: About to destroy cluster matdev (aws).
Proceed? [y/N] y
2020-11-27T04:52:08.309Z error: error during cluster teardown (attempt 1):
UnknownEndpoint: UnknownEndpoint: Inaccessible host: `eks.us-west2.amazonaws.com'. This service may not be available in the `us-west2' region.
    at throwIfAWSAPIError (/snapshot/opstrace/node_modules/@opstrace/aws/build/util.js:0)
    at Object.awsPromErrFilter (/snapshot/opstrace/node_modules/@opstrace/aws/build/util.js:0)
    at processTicksAndRejections (internal/process/task_queues.js:97:5)
    at getCluster (/snapshot/opstrace/node_modules/@opstrace/aws/build/eks.js:0)
    at Object.doesEKSClusterExist (/snapshot/opstrace/node_modules/@opstrace/aws/build/eks.js:0)
    at getEKSKubeconfig (/snapshot/opstrace/node_modules/@opstrace/uninstaller/build/index.js:0) {
  statusCode: undefined
}
2020-11-27T04:52:08.310Z error: JSON representation of err: {
  "name": "UnknownEndpoint"
}

This is due to an invalid region, but we should at least report the region might be invalid asap to the user.

DNS service: do not use 429 response for 'quota reached', but a 400 response

Got a 429 error from the DNS service in CI: https://github.com/opstrace/opstrace-prelaunch/issues/2035

No further detail in the log (no response body, in particular), so the error stayed mysterious for the moment. Improving that is now tracked in https://github.com/opstrace/opstrace/issues/2034.

I wondered why / how we would breach an HTTP request limit... then I went into the DNS service code and realized that

https://github.com/opstrace/opstrace/blob/025fc0a3b3e405be5597d45c5128831ea3eb52a3/packages/dns-service/src/controllers/dns.ts#L55

Then I remembered that we had been talking about that. Did a bit of digging, here I was suggesting to not use a 429 response for that, but a 400 response:

See https://github.com/opstrace/opstrace-prelaunch/issues/1552#issuecomment-713424784

Let's use 429 only for an actual HTTP request rate limit.
For quota/limits I suggest using 400 responses:
AWS VpcLimitExceeded example: #1459
AWS TooManyBuckets example: #1323

(there was no reply about that in #1552).

Still very much my opinion :).

ci: aws: quota reached for elastic IP addresses

[2020-12-02T10:51:56Z] 2020-12-02T10:51:56.070Z debug: aws-sdk-js request failed (attempt 0): AddressLimitExceeded: The maximum number of addresses has been reached. (retryable, according to sdk: false)

Remove kubed and have the controller manage the secret with https certificate

Current Behavior / Proposed Behavior

Have the controller copy the secret with the https secret to the tenant namespaces.

Context

We introduced kubed to copy the http certificate over to the tenant namespaces to be used by the ingresses. But we can also do it in the controller.

The controller can copy objects across namespaces. This is an example of using a secret in the kube-system as the source of truth and then persisting it in the appllication namespace too https://github.com/opstrace/opstrace/pull/45/files#diff-f7820c94ae287fc0583ec55d49a886b718acfd1247ac7132d6fd03db040c2ef0R159

Something unexpected happened: Error during service worker registration: DOMException: Failed to register a ServiceWorker for scope

Describe the bug

UI throws errors and the javascript console shows

Error during service worker registration: DOMException: Failed to register a ServiceWorker for 
scope ('https://sreis337.opstrace.io/') with script ('https://sreis337.opstrace.io/service-worker.js'): 
An SSL certificate error occurred when fetching the script.

This is related to the use of self-signed certificates

w3c/ServiceWorker#1159

To Reproduce

Create an Opstrace cluster with letsencrypt-staging cert issuer.

Don't get valid certs when creating new tenants

After an install, add a new tenant to the cluster config file, then run the create command again (this is the simplest way to reproduce by effectively "adding" a tenant). The new tenant will be created, but the cert is invalid for this tenant's domain. Cert is fine for original domains. @sreis seems like the cert isn't getting copied over to the new tenant's namespace or maybe we're not generating a new cert to cover this domain, now that we combine all subdomains into the same cert?

This cluster has letsencrypt-prod certs and all other tenants created during install have valid certs.

Adding a tenant after the original install no longer get's a valid cert:

Screen Shot 2020-11-30 at 10 40 23 PM

ci: lint entire TS codebase with ESLint

The goal is that the entire TypeScript code base is linted (and passing!) using ESLint (and a consistent set of rules).

This is a bigger effort (sometimes requiring quite involved code changes) and we should probably decompose this:

  • installer
  • uninstaller
  • lib/aws
  • lib/gcp
    -...

Started doing that for the CLI:

"lint": "eslint . --ext .ts"

We certainly want to end up in a state where we do not have "lint": "echo done" anymore in our code base :-) (example)

The rules (respected by ESLint in CI, and also by the ESLint extension in VS Code) live here: https://github.com/opstrace/opstrace/blob/main/.eslintrc.js (and can be adjusted).

installer: show controller 'events' when the pod does not get ready (example: ImagePullBackOff)

Not sure if I am using good k8s terminology here.

When ProgressDeadlineExceeded for the controller deployment then the installer will log that (after https://github.com/opstrace/opstrace-prelaunch/pull/2033): https://github.com/opstrace/opstrace-prelaunch/pull/2033#issuecomment-731609108.

That's nice, already much better than before. We can consider that this resolved https://github.com/opstrace/opstrace-prelaunch/issues/1208.

But we can and should do better in terms of showing reasons / specific errors. We could

  • emit a log msg how to now configure kubectl, and then show a helpful kubectl command
  • do the pod/container inspection ourselves

A good criterion I think is when e.g. the image can't be found then show the imagepullbackoff error. Also see https://github.com/opstrace/opstrace-prelaunch/pull/2033#issuecomment-731630012.

I tried that using our Deployment class but couldn't quite get to the EphemeralContainers objects -- will try again via the raw js k8s client lib. Also see https://github.com/opstrace/opstrace-prelaunch/pull/2033#issuecomment-731609108 for inspiration from kubectl.

docs: markdown: require no more than one sentence per line (enforce in CI)

I think I'd like to have this. Had a lot of success with that before, especially w.r.t. keeping diffs meaningful.

Quick resource dump.

"what she says": https://sembr.org/ :-)

By inserting line breaks at semantic boundaries, writers, editors, and other collaborators can make source text easier to work with, without affecting how it’s seen by readers.

From: prometheus-community/helm-charts#25

https://sembr.org/ makes a compelling case for this, and I'm inclined to agree.

https://rhodesmill.org/brandon/2012/one-sentence-per-line/

I agree that semantic line breaks / one sentence per line makes it easier to review changes.

DavidAnson/markdownlint#66

This seems to implement that: https://github.com/JoshuaKGoldberg/sentences-per-line

DavidAnson/markdownlint#298

installer: aws-sdk-js retries: use debug log again for non-retryable errors

Too noisy on info log:

2020-12-02T08:25:04.691Z info: aws-sdk-js request failed (attempt 0): InvalidDBInstanceState: Instance jpdev is already being deleted.
2020-12-02T08:25:04.692Z info: RDS Aurora instance teardown: tryDestroy(): ignore aws api error: InvalidDBInstanceState: Instance jpdev is already being deleted. (HTTP status code: 400)
2020-12-02T08:25:05.563Z info: RDS instance status: deleting
2020-12-02T08:25:06.494Z info: RDS cluster status: deleting
2020-12-02T08:25:16.482Z info: aws-sdk-js request failed (attempt 0): InvalidDBInstanceState: Instance jpdev is already being deleted.
2020-12-02T08:25:16.484Z info: RDS Aurora instance teardown: tryDestroy(): ignore aws api error: InvalidDBInstanceState: Instance jpdev is already being deleted. (HTTP status code: 400)

That is, this commit didn't quite do what it was supposed to do: cf44e1f

And this patch wasn't quite sufficient: aws/aws-sdk-js#3402 (does the SDK internally classify many non-retryable errors as retryable?)

dns client: log HTTP response details (including body) for error responses

[2020-11-20T12:35:23Z] 2020-11-20T12:35:23.255Z info: ServiceLinkedRole(elasticloadbalancing.amazonaws.com) setup: reached desired state, done (duration: 0.37 s)
[2020-11-20T12:35:23Z] 2020-11-20T12:35:23.255Z info: setting up DNS
[2020-11-20T12:35:23Z] 2020-11-20T12:35:23.859Z error: error during cluster creation (attempt 1):
[2020-11-20T12:35:23Z] Error: Request failed with status code 429

In the context of this we have to (debug-)log HTTP response details (such as the body (prefix)). Needed to debug things, generally.

Examples for what/how to log:

ci: scheduled main builds: upon success, tag controller image as latest-main (dev workflow)

When launching a dev cluster using a CLI build not built by CI (e.g. when using the current tsc-built index.js) then the default for controller_image is usually pointing to an image that does not exist.

I then usually have picked the last controller image built and pushed by CI and added a corresponding controller_image: ... value to the cluster config yaml document based on which I was trying to launch a cluster. That controller image reference requires a manual lookup (either from a buildkite build log, or from docker hub).

As an improvement for this dev workflow let's instead have an alias that wen can use when manually building up a cluster config file.

That is, let's have CI add a special tag to the controller image of the last passed CI run from main. This make target is used by CI for that.

It's important to appreciate that this moving target / alias has its purpose for a local dev workflow only; i.e. where the ambiguity w.r.t. the actual controller image that you get is manually chosen.

create operation: detect "continuation", otherwise fallout such as key material split brain (idempotency violation)

With each create operation, the CLI generates fresh key material early in the process: an RSA key pair, and derived authentication tokens for the data API. It's doing that before doing any remote state inspection.

When the Opstrace cluster already existed before invoking the current create operation, then the create operation will push the new public key into the existing cluster. However, the deployments will not pick it up and subsequently, the CLI will use the new (bad) authentication tokens for probing cluster readiness in the last phase of said create operation.

That will fail with something like

[2020-11-24T20:57:27Z] 2020-11-24T20:57:27.056Z info: https://loki-external.default.bk-2962-d71-a.opstrace.io:8443/loki/api/v1/labels: still waiting, unexpected HTTP response
[2020-11-24T20:57:27Z] 2020-11-24T20:57:27.543Z debug: HTTP response details:
[2020-11-24T20:57:27Z]   status: 401
[2020-11-24T20:57:27Z]   body[:500]: bad authentication token

This is a known limitation not yet documented; and a good reason to create this ticket.

This behavior also certainly a violation of an idempotency constraint that we talk about every now and then (where we think and want the create operation to be idempotent).

For example, from the current quickstart documentation

So you know: The CLI is re-entrant, so if you kill it or it dies, it will pick up right where it left off.

This topic raises so many interesting points!

First-level thinking: we could detect when the k8s cluster & cluster-internal config (controller config) exists; and then do not regenerate key material and data api auth tokens; try to read existing authentication tokens (and fail when they can't be discovered).

Second-level thinking: the previous thought reveals a bigger picture insight: when the cluster already exists then we don't want to overwrite any part of the cluster-internal config state -- that is not well-specified yet.

Third-level thinking: ok, we can explicitly ignore the user-given config, emit a clear warning message, and move on with the Nth create operation on the same cluster; trying to inform the admin via log messages that we just ignored the config they provided.

Fourth-level thinking: but this kind of fallback would be too magic -- we'd be ignoring the user-given config file (or parts of it!) w/o providing a clear signal. No, this must lead to non-zero exit of the current create operation.

Fifth-level thinking: this means the same command run twice (unaltered) can't just magically do the right thing in an idempotent fashion. There's a logical conflict here. Even with proper config upgrade mechanism we should not magically switch between 'initial create' and 'config upgrade'.

What's the value of the idempotency constraint? I think we have to see that it is a little ignorant -- because we have to think about every aspect of the cluster configuration when discussing idempotency or "continuation" of a previous partial create.

Long term: we need to introduce proper cluster config diff and mutation design

We could specify "well" what it means to overwrite the config (and apply all changes). That could resolve this conflict. Each create operation could simply set the current config, including key material. That sounds like a big, interesting project for later. When done properly that does, for example, require to do key rotation in the API proxies w/o downtime.

But note: even with proper config upgrade mechanism we should not magically switch between 'initial create' and 'config upgrade'. That must be an explicit user choice.

Short term: idempotency for cloud infra, pragmatic explicit continuation mode for k8s cluster

We could make the the idempotency constraint a little weaker, and apply it only to the cloud infrastructure setup before doing k8s cluster interaction. I think this would be pragmatic and valuable! That is: implicit continuation, actual idempotency when we talk about cloud infra, not about the k8s cluster state itself.

How to distinguish those cases? We could, in the beginning of the create operation, check if a corresponding k8s/EKS cluster already exists (not limiting the search to the region specified via config file, but across all regions).

Question: do we allow for the same Opstrace cluster name to be re-used across regions in the same cloud account? No. (for now, to have something to base decisions on -- but probably forever; that's just too confusing). (already used elsewhere, but need to write down)

If we do not find such a k8s cluster

Simply continue -- with key material regeneration :) -- and cloud infra creation (in an idempotent fashion, i.e. this will automatically pick up where previously left off).

If we find such a k8s cluster

If it does not have a controller deployment and no Opstrace-specific config map set:

Continue. There's no conflict here (again: idempotency).

If it has a controller deployment and any config map set

Then we should abort, exit non-zero; saying that we don't yet have a config-diff-upgrade mechanism.

Then offer a continuation run with a CLI flag (e.g. --continue) that will then move past this point.

This --continue mode would not accept a cluster config, or does explicitly ignore various parts of the cluster configuration file; and requires successful discovery of authentication token files from a previous create run. To me, that's another argument for: authentication token files should actually always live in a directory as an atomic unit; and we want to introduce a command line flag for discovering that.

I begin to see:

opstrace create aws jpdev --continue [--api-token-dir PATH]

  • --api-token-dir PATH is for discovery in this case; and it has has a default, matching the corresponding default for writing these files (upon first, happy-path create).
  • That command above does not require reading a cluster config document. Maybe we still want to do that (so that, so that one can simply apped a --continue for the second cmd invocation (so that one does not need to remove -c or stdin?)

Definite TODO: write authentication tokens to disk only when we are certain that we create new new cluster.

That means: delay writing the authentication tokens (compared to what we do today).

And here we are: Nth level thinking: this create --continue is basically the status command that we already have; or at least it's getting super close. I think: no, let's not do opstrace create aws jpdev --continue [--api-token-dir PATH] -- let's re-think status, maybe call it differently.

opstrace wait aws jpdev [--api-token-dir PATH]

It will find the EKS cluster (look for it in all regions), get the list of tenants from the cluster, and then use the API tokens to do its thing.

This would also resolve my major 'design complaint' about the current status implementation: it requires the config file, but shouldn't: https://github.com/opstrace/opstrace/blob/e44644d78f01659cf7d69ee44d6658a0f9119059/packages/cli/src/status.ts#L59

Misc

Other thoughts triggered by this discussion

  • For readiness, do not rely on HTTP / data API interaction -- probe readiness with k8s API means only (the proper solution?).

ci instability: TCP connect() timeout during `openssl s_client -showcerts -connect`

https://buildkite.com/opstrace/prs/builds/3121#5648443f-a53b-4e7b-aef5-bd7715eceb2a/3715

checking cluster is using certificate issued by LetsEncrypt
[2020-12-02T21:36:04Z] + openssl x509 -noout -issuer
[2020-12-02T21:36:04Z] + openssl s_client -showcerts -connect system.bk-3121-df2-a.opstrace.io:443
[2020-12-02T21:36:04Z] + grep 'Fake LE Intermediate'
[2020-12-02T21:38:14Z] 140420438553728:error:0200206E:system library:connect:Connection timed out:../crypto/bio/b_sock2.c:110:
[2020-12-02T21:38:14Z] 140420438553728:error:2008A067:BIO routines:BIO_connect:connect error:../crypto/bio/b_sock2.c:111:
[2020-12-02T21:38:14Z] connect:errno=110
[2020-12-02T21:38:14Z] + teardown
[2020-12-02T21:38:14Z] + LAST_EXITCODE_BEFORE_TEARDOWN=1

cli: allow for configuring kubectl

Something like opstrace kubectl aws jpdev?

input:

  • cloud provider
  • cluster name

(same two parameters as for destroy)

For debuggability.
For replacing the make kconfig-* make targets.
Can refer to that in CLI-emitted log msgs/error msgs.

installer: aws: waiting "for Certificate ingress/https-cert to be ready" didn't resolve within ~1 hour

A CLI build from main from today.

2020-12-01T16:02:16.273Z debug: CLI build information: {
  "BRANCH_NAME": "main",
  "VERSION_STRING": "5b8ea45a-ci",
  "COMMIT": "5b8ea45",
  "BUILD_TIME_RFC3339": "2020-12-01 12:05:51+00:00",
  "BUILD_HOSTNAME": "0d5430f98c0d"
}
$ ./opstrace create aws jpdev -c ~/dev/opstrace/ci/cluster-config.yaml 
...
2020-12-01T16:02:16.277Z debug: user-given cluster config parsed. JSON representation:
{
  "tenants": [
    "default"
  ],
  "env_label": "ci",
  "node_count": 3
}
...
2020-12-01T16:22:27.526Z info: waiting for 3 StatefulSets
2020-12-01T16:22:27.526Z info: waiting for 2 Certificates
2020-12-01T16:22:27.526Z debug:     Waiting for Certificate ingress/https-cert to be ready
2020-12-01T16:22:27.527Z debug:     Waiting for Certificate ingress/kubed-apiserver-cert to be ready

...

2020-12-01T16:42:18.212Z info: waiting for 0 Deployments
2020-12-01T16:42:18.212Z info: waiting for 0 DaemonSets
2020-12-01T16:42:18.213Z info: waiting for 0 StatefulSets
2020-12-01T16:42:18.213Z info: waiting for 1 Certificates
2020-12-01T16:42:18.213Z debug:     Waiting for Certificate ingress/https-cert to be ready
2020-12-01T16:42:19.127Z info: shutting down k8s informers
2020-12-01T16:42:19.129Z warning: cluster creation attempt timed out after 2400 seconds
2020-12-01T16:42:19.130Z info: start attempt 2 in 10 s
...
2020-12-01T17:16:00.171Z info: waiting for 0 Deployments
2020-12-01T17:16:00.172Z info: waiting for 0 DaemonSets
2020-12-01T17:16:00.172Z info: waiting for 0 StatefulSets
2020-12-01T17:16:00.172Z info: waiting for 1 Certificates
2020-12-01T17:16:00.172Z debug:     Waiting for Certificate ingress/https-cert to be ready

@sreis I would appreciate if you can have a look here.

ci instability: during loki query: 502 response, EOF in body

https://buildkite.com/opstrace/scheduled-main-builds/builds/1290#1facfc77-1ab5-432d-a745-407534179efa/2796-4803

[2020-12-02T12:37:00Z] 2020-12-02T12:37:00.972Z info: HTTP resp to GET(https://loki.system.bk-1290-878-a.opstrace.io/loki/api/v1/query_range?query=%7Bk8s_namespace_name%3D%22loki%22%2C+k8s_container_name%3D%22ingester%22%7D+%7C%3D+%22Starting+Loki%22&direction=BACKWARD&regexp=&limit=10&start=1606552581861000000&end=1606916181861000000):
[2020-12-02T12:37:00Z]   status: 502
[2020-12-02T12:37:00Z]   body[:500]: EOF
[2020-12-02T12:37:00Z]   headers: {"server":"openresty/1.15.8.2","date":"Wed, 02 Dec 2020 12:37:00 GMT","content-type":"text/plain; charset=utf-8","content-length":"3","connection":"close","strict-transport-security":"max-age=15724800; includeSubDomains"}
[2020-12-02T12:37:00Z]   totalTime: 39.108 s
[2020-12-02T12:37:00Z]   dnsDone->TCPconnectDone: 0.001 s
[2020-12-02T12:37:00Z]   connectDone->reqSent 0 s
[2020-12-02T12:37:00Z]   reqSent->firstResponseByte: 39.067 s

Roadmap link to community discussions is not working

Describe the bug

The link to community discussions at the end of the Roadmap is not working.
(I assume because I'm not a member of the organization so maybe that is by intention.)
https://opstrace.com/docs/references/roadmap

The link leads to
https://go.opstrace.com/community
which redirects to
https://github.com/opstrace/opstrace/discussions
where I get a 404 response.

To Reproduce

  1. Go to the aforementioned Roadmap link
  2. Scroll down to the end of the page
  3. Click the link while not being signed into GitHub/the Opstrace org.

Expected behavior

I expected to be redirected to a page where I can see ongoing community discussions or a note that this is intended only for members of the organization.

ci instability: loki query: 500 response w/ rpc error: code = Internal desc = received XXX-bytes data exceeding the limit YYY bytes

https://buildkite.com/opstrace/scheduled-main-builds/builds/1278#a54f9f7b-1004-4299-89f8-dd98121eaf75/2721-3563

[2020-11-30T20:39:27Z] 2020-11-30T20:39:27.951Z info: HTTP resp to GET(https://loki.default.bk-1278-bfb-a.opstrace.io/loki/api/v1/query_range?query=%7Bdummystream%3D%22test-remote-ldi-1ZqwhA-0003%22%7D&direction=FORWARD&limit=20000&start=1606768759642000000&end=1606768759644000000):
[2020-11-30T20:39:27Z]   status: 500
[2020-11-30T20:39:27Z]   body[:500]: rpc error: code = Internal desc = received 252090-bytes data exceeding the limit 242040 bytes
[2020-11-30T20:39:27Z]
[2020-11-30T20:39:27Z]   headers: {"server":"openresty/1.15.8.2","date":"Mon, 30 Nov 2020 20:39:27 GMT","content-type":"text/plain; charset=utf-8","content-length":"94","connection":"close","strict-transport-security":"max-age=15724800; includeSubDomains","x-content-type-options":"nosniff"}
[2020-11-30T20:39:27Z]   totalTime: 0.258 s
[2020-11-30T20:39:27Z]   dnsDone->TCPconnectDone: 0.003 s
[2020-11-30T20:39:27Z]   connectDone->reqSent 0 s
[2020-11-30T20:39:27Z]   reqSent->firstResponseByte: 0.215 s
[2020-11-30T20:39:27Z]
[2020-11-30T20:39:27Z]     1) long dummystream insert, validate via query

cluster teardown: be more deliberate about the higher-level mode/goal (obliteration vs hibernation)

From prelaunch repo, Sep 2021.


Thinking far into the future, there might be two "cluster teardown" modes of interest:

  1. obliteration: including all cloud resources ever created for that cluster, including payload data (cloud storage), including -- well -- really everything.
  2. hibernation: with the major goal to keep the data around for little cost, and to be able to start a cluster again w/o complications, exposing the data via the usual query interfaces.

To date, wen seem to have been focusing on (1) only; at least the current cluster destroy operation is closer to (1) than to (2). Because (2) takes a whole lot of thinking & verification/testing work.

But we're not really consequential within doing (1) -- for example, we wait for the kubernetes deployments to cleanly terminate. Which should only be necessary in the context of (2).

Systematic / clear thinking in these lines, especially for (1), might save a lot of work today.

we wait for the kubernetes deployments to cleanly terminate

Btw, we do this for getting away with not tearing down disks/EBS volumes "manually".

ci instability: failed: checking cluster is using certificate issued by LetsEncrypt

Describe the bug

checking cluster is using certificate issued by LetsEncrypt
[2020-11-25T00:17:16Z] + openssl s_client -showcerts -connect system.bk-1240-2ac-g.opstrace.io:443
[2020-11-25T00:17:16Z] + grep 'Fake LE Intermediate'
[2020-11-25T00:17:16Z] + openssl x509 -noout -issuer
[2020-11-25T00:17:16Z] 140606379885696:error:2008F002:BIO routines:BIO_lookup_ex:system lib:../crypto/bio/b_addr.c:724:No address associated with hostname
[2020-11-25T00:17:16Z] connect:errno=0
[2020-11-25T00:17:16Z] + teardown
[2020-11-25T00:17:16Z] + LAST_EXITCODE_BEFORE_TEARDOWN=1

Seen here: https://buildkite.com/opstrace/scheduled-main-builds/builds/1240#2fc9a08e-bfc0-41f8-8992-fbee901b3631/988 (scheduled build from main)

To Reproduce

Flaky issue only seen, thus far, in GCP.

Expected behavior

Cluster should be made ready with the requested certificate.

elb/volume/... teardown: encode opstrace cluster fact and opstrace cluster name in k8s-cluster-name tag, use that

This is a follow-up from comments/observations made previously:

ELBs created by entities running on a k8s cluster encode the k8s cluster name in a tag name:

kubernetes.io/cluster/<k8s-cluster-name>

Note that "k8s cluster name" is actually the GKE / EKS cluster name.

Using this tag might be the best bet for teardown to detect:

  1. that the ELB belongs to an opstrace cluster
  2. that the ELB belongs to the specific opstrace cluster <opstrace-cluster-name>

This assumes that (only needs to be done if) we cannot detect (1) reliably by e.g. setting and reading opstrace_cluster_name.

If we go down this path then we need to encode the fact that a k8s cluster (GKE / EKS cluster) belongs to an Opstrace cluster in the k8s cluster name, e.g. via prefix.

That means: while today, the k8s cluster name corresponds to the opstrace cluster name, in the future we might want to have k8s_cluster_name = "opstrace-${opstrace_cluster_name}".

This approach could replace the 'detect-elbs-belonging-to-opstrace-cluster-via-vpc-association-technique' (https://github.com/opstrace/opstrace-prelaunch/blob/39a5e869d171655268b40dac8330a54801f05683/lib/aws/src/vpc.ts#L14), and could also be applied for other resource types, such as persistent volumes!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.