Giter VIP home page Giter VIP logo

holmes's Introduction

MOSN logo

Build Status codecov Go Report Card license

中文

MOSN (Modular Open Smart Network) is a cloud-native network proxy written in Go language. It is open sourced by Ant Group and verified by hundreds of thousands of production containers in 11.11 global shopping festival. MOSN provides the capabilities of multiple protocol, modularity, intelligent and security. It integrates a large number of cloud-native components, and also integrates a Envoy network library, which is high-performance and easy to expand. MOSN and Istio can be integrated to build Service Mesh, and can also be used as independent L4/L7 load balancers, API gateways, cloud native Ingress, and etc.

Core capabilities

  • Istio integration
    • Integrates Istio 1.10 to run in full dynamic resource configuration mode
  • Core forwarding
    • Supports a self-contained server
    • Supports the TCP proxy
    • Supports the UDP proxy
    • Supports transparent traffic hijack mode
  • Multi-protocol
    • Supports HTTP/1.1 and HTTP/2
    • Supports protocol extension based on XProtocol framework
    • Supports protocol automatic identification
    • Supports gRPC
  • Core routing
    • Supports virtual host-based routing
    • Supports headers/URL/prefix/variable/dsl routing
    • Supports redirect/direct response/traffic mirror routing
    • Supports host metadata-based subset routing
    • Supports weighted routing.
    • Supports retries and timeout configuration
    • Supports request and response headers to add/remove
  • Back-end management & load balancing
    • Supports connection pools
    • Supports persistent connection's heart beat handling
    • Supports circuit breaker
    • Supports active back-end health check
    • Supports load balancing policies: random/rr/wrr/edf
    • Supports host metadata-based subset load balancing policies
    • Supports different cluster types: original dst/dns/simple
    • Supports cluster type extension
  • Observability
    • Support trace module extension
    • Integrates jaeger/skywalking
    • Support metrics with prometheus style
    • Support configurable access log
    • Support admin API extension
    • Integrates Holmes to automatic trigger pprof
  • TLS
    • Support multiple certificates matches, and TLS inspector mode.
    • Support SDS for certificate get and update
    • Support extensible certificate get, update and verify
    • Support CGo-based cipher suites: SM3/SM4
  • Process management
    • Supports hot upgrades
    • Supports graceful shutdown
  • Extension capabilities
    • Supports go-plugin based extension
    • Supports process based extension
    • Supports WASM based extension
    • Supports custom extensions configuration
    • Supports custom extensions at the TCP I/O layer and protocol layer

Download&Install

Use go get -u mosn.io/mosn, or you can git clone the repository to $GOPATH/src/mosn.io/mosn.

Documentation

Contributing

See our contributor guide.

Partners

Partners participate in MOSN co-development to make MOSN better.

End Users

The MOSN users. Please leave a comment here to tell us your scenario to make MOSN better!

Ecosystem

The MOSN community actively embraces the open source ecosystem and has established good relationships with the following open source communities.

Community

See our community materials on https://github.com/mosn/community.

Visit the MOSN website for more information on working groups, roadmap, community meetings, MOSN tutorials, and more.

Scan the QR code below with DingTalk(钉钉) to join the MOSN user group.

Community meeting

MOSN community holds regular meetings.

Landscapes

  

MOSN enriches the CNCF CLOUD NATIVE Landscape.

holmes's People

Contributors

atlanci avatar cch-4321 avatar cch123 avatar champly avatar cuishuang avatar doujiang24 avatar dumbfeng avatar fangchichen avatar fibbery avatar hobbybear avatar ioworker0 avatar istudies avatar jun10ng avatar ls-2018 avatar maratrixx avatar nejisama avatar songzhibin97 avatar taoyuanyuan avatar wangjc0216 avatar wjcgithub avatar xiezhenouc avatar yudidi avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

holmes's Issues

feature: support upload profile to external platform.

now, we can only write the profile on the local filesystem.
it's better to upload them to a central platform so that we can get real-time alerts based on the center platform, and we can analyze them on the central platform.

design:
we can add a new profile callback option, call the callback on each write profile file if it has been set.

callback API maybe:

callback(type, filename, reason)

consul/api 版本冲突无法使用

自己项目中有用到consul版本为v1.4.3,holmes中的consul/api的版本为v1.3.0

去看holmes代码中其实并没用用到consul,去追依赖发现依赖层级如下

mosn.io/pkg-> github.com/dubbogo/gost -> github.com/prometheus/client_golang  -> github.com/prometheus/common ->  github.com/go-kit/kit -> consul/api v1.3.0

本身holmes中并没有任何consul的依赖,即使是使用的日志库holmes.io/pkg/log的功能中,也没有用到consul/api的功能。

不知是否可以把moson.io/pkg/log中有关consul的依赖去除,要不可惜了这么优秀的开源不能使用。

example Dump事件上报 拼写错误

 type ReporterImpl struct{}
    func (r *ReporterImple) Report(pType string, buf []byte, reason string, eventID string) error{
        // do something	
    }

ReporterImple && ReporterImpl

deadlock detection

investigate for deadlock detection,

if plenty of goroutines block on lock acquire, is it a deadlock?

if goroutines stack matches some rules, is it a deadlock?

if runtime_SemacquireMutex numbers keep increasing, is it a deadlock

Add a dangerous_limit paramter for WithCPUDump method

Hi team,

In my opinion, there is should be a dangerous_limit parameter means holmes will not dump profile when current CPU usage reached this limit, cuz CPU pprof usually waste some resource, commonly 5% or less, if holmes executes the CPU pprof causes CPU usage up 5% and result of the service crash, I won't hope that.

func WithCPUDump(min int, diff int, abs int)

change to

func WithCPUDump(min int, diff int, abs int, dangerous int)

or add a new withOption func

func WithDangerousLimit(d int)

I prefer the latter.

句柄打满监控

最近遇到流量比较大的模块,可能会出现too many files句柄打满错误。其中一个原因是物理机句柄限制。另外一个,由于是偶现,抓不到现场,现在还不知道原因。。所以这个工具是否可以帮助我们抓到这个现场,定位到具体的问题。

一点小建议

var buf bytes.Buffer
_ = pprof.Lookup("xxx").WriteTo(&buf, int(h.opts.DumpProfileType))
几秒就要运行一次
此处的 buffer是否可以搞成复用呢,不然这样不停的分配,又不停的回收, 这会不会成为性能点,
可能理解的不对

heap samples are not what I expect

What version of Go are you using (go version)?

$  1.17.8

Does this issue reproduce with the latest release?

Haven't tried.

What operating system and processor architecture are you using (go env)?

go env Output
$ MacOS and amd64

What did you do?

I wrote a piece of code:

var (
    default = 1073741824
    a []byte
)

func MakeNGbSlice(ctx *gin.Context) {
    sizeStr ,ok := ctx.GetQuery("size")
    if ok {
        size, _ := strconv.Atoi(sizeStr)
        if size > 0 {
            defaultSize = size * defaultSize
        }
    }

    a = make([]byte, 0, defaultSize)
    for i := 0; i < defaultSize; i ++ {
        a = append(a, byte('a'))
    }
    time.Sleep(time.Second * 10)
    a = nil // for gc
}

Then I curled the api to trigger the code to run.

What did you expect to see?

  1. The RSS my app used went up to 1GB.
  2. The heap profile data was right and could help me find out the reason why RSS went up after I dumped the heap samples.

What did you see instead?

  1. The RSS my app used went up to 1GB.
  2. The heap profile seemed to be weird:
Type: inuse_space
Time: Aug 3, 2022 at 2:32pm (CST)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 12052.56kB, 64.38% of 18721.44kB total
Showing top 10 nodes out of 153
      flat  flat%   sum%        cum   cum%
 2562.81kB 13.69% 13.69%  3075.02kB 16.43%  runtime.allocm
 2048.81kB 10.94% 24.63%  2048.81kB 10.94%  runtime.malg
 2048.19kB 10.94% 35.57%  2048.19kB 10.94%  github.com/Shopify/sarama.(*TopicMetadata).decode
 1097.69kB  5.86% 41.44%  1097.69kB  5.86%  github.com/Shopify/sarama.(*client).updateMetadata
 1089.33kB  5.82% 47.26%  1089.33kB  5.82%  google.golang.org/grpc/internal/transport.newBufWriter
// ...
  1. After about 2 minutes, I profiled again and the result was what I expected:
Type: inuse_space
Time: Aug 3, 2022 at 2:33pm (CST)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 1024.50MB, 98.25% of 1042.78MB total
Dropped 154 nodes (cum <= 5.21MB)
Showing top 10 nodes out of 18
      flat  flat%   sum%        cum   cum%
    1024MB 98.20% 98.20%     1024MB 98.20%  .../examples.Make1GbSlice
    0.50MB 0.048% 98.25%     6.63MB  0.64%  runtime.main
         0     0% 98.25%  1024.51MB 98.25%  github.com/gin-gonic/gin.(*Context).Next (inline)
         0     0% 98.25%  1024.51MB 98.25%  github.com/gin-gonic/gin.(*Engine).ServeHTTP
         0     0% 98.25%  1024.51MB 98.25%  github.com/gin-gonic/gin.(*Engine).handleHTTPRequest

So, it seems like that the samples pprof dumps in time are wrong.

For more discussion, Please see golang issue #54233

bug: WithCollectInterval function takes a long time before collectInterval was changed.

We found the test case for the WithCollectInterval function failure of some times. so I dived into it, the root of the failure of the test case is:

  1. select-case has no order, it will pick a random case to execute.
  2. the logic after ticker.C takes more than 1 second.

so we would set priority for ticker.C and intervalResetting , making sure that the message inside intervalResetting channel would be consumed before ticker.C

add mem cron job dump

In the memory leak scenario, usually, find the leaking place through two profiles from different times.

The feature of dumping profile from abnormal time holmes has support already.
So, I thought we need a feature of dumping from normal times, and that can be set like a cron job.

A discussion about dynamic configuration.

Hi team,

I saw there was an issue about "dynamic configuration", and I want to discuss some questions about it.

  1. The definition of "dynamic" is holmes pull/receive configurations from a remote server? or holmes can adjust its configuration automatically through pre-prepared plans?
  2. If its definition is the former, holmes receive configuration from Apollo/Redis/API or some service built by users themselves, we better draft an abstract API, and implement dynamic config feature based on the API.
  3. Does we support hot-load configuration? I mean weather holmes supports modifying its config when it is running? Or just on a simple way-- only load configuration from remote when holmes is initiating. The change to support the former seems great.

dump文件多余(.)逗号.

1653288560294_55A35748-CF59-4793-A824-9C8EEA87F574
如图,dump文件名字第一个字段后有两个点,请问这里是一个问题还是说这里应该填写用户指定的字符串。如果是用户指定字符串,如何指定没看到相关API

关于ring的一些问题

1、我看完代码感觉ring的作用只是计算平均值。那么为啥没有用队列,添加元素的时候从队头删除一个,队尾添加一个。

2、如果没有用队列的原因是队列没有像ring那样严格的数量限制容易出现问题,那么为啥没有封装一下官方库提供的ring,而是自己写了一个。自己写的ring里面idx的最大值等于len(ring.data),又有ring.data[ring.idx]的操作,感觉这个有点危险。

Support Dump Goroutines to logger

Currently, the goroutines are dumped into specified file path, sometimes it might be hard for us to check.
Is it possible to support to dump goroutines to logger? If so we can see check it in our existing log system(ELK). Thanks.

feature: support external logger.

we create logger based on the file in holmes, it makes holmes easier to use standalone.
but sometimes, we already have logger, we'd better support just use the existing logger instead create a new one.
it will make holmes easier to integrate.

bug: should log the previous data in human readable order

Now, the previous data is weird, not in human-readable order.
79 and 70 should be the last value.

2022-09-26 17:00:02,934 [ERROR] [holmes.cpu] [Holmes] pprof dump cpu, config_min : 30, config_diff : 25, config_abs : 100, config_max : 0, previous : [13 13 13 13 79 13 13 14 14 13], current: 79
2022-09-26 17:11:02,938 [ERROR] [holmes.cpu] [Holmes] pprof dump cpu, config_min : 30, config_diff : 25, config_abs : 100, config_max : 0, previous : [16 12 13 12 13 13 70 13 13 13], current: 70

This is due to, we print the data from a ringbuffer directly, we should print them in nature order:
https://github.com/mosn/holmes/blob/master/ring.go#L21

Maybe we can introduce a data method for ring, which returns the data in the natural order, so that we can use data() instead of ring.data.

Why CPU Dump didn't work

I run holmes inside my application and set the follow options.

       // other initial


	h, _ := holmes.New(
		holmes.WithCollectInterval("5s"),
		holmes.WithCoolDown("1m"),
		holmes.WithDumpPath("/tmp"),
		holmes.WithCPUDump(1, 25, 80),
		holmes.WithCPUMax(90),
	)
	h.EnableCPUDump()

	// start the metrics collect and dump loop
	h.Start()

      // server start

then I run a eating cpu shell script, and the following screenshot is top command output, CPU usage rate is almost 100%.
image

but I check holmes.log it shows CPU usage rate is 0%

image

log priting is not normal

In (*Holmes).EnableDump, the current cpu percent shouldn't be assigned with cpu but curCPU. Like the following code:

func (h *Holmes) EnableDump(curCPU int) (err error) {
	if h.opts.CPUMaxPercent != 0 && curCPU >= h.opts.CPUMaxPercent {
		return fmt.Errorf("current cpu percent [%v] is greater than the CPUMaxPercent [%v]", curCPU, h.opts.CPUMaxPercent)
	}
	return nil
}

为什么要记录 cpu 利用率和 goroutine 数量?

看 readme 提及的意思是,为了解决的问题是被 OOM-Killer kill 掉而无法保留现场的问题
那么是否只监控物理机或容器的内存使用率即可,而无需监控 cpu 利用率和 goroutine 数量?

"open file failed" when dump path does not exist

When using dump path that does not exist, for example

holmes.WithDumpPath("./tmp"),

Expected behavior:
The goroutine should be dumped successfully.

Current behavior:
Will encounter error like

2022-03-29 17:07:06,609 [ERROR] failed to write profile to file(tmp/goroutine..20220329170706.609.log), err: pprof goroutine open file failed : open tmp/goroutine..20220329170706.609.log: no such file or directory

Source Code

package main

import (
	"fmt"
	"runtime"
	"time"

	"mosn.io/holmes"
	mlog "mosn.io/pkg/log"
)

func main() {
	logger := holmes.NewStdLogger()
	logger.SetLogLevel(mlog.INFO)

	h, _ := holmes.New(
		holmes.WithCollectInterval("1s"),

		holmes.WithTextDump(),
		holmes.WithDumpPath("./tmp"),
		// dump will happen when current_goroutine_num > 500 && current_goroutine_num < 1500
		holmes.WithGoroutineDump(500, 0, 500, 1500, 40*time.Second),
		holmes.WithLogger(logger),
	)
	h.EnableGoroutineDump()
	h.Start()

	spawnGoroutine(490)

	for {
		fmt.Println(time.Now(), "Number of goroutines:", runtime.NumGoroutine())
		spawnGoroutine(10)
		time.Sleep(10 * time.Second)
	}
}

func spawnGoroutine(n int64) {
	for i := int64(0); i < n; i++ {
		go func() {
			time.Sleep(500 * time.Minute)
		}()
	}
}

go.mod

module test_holmes

go 1.16

require (
	mosn.io/holmes v1.0.0
	mosn.io/pkg v0.0.0-20211217101631-d914102d1baf
)

a code issue in releases

During my use, I found that there was a code issue in the v1.0.1 and v1.1.0 package, but there was no problem with the code here in the master branch of GitHub. Should type a new version of the tag?
holmes.go:
Error code:
func (h *Holmes) DisableMemDump() *Holmes { h.opts.gCHeapOpts.Enable = false return h }
Correct code:
func (h *Holmes) DisableMemDump() *Holmes { h.opts.memOpts.Enable = false return h }

report时传递更详细的现场信息

鉴于线上告警场景的需求,如果能够将指标当前值、前n-1次采样的平均值以及指标相关的配置(min、max、abs、diff等)作为Report的参数传进来就好了。
目前参数中就只有reason能用,并且内容可读性不强。

overwrite about different variables but same name

I found there are a lot of different meaning variables but they have the same name & type.
And one of them is the global variable.
I think that is unsafe, for example, if I change the func signature at line 270 to cpuCheckAndDump(cpuUsage int)
, then everything is ok, the cpu variable at line 280 will be recognized as the global variable quietly,
IDE/Editor/Compiler ever have no warning. (also same about mem)

holmes/holmes.go

Lines 269 to 284 in 798377c

// cpu start.
func (h *Holmes) cpuCheckAndDump(cpu int) {
if !h.opts.CPUOpts.Enable {
return
}
if h.cpuCoolDownTime.After(time.Now()) {
h.logf("[Holmes] cpu dump is in cooldown")
return
}
if triggered := h.cpuProfile(cpu); triggered {
h.cpuCoolDownTime = time.Now().Add(h.opts.CoolDown)
h.cpuTriggerCount++
}
}

holmes/consts.go

Lines 44 to 49 in 798377c

const (
mem configureType = iota
cpu
thread
goroutine
)

cpu, mem, gNum, tNum, err := collect()

In my opinion, It's better to change the name of variables in consts.go
from cpu to cpuType

cpu 突然飙高的一下触发了dump操作,但是dump下来的文件使用 go tool pprof 分析文件,好像什么都分析不出来

代码:
h, err := holmes.New(
holmes.WithProfileReporter(r),
holmes.WithCollectInterval("30s"),
holmes.WithDumpPath(pprofDumpPath),
holmes.WithLogger(holmes.NewFileLog(holmesLogFilePath, mlog.INFO)),
holmes.WithCPUDump(30, 50,60, time.Minute),
holmes.WithCPUMax(90),
holmes.WithCGroup(true),
)
h.EnableCPUDump()

// start the metrics collect and dump loop
h.Start()

holmes log
image

pprof_file
wecom-temp-151278-ae93f62a8cb7ddf44a471a9863a220a7

看起来感觉cpu突然飙高了一下,然后又正常了,是采集周期或者参数设置的问题吗

add license and release for prod

Some users want to use holmes on their production env and hesitate for the license and beta ver, currently most features of holmes have already been stable enough

v1 should be soon released,

another question is what license should this lib be? @nejisama @doujiang24 @taoyuanyuan
apache v2?

support stack debug

  • we can get allgs by go:linkname demo

  • can print all goroutines with goroutine id by debug/pprof/goroutine?debug=2

So we can output every goroutine stack size

DisableMemDump 这个方法是不是笔误了

// DisableMemDump disables the mem dump.
func (h *BHolmes) DisableMemDump() *BHolmes {
h.opts.gCHeapOpts.Enable = false
return h
}
是不是应该是
// DisableMemDump disables the mem dump.
func (h *BHolmes) DisableMemDump() *BHolmes {
h.opts.memOpts.Enable = false
return h
}

有没有朋友遇到过这个问题

类似这个 golang/go#40974

goland里面执行gotest没有问题。

项目里go install 命令的时候出现以下问题:
/usr/local/Cellar/go/1.15.2/libexec/pkg/tool/darwin_amd64/link: running clang failed: exit status 1
ld: sectionForAddress(0x13654DE) address not in any section file '/var/folders/rr/2rdmy50d1jb0tjnjfhcbnbqh0000gn/T/go-link-273425566/go.o' for architecture x86_64
clang: error: linker command failed with exit code 1 (use -v to see invocation)

升级go版本至1.15.5后问题消失

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.