Giter VIP home page Giter VIP logo

rebot's Introduction

GoDoc Build Status Coverage Status Go Report Card

ReBot

The rebot tool identifies machines on the M-Lab infrastructure that are not reachable anymore and should be rebooted (according to various criteria) and attempts to reboot them through iDRAC.

Criteria for reboot candidates

This is the list of criteria ReBot will check to determine if a machine needs to be rebooted.

  • machine is offline - port 806 down for the last 15m
  • machine is not lame-ducked - lame_duck_node is not 1
  • site and machine are not in GMX maintenance - gmx_machine_maintenance and gmx_site_maintenance are not 1
  • switch is online - probe_success{instance=~"s1.*", module="icmp"} has been 0 for the last 15m
  • there are no NDT tests running - rate(inotify_extension_create_total{ext=".s2c_snaplog"}[15m]) is 0 or not present
  • metrics are actually being collected for all probes (i.e. prometheus was up)
    • count_over_time(probe_success{service="ssh806", module="ssh_v4_online"}[15m]) >= 14

Additionally, ReBot checks the following:

  • the machine has not been rebooted already in the last 24hrs
  • no more than 5 machines should be rebooted together at any time

rebot's People

Contributors

nkinkade avatar pboothe avatar robertodauria avatar rschulman avatar salarcon215 avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

rebot's Issues

Make Rebot run as a system user

Currently, Rebot runs as salarcon on eb.measurementlab.net. It does so because it uses salarcon's .netrc file to authenticate to Nagios to get status from baselist, and salarcon's PLC account to run the drac.py command from the operator repo. Finally, because the script runs as salarcon, it uses [email protected] to send status emails.

Three changes should happen:

  1. Make a system account to access Nagios, change the script to use the system account, OR cause the script to run baselist locally on eb rather than making a request to http://nagios.measurementlab.net/baseList?show_state=1&service_name=ssh&plugin_output=0&show_problem_acknowledged=1, etc. The decision was made to call a URL to keep the script portable, but maybe that trade-off doesn't pay off.

  2. Use a different PLC account to run drac.py OR somehow make a local copy of DRAC passwords from the PLC database (or otherwise break the dependency on PLC for DRAC passwords).

  3. Cause the script to run as a system user on eb, and make sure that user is allowed to post to the [email protected] Google Group. Current admins for the group are critzo, soltesz, salarcon, and kinkade.

Include a unique value in the subject line for better tracking / filtering

The current alert email subject is the same every time, leaving all messages from Rebot to show up in a single thread in some mail clients.

Consider adding a unique value like the date/time (or whatever makes sense) to the subject line of Rebot alert emails.

Perhaps discuss in Ops standup what this format ought to look like and apply to all alert notifications from different systems?

Rebot sometimes attempts to reboot nodes when the switch is offline

This was observed on the week of Jun 1 on the SRE overview dashboard, when Rebot tried to reboot nodes at GRU01, ORD05 and IAD03 while these sites were offline.

I suspect this is a timing issue happening when Rebot runs its SwitchQuery when the switch hasn't been offline continuously for 15 minutes yet, but this needs to be investigated further.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.