Giter VIP home page Giter VIP logo

puppet's Introduction

puppet

Build Status

ocf servers

This repository contains the Puppet modules used to maintain and configure the servers and desktops used by the Open Computing Facility at UC Berkeley.

These modules are generally intended to be used on the latest Debian stable release, though probably also work on Debian-derived distros (such as Ubuntu).

This README outlines development practices for OCF volunteer staff members. If you're a member of the UC Berkeley community and interested in getting involved, check us out!

Making and testing changes

Puppet environments

Every staffer owns a puppet environment. A puppet environment is a copy of this repository which you can use for testing out changes to this puppet code.

Puppet environments are stored on the puppetmaster in /opt/puppet/env/. Each staffer's environment has the same name as their user name:

ckuehl@lightning:~$ ls -l /opt/puppet/env/
drwxr-xr-x 6 ckuehl  ocf  4.0K Nov  5 11:04 ckuehl
drwxr-xr-x 5 daradib ocf  4.0K Aug 11 22:53 daradib
drwxr-xr-x 5 tzhu    ocf  4.0K Sep  2 17:46 tzhu
drwxr-xr-x 6 willh   ocf  4.0K Oct  9 13:56 willh

The puppetmaster has the service CNAME puppet, so you can connect to it via ssh puppet.

You should make your changes in your puppet environment and test them before pushing them to GitHub to be deployed into production.

Setting up your puppet environment

If you're using your puppet environment for the first time, there's a little setup you'll have to do. cd into your puppet environment (/opt/puppet/env/you) and run:

you@lightning:/opt/puppet/env/you$ git pull
you@lightning:/opt/puppet/env/you$ make

This will update your puppet environment to the latest version on master and install the appropriate third-party modules and the pre-commit hooks.

Testing using your puppet environment

Before pushing, you should test your changes by switching at least one of the affected servers to your puppet environment and triggering a run. Changing environments requires root, so if you don't have root, you will need to ask a root staffer to change the environment.

If you have root, you can change a host's environment with the puppet-trigger command:

ckuehl@supernova:~$ ssh raptors
ckuehl@raptors:~$ sudo puppet-trigger -te ckuehl

This changes the environment to ckuehl and triggers a run.

Make sure to switch the environment back to production after pushing your changes.

Linting and validating the puppet config

We use pre-commit to lint our code before commiting. The main checks are:

  • Parsing puppet manifests for syntax errors (puppet parser validate)
  • Validating Ruby erb templates for syntax errors
  • Linting puppet manifests to ensure a consistent style (puppet-lint)
  • Running a bunch of standard Python linters (the same ones we use for all of our Python projects)

While some of the rules might seem a little arbitrary, it helps keep the style consistent, and ensure annoying things like trailing whitespace don't creep in.

You can simply run make install-hooks to install the necessary git hooks; once installed, pre-commit will run every time you commit.

Alternatively, if you'd rather not install any hooks, you can simply use make test to run the hooks on every file on-demand. This is what Jenkins will do before deploying your change.

Deploying changes to production

GitHub is the authoritative source for this repository; at all times, the production environment on the puppetmaster will be a clone of the master branch on GitHub (we use Jenkins to keep it up-to-date).

Pushing to GitHub will immediately update the production environment, but your changes will not take effect until the puppet agent runs on each server (every 30 minutes, at an arbitrary offset). You can use the puppet-trigger script if you want it to happen faster.

Conventions and styling

Naming conventions

All OCF modules that are primarily intended for OCF use (currently, all of them) should be prefixed with ocf_.

For modules that apply only to a specific service (such as the MySQL server), use the service CNAME (such as mysql) for the module name. Otherwise, use common sense to come up with a reasonable name (e.g. ocf_desktop).

For manifests that don't refer to a service but are commonly used, such as one that sets up LDAP/Kerberos authentication (used on every server) or the SSL bundle generation manifest (used by lots of servers), consider just creating a new class under the ocf module.

Try not to refer to servers by hostname (such as lightning). Instead, use the service CNAME (such as puppet) or the top-level variables $::hostname and $::fqdn.

Including third-party modules

Third-party modules can be helpful. Try to only use ones that are actively maintained.

We use r10k to include third-party modules in our config. This has benefits over storing them in a global directory on the puppetmaster (e.g. with the puppet module tool), and is easier to manage than using git submodules:

  • This puppet config repository is self-contained
  • Adding and updating modules can be tested in an environment before being inflicted on every server
  • Staff members can test third-party modules without needing root on the puppetmaster
  • Modules can be installed from Puppet Forge without needing to have a git repository

Styling

In lieu of an actual style guide, please try to make your code consistent with the existing code (or help write a style guide?), and ensure that it passes validation (including pre-commit).

Minimal config file management

Try to change as few things as possible; this makes upgrading to newer versions of packages and operating systems easier, as well as making it more obvious to future staffers what options you actually changed.

Instead of overwriting an entire config file just to change one value, try to use augeas (example) or sed to change just the necessary values.

Future improvements

  • Trigger puppet runs automatically after production is updated
  • Better monitoring of puppet runs (e.g. to see when a server has not updated recently, which is a common problem on desktops)

puppet's People

Contributors

64bitpandas avatar abizer avatar baisang avatar bernardzhao avatar bryli avatar bzh-bzh avatar cg505 avatar chriskuehl avatar daradib avatar dkess avatar encadyma avatar ethanhs avatar ethanwu10 avatar fawaf avatar fydai avatar ja5087 avatar jvperrin avatar kalissaac avatar kennydo avatar kkuehlz avatar kpengboy avatar matthew-mcallister avatar mcint avatar naderm avatar nikhiljha avatar nolanlum avatar oliver-ni avatar singingtelegram avatar solar464 avatar tmochida avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

puppet's Issues

nix jessie code

thanks to @jvperrin we're now an all-Stretch shop, minus the temporary old-death resurrection. We can now kill any conditionals targeting Jessie in the codebase, and start replacing them with conditionals targeting buster, lol.

move desktop files in Xsession

Currently the .desktop files in ~/Desktop are written from the Xsession file directly. It would make a lot more sense to just put them in /etc/skel/Desktop as actual files.

Printing improvements

  • Printing 2 copies of a double-sided document with an odd number of pages causes the second copy to start on the back side of the last page of the first copy. Suggested solution: tell CUPS not to duplicate pages and figure out how to add blank pages and copies manually in GhostScript.
  • Some .docx files converted to PDF by Google Docs cause PostScript errors on the printer. So far they've printed fine after passing through pdf2ps first.
  • Figure out a way to print non-letter-size jobs. Sample PDF.
  • Implement an alerting system for desktops, including notifications for jobs received by the printer, completed jobs, and pages remaining. I think @jvperrin is making some progress on this using dbus, but alternative solutions would be welcome.

Fix docker socket group override

We currently set the Docker socket group by adding a systemd override for docker.socket, but it seems that lately dockerd no longer uses systemd to manage its socket. Hence, our override needs to be updated.

The way to go is probably to override the docker.service command to pass the -G flag with the appropriate group. (Unless someone knows of a better way.)

sks-keyserver: dump keys to mirrors

there aren't that many high-quality sks keyserver keydumps available for starting new servers from. most of the existing ones are low-bandwidth, slow, or unreliable at best, so I think it would be valuable for us to offer a keydump. We can put it on mirrors and automatically get http/s, rsync, and ftp as well, at a cost of just ~12GB of disk space.

Here is my tentative proposal for how to do it, but I'm not sure it's a good or tractable design.

some constraints:

  1. The keyserver itself needs to be turned off during the dump (~3 minutes)
  2. The dump command (sks dump $count $output_directory $output_prefix) is a little too smart - it tries to split the dump into separate files (each with $count keys) and write them to the specified output directory itself, rather than dumping to a single file / stdout and letting us split/compress it ourselves afterwards.

my proposal would be something like the following:

write a sync-archive-style script in ocf_mirrors that uses a keytab like ocfbackups to log into the keyserver and:

  1. rm -r /var/lib/sks/dump/*
  2. systemctl stop sks-keyserver
  3. docker run --rm -it -w /var/lib/sks -v /var/lib/sks:/var/lib/sks zhusj/sks sks dump 20000 /var/lib/sks/dump sks-dump
  4. systemctl start sks-keyserver
    then, back on the host, rsync -azH --delete something@pgp:/var/lib/sks/dump /opt/mirrors/ftp/sks-keydump/

what do y'all think?

add vsftpd and rsyncd transfer stats to mirrors statistics

/var/log/vsftpd.log contains statistics on things downloaded from mirrors over FTP, and syslog contains log lines from rsyncd with rsync on and `total size' as a search key. we might be able to extract this information and add it to the mirrors statistics.

Augment mirrors monitoring

We should add multiple upstreams to healthcheck against and take the max of the last_updated timestamp (or the analogue for other distros). This will prevent the situation where both us and our upstream monitor go out of of date, and thus we don't get any alerts from the healthcheck

We should also add Prometheus monitoring to our systemd units that perform the syncing. If any unit goes into the Failed state we can proactively fix our sync scripts before getting the alert that we are days out of date. This will help our "out-of-date" avgs if we want to apply to be a tier 1 mirror.

Move off `validate_re`

The validate_re function is deprecated in recent versions of stdlib, and should be replaced by using abstract data types.

configure 802.3ad LACP on hypervisors and switch

last week I reorganized all the 10GbE DAC cables for the hypervisors, such that each hypervisor has both ports on the SolarFlare NIC plugged into the Arista 7050 switch, in consecutive interfaces. e.g. riptide is plugged into Ethernet 1 and Ethernet 2 on the switch, which, physically, are the leftmost top and bottom ports on the switch. Each hypervisor is organized this way, taking a top and bottom port, and @dongkyunc has helpfully added labels to the underside of the switch indicating which server each group corresponds to.

Now, we need to activate that second interface on each hypervisor, and preferably put both interfaces into an LACP channel group. This means that while both links are up, the virtual LACP bond interface will have an aggregate bandwidth of 20Gb/s, but if one of the links fails, the interface will not drop entirely but instead merely drop to 10Gb/s. This gives us an element of fault tolerance while doubling the existing bandwidth each server can utilize. Configuring 802.3ad will require configuration on both the hosts and the switch. @gpl and I have experimented with configuring the switch to support LACP while @cg505 and I have experimented with configuring the hosts to support bond0 as the LACP interface. Some work still needs to happen to get everything working, but I was able to configure the bond interface on dev-fallingrocks into active-passive mode before accidentally locking myself out of the machine when trying to configure the switch into LACP mode.

We will need to modify the configuration in https://github.com/ocf/puppet/blob/master/modules/ocf/manifests/networking.pp to support bringing up the bond interface correctly and configuring it to bridge to br0 VMs as well. Doing it would likely make for an interesting blog post of sorts as well since much of the documentation online for doing this is rather out of date.

don't try to guess the correct interface in ocf::networking

currently we do

$br_iface = grep($ifaces_array, 'en.+')[0]

this only works if we have a single active interface connected to our machines, and if the cable is plugged into the first port. This is not a safe assumption now that we have 10GbE cards in our servers. For example, this code picks enp4s0f0 as the 'active' interface on corruption, as this is the first of the interfaces built into corruption's motherboard, but the new SolarFlare cards we have live at ens5, and the correct interface we should be writing to /etc/network/interfaces is ens5f1np1.

Perhaps a custom fact like

#!/bin/bash 
set -euo pipefail

for i in /sys/class/net/en*; do
  if ethtool "$(basename $i)" | grep -q "Link detected: yes"; then
    echo "$(basename $i)"
  fi
done

would be better for selecting the correct interface(s) that are actually active and plugged into the machines.

mesos.o.b.e nginx proxies only read DNS at startup, eventually proxy to wrong mesos master

When a leader election occurs, the DNS name leader.mesos changes. But unfortunately, nginx only resolves the name once, at startup, and so it continues to proxy to a master which is no longer leading (which then issues a redirect to the new leader, but with the internal (non-proxied) hostname and port, and everything just doesn't work).

Seems like the commercial version of nginx supports re-resolving (boo), but there are some tricks online to get the open-source version to re-resolve, too.

separate out firewall syslog and more aggressively compress/logrotate

logs coming from the bsecure firewall are massive (at our scale). Even after disabling logging from mirrors and death, it's still almost 2+GB a day.

I have some provisional rsyslog and logrotate conf changes that try to redirect firewall log entries from /var/log/remote/ to /var/log/external_firewall/<rule name>/<rule name>.log and then compress the living hell out of them to save space. Luckily they compress well, to ~10% of original size.

lets-encrypt-update silently failing

for some reason the lets-encrypt-update script we use to acquire new and updated certs for death has been silently failing to run for some time (rt#7937, rt#7901, etc.)

there's nothing relevant in the logs, as far as I can tell, and we aren't getting any cron emails from failures, so perhaps there might be some way for us to turn up the verbosity / have it report everything for a bit to see if we can trigger a failure?

Get http://rt/ to work from inside the lab

Right now that URL redirects to fluffy (instead of RT as it used to) because the nginx running on marathon-lb doesn't understand vhost aliases or something like that. We should fix that.

mysql root password preseed doesn't work

the preseed file we include for debconf (https://github.com/ocf/puppet/blob/master/modules/ocf_mysql/manifests/init.pp#L6) has two problems. First, we're now on mariadb-server-10.1, so the package name for debconf needs to be updated, but more importantly, the mariadb-server-10.1 package templates doesn't appear to support mysql-server/root_password and root_password_again debconf options at all:

root@e5a1d73f257e:/# sudo debconf-show mariadb-server-10.1                                                                                                                               
  mariadb-server-10.1/nis_warning:
  mariadb-server-10.1/old_data_directory_saved:
  mariadb-server-10.1/postrm_remove_databases: false

unless I'm interpreting this incorrectly, we'll need to change the way we preseed the root password and do initial grants.

Retry obtaining certs if domains don't match

One problem right now with the Let's Encrypt changes (in #337) is that if new domains are added in LDAP, they likely won't be in DNS yet, a request will be made for a new cert, it'll fail, and won't be retried, since it's been refreshed once and that's it.

If the domains requested don't match the cert, or maybe if the previous cert request failed, we should retry the cert request, since giving up isn't a good option. I'm not sure what would be the best option here, one quick and dirty trick that comes to mind is to make a lockfile of sorts, so that if the cert fails to get obtained, a future run can check if the file exists, and schedule a run. I'd hope there's something cleaner than this, maybe a custom fact that shows the domains in each cert by parsing openssl x509 -in cert.pem -text output?

back up VM XML

we should start dumping the VM XML files and including them in backups if possible. we can have a scrip that logs in to the hypervisors as ocfbackups, parses the output of virsh list --all or some more cli-friendly thing, and then "ssh -K ocfbackups@$hypervisor virsh dumpxml $domain" > $hypervisor-$domain.xml" or something like that

Purge any files in /etc/sudoers.d (except the README)

We want any added root users to go through usudo in puppet instead of being added manually, so we should manage the /etc/sudoers.d directory and make sure that it is empty (except for the README file that is provided in it, since apparently it needs to have some file in the directory to work). An excerpt from the /etc/sudoers.d/README file:

# Note that there must be at least one file in the sudoers.d directory (this
# one will do), and all files in this directory should be mode 0440.

disaggregate postgres backups into per-database files

@ja5087 worked on getting postgres databases into our backups in #366 / #377, but we should move to having backups that are per-database instead of a single dump of the entire server's state. this is better for recovery and will help make our backups more efficient as well by taking advantage of rsnapshot's hardlinking features.

/etc/sudoers is broken if staffvm and no owner

The /etc/sudoers file becomes broken if the hiera file to add an owner is not added quickly enough, since it contains the following line (notice the single space at the front):

 ALL=(ALL) NOPASSWD: ALL

This appears to be because the hiera lookup for an owner returns an empty string in the usudo array, leading to a line being added in the /etc/sudoers template, but not one with an actual user. Probably should just filter out empty or invalid users in the auth.pp manifest?

ocfbroker group exists on the desktops

seems like this shouldn't be the case:

abizer@blackout:~$ check abizer
abizer:*:35934:1000:Abizer Lokhandwala:/home/a/ab/abizer:/bin/zsh
Created on: 2015-10-03
Member of group(s): ocfbroker ocfstaff ocfroot ocfdev ocfofficers

add postgres to ocf_backups

I've done some experimental work on creating an ocfpgbackups postgres role with GRANT SELECT on all databases for backups, but the actual implementation could probably be done by a new staff member to cut their teeth on working on the infra. We'd want to add a rule to rsnapshot.conf in the ocf_backups module, probably using pg_dump or pg_dumpall.

Upgrade remaining machines from jessie to stretch

2018-10-23 03:13:07 -0700 //death.ocf.berkeley.edu/Facter (err): error while processing "/opt/puppetlabs/puppet/cache/facts.d/iface-linked" for external facts: child process returned non-zero exit status (1).

File in question is here: puppet/modules/ocf/facts.d/iface-linked on line 8

death, tsunami, and werewolves are the remaining machines currently on jessie. We should tie up the loose ends here before transitioning to buster.

Switching a desktop from staff-only to not staff-only doesn't change the wrong password message

We have a custom message for staff-only desktops that informs users when they're denied entry that it is staff-only. However, if a desktop was previously set to staff-only in puppet and then was set to not be staff-only, that message doesn't change.

The command that is supposed to recompile lightdm-greeter to include the new message is here, but since the subscribe points to the filename, lightdm-greeter will only be recompiled if the actual contents of the message template change, instead of when $staff_only changes as intended.

To fix this, you couid write both the staff-only and not staff-only greeter error message to a single file on the desktop, and have the subscribe point to that file.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.