ocf / puppet Goto Github PK

Puppet config for OCF servers and lab machines

Home Page: https://www.ocf.berkeley.edu/

Shell 12.01% Ruby 0.74% Makefile 0.13% Python 23.25% Puppet 52.02% HTML 11.08% JavaScript 0.22% Jinja 0.55%

puppet's Issues

mesos.o.b.e nginx proxies only read DNS at startup, eventually proxy to wrong mesos master

When a leader election occurs, the DNS name leader.mesos changes. But unfortunately, nginx only resolves the name once, at startup, and so it continues to proxy to a master which is no longer leading (which then issues a redirect to the new leader, but with the internal (non-proxied) hostname and port, and everything just doesn't work).

Seems like the commercial version of nginx supports re-resolving (boo), but there are some tricks online to get the open-source version to re-resolve, too.

ocf_mesos: add authentication for zookeeper

sks-keyserver: dump keys to mirrors

there aren't that many high-quality sks keyserver keydumps available for starting new servers from. most of the existing ones are low-bandwidth, slow, or unreliable at best, so I think it would be valuable for us to offer a keydump. We can put it on mirrors and automatically get http/s, rsync, and ftp as well, at a cost of just ~12GB of disk space.

Here is my tentative proposal for how to do it, but I'm not sure it's a good or tractable design.

some constraints:

The keyserver itself needs to be turned off during the dump (~3 minutes)
The dump command (sks dump $count $output_directory $output_prefix) is a little too smart - it tries to split the dump into separate files (each with $count keys) and write them to the specified output directory itself, rather than dumping to a single file / stdout and letting us split/compress it ourselves afterwards.

my proposal would be something like the following:

write a sync-archive-style script in ocf_mirrors that uses a keytab like ocfbackups to log into the keyserver and:

rm -r /var/lib/sks/dump/*
systemctl stop sks-keyserver
docker run --rm -it -w /var/lib/sks -v /var/lib/sks:/var/lib/sks zhusj/sks sks dump 20000 /var/lib/sks/dump sks-dump
systemctl start sks-keyserver
then, back on the host, rsync -azH --delete something@pgp:/var/lib/sks/dump /opt/mirrors/ftp/sks-keydump/

what do y'all think?

Have notify use ocflib functions

As suggested by @chriskuehl, have notify use ocflib functions rather than parse the output of paper.

Note: This is also a good starter issue if anyone's interested.

Rewrite upload_to_box to not use DAV

box.com is deprecating DAV, see https://community.box.com/t5/Box-Product-News/Deprecation-WebDAV-Support/ba-p/55684

rclone has good support for this, but it could be difficult to puppet, since the version currently packaged in Debian doens't support box.com. We'll have to figure out how to get the latest version.

nix jessie code

thanks to @jvperrin we're now an all-Stretch shop, minus the temporary old-death resurrection. We can now kill any conditionals targeting Jessie in the codebase, and start replacing them with conditionals targeting buster, lol.

sg.ocf.io cert is broken

@jvperrin mentioned this some time ago but I thought I'd make a ticket for it

Replace references to Hearst Gym with MLK

Migrated from ocf/ocflib#7

Box.com deprecating WebDAV

Box.com has announced that they are deprecating WebDAV support on January 31st, 2019. Indiana University has a nice page on alternative ways to connect to Box.com. I'm thinking that connecting to box over the FTPS protocol will be the best replacement of those listed.

Retry obtaining certs if domains don't match

One problem right now with the Let's Encrypt changes (in #337) is that if new domains are added in LDAP, they likely won't be in DNS yet, a request will be made for a new cert, it'll fail, and won't be retried, since it's been refreshed once and that's it.

If the domains requested don't match the cert, or maybe if the previous cert request failed, we should retry the cert request, since giving up isn't a good option. I'm not sure what would be the best option here, one quick and dirty trick that comes to mind is to make a lockfile of sorts, so that if the cert fails to get obtained, a future run can check if the file exists, and schedule a run. I'd hope there's something cleaner than this, maybe a custom fact that shows the domains in each cert by parsing openssl x509 -in cert.pem -text output?

ocf_mesos: add authentication to talk to mesos masters (http webui)

Fix desktop default volume

i just had my ears blasted out

It's supposed to be adjusted here, but volume is still set to max upon login.

ocf-netboot should fail more gracefully

among other things, ocf-netboot starts by deleting everything in /opt/tftpd, but it can't recover those things if mirrors are down.

mysql root password preseed doesn't work

the preseed file we include for debconf (https://github.com/ocf/puppet/blob/master/modules/ocf_mysql/manifests/init.pp#L6) has two problems. First, we're now on mariadb-server-10.1, so the package name for debconf needs to be updated, but more importantly, the mariadb-server-10.1 package templates doesn't appear to support mysql-server/root_password and root_password_again debconf options at all:

root@e5a1d73f257e:/# sudo debconf-show mariadb-server-10.1                                                                                                                               
  mariadb-server-10.1/nis_warning:
  mariadb-server-10.1/old_data_directory_saved:
  mariadb-server-10.1/postrm_remove_databases: false

unless I'm interpreting this incorrectly, we'll need to change the way we preseed the root password and do initial grants.

Augment mirrors monitoring

We should add multiple upstreams to healthcheck against and take the max of the last_updated timestamp (or the analogue for other distros). This will prevent the situation where both us and our upstream monitor go out of of date, and thus we don't get any alerts from the healthcheck

We should also add Prometheus monitoring to our systemd units that perform the syncing. If any unit goes into the Failed state we can proactively fix our sync scripts before getting the alert that we are days out of date. This will help our "out-of-date" avgs if we want to apply to be a tier 1 mirror.

ocf_nfs fact is broken on stretch

The check made for the ocf_nfs fact doesn't work on stretch, since it only has one slash instead of two:

supernova (stretch):

$ mount | grep home
services:/home on /home type nfs4 [...]

tsunami (jessie):

$ mount | grep home
services://home on /home type nfs4 [...]

Switching a desktop from staff-only to not staff-only doesn't change the wrong password message

We have a custom message for staff-only desktops that informs users when they're denied entry that it is staff-only. However, if a desktop was previously set to staff-only in puppet and then was set to not be staff-only, that message doesn't change.

The command that is supposed to recompile lightdm-greeter to include the new message is here, but since the subscribe points to the filename, lightdm-greeter will only be recompiled if the actual contents of the message template change, instead of when $staff_only changes as intended.

To fix this, you couid write both the staff-only and not staff-only greeter error message to a single file on the desktop, and have the subscribe point to that file.

Fix docker socket group override

We currently set the Docker socket group by adding a systemd override for docker.socket, but it seems that lately dockerd no longer uses systemd to manage its socket. Hence, our override needs to be updated.

The way to go is probably to override the docker.service command to pass the -G flag with the appropriate group. (Unless someone knows of a better way.)

disaggregate postgres backups into per-database files

@ja5087 worked on getting postgres databases into our backups in #366 / #377, but we should move to having backups that are per-database instead of a single dump of the entire server's state. this is better for recovery and will help make our backups more efficient as well by taking advantage of rsnapshot's hardlinking features.

Set default homepage for Brave

To the browser_homepage value in common.yaml

configure 802.3ad LACP on hypervisors and switch

last week I reorganized all the 10GbE DAC cables for the hypervisors, such that each hypervisor has both ports on the SolarFlare NIC plugged into the Arista 7050 switch, in consecutive interfaces. e.g. riptide is plugged into Ethernet 1 and Ethernet 2 on the switch, which, physically, are the leftmost top and bottom ports on the switch. Each hypervisor is organized this way, taking a top and bottom port, and @dongkyunc has helpfully added labels to the underside of the switch indicating which server each group corresponds to.

Now, we need to activate that second interface on each hypervisor, and preferably put both interfaces into an LACP channel group. This means that while both links are up, the virtual LACP bond interface will have an aggregate bandwidth of 20Gb/s, but if one of the links fails, the interface will not drop entirely but instead merely drop to 10Gb/s. This gives us an element of fault tolerance while doubling the existing bandwidth each server can utilize. Configuring 802.3ad will require configuration on both the hosts and the switch. @gpl and I have experimented with configuring the switch to support LACP while @cg505 and I have experimented with configuring the hosts to support bond0 as the LACP interface. Some work still needs to happen to get everything working, but I was able to configure the bond interface on dev-fallingrocks into active-passive mode before accidentally locking myself out of the machine when trying to configure the switch into LACP mode.

We will need to modify the configuration in https://github.com/ocf/puppet/blob/master/modules/ocf/manifests/networking.pp to support bringing up the bond interface correctly and configuring it to bridge to br0 VMs as well. Doing it would likely make for an interesting blog post of sorts as well since much of the documentation online for doing this is rather out of date.

Purge any files in /etc/sudoers.d (except the README)

We want any added root users to go through usudo in puppet instead of being added manually, so we should manage the /etc/sudoers.d directory and make sure that it is empty (except for the README file that is provided in it, since apparently it needs to have some file in the directory to work). An excerpt from the /etc/sudoers.d/README file:

# Note that there must be at least one file in the sudoers.d directory (this
# one will do), and all files in this directory should be mode 0440.

add autocomplete script for VM names

I don't want to type "walpurgisnacht" ever again

ocf_mesos: add service file for zookeeper

currently it forks and systemd loses track of it

lets-encrypt-update silently failing

for some reason the lets-encrypt-update script we use to acquire new and updated certs for death has been silently failing to run for some time (rt#7937, rt#7901, etc.)

there's nothing relevant in the logs, as far as I can tell, and we aren't getting any cron emails from failures, so perhaps there might be some way for us to turn up the verbosity / have it report everything for a bit to see if we can trigger a failure?

ocf_mesos: use SSL for mesos masters/slaves, marathon masters

These support HTTPS both for messages between each other and for the HTTP APIs/UIs.

This is a nice-to-have, since the public endpoints that staff use are already HTTPS. The only unencrypted communication is over our own network.

Investigate Docker repo

It seems like apt.dockerproject.org hasn't updated since May 5 even though it was getting monthly updates before. We might consider switching to download.docker.com, which is apparently the official source now.

Printing improvements

Printing 2 copies of a double-sided document with an odd number of pages causes the second copy to start on the back side of the last page of the first copy. Suggested solution: tell CUPS not to duplicate pages and figure out how to add blank pages and copies manually in GhostScript.
Some .docx files converted to PDF by Google Docs cause PostScript errors on the printer. So far they've printed fine after passing through pdf2ps first.
Figure out a way to print non-letter-size jobs. Sample PDF.
Implement an alerting system for desktops, including notifications for jobs received by the printer, completed jobs, and pages remaining. I think @jvperrin is making some progress on this using dbus, but alternative solutions would be welcome.

add vsftpd and rsyncd transfer stats to mirrors statistics

/var/log/vsftpd.log contains statistics on things downloaded from mirrors over FTP, and syslog contains log lines from rsyncd with rsync on and `total size' as a search key. we might be able to extract this information and add it to the mirrors statistics.

add efi support to ocf-netboot

we now have servers that can boot over efi, like jaws, riptide, corruption, and dataloss

update or delete ocf_rancid now that we no longer have any Cisco devices on our network

the firewall was replaced by a Palo Alto 5260 (bsecure) and the switch was replaced with an Arista 7050SX, so we should probably either fix rancid to work with the Arista device or get rid of it entirely since we don't really need it anymore

back up VM XML

we should start dumping the VM XML files and including them in backups if possible. we can have a scrip that logs in to the hypervisors as ocfbackups, parses the output of virsh list --all or some more cli-friendly thing, and then "ssh -K ocfbackups@$hypervisor virsh dumpxml $domain" > $hypervisor-$domain.xml" or something like that

ocf_stats: link to relevant graphs in munin emails

replace ocf::staff_users with pam_mkhomedir

we can probably use a lot of the logic from pam_mkhomedir on desktops

separate out firewall syslog and more aggressively compress/logrotate

logs coming from the bsecure firewall are massive (at our scale). Even after disabling logging from mirrors and death, it's still almost 2+GB a day.

I have some provisional rsyslog and logrotate conf changes that try to redirect firewall log entries from /var/log/remote/ to /var/log/external_firewall/<rule name>/<rule name>.log and then compress the living hell out of them to save space. Luckily they compress well, to ~10% of original size.

/etc/sudoers is broken if staffvm and no owner

The /etc/sudoers file becomes broken if the hiera file to add an owner is not added quickly enough, since it contains the following line (notice the single space at the front):

 ALL=(ALL) NOPASSWD: ALL

This appears to be because the hiera lookup for an owner returns an empty string in the usudo array, leading to a line being added in the /etc/sudoers template, but not one with an actual user. Probably should just filter out empty or invalid users in the auth.pp manifest?

add hsts to death/www

https://en.wikipedia.org/wiki/HTTP_Strict_Transport_Security

probably also includesubdomains and preload.

Add a fact to identify dev- hosts

As suggested by @chriskuehl in #193, it would be much nicer to have something to identify dev hosts rather than checking for dev- at the start of the hostname every time.

ldap-lint: possibly check callinkOid xor calnetUid

accounts probably should not have both of these specified for long periods of time. It may make sense for us to periodically confirm that no associations have been left by accident.

ensure all kubernetes files go in /etc/kubernetes-ocf

i stupidly put some kubernetes files in /etc/kubernetes. this is managed by kubeadm... so far it's been fine, but it's better to take these files out of there.

Get http://rt/ to work from inside the lab

Right now that URL redirects to fluffy (instead of RT as it used to) because the nginx running on marathon-lb doesn't understand vhost aliases or something like that. We should fix that.

don't try to guess the correct interface in ocf::networking

currently we do

puppet/modules/ocf/manifests/networking.pp

Line 46 in 8d27057

$br_iface = grep($ifaces_array, 'en.+')[0]

this only works if we have a single active interface connected to our machines, and if the cable is plugged into the first port. This is not a safe assumption now that we have 10GbE cards in our servers. For example, this code picks enp4s0f0 as the 'active' interface on corruption, as this is the first of the interfaces built into corruption's motherboard, but the new SolarFlare cards we have live at ens5, and the correct interface we should be writing to /etc/network/interfaces is ens5f1np1.

Perhaps a custom fact like

#!/bin/bash 
set -euo pipefail

for i in /sys/class/net/en*; do
  if ethtool "$(basename $i)" | grep -q "Link detected: yes"; then
    echo "$(basename $i)"
  fi
done

would be better for selecting the correct interface(s) that are actually active and plugged into the machines.

Get rid of puppet reports when environment changes

This has something to do with tagmail config, according to Dara.

ocfbroker group exists on the desktops

seems like this shouldn't be the case:

abizer@blackout:~$ check abizer
abizer:*:35934:1000:Abizer Lokhandwala:/home/a/ab/abizer:/bin/zsh
Created on: 2015-10-03
Member of group(s): ocfbroker ocfstaff ocfroot ocfdev ocfofficers

ocf_mesos: add authentication to talk to marathon masters (http api)

Upgrade remaining machines from jessie to stretch

2018-10-23 03:13:07 -0700 //death.ocf.berkeley.edu/Facter (err): error while processing "/opt/puppetlabs/puppet/cache/facts.d/iface-linked" for external facts: child process returned non-zero exit status (1).

File in question is here: puppet/modules/ocf/facts.d/iface-linked on line 8

death, tsunami, and werewolves are the remaining machines currently on jessie. We should tie up the loose ends here before transitioning to buster.

move desktop files in Xsession

Currently the .desktop files in ~/Desktop are written from the Xsession file directly. It would make a lot more sense to just put them in /etc/skel/Desktop as actual files.

generate rt apache config with worker IPs

ocf/rt#8 (review)

add postgres to ocf_backups

I've done some experimental work on creating an ocfpgbackups postgres role with GRANT SELECT on all databases for backups, but the actual implementation could probably be done by a new staff member to cut their teeth on working on the infra. We'd want to add a rule to rsnapshot.conf in the ocf_backups module, probably using pg_dump or pg_dumpall.

Move off `validate_re`

The validate_re function is deprecated in recent versions of stdlib, and should be replaced by using abstract data types.

ocf / puppet Goto Github PK

puppet's Issues

Recommend Projects

Recommend Topics

Recommend Org