gomods / athens Goto Github PK

View Code? Open in Web Editor NEW

4.3K 95.0 485.0 83.61 MB

A Go module datastore and proxy

Home Page: https://docs.gomods.io

License: MIT License

Go 94.21% Makefile 0.90% Shell 2.68% Dockerfile 0.57% PowerShell 1.57% HCL 0.07%

golang dependencies package-control proxy-server dependency-manager package-manager go-modules goproxy go athens

athens's People

Contributors

Stargazers

Watchers

Forkers

kr makhov lcd1232 marpio oldmantaiter lewiscowper chrispickard crezam pombredanne marstr robertwe arschles marwan-at-work kunde21 tobiaskohlbau michalpristas letory tbpg lucasallan fedepaol lypht es manugupt1 ramyakailas carolynvs rohancme devslives vagruchi kdvy fhs florenciacomuzzi berndverst roshanraj kteb andeeliao franciscocpg adyach mydiemho jeremyrickard dioms ebati joshgav toffentoffen artmello tklauser ndedic hookttg 321dao sauvaget codemonkeysoftware dapchen sjeanpierre stephanhattingh zhanglei 45cali donmcnamara deoakshay yangzx554 volodimyr imskyer forging2012 alexshilucky dut3062796s rottylee bsmr davidmr001 getsgock awesomegolang influx6 sahitpj ljavargo two gcstang pendergraft robbert229 scriptonist piamo msnetc haritsfahreza atul9 zkry rvaughan ellerbrock cloud-architecture brainsnail adinabudriga ravitezu bhaveshpraveen adelowo t-tomalak steakunderscore lgoeller hannansolo cuigh grahams58 som-snytt voutasaurus gotoolkit jarlefosen hp-plus5

athens's Issues

Tests for gomod parsers

After #2 is done
Tests for parsers need to be implemented

proxy: add background worker to fetch requested modules that aren't present locally

Wire up gofmt and golint to CI

Make the GoGet middleware configurable

You should be able to configure it to redirect to any DNS name or IP - including the proxy IP/name, if running in proxy mode. Right now, it's hard-coded to gomods.io

Proposal: Implement MS SQL Server storage driver

For completeness and all SQL Server users.
Pop doesn't cover that but there is go-mssqldb.

Longer term, when Athens reaches a beta release (maybe earlier, when we reach some stability), I want the project to have some milestones so we can plan each release. I volunteer as tribute to do this.

Add Open Tracing support

[ ] to all binaries
[ ] add OT Collector to the docker-compose file.

Split proxy storage, CDN storage, and actions between proxy and registry

The proxy and the registry are different enough that much of the packages and codebase needs to be split apart. For example:

The registry needs to create a CDN getter
The registry should not declare handlers for the metadata/code download paths
- It should instead use a meta redirect to a CDN location
The proxy needs to create storage
The proxy needs to declare handlers for metadata/code download paths

Ref #44

Proxy Architecture Considerations

Athens Proxy Architecture

The Athens repository builds three logical artifacts:

The proxy server
The registry server
The (crude) CLI

This document is focused on discussing the systems architecture and challenges in the proxy.

Registry architecture will be covered in a different document.

Code layout and architecture, dependency management considerations, and discussion on the CLI are out of scope and may be covered in a separate document as well.

The Proxy Server

The proxy server has 2 major responsibilities:

Cache modules locally, in some storage medium (e.g. disk, relational DB, mongoDB...)
On a cache miss, fill the cache from an upstream

Local caching is achieved fairly simply by using existing storage systems. As I write this, we have disk, memory and mongo based storage (see https://github.com/gomods/athens/tree/master/pkg/storage), with relational DB support in progress.

Challenges come up when we introduce cache misses and the cache filling mechanism. Our current plan on (verbal) record is to do the following when vgo requests a module that isn't in the cache:

Return a 404
1. This will effectively tell vgo to get the module upstream somewhere (e.g. a VCS or hosted registry)
Start a background process to fetch the module from upstream

We have 2 challenges here:

How to run background jobs
How to serialize cache fills (to prevent a thundering herd)

Running Background Jobs

Just running background jobs in isolation (challenges will come later 😄) is relatively easy. We use the Buffalo framework, and it gives us built-in, pluggable background jobs support.

The two documented (on the Buffalo site) implementations are in-memory (i.e. a goroutine) and redis (using gocraft/work). We can use the background jobs interface to submit background jobs and we can consume the background jobs from a long-running task.

Aside from the data storage system, the proxy will have two moving parts (the API server and the background workers). Since this software might be deployed by anyone on their infrastructure, a proxy operator is gonna have to figure out how to deploy the database, API server and background worker (and probably a queueing system, depending on the work queue type) on their own. I ❤️ Kubernetes, so I'd like to initially provide Docker images for our software (the API server and background worker), a Helm chart, and really good documentation on how to easily configure and deploy this thing to any Kubernetes cluster. Over time, I hope other folks will contribute documentation to help others deploy into other environments.

Serializing Cache Fills

Suppose you just started up an Athens proxy and everything it needs and you spread the word throughout your company. You have 1000 engineers in your organization and you expect all of them to be heavily using the proxy, so you start 50 API servers and 1000 background workers.

On day 0, all 1000 of the engineers set up GOPROXY and run vgo get github.com/gorilla/mux. They all get a 404 and vgo correctly downloads the package from Github (let's assume everyone has set up their .netrc properly so they don't get rate limited).

On the backend, the proxy has started up 1000 background jobs all to get the same package from Github, and then they all race to write it to the database. The problem is compounded on 2 dimensions: the number of engineers running vgo and the number of imports and transitive dependencies in the codebase.

We need to prevent this behavior!

Invariants

To start, I believe we should treat the cache as write-only. Once [email protected] is written to the cache, it can't be deleted or modified in any way (except by manual operator intervention).

Next, I believe we should aim for these invariants, modulo manual intervention and failure modes (those will be covered later):

If N>1 requests come in for [email protected], we should start exactly zero or one background job between time t0 and tX, where X is when the cache has been filled
If the cache is filled at time t0, no background jobs should ever be started to fill [email protected]
On a cache miss for [email protected], only one background job should ever be started between time t0 and tX, where X is the time at which the cache was filled
No background job should ever be started to fill [email protected] after time tX

In order to maintain these invariants in our proxy, we'll need to coordinate on background jobs. We certainly need to support multi-node deployments (like the 1000 engineer scenario above), so we'll need to distribute the coordination mechanism.

Finally, I believe in adding the absolute least amount of complexity in order to get this job done. My proposal is below.

Distributed Coordination of Background Jobs

The immutable cache helps us here for two reasons:

It speeds up our serialization protocol
It simplifies our serialization protocol & code

Currently, when an API server gets a GET /module/@v/vX.Y.Z.{mod,zip,info}, it checks the cache and returns 404 if [email protected] doesn't exist. It also starts up a background cache-fill job to fetch [email protected].

I propose that we keep that behavior. Note that the API server doesn't participate in any concurrency control protocol. I am limiting concurrency control entirely to background jobs. I suggest that we do this because the API is in the critical path of all vgo get operations (in proxy deployments). I want to keep this code as simple as possible.

On to background jobs. I propose that we add leases to protect individual module@version cache entries. Here's how that would look (in pseudocode):

if exists_in_db("[email protected]") {
	exit()
}
// run_with_lease only runs the function (second parameter) if the lease for 
// "[email protected]" was acquired. when the function exits, the lease is 
// given back up. If the lease couldn't be acquired, do nothing
run_with_lease("[email protected]", {
	// get module metadata & zipfile from upstream
	module = download_module_from_upstream("[email protected]")
	// put all module metadata & zipfile into the cache entry 
	insert_module_to_cache(module)
})

We can then build on this protocol for fetching lists of modules (i.e. handling GET /module/@v/list requests):

if exists_in_db("list:module") {
	exit()
}
versions = []
run_with_lease("list:module", {
	// just get the list of versions from the upstream
	versions = download_versions_from_upstream("module")
	// put the versions list into the cache
	insert_module_list_to_cache(versions)
})
for version in versions {
	// start a cache-fill job (the previous psuedocode)
	enqueue_cache_filler("module@"+version)
})

In either case, if there's a failure, we can release the lease and retry the job. After we hit a maxiumum number of retries, we should write a "failed" message into the appropriate cache entry (list or the actual module).

Open Questions

We've implemented an immutable cache here in the proxy, but we also should consider modules to be mutable upstream. I've included some example scenarios that could result in unexpected, non-repeatable builds:

Scenario: Version Deleted

At time t0, someone requests [email protected] from the proxy
The proxy returns 404 on the /list request
vgo fetches the module from the upstream
The proxy kicks off the list background job (which then kicks off cache-fill jobs)
At time t1, v1.0.0 is deleted

Result: any environment that has access to the proxy builds properly, any that doesn't won't build

Discussion on whether modules are mutable has begun. Regardless of outcome, I believe that the proxy cache should be immutable, and require explicit intervention by operators to delete or mutate an individual module. This behavior helps deliver repeatable, correct builds to an organization using the proxy.

Scenario: Proxy Has Missing Module Version

At time t0, someone requests [email protected] from the proxy
The proxy returns 404 on the /list request
vgo fetches the module from the upstream
At time t1, the proxy kicks off the list background job
At time t2, the proxy saves v1.0.0 as one of the versions in the versions cache entry
At time t3, v1.0.0 is deleted from the upstream
At time t4, the proxy kicks off the cache-fill job for v1.0.0, and cannot find the version upstream

Result: no observable difference between this and the previous scenario

Scenario: Version Mutated

At time t0, someone requests [email protected] from the proxy
The proxy returns 404
At time t1 vgo properly falls back to the upstream
At time t2, v1.0.0 is modified upstream
At time t3, the cache-fill background job fills the cache with v1.0.0

Result: builds on the local machine build with v1.0.0 code from t1, future builds build with v1.0.0 code from t3.

Some of our integrity work may prevent this case

Final Notes

The first scenario above requires us to make some "cultural" decisions on the Go module ecosystem. We'll have to first decide whether module version "should" be mutable.

Personally, I don't think they should be. If someone decides to change or delete a module (i.e. delete or delete-and-recreate a Git tag), the proxy and registry (detailed in another document) should insulate dependent modules from the change.

We could solve the second and third scenario by adding some coordination into the API server. Here's a very rough sketch on how that could look:

The API server checks for [email protected] in the cache. If it finds it, return immediately
If it doesn't find it, check for a lease on [email protected]. If none exists, start the cache-fill job
Wait for the lease to be released. If it released successfully check the cache for v1.0.0 and return it to the client
If the lease expired, look for a new lease to be created on v1.0.0 and goto 3

I've mentioned a few times above that I don't think we should do this. It's much more complex to get right at scale, and if we can get away with saying "don't change or delete modules!" - at least at first - that makes more sense to me, culturally and technically.

cc/ @michalpristas @bketelsen

Pull module feed periodically from Olympus

Prepare a long-running background task which will periodically ask Olympus what's new providing last synchronization point

Proxy needs to makes initial sync request to olympus.golang.org to learn its affinity to Olympus deployment e.g amazon.olympus.golang.org

Olympus recognizes that proxy is asking for the first time and responds with following information:

entire log
sequence_id: which serves as a pointer to latest served entry of the log
olympus_host: with DNS entry of OM

Prereq: #100

Feature: Serve as a proxy for the go tool

if a user sets the GOPROXY environment variable to our domain name, they should use our service to download modules, zip files, metadata, etc from our servers where it is cached from the upstream sources.

This reduces load on upstream VCS servers, and allows us to provide a better experience for end-users by using a CDN for content.

Proposal: hosting docs

Proxies will host multiple versions of a module, and I think it would be very helpful to the community for proxies to also host documentation for each version. This is similar to what godoc.org or the godoc CLI does, except docs would be versioned. I could also imagine this being great for private code 😄

Fix golint issues

At the moment, golint $(go list ./...) reports quite a few issues (missing comments and stutters).
We should run golint with -set_exit_status flag after they are fixed - see #85

Add CI to this repository

Add circleCI (2.0) tests

Feature: Serve as vanity host

as a registry service, provide vanity hosting for modules under a given set of domain names.

Using this service, VCS would be insulated from the end user. As an example if we host at gomods.io, users can register modules with vanity domain names -- also serving as the canonical import path for the module.

package captainhook // import "gomods.io/captainhook"

This functionality is separate from, and in addition to the proxy service.

Build and publish docker images for the Athens server

We'll need to do this on the road to #57 and #80

Add a test for the CLI's file walk function for the zip creator

After #30, there will be a tree walk function that we'll need to test

Make the CLI read the `go.mod` file

The current CLI usage looks like this:

./vgp <directory> <baseURL> <module> <version>

The module part shouldn't be required because the module is listed in the go.mod file. The CLI should just read that entry. The resulting usage would then be:

./vgp <directory> <baseURL> <version>

In disk storage mode, do initial check that disk storage root exists

Feature: Module publishing API

Create an API that allows module authors to publish modules from their CI/CD systems like Travis.

Possibly this could be as simple as a webhook integration.

Make the CLI ignore unnecessary files when zipping source code

The does it work section shows that the default "upload" command in the CLI zips source code that then fails when vgo downloads it:

$ vgo get arschles.com/[email protected]
vgo: downloading arschles.com/testmodule v1.0.0
vgo: import "arschles.com/testmodule": zip for arschles.com/[email protected]. has unexpected file testmodule/.DS_Store

It should ignore at least the .git and .DS_Store files, but probably more. At least to make vgo get succeed in downloading modules for now

TODO: Registry architecture and considerations

We need an issue that lays out the current architecture of the registry and considerations we need to think through in the near future for the registry (similar to #78).

To start, I know we'll need to figure out how to save the CDN metadata and upload module metadata & code to the CDN, atomically. A bit more detail about that in #50

vendoring

sort out vendoring
Ideally, a vgo based solution would be best, but how viable is that with buffalo in current configuration? Research what buffalo dev does and how deep the dep integration is with buffalo.

Add docs on how to deploy Athens to ACI

This would increase the options for folks who want to deploy Athens to the cloud

ACI docs: https://azure.microsoft.com/en-us/services/container-instances/

Implement a Mongo-backed CDN getter

We only have a fake one now

use disk based storage by default

Memory storage is not very useful for almost all uses 😄

Build a storage driver for Azure blob storage

It needs to be able to save module metadata & source (go.mod, add to the versions list, and the zip file), and also fetch it

After this is built, consider building drivers for other CDNs

Host Olympus / Athens

I'd like us to host the gomods.io (and all of the other hostnames!) inside a Kubernetes cluster. I'd also like to provide the option for anyone else to host their own registry or proxy in their own Kubernetes cluster. By providing all the artifacts someone needs to install Athens (proxy or registry) on their cluster, we can do both at once 🎉!

This is a umbrella ticket for hosting Athens (both the registry and proxy server) -- and providing everything other needs to host Athens -- in Kubernetes:

Build docker images (#81)
Create helm charts
- And uploading to a helm repo? When Athens gets some stability (#82) we might consider putting it into the official incubator repo
- Another option for a place to host is chart museum
Connect Athens to required external services (CDN, Databases, etc...)

Implement a PostgreSQL storage driver

The storage drivers at ./pkg/storage are used by the proxy to store and return metadata and source for a given module. We should implement a driver for PostgreSQL.

Implement a CDN saver

This would complement the CDN getter. The getter just gets a redirect location for a module; the setter should upload/update:

The versions list
The go.mod file
The version metadata
The source zip

It should also enter a redirect location (that the getter can fetch) into the CDN metadata store.

Make sure to put the entire operation (enter into CDN metadata store + upload) in a critical section. Ideas:

Lock/lease the directory on the CDN
- Azure blob storage gives you versioned blobs and also lock semantics for blobs (see here)
- Google Cloud Storage gives you generational versioning and preconditions (see here and here)
- Quick research on S3 indicates that blobs (buckets?) have versions. Need more to research whether the API includes preconditions for generations

docs: add documentation endpoint

add documentation endpoint serving nicely formatted docs in markdown

upgrade buffalo to 0.11.0

Build other CDN drivers

There are two pieces of CDN storage:

Functionality that uploads stuff to the CDN itself ("storage drivers")
Functionality that saves and fetches metadata about the CDN ("metadata drivers")

Storage drivers and metadata drivers can be mixed and matched. For example, you can use Azure CDN storage drivers with MySQL metadata drivers, Google Cloud Storage with Mongo, and so on...

The issue for adding more metadata drivers is #147. This issue is about adding more storage drivers for CDNs.

We're initially planning to use Azure CDN (#29), but it would be nice to implement other CDN drivers. They will need to support something like the atomicity features described in #50

Ones I think we should do:

Azure CDN (#29)
Google Cloud Storage (#158)
AWS S3 (#159)

Report 404 for inactive/hidden modules immediately

active/hidden flag means that download is in progress, so we return 404 to vgo and we dont notify Olympus about cache-miss

Prereq: #93

Notify Olympus on cache-miss

When proxy encounters a request for which it does not have data it returns 404 so vgo can ask upstream VCS for requested package
At the same time, handler should fire a goroutine in which it notifies Olympus about cache miss.
Olympus will do the rest.

Prereq: #101

Remove 'dep ensure' from the Dockerfile

We already check in the vendor dir, so we don't need to re-run dep ensure on every build

Use spf13/afero for InMemory and Disk Storage

It would be possible to use the same code for both backends and automatically enable others i.e. sftp
https://github.com/spf13/afero

Proposal: build separate binaries for proxy & registry

We have a growing set of logic that's basically if proxy { ... } else { ... } which will be growing over time. I believe we're better served building separate binaries (i.e. having two main() functions), one for the proxy and one for the registry (and others for components to come if necessary). We already have a cmd directory, so my proposal is to create a cmd/proxy and a cmd/registry, and to house buffalo app inside of each directory. Common code should still be shared via the packages inside pkg/. Finally, I want to keep proxy, registry, and other code inside of a single repository.

Implement a MongoDB Storage Driver

The storage drivers at ./pkg/storage are used by the proxy to store and return metadata and source for a given module. We should implement a driver for MongoDB. By using the MongoDB API, we can connect the proxy to CosmosDB

Fix up readme, license, etc

change to team ownership

Disk storage: Do not read entire source zipfiles into memory

After #20, there will be a disk storage system, which is likely to store large modules. The storage interface's Get function returns []bytes for both the go.mod file and the zipped-up module source code. Of course, that's not ideal because module zips would need to be read into memory.

The storage interface Get method should be changed to return io.ReadClosers instead of []bytes for the zip. I don't think we'll need to do the same for go.mod files for now, since those will be relatively small.

Write a docker-compose file for use with testing

Right now, you need to run a MongoDB server to successfully run tests. The number of service dependencies will grow, though (see #40 and #39 for example), and it'd be nice to only need to run docker-compose up to get all them running before you run make test. That means writing a docker-compose.yml file

Support for Exclude

Proxy by default fetches ALL modules reported from Olympus. Provide a way how to configure proxy with exclude configuration file
which changes this behavior

Example 1: Exclude modules

- github.com/azure
- github.com/foo/bar

this configuration will exclude from processing all repos owned by azure and foo/bar repo.
Other repos by foo will be processed as designed

These prefixes should be supported:

vcs.com/owner/repo
vcs.com/owner
vcs.com

NO regexes

break out different targets into folders in the monorepo

Add Olympus Buffalo App
Add Athens Buffalo App
Move Athens CLI App
Docs (Athens)
Docs (Olympus)
Docs (cli)

Factor all storage tests into a common suite

We have a lot of ./pkg/storage/*/all_test.go files that exercise pretty much the same functionality for all the drivers. We should just factor them all out into one suite

Implement a MySQL Storage driver

The storage drivers at ./pkg/storage are used by the proxy to store and return metadata and source for a given module. We should implement a driver for MySQL.

Build background workers to allow the proxy to fetch source code

When a package is requested by the proxy, it should check if the module is in storage. If it is not, then it should respond with a 404 and kick off a fetch operation in the background

Make the CLI use a subcommand & flags

Instead of ./vgp ./testmodule arschles.com testmodule v1.0.0, it'd be nicer to have something like this:

./vgp upload ./testmodule --baseurl arschles.com --module testmodule --version v1.0.0

It's more human readable, and it also allows for expanding the admin API and supporting it in the CLI. Prefer to use cobra for this

Add a disk storage backend

Since it'll be the second storage backend, it will require that the server also reads some config (prefer to read from env vars) to indicate what storage backend to use

Add support for disabling communication with Olympus

If proxy is used as private there might need to disable communication with Olympus.
Provide configuration settings with the possibility to change value during runtime. Each following operation will use new configuration value

Check before contacting Olympus

Disabling means not reporting cache-misses. Proxy will still try to fetch data from olympus on regular basis and pull modules which pass the filter