jcjones / ct-mapreduce Goto Github PK

View Code? Open in Web Editor NEW

11.0 11.0 2.0 2 MB

Map/Reduce functions for processing Certificate Transparency. Used for https://LetsEncrypt.org/stats

Home Page: https://ct.tacticalsecret.com/

Go 100.00%

ct-mapreduce's People

Contributors

Stargazers

Watchers

Forkers

hunslater

ct-mapreduce's Issues

iniflags: unknown flag names

with the configuration as suggested ct-fetch doesn't start:

2017/08/16 19:55:33 iniflags: unknown flag name=[geoipDbPath] found at line [2] of file [./ct-fetch.conf]
2017/08/16 19:55:33 iniflags: unknown flag name=[runForever] found at line [10] of file [./ct-fetch.conf]

after commenting out the two lines it runs

Over time, somehow multiple threads for a single log started

yeti2019.ct.digicert.com/log/  [-------] 2 %      13h4m10s   5157404 / 277040229
yeti2019.ct.digicert.com/log/  [-------] 3 %      8h52m45s   7253760 / 271896381
yeti2019.ct.digicert.com/log/  [==>---] 57 %      4h11m48s 151620453 / 264662809

No idea how this happened, with redirects or what (probably not, the internal state wouldn't have moved)

Support mutlithreaded download per log

Currently we read logs linearly, which is limiting us to single-thread log download throughput. Most logs will accept simultaneous readers, and for catching up it'd be nice to take buckets of, say, 1M entries and process them in parallel.

structure error: E: integer not minimally-encoded

Behind #4 lurks this error:

groovetop:ct-mapreduce lcrouch$ ct-fetch -config ~/.ct-fetch.conf --offset 5068
Saving to disk at /tmp/ct
[https://ct.googleapis.com/rocketeer] Starting download.
[https://ct.googleapis.com/rocketeer] Fetching signed tree head...
[https://ct.googleapis.com/rocketeer] Starting from offset 5068
[https://ct.googleapis.com/rocketeer] 227985342 total entries at Thu Mar 15 09:02:00 2018
[https://ct.googleapis.com/rocketeer] Going from 5068 to 227985342
| 0.0% (17408 of 227980274) Rate: 4069/minute (933h41m0s remaining)
[https://ct.googleapis.com/rocketeer] Download halting, error caught: failed to parse certificate in MerkleTreeLeaf for index 24462: asn1: structure error: E: integer not minimally-encoded
[ct.googleapis.com/rocketeer] Saved state. MaxEntry=23500, LastEntryTime=2014-09-09 08:42:29.000000919 -0500 CDT

Same config as #4.

De-duplicate with multiple logs

The on-disk storage for ct-fetch serializes certificates straight to disk, it does not maintain any state about certificates to know if they've already been written. So right now, if you change logs or have multiple logs, you'll get duplicates.

The FQDN and RegDom map/reduce functions will probably handle that OK, but it might inflate the cert count since that's done in a simple fashion. Also, it's wasteful to the disk.

Each day's metadata should maintain a list of seen issuer/serial combinations that can be used to de-dupe and decide if we should re-serialize a cert, as it's encountered in CT.

TTLs not always set for serials lists

Some combination of ct-fetch and ct-reprocess-known-certs fails to set TTL for all Redis cache keys. A recent fixup caught 437 having no TTL set at all, and having not expired yet.

ApproximateMostRecentUpdateTimestamp is really slow

ApproximateMostRecentUpdateTimestamp, added in #41, uses the Redis scan which is just stupidly slow, even for a really narrow key-scope.

Most of #41 should be reverted in favor of instead analyzing a shared state of the LogSyncEngine's LogWorkers, which it currently does not track, but should.

ct-mapreduce/cmd/ct-fetch/ct-fetch.go

Lines 148 to 157 in 2b0680e

 func (ld *LogSyncEngine) ApproximateMostRecentUpdateTimestamp() time.Time { 

 var mostRecent *storage.CertificateLog 

 for _, log := range ld.database.GetAllLogStates() { 

 if mostRecent == nil || log.LastUpdateTime.After(mostRecent.LastUpdateTime) { 

 mostRecent = log 

 } 

 } 

 glog.V(4).Infof("Most recently updated log was %+v", mostRecent) 

 return mostRecent.LastUpdateTime 

 }

Add test that LogWorker never drops entries

This will require adding shims for the CT client in LogSyncEngine, which will be useful for testing the rest of the mechanism anyway.

ct-fetch should use mpb v4

Unexpected fatals

Fatals received: StreamSerialsForExpirationDateAndIssuer iter.Next unexpected code Internal aborting: (2020-01-21/hOF1upiO-xevnDEQ13g3aY3M3Uqa44RN5rVlwvU2WC8=) (total time: 27m24.196755175s) (count=6069) (offset=160314) err=rpc error: code = Internal desc = unexpected EOF
StreamSerialsForExpirationDateAndIssuer iter.Next unexpected code Internal aborting: (2019-12-21/YLh1dUR9y6Kja30RrAn7JKnbQG_uEtLMkBgFF2Fuihg=) (total time: 4h37m30.615102801s) (count=2807) (offset=841319) err=rpc error: code = Internal desc = unexpected EOF

`ct-fetch` sleeps most of the time

Because LogDownloader.EntryChan is an unbuffered channel, and LogDownloader.downloadCTRangeToChannel has some weird backoff logic, a ct-fetch process ends up sleeping most of the time.

Making LogDownloader.EntryChan a buffered channel of the size of the batch increases the ct-fetch performance 10x in our environment:

diff --git a/cmd/ct-fetch/main.go b/cmd/ct-fetch/main.go
index 7bc12b1..f543574 100644
--- a/cmd/ct-fetch/main.go
+++ b/cmd/ct-fetch/main.go
@@ -71,7 +71,7 @@ type LogDownloader struct {
 func NewLogDownloader(db storage.CertDatabase) *LogDownloader {
        return &LogDownloader{
                Database:            db,
-               EntryChan:           make(chan CtLogEntry),
+               EntryChan:           make(chan CtLogEntry, 1024),
                Display:             utils.NewProgressDisplay(),
                ThreadWaitGroup:     new(sync.WaitGroup),
                DownloaderWaitGroup: new(sync.WaitGroup),

I'm not opening this as a PR, because I sense that the backoff logic should probably be removed too. The producer can, and probably just should, block on the channel when it is full.

TTLs are off by one hour

At 23:01 UTC aggregate-known from CRLite printed warnings:
No cached certificates for issuer=CN=Go Daddy Secure Certificate Authority - G2,OU=http://certs.godaddy.com/repository/,O=GoDaddy.com\, Inc.,L=Scottsdale,ST=Arizona,C=US (8Rw90Ej3Ttt8RRkrg-WYDS9n7IS03bk5bjP_UXPtaY8=) expDate=2019-11-26-23, but the loader thought there should be. (current count this worker=4782155)
No cached certificates for issuer=CN=cPanel\, Inc. Certification Authority,O=cPanel\, Inc.,L=Houston,ST=TX,C=US (hOF1upiO-xevnDEQ13g3aY3M3Uqa44RN5rVlwvU2WC8=) expDate=2019-11-26-23, but the loader thought there should be. (current count this worker=4406637)

The Redis cache expired those because hour 23 started, but they should have stuck around until hour 23 ended, so we've a fencepost issue somewhere. It's possible this is in my repair script, not in the Go implementation. (See #32 which indicates why I have a repair script)

Remove Firestore support

Firestore is just not great. I'd rather see a backend implementation using Google Cloud Storage if we needed bulk PEM data again. In the mean time, once Firestore is phased out of the CRLite deployment the code will rot, and I don't think it should remain in-tree.

Add a tool to cleanup expired Firestore entries

There's currently no delete method in storage/firestorebackend.go so that'll need to be added.

diskdatabase.go should be threadsafe

All the good thread-stuff is missing, since we're forced to a single worker thread right now. Diskdatabase.go should choose a mechanism to ensure threadsafety for individual files.

Handle too-large known certificates in Firestore

Error writing known certificates 2020-03-07::gxeKFFaZ2HFJIsTdTjEl6nVo3ckTCX-qzRMqb9Xoa1w=: rpc error: code = InvalidArgument desc = A document cannot be written because it exceeds the maximum size allowed."

The Issuer's known certs cache document was lost. This is pretty critical. The max size is 1 MB for a document, and back-of-envelope suggested that was large enough, but it clearly isn't.

ct-fetch should perform log verification

Don't use `keys` for Redis

Redis documentation says to avoid Keys command in favor of SCAN or SETS:
https://redis.io/commands/keys

KnownCertificates is stateless, so FilesystemDatabase need not cache it

Now that the Issuer metadata is no longer per-expDate, the only per issuer/expDate structure in FileystemDatabase's gcache is KnownCertificates, which is now stateless. Basically, that cache is useless and should be removed.

Reduce the number of metrics

Metrics are great, and we added a ton. Let's ... back that down.

Filter down what certificates get inserted into the Redis cache

Since the Firestore effectively de-duplicates certs (by colliding), I could filter out to only the Mozilla root-store-included issuers before writing to the Redis cache. I don’t think that will have a huge impact, but it’ll have some.

ct-mapreduce-map does not handle CT pre-certificates

Pre-certificates, such as:

Prompt errors in problems like:
/tmp/x/2020-05-16/bob.pem:0 Unable to load certificate

Upgrading to Python Cryptography 2.0 provides some CT support, but not enough to avoid having this line fail:

ct-mapreduce/python/ct-mapreduce-map.py

Line 108 in 2337b37

cert = x509.load_der_x509_certificate(der_data, default_backend())

ct-fetch should periodically save its state on long downloads

Currently ct-fetch persists its log state whenever a log download completes / catches-up. That's fine for maintenance, but during an initial sync which can take many days, the log states won't persist unless the user manually issues a SIGTERM or ctrl-c. And then restart, of course.

Really, we should download smaller batches and persist state in between them.

ct-fetch time estimates are always wildly optimistic

ct.googleapis.com/logs/argon2019/ [==================================>-------------------------------------] 49 %           10s 94976 / 192723

that won't complete within 10 minutes, let alone 10 seconds. We should unbreak that.

ct-fetch should expose a liveness endpoint

This is for issue mozilla/crlite#55.

See https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/#define-a-liveness-http-request.

Segment cache data on expiration date+hour

The number of bytes/cert is directly related to the number of certs in a key set. On a set of a few thousand certs, it's 24 bytes/cert, on one of a few hundred thousand, it's 45 bytes/cert.

Let’s Encrypt’s certs expiring on December 30th - 1509338 certs - is 159937855 bytes, or 105.96 bytes per cert.

To improve space utilization, I can segment the in-cache data by issuer/date/hour-of-day easily, because I always know hour-of-day, and it doesn’t affect the final filter. This also would let me do hourly revocation removals, which is a long term goal.

cannot parse dnsName "vmext21-065.gwdg.de."

groovetop:ct-mapreduce lcrouch$ ct-fetch -config ~/.ct-fetch.conf
Saving to disk at /tmp/ct
[https://ct.googleapis.com/rocketeer] Starting download.
[https://ct.googleapis.com/rocketeer] Fetching signed tree head...
[https://ct.googleapis.com/rocketeer] Counting existing entries...
[https://ct.googleapis.com/rocketeer] 227988705 total entries at Thu Mar 15 10:00:55 2018
[https://ct.googleapis.com/rocketeer] Going from 0 to 227988705
| 0.0% (2711 of 227988705) Rate: 5287/minute (718h39m0s remaining)
[https://ct.googleapis.com/rocketeer] Download halting, error caught: failed to parse certificate in MerkleTreeLeaf for index 5067: x509: cannot parse dnsName "vmext21-065.gwdg.de."
[ct.googleapis.com/rocketeer] Saved state. MaxEntry=4096, LastEntryTime=2014-09-09 08:29:53.000000573 -0500 CDT

~/.ct-fetch.conf:

issuerCNList = DigiCert
logList = https://ct.googleapis.com/rocketeer # DigiCert is in here
certPath = /tmp/ct

Remove stackdriver, make statsd configurable via environment variables

Show a progress bar of 'state of CT'

It'd be nice to have a total-percentage-of-CT that is the sums of all logs and their maxentries, just as a talking point and a general total-time-to-sync mechanism.

HTTP Status 429 should back-off, not wait the whole restart period

[https://ct.googleapis.com/skydiver/] downloadCTRangeToChannel exited with an error: got HTTP Status "429 Too Many Requests", finalIndex=115867925, finalTime=2019-02-16 22:12:58.000000089 +0000 UTC

This error should be caught inside downloadCTRangeToChannel and use the backoff logic to retry.

Ensure use of HTTP/2 and multiplexed streams for log downloading

Some rate limits may be directly related to whether H2 with multiplexed requests are in-use. Make sure we're using them.

Don't require a PEM storage backend

It should be possible to only update the memorycache (Redis). Most notably, this means there needs to be CT Log metadata also stored into Redis.

Re-analyze context deadlines

StreamSerialsForExpirationDateAndIssuer iter.Next unexpected code Unknown aborting: (2019-11-15/hOF1upiO-xevnDEQ13g3aY3M3Uqa44RN5rVlwvU2WC8=) (total time: 1m21.131437568s) (count=0) (offset=24576) err=context deadline exceeded

after many

LoadCertificatePEM failed, issuer=YLh1dUR9y6Kja30RrAn7JKnbQG_uEtLMkBgFF2Fuihg= expDate=2019-11-08, serial=034b33c63cf4212dd42fa582ef6e79789e7c Couldn't get document snapshot for ct/2019-11-08/issuer/YLh1dUR9y6Kja30RrAn7JKnbQG_uEtLMkBgFF2Fuihg=/certs/A0szxjz0IS3UL6WC7255eJ58: context deadline exceeded

The PEM-loading method should be much more forgiving of congestion now, as it's happening in its own thread.

Unexpected errors


LoadCertificatePEM unexpected error, issuer=YLh1dUR9y6Kja30RrAn7JKnbQG_uEtLMkBgFF2Fuihg= expDate=2020-01-18, serial=035c5c3d88d9c2ca42fbe6204eccd5169348 time=1m0.051885077s skipping: Couldn't get document snapshot for ct/2020-01-18/issuer/YLh1dUR9y6Kja30RrAn7JKnbQG_uEtLMkBgFF2Fuihg=/certs/A1xcPYjZwspC--YgTszVFpNI: rpc error: code = Unavailable desc = The datastore operation timed out, or the data was temporarily unavailable.

LoadCertificatePEM unexpected error, issuer=YLh1dUR9y6Kja30RrAn7JKnbQG_uEtLMkBgFF2Fuihg= expDate=2020-01-03, serial=035bfa58626bca42724556637ffd30fb00c7 time=1m0.128993994s skipping: Couldn't get document snapshot for ct/2020-01-03/issuer/YLh1dUR9y6Kja30RrAn7JKnbQG_uEtLMkBgFF2Fuihg=/certs/A1v6WGJrykJyRVZjf_0w-wDH: rpc error: code = Unavailable desc = The datastore operation timed out, or the data was temporarily unavailable.

ct-fetch should verify certs

Around these lines:

ct-mapreduce/cmd/ct-fetch/ct-fetch.go

Lines 187 to 198 in a691f52

 if len(ep.LogEntry.Chain) < 1 { 

 glog.Warningf("[%s] No issuer known for certificate precert=%v index=%d serial=%s subject=%+v issuer=%+v", 

 ep.LogURL, precert, ep.LogEntry.Index, storage.NewSerial(cert).String(), cert.Subject, cert.Issuer) 

 continue 

 } 

 issuingCert, err := x509.ParseCertificate(ep.LogEntry.Chain[0].Data) 

 if err != nil { 

 glog.Errorf("[%s] Problem decoding issuing certificate: index: %d error: %s", ep.LogURL, ep.LogEntry.Index, err) 

 continue 

 } 

 metrics.MeasureSince([]string{"insertCTWorker", "ParseCertificates"}, parseTime)

ct-fetch should verify that the certificate was signed by its issuer, to ensure it's a real certificate. This is important in the event that a CT log is coerced to log an invalid certificate.

If the certificate is valid but from an unknown issuer, tools can more readily handle that via whitelisting. But it's much better to ensure that we never log certificates that are actively themselves fraudulent.

Close this repository

Cross-referencing mozilla/crlite#119 ... Basically, I no longer map/reduce CT, and I'm not sure this library could even do it at this point. The ct-fetch tool and its associated pieces should just move into CRLite, and everything not CRLlite-related should be removed. This repo should get an update to its README marking it out-of-use and point to CRLite.

ct-fetch should support reducing the refresh rate of progess bars

When running in an environment where prog bars aren't actually displayed, they pollute stdout. It should be possible to turn them off.

downloadCTRangeToChannel drops entries on contention

This is pretty much my fault for not being more thorough in e09f015 - issue #2 predicted this calling the back-off logic unnecessary, which it really is.

The issue here is that the select statement in downloadCTRangeForChannel provides ways out that neither abort nor pass the CT entry on for evaluation:

ct-mapreduce/cmd/ct-fetch/main.go

Lines 388 to 406 in 8bb323f

 // Are there waiting signals? 

 select { 

 case sig := <-sigChan: 

 glog.Infof("[%s] Signal caught: %s", lw.LogURL, sig) 

 return index, lastEntryTimestamp, nil 

 case entryChan <- CtLogEntry{logEntry, lw.LogURL}: 

 lastEntryTimestamp = uint64ToTimestamp(logEntry.Leaf.TimestampedEntry.Timestamp) 

 lw.Backoff.Reset() 

 case <-lw.SaveTicker.C: 

 lw.saveState(index, lastEntryTimestamp) 

 default: 

 // Channel full, retry 

 duration := lw.Backoff.Duration() 

 metrics.IncrCounter([]string{"downloadCTRangeToChannel", "channelFull"}, 1) 

 metrics.AddSample([]string{"downloadCTRangeToChannel", "channelFullBackoff"}, 

 float32(duration.Milliseconds())) 

 time.Sleep(duration) 

 }

Both the path through the SaveTicker and the default drop the entry in logEntry into the ether and then proceed on at the top of the loop again, grabbing a new entry.

	func (ld *LogSyncEngine) ApproximateMostRecentUpdateTimestamp() time.Time {
	var mostRecent *storage.CertificateLog
	for _, log := range ld.database.GetAllLogStates() {
	if mostRecent == nil \|\| log.LastUpdateTime.After(mostRecent.LastUpdateTime) {
	mostRecent = log
	}
	}
	glog.V(4).Infof("Most recently updated log was %+v", mostRecent)
	return mostRecent.LastUpdateTime
	}

	if len(ep.LogEntry.Chain) < 1 {
	glog.Warningf("[%s] No issuer known for certificate precert=%v index=%d serial=%s subject=%+v issuer=%+v",
	ep.LogURL, precert, ep.LogEntry.Index, storage.NewSerial(cert).String(), cert.Subject, cert.Issuer)
	continue
	}

	issuingCert, err := x509.ParseCertificate(ep.LogEntry.Chain[0].Data)
	if err != nil {
	glog.Errorf("[%s] Problem decoding issuing certificate: index: %d error: %s", ep.LogURL, ep.LogEntry.Index, err)
	continue
	}
	metrics.MeasureSince([]string{"insertCTWorker", "ParseCertificates"}, parseTime)

	// Are there waiting signals?
	select {
	case sig := <-sigChan:
	glog.Infof("[%s] Signal caught: %s", lw.LogURL, sig)
	return index, lastEntryTimestamp, nil
	case entryChan <- CtLogEntry{logEntry, lw.LogURL}:
	lastEntryTimestamp = uint64ToTimestamp(logEntry.Leaf.TimestampedEntry.Timestamp)

	lw.Backoff.Reset()
	case <-lw.SaveTicker.C:
	lw.saveState(index, lastEntryTimestamp)
	default:
	// Channel full, retry
	duration := lw.Backoff.Duration()
	metrics.IncrCounter([]string{"downloadCTRangeToChannel", "channelFull"}, 1)
	metrics.AddSample([]string{"downloadCTRangeToChannel", "channelFullBackoff"},
	float32(duration.Milliseconds()))
	time.Sleep(duration)
	}