jcjones / ct-mapreduce Goto Github PK
View Code? Open in Web Editor NEWMap/Reduce functions for processing Certificate Transparency. Used for https://LetsEncrypt.org/stats
Home Page: https://ct.tacticalsecret.com/
Map/Reduce functions for processing Certificate Transparency. Used for https://LetsEncrypt.org/stats
Home Page: https://ct.tacticalsecret.com/
with the configuration as suggested ct-fetch
doesn't start:
2017/08/16 19:55:33 iniflags: unknown flag name=[geoipDbPath] found at line [2] of file [./ct-fetch.conf]
2017/08/16 19:55:33 iniflags: unknown flag name=[runForever] found at line [10] of file [./ct-fetch.conf]
after commenting out the two lines it runs
yeti2019.ct.digicert.com/log/ [-------] 2 % 13h4m10s 5157404 / 277040229
yeti2019.ct.digicert.com/log/ [-------] 3 % 8h52m45s 7253760 / 271896381
yeti2019.ct.digicert.com/log/ [==>---] 57 % 4h11m48s 151620453 / 264662809
No idea how this happened, with redirects or what (probably not, the internal state wouldn't have moved)
Currently we read logs linearly, which is limiting us to single-thread log download throughput. Most logs will accept simultaneous readers, and for catching up it'd be nice to take buckets of, say, 1M entries and process them in parallel.
Behind #4 lurks this error:
groovetop:ct-mapreduce lcrouch$ ct-fetch -config ~/.ct-fetch.conf --offset 5068
Saving to disk at /tmp/ct
[https://ct.googleapis.com/rocketeer] Starting download.
[https://ct.googleapis.com/rocketeer] Fetching signed tree head...
[https://ct.googleapis.com/rocketeer] Starting from offset 5068
[https://ct.googleapis.com/rocketeer] 227985342 total entries at Thu Mar 15 09:02:00 2018
[https://ct.googleapis.com/rocketeer] Going from 5068 to 227985342
| 0.0% (17408 of 227980274) Rate: 4069/minute (933h41m0s remaining)
[https://ct.googleapis.com/rocketeer] Download halting, error caught: failed to parse certificate in MerkleTreeLeaf for index 24462: asn1: structure error: E: integer not minimally-encoded
[ct.googleapis.com/rocketeer] Saved state. MaxEntry=23500, LastEntryTime=2014-09-09 08:42:29.000000919 -0500 CDT
Same config as #4.
The on-disk storage for ct-fetch
serializes certificates straight to disk, it does not maintain any state about certificates to know if they've already been written. So right now, if you change logs or have multiple logs, you'll get duplicates.
The FQDN and RegDom map/reduce functions will probably handle that OK, but it might inflate the cert count since that's done in a simple fashion. Also, it's wasteful to the disk.
Each day's metadata should maintain a list of seen issuer/serial combinations that can be used to de-dupe and decide if we should re-serialize a cert, as it's encountered in CT.
Some combination of ct-fetch and ct-reprocess-known-certs fails to set TTL for all Redis cache keys. A recent fixup caught 437 having no TTL set at all, and having not expired yet.
ApproximateMostRecentUpdateTimestamp, added in #41, uses the Redis scan
which is just stupidly slow, even for a really narrow key-scope.
Most of #41 should be reverted in favor of instead analyzing a shared state of the LogSyncEngine
's LogWorker
s, which it currently does not track, but should.
ct-mapreduce/cmd/ct-fetch/ct-fetch.go
Lines 148 to 157 in 2b0680e
This will require adding shims for the CT client in LogSyncEngine, which will be useful for testing the rest of the mechanism anyway.
Fatals received: StreamSerialsForExpirationDateAndIssuer iter.Next unexpected code Internal aborting: (2020-01-21/hOF1upiO-xevnDEQ13g3aY3M3Uqa44RN5rVlwvU2WC8=) (total time: 27m24.196755175s) (count=6069) (offset=160314) err=rpc error: code = Internal desc = unexpected EOF
StreamSerialsForExpirationDateAndIssuer iter.Next unexpected code Internal aborting: (2019-12-21/YLh1dUR9y6Kja30RrAn7JKnbQG_uEtLMkBgFF2Fuihg=) (total time: 4h37m30.615102801s) (count=2807) (offset=841319) err=rpc error: code = Internal desc = unexpected EOF
Because LogDownloader.EntryChan
is an unbuffered channel, and LogDownloader.downloadCTRangeToChannel
has some weird backoff logic, a ct-fetch
process ends up sleeping most of the time.
Making LogDownloader.EntryChan
a buffered channel of the size of the batch increases the ct-fetch
performance 10x in our environment:
diff --git a/cmd/ct-fetch/main.go b/cmd/ct-fetch/main.go
index 7bc12b1..f543574 100644
--- a/cmd/ct-fetch/main.go
+++ b/cmd/ct-fetch/main.go
@@ -71,7 +71,7 @@ type LogDownloader struct {
func NewLogDownloader(db storage.CertDatabase) *LogDownloader {
return &LogDownloader{
Database: db,
- EntryChan: make(chan CtLogEntry),
+ EntryChan: make(chan CtLogEntry, 1024),
Display: utils.NewProgressDisplay(),
ThreadWaitGroup: new(sync.WaitGroup),
DownloaderWaitGroup: new(sync.WaitGroup),
I'm not opening this as a PR, because I sense that the backoff logic should probably be removed too. The producer can, and probably just should, block on the channel when it is full.
At 23:01 UTC aggregate-known
from CRLite printed warnings:
No cached certificates for issuer=CN=Go Daddy Secure Certificate Authority - G2,OU=http://certs.godaddy.com/repository/,O=GoDaddy.com\, Inc.,L=Scottsdale,ST=Arizona,C=US (8Rw90Ej3Ttt8RRkrg-WYDS9n7IS03bk5bjP_UXPtaY8=) expDate=2019-11-26-23, but the loader thought there should be. (current count this worker=4782155)
No cached certificates for issuer=CN=cPanel\, Inc. Certification Authority,O=cPanel\, Inc.,L=Houston,ST=TX,C=US (hOF1upiO-xevnDEQ13g3aY3M3Uqa44RN5rVlwvU2WC8=) expDate=2019-11-26-23, but the loader thought there should be. (current count this worker=4406637)
The Redis cache expired those because hour 23 started, but they should have stuck around until hour 23 ended, so we've a fencepost issue somewhere. It's possible this is in my repair script, not in the Go implementation. (See #32 which indicates why I have a repair script)
Firestore is just not great. I'd rather see a backend implementation using Google Cloud Storage if we needed bulk PEM data again. In the mean time, once Firestore is phased out of the CRLite deployment the code will rot, and I don't think it should remain in-tree.
There's currently no delete method in storage/firestorebackend.go
so that'll need to be added.
All the good thread-stuff is missing, since we're forced to a single worker thread right now. Diskdatabase.go should choose a mechanism to ensure threadsafety for individual files.
Error writing known certificates 2020-03-07::gxeKFFaZ2HFJIsTdTjEl6nVo3ckTCX-qzRMqb9Xoa1w=: rpc error: code = InvalidArgument desc = A document cannot be written because it exceeds the maximum size allowed."
The Issuer's known certs cache document was lost. This is pretty critical. The max size is 1 MB for a document, and back-of-envelope suggested that was large enough, but it clearly isn't.
Redis documentation says to avoid Keys
command in favor of SCAN or SETS:
https://redis.io/commands/keys
Now that the Issuer metadata is no longer per-expDate, the only per issuer/expDate structure in FileystemDatabase's gcache is KnownCertificates, which is now stateless. Basically, that cache is useless and should be removed.
Metrics are great, and we added a ton. Let's ... back that down.
Since the Firestore effectively de-duplicates certs (by colliding), I could filter out to only the Mozilla root-store-included issuers before writing to the Redis cache. I don’t think that will have a huge impact, but it’ll have some.
Pre-certificates, such as:
-----BEGIN CERTIFICATE-----
MIIEI6ADAgECAhADktkhTsTVqOlF0JrhFs/ZMA0GCSqGSIb3DQEBCwUAME0xCzAJ
BgNVBAYTAlVTMRUwEwYDVQQKEwxEaWdpQ2VydCBJbmMxJzAlBgNVBAMTHkRpZ2lD
ZXJ0IFNIQTIgU2VjdXJlIFNlcnZlciBDQTAeFw0xODAzMTYwMDAwMDBaFw0xODA2
MTIxMjAwMDBaMIGDMQswCQYDVQQGEwJTRTESMBAGA1UECBMJU3RvY2tob2xtMRIw
EAYDVQQHEwlTdG9ja2hvbG0xETAPBgNVBAoTCEVyaWNzc29uMRQwEgYDVQQLEwtJ
VCBTRVJWSUNFUzEjMCEGA1UEAxMaYXR0d2lmaS5kcml2ZS5lcmljc3Nvbi5uZXQw
ggEiMA0GCSqGSIb3DQEBAQUAA4IBDwAwggEKAoIBAQDcWzCAXYfzaz3hzzbtkUJW
N32EzDzNzipCdPirv5dlvJicbh8+rUwTK37jkq+pHtcCLf+gJqgTXsMyB1znYizc
zH2HxZEh8TgMr5/B0VPU/xEysPyioRkDBzHBqXx2WJPZrZuyvK7hmVHHragmHOZa
tHO7zzF/rDMInOGNoZ1IRCpfMi9jMKuWcahCHQ4A9ipgRB0dBOEhvbT7Yg9jfyu4
yUexh2aNbM7ZxZrl8FPhlPgnvJdzaWecDF8BrYgidBtXhhfjDiGgukQg7T2DAzqz
hfbFBqLN4dbHjLrLWst9Z+MZvg67rHWdpKREBF16zeP36j6/Shg6pph9vDQLqCyl
AgMBAAGjggHeMIIB2jAfBgNVHSMEGDAWgBQPgGEcgjFh1S8o541GOLQs4cbZ4jAd
BgNVHQ4EFgQUjVxUn3OkhHByp9sET7m0CNAiXHAwJQYDVR0RBB4wHIIaYXR0d2lm
aS5kcml2ZS5lcmljc3Nvbi5uZXQwDgYDVR0PAQH/BAQDAgWgMB0GA1UdJQQWMBQG
CCsGAQUFBwMBBggrBgEFBQcDAjBrBgNVHR8EZDBiMC+gLaArhilodHRwOi8vY3Js
My5kaWdpY2VydC5jb20vc3NjYS1zaGEyLWc2LmNybDAvoC2gK4YpaHR0cDovL2Ny
bDQuZGlnaWNlcnQuY29tL3NzY2Etc2hhMi1nNi5jcmwwTAYDVR0gBEUwQzA3Bglg
hkgBhv1sAQEwKjAoBggrBgEFBQcCARYcaHR0cHM6Ly93d3cuZGlnaWNlcnQuY29t
L0NQUzAIBgZngQwBAgIwfAYIKwYBBQUHAQEEcDBuMCQGCCsGAQUFBzABhhhodHRw
Oi8vb2NzcC5kaWdpY2VydC5jb20wRgYIKwYBBQUHMAKGOmh0dHA6Ly9jYWNlcnRz
LmRpZ2ljZXJ0LmNvbS9EaWdpQ2VydFNIQTJTZWN1cmVTZXJ2ZXJDQS5jcnQwCQYD
VR0TBAIwAA==
-----END CERTIFICATE-----
Prompt errors in problems
like:
/tmp/x/2020-05-16/bob.pem:0 Unable to load certificate
Upgrading to Python Cryptography 2.0 provides some CT support, but not enough to avoid having this line fail:
ct-mapreduce/python/ct-mapreduce-map.py
Line 108 in 2337b37
Currently ct-fetch persists its log state whenever a log download completes / catches-up. That's fine for maintenance, but during an initial sync which can take many days, the log states won't persist unless the user manually issues a SIGTERM or ctrl-c. And then restart, of course.
Really, we should download smaller batches and persist state in between them.
ct.googleapis.com/logs/argon2019/ [==================================>-------------------------------------] 49 % 10s 94976 / 192723
that won't complete within 10 minutes, let alone 10 seconds. We should unbreak that.
The number of bytes/cert is directly related to the number of certs in a key set. On a set of a few thousand certs, it's 24 bytes/cert, on one of a few hundred thousand, it's 45 bytes/cert.
Let’s Encrypt’s certs expiring on December 30th - 1509338 certs - is 159937855 bytes, or 105.96 bytes per cert.
To improve space utilization, I can segment the in-cache data by issuer/date/hour-of-day easily, because I always know hour-of-day, and it doesn’t affect the final filter. This also would let me do hourly revocation removals, which is a long term goal.
groovetop:ct-mapreduce lcrouch$ ct-fetch -config ~/.ct-fetch.conf
Saving to disk at /tmp/ct
[https://ct.googleapis.com/rocketeer] Starting download.
[https://ct.googleapis.com/rocketeer] Fetching signed tree head...
[https://ct.googleapis.com/rocketeer] Counting existing entries...
[https://ct.googleapis.com/rocketeer] 227988705 total entries at Thu Mar 15 10:00:55 2018
[https://ct.googleapis.com/rocketeer] Going from 0 to 227988705
| 0.0% (2711 of 227988705) Rate: 5287/minute (718h39m0s remaining)
[https://ct.googleapis.com/rocketeer] Download halting, error caught: failed to parse certificate in MerkleTreeLeaf for index 5067: x509: cannot parse dnsName "vmext21-065.gwdg.de."
[ct.googleapis.com/rocketeer] Saved state. MaxEntry=4096, LastEntryTime=2014-09-09 08:29:53.000000573 -0500 CDT
~/.ct-fetch.conf
:
issuerCNList = DigiCert
logList = https://ct.googleapis.com/rocketeer # DigiCert is in here
certPath = /tmp/ct
It'd be nice to have a total-percentage-of-CT that is the sums of all logs and their maxentries, just as a talking point and a general total-time-to-sync mechanism.
[https://ct.googleapis.com/skydiver/] downloadCTRangeToChannel exited with an error: got HTTP Status "429 Too Many Requests", finalIndex=115867925, finalTime=2019-02-16 22:12:58.000000089 +0000 UTC
This error should be caught inside downloadCTRangeToChannel and use the backoff logic to retry.
Some rate limits may be directly related to whether H2 with multiplexed requests are in-use. Make sure we're using them.
It should be possible to only update the memorycache (Redis). Most notably, this means there needs to be CT Log metadata also stored into Redis.
StreamSerialsForExpirationDateAndIssuer iter.Next unexpected code Unknown aborting: (2019-11-15/hOF1upiO-xevnDEQ13g3aY3M3Uqa44RN5rVlwvU2WC8=) (total time: 1m21.131437568s) (count=0) (offset=24576) err=context deadline exceeded
after many
LoadCertificatePEM failed, issuer=YLh1dUR9y6Kja30RrAn7JKnbQG_uEtLMkBgFF2Fuihg= expDate=2019-11-08, serial=034b33c63cf4212dd42fa582ef6e79789e7c Couldn't get document snapshot for ct/2019-11-08/issuer/YLh1dUR9y6Kja30RrAn7JKnbQG_uEtLMkBgFF2Fuihg=/certs/A0szxjz0IS3UL6WC7255eJ58: context deadline exceeded
The PEM-loading method should be much more forgiving of congestion now, as it's happening in its own thread.
LoadCertificatePEM unexpected error, issuer=YLh1dUR9y6Kja30RrAn7JKnbQG_uEtLMkBgFF2Fuihg= expDate=2020-01-18, serial=035c5c3d88d9c2ca42fbe6204eccd5169348 time=1m0.051885077s skipping: Couldn't get document snapshot for ct/2020-01-18/issuer/YLh1dUR9y6Kja30RrAn7JKnbQG_uEtLMkBgFF2Fuihg=/certs/A1xcPYjZwspC--YgTszVFpNI: rpc error: code = Unavailable desc = The datastore operation timed out, or the data was temporarily unavailable.
LoadCertificatePEM unexpected error, issuer=YLh1dUR9y6Kja30RrAn7JKnbQG_uEtLMkBgFF2Fuihg= expDate=2020-01-03, serial=035bfa58626bca42724556637ffd30fb00c7 time=1m0.128993994s skipping: Couldn't get document snapshot for ct/2020-01-03/issuer/YLh1dUR9y6Kja30RrAn7JKnbQG_uEtLMkBgFF2Fuihg=/certs/A1v6WGJrykJyRVZjf_0w-wDH: rpc error: code = Unavailable desc = The datastore operation timed out, or the data was temporarily unavailable.
Around these lines:
ct-mapreduce/cmd/ct-fetch/ct-fetch.go
Lines 187 to 198 in a691f52
ct-fetch should verify that the certificate was signed by its issuer, to ensure it's a real certificate. This is important in the event that a CT log is coerced to log an invalid certificate.
If the certificate is valid but from an unknown issuer, tools can more readily handle that via whitelisting. But it's much better to ensure that we never log certificates that are actively themselves fraudulent.
Cross-referencing mozilla/crlite#119 ... Basically, I no longer map/reduce CT, and I'm not sure this library could even do it at this point. The ct-fetch tool and its associated pieces should just move into CRLite, and everything not CRLlite-related should be removed. This repo should get an update to its README marking it out-of-use and point to CRLite.
When running in an environment where prog bars aren't actually displayed, they pollute stdout. It should be possible to turn them off.
This is pretty much my fault for not being more thorough in e09f015 - issue #2 predicted this calling the back-off logic unnecessary, which it really is.
The issue here is that the select statement in downloadCTRangeForChannel
provides ways out that neither abort nor pass the CT entry on for evaluation:
ct-mapreduce/cmd/ct-fetch/main.go
Lines 388 to 406 in 8bb323f
Both the path through the SaveTicker
and the default
drop the entry in logEntry
into the ether and then proceed on at the top of the loop again, grabbing a new entry.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.