Giter VIP home page Giter VIP logo

x509-cert-testcorpus's Introduction

x509-cert-testcorpus

This is a corpus of about 1.74 million unique X.509 certificates, all of which have been in public use in the wild by querying about 3.9 million different TLS servers. They have been scraped by using server name lists like the Alexa top Million list and querying every single host on port 443 for their certificate.

The goal is to have a realistic test corpus to test tools against (shameless plug: I did this for my X.509 Swiss Army Knife tool x509sak.

Since scraping takes a long time, it made sense to me to publish the whole corpus so other people don't have to do the scraping themselves.

Certificate content

Because the size of the database has outgrown GitHub (GitHub LFS is too restrictive in the free plan and we really don't want to store the binaries within Git), they're now hosted as a tar.gz archive here:

  • Latest certs.tar.gz from 2019-12-31 10:30:23
  • Download link
  • Alternative download link
  • File size 2054334870 bytes (1.91 GiB)
  • SHA256 04740d4e1205a2274bed78991b20f698e36e9d2b334547e6eea66a7bc702b449

Database structure

The database contains a table of contents (TOC) Sqlite3 database and 256 storage Sqlite3 databases. The TOC contains SHA256 hashes over the DER encoding of the certificates. The first byte of the SHA256 gives the number of the storage database the actual cert can be found in. Thus, the TOC structure is:

CREATE TABLE connections (
	conn_id integer PRIMARY KEY,
	leaf_only boolean NOT NULL,
	fetch_timestamp integer NOT NULL,
	servername varchar NOT NULL,
	cert_hashes blob NOT NULL,
	UNIQUE(servername, fetch_timestamp)
);

The leaf_only flag indicates wether or not only the leaf (server) certificate was stored in the database. This is meaningful to be able to distinguish if the server really returned only its server certificate or if it might have returned more, but those CA certificates were not stored.

The cert_hashes is a binary string of concatenated SHA256 hashes that reference the presented certificates. It is always a multiple of 32 bytes in length and the order is the same order in which the certificates were received (starting from the server certificate itself at the very front).

The storage databases themselves are straightforward in their definition:

CREATE TABLE certificates (
	cert_sha256 blob PRIMARY KEY,
	der_cert blob NOT NULL
);

Date/time of scraping

A first batch of these certificates were scraped over about a week's worth of time starting around 2018-10-06, a second batch around 2019-12-22.

Domain name list

To import a CSV of a domain name list, the following sources can be useful:

They are all in CSV format and can be imported using the import_domainname_csv.py script.

License

Everything in here is CC-0.

x509-cert-testcorpus's People

Contributors

johndoe31415 avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.