Giter VIP home page Giter VIP logo

cache-trace's Introduction

Anonymized Cache Request Traces from Twitter Production

Trace Overview

This repository describes the traces from Twitter's in-memory caching (Twemcache/Pelikan) clusters. The current traces were collected from 54 clusters in Mar 2020. The traces are one-week-long. More details are described in the following paper and blog.


Trace Format

The traces are compressed with zstd, to decompress run zstd -d /path/file. The decompressed traces are plain text structured as comma-separated columns. Each row represents one request in the following format.

  • timestamp: the time when the cache receives the request, in sec
  • anonymized key: the original key with anonymization
  • key size: the size of key in bytes
  • value size: the size of value in bytes
  • client id: the anonymized clients (frontend service) who sends the request
  • operation: one of get/gets/set/add/replace/cas/append/prepend/delete/incr/decr
  • TTL: the time-to-live (TTL) of the object set by the client, it is 0 when the request is not a write request.

Note that during key anonymization, we preserve the namespaces, for example, if the anonymized key is nz:u:eeW511W3dcH3de3d15ec, the first two fields nz and u are namespaces, note that the namespaces are not necessarily delimited by :, different workloads use different delimiters with different number of namespaces.

A sample of the traces are attached under samples.


Trace Download

The full traces are large (2.8 TB in compressed form, 14 TB uncompressed), and can be downloaded from the following places.

Carnegie Mellon University PDL cluster

https://ftp.pdl.cmu.edu/pub/datasets/twemcacheWorkload/open_source

SNIA

http://iotta.snia.org/tracetypes/17

Storj

see storj for how to access (Good for worldwide access, especially Asia and Europe, but not available after Dec 2020)

Baidu pan

https://pan.baidu.com/s/1Jm2nAW-UhsjXU6JYoA07LA access code: wcws (Good for Asia access, but UI only has Chinese)

These traces are splitted into smaller files of 1000000000 lines (smaller for SNIA) each and compressed with zstd, so a file with name clusterN.0.zst means this file contains the first 1000000000 requests of cluster N.

Feel free to contact us if you have problem downloading the traces.


Choice of traces for different evaluations

For different evaluation purposes, we recommend the following clusters/workloads

  • miss ratio related (admission, eviction): cluster52, cluster17 (low miss ratio), cluster18 (low miss ratio), cluster24, cluster44, cluster45, cluster29.

  • write-heavy workloads: cluster12, cluster15, cluster31, cluster37.

  • TTL-related: mix of small and large TTLs: cluster 52, cluster22, cluster25, cluster11; small TTLs only: cluster18, cluster19, cluster6, cluster7.

others?: feel free to contact us if you are looking for a trace for specific purpose.


More information about each workload is included under stat/

We release a computed statistics of each cluster workload under stat/, the latest is here. This table includes the following fields, each field is the mean value of the metric either from production or from the traces.

The fields include production miss ratio, workload category (1: storage, 2: computation, 3: transient item), key size, value size, request rate, mean object frequency, one-hit-wonder ratio (%), compulsory miss ratio (%), common TTLs, working set size, operations, Zipf alpha.


Misc

  • Please join our discussion channel for questions and updates.
  • We provide a trace bibliography of papers that have used and/or analyzed the traces, and encourage anybody who publishes one to add it to the bibliography by creating an issue or pull request on GitHub.

Acknowledgement

We thank Carnegie Mellon University PDL, SNIA and Storj for hosting the traces.

License

Creative Commons CC-BY license The data and trace documentation are made available under the CC-BY license. By downloading it or using them, you agree to the terms of this license.

cache-trace's People

Contributors

1a1a11a avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cache-trace's Issues

.sort or not

When I was downloading the traces, I noticed some files have a .sort tag and some do not. Is there much difference between a .sort file and one that is not? Please let me know the difference.

CMU FTP returns 403 Forbidden

problem
The https://ftp.pdl.cmu.edu/pub/datasets/twemcacheWorkload/open_source url is no longer accessible (this used to work in the past).

image

questions

  1. Whats the best way to download this dataset (in the US northeast)?
  2. Could we get checksums of each file (e.g. a sha256)?

Recommended tools for replay

Hi,

Thanks for the exciting work and traces.

I wonder if you can recommend a tool to replay these traces.

Thank you so much for your insights and time!

looking for special traces or scripts

I want to do some research about caching with low locality(means large reuse distance), but I don't find such nice traces.

So I want to ask is there such kind of traces in this project?

Or any scripts to get L2 traces by doing a L1 filtering?

Thank you for your time.

[Question] Can you share more details about namespace delimiters?

Hi!

Note that during key anonymization, we preserve the namespaces, for example, if the anonymized key is nz:u:eeW511W3dcH3de3d15ec, the first two fields nz and u are namespaces, note that the namespaces are not necessarily delimited by :, different workloads use different delimiters with different number of namespaces.

I've already found several different special characters, and I have no idea whether they are the delimiters or not.
Can you share more details about namespace delimiters? like some regex?
It will help me a lot!

Thanks so much!!

Uncompressed trace files

Is it possible to put the uncompressed trace files on ftp.pdl.cmu.edu or other places? In this way i can use range http header to download part of the file and send to cache. For example the meta KV trace, I can use curl -v -r 0-5000 'https://cachelib-workload-sharing.s3-us-west-2.amazonaws.com/pub/kvcache/202206/kvcache_traces_1.csv' to get first 5000 bytes.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.