Giter VIP home page Giter VIP logo

when2pickle's Introduction

when2pickle

Answering the question: when is the right time to use the Python pickle module?

Introduction

The Python pickle module lets you serialize arbitrary Python objects to disk. It comes with two important caveats:

  1. The pickle module is not secure against erroneous or maliciously constructed data. Never unpickle data received from an untrusted or unauthenticated source.
  2. Pickled objects are not guaranteed to be usable if you change your source code, for instance if you release an update. (This is actually true of all serialization formats, but some of them make it easier to deal with than others.)

Please make sure you understand these two caveats. I completely agree with them.

Having said that, some people think these caveats mean you should never use pickle, and there I disagree. Yes pickle can be misused, but that doesn't mean it's impossible or even all that hard to use it well. If you think that it's not worth putting any thought into the tool you use, just go ahead and avoid pickle at all costs to make sure you're not misusing it. If you're willing to think a little bit, however, here are the advantages pickle has over other serialization options in Python:

  1. It's fast.
  2. It's built in, so it's available everywhere.
  3. It requires you to write less custom serialization code. Less code means less chances for errors. Thus pickle, when used correctly, is less error-prone than other options used correctly.

Therefore I claim that if you're writing a Python program that requires some non-trivial state, that can be reconstructed if needed, and you care about speed or about avoiding bugs, then pickle is a good tool to consider.

This repository contains examples of problems where I think it's appropriate to consider pickle. For each problem, I implement several different solutions using different serialization modules. I'll measure runtime benchmarks, as well as give my subjective opinion on the additional code overhead required to use different solutions.

All code is written in the latest version of Python available on Ubuntu (currently Python 3.5). Benchmarks are run on some laptop running Ubuntu.

Example: hyphenation

Problem description

Write a Python script that adds hyphenation breaks to English text, so that you know where you can break a word for hyphenation. Use the algorithm created by Frank Liang, for which there is a Python implementation by Ned Batchelder.

The algorithm involves parsing a large list of strings to build an object that contains two data structures, one of which is a trie-like thing consisting of nested dicts. Instead of parsing the strings and rebuilding the object each time the script is run, consider serializing the object (or its data structures).

Serialization formats evaluated

  1. No serialization (baseline)
  2. json
  3. pickle
  4. msgpack
  5. yaml

Results

Hyphenation runtimes

format runtime (s) uncertainty (s) file size (kb)
baseline 0.0670 0.00083 34
json 0.0556 0.00085 185
pickle 0.0459 0.00174 173
msgpack 0.0583 0.00059 64
yaml 4.2091 0.04437 254

Verdict

For this problem, serialization is probably not crucial. The baseline implemetation without serialization already runs fairly fast. You can probably get away without serialization here, unless you care a lot about speed.

However, if you do care about speed and decide to serialize, pickle is the clear winner. It's the fastest option, about 32% faster than baseline, and requires zero additional code. json and msgpack, in addition to being slower (17% and 13% faster than baseline), both require tweaking the internals of the hyphenate object. It's not a lot of additional code, but it's the kind of code I hate to write. It's unpythonic: I had to use isinstance to check the types in the deserialized data.

Furthermore, both json and msgpack failed silently before I added this tweak. Each one acted as if it had run correctly, but no hyphenation breaks were actually added. Of course you should always test your code, but be extra vigilant if you're serializing Python objects with json or msgpack.

yaml, like pickle, requires no additional code, but it's completely unusable here. It's far too slow to be worthwhile, around 60x slower than baseline.

msgpack has the smallest file size, but still larger than baseline.

when2pickle's People

Contributors

cosmologicon avatar

Watchers

Dimitri Grinkevich avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.