Giter VIP home page Giter VIP logo

pickle's Introduction

Pickle - Java and .NET library for Python's pickle serialization protocol

Maven Central NuGet

Pickle is written by Irmen de Jong ([email protected]). This software is distributed under the terms written in the file LICENSE.

The pickle serialization protocol

This is a feature complete pickle protocol implementation. You can read and write pickle files. Pickle is Python's serialization protocol.

Pickle protocol version support: reading: 0,1,2,3,4,5; writing: 2. We can read all pickle protocol versions (0 to 5, so this includes the latest additions made in Python 3.8 related to out-of-band buffers). We always writes pickles in protocol version 2. There are no plans on including protocol version 1 support. Protocols 3 and 4 contain some nice new features which may eventually be utilized (protocol 5 is quite obscure), but for now, only version 2 is used.

Size limitations

Unlike Python where the length of strings and (byte)arrays is only limited by the available memory, Java and .NET do have an arbitrary maximum object size. The maximum length of strings and byte arrays of both platforms is limited to 2 gigabytes (2^31 - 1). This is not a Pickle library limitation, this is a limitation of the underlying platform. If an object in your pickle exceeds this limit the code will crash with something like an NegativeArraySizeException, OverflowException or perhaps an out of memory error of some sort. You should make sure in your own code that the size of the pickled objects does not exceed 2 gigabyte.

Type Mapping

Python to Java (unpickling)

The Unpickler simply returns an Object. Because Java is a statically typed language you will have to cast that to the appropriate type. Refer to this table to see what you can expect to receive.

PYTHON JAVA
None null
bool boolean
int int
long long or BigInteger (depending on size)
string String
unicode String
complex net.razorvine.pickle.objects.ComplexNumber
datetime.date java.util.Calendar
datetime.datetime java.util.Calendar
datetime.time net.razorvine.pickle.objects.Time
datetime.timedelta net.razorvine.pickle.objects.TimeDelta
float double (float isn't used)
array.array array of appropriate primitive type (char, int, short, long, float, double)
list java.util.List
tuple Object[]
set java.util.Set
dict java.util.Map
bytes byte[]
bytearray byte[]
decimal BigDecimal (except NaN which is mapped to Double.NaN)
custom class Map<String, Object> (dict with class attributes including its name in "class")
Pyro4.core.URI net.razorvine.pyro.PyroURI
Pyro4.core.Proxy net.razorvine.pyro.PyroProxy
Pyro4.errors.* net.razorvine.pyro.PyroException
Pyro4.utils.flame.FlameBuiltin net.razorvine.pyro.FlameBuiltin
Pyro4.utils.flame.FlameModule net.razorvine.pyro.FlameModule
Pyro4.utils.flame.RemoteInteractiveConsole net.razorvine.pyro.FlameRemoteConsole

Java to Python (pickling)

JAVA PYTHON
null None
boolean bool
byte int
char str/unicode (length 1)
String str/unicode
double float
float float
int int
short int
BigDecimal decimal
BigInteger long
any array array if elements are primitive type (else tuple)
Object[] tuple (cannot contain self-references)
byte[] bytearray
java.util.Date datetime.datetime
java.util.Calendar datetime.datetime
java.sql.Date datetime.date
java.sql.Time datetime.time
java.sql.Timestamp datetime.datetime
Enum the enum value as string
java.util.Set set
Map, Hashtable dict
Vector, Collection list
Serializable treated as a JavaBean, see below.
JavaBean dict of the bean's public properties + __class__ for the bean's type.
net.razorvine.pyro.PyroURI Pyro4.core.URI
net.razorvine.pyro.PyroProxy cannot be pickled.

Python to .NET (unpickling)

The unpickler simply returns an object. In the case of C#, that is a statically typed language so you will have to cast that to the appropriate type. Refer to this table to see what you can expect to receive. Tip: you can use the 'dynamic' type in some places to avoid excessive type casting.

PYTHON .NET
None null
bool bool
int int
long long (c# doesn't have BigInteger so there's a limit on the size)
string string
unicode string
complex Razorvine.Pickle.Objects.ComplexNumber
datetime.date DateTime
datetime.datetime DateTime
datetime.time TimeSpan
datetime.timedelta TimeSpan
float double
array.array array (all kinds of element types supported)
list ArrayList (of objects)
tuple object[]
set HashSet
dict Hashtable (key=object, value=object)
bytes ubyte[]
bytearray ubyte[]
decimal decimal (except NaN which is mapped to double.NaN)
custom class IDictionary<string, object> (dict with class attributes including its name in "class")
Pyro4.core.URI Razorvine.Pyro.PyroURI
Pyro4.core.Proxy Razorvine.Pyro.PyroProxy
Pyro4.errors.* Razorvine.Pyro.PyroException
Pyro4.utils.flame.FlameBuiltin Razorvine.Pyro.FlameBuiltin
Pyro4.utils.flame.FlameModule Razorvine.Pyro.FlameModule
Pyro4.utils.flame.RemoteInteractiveConsole Razorvine.Pyro.FlameRemoteConsole

.NET to Python (pickling)

.NET PYTHON
null None
boolean bool
byte byte
sbyte int
char str/unicode (length 1)
string str/unicode
double float
float float
int/short/sbyte int
uint/ushort/byte int
decimal decimal
byte[] bytearray
primitivetype[] array
object[] tuple (cannot contain self-references)
DateTime datetime.datetime
TimeSpan datetime.timedelta
Enum just the enum value as string
HashSet set
Map, Hashtable dict
Collection list
Enumerable list
object with public properties dictionary of those properties + class
anonymous class type dictonary of the public properties
Razorvine.Pyro.PyroURI Pyro4.core.URI
Razorvine.Pyro.PyroProxy cannot be pickled.

pickle's People

Contributors

dependabot[bot] avatar irmen avatar shaltielshmid avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

pickle's Issues

Should `put_long` use LONG1 encoding for values than Integer.MAX_VALUE?

Pickler's put_long method currently falls back on the text-based INT encoding if the long value is too large to be represented as a 4-byte signed integer.

Instead, I'm wondering whether it should use the LONG1 encoding and write it as an 8-byte signed integer. Since this method's parameter is a long I think all of the values should fit in a LONG1. My understanding is that LONG1 should be more time- and space-efficient for these values. Pyrolite already uses LONG1 encoding when writing BigIntegers.


If I use Pyrolite to do pickler.dumps(9223372036854775807L) (which is Long.MAX_VALUE), pickletools disassembles the result as:

    0: \x80 PROTO      2
    2: I    INT        9223372036854775807
   23: .    STOP
highest protocol among opcodes = 2

This matches Python 2.7's behavior.

In contrast, Python 3.7 pickles this value using LONG1 (which requires nearly half the space):

>>> pickletools.dis(pickle.dumps(9223372036854775807, protocol=2))
    0: \x80 PROTO      2
    2: \x8a LONG1      9223372036854775807
   12: .    STOP
highest protocol among opcodes = 2

Decimal('NaN') is unsupported in Java unpickling

Looks like decimal('NaN') is unsupported in net.razorvine.pickle:

Python pickle

# float('NaN') can be picled and loaded
>>> pickled = cloudpickle.dumps(float('NaN'))
>>> pickle.loads(pickled)
nan

# Decimal('NaN') can be pickled and loaded
>>> pickled = cloudpickle.dumps(decimal.Decimal('NaN'))
b'\x80\x05\x95!\x00\x00\x00\x00\x00\x00\x00\x8c\x07decimal\x94\x8c\x07Decimal\x94\x93\x94\x8c\x03NaN\x94\x85\x94R\x94.'
>>> pickle.loads(pickled)
Decimal('NaN')

Java net.razorvine.pickle

float('NaN') can be loaded as exepected

scala> unpickle.loads(PickleUtils.str2bytes("\u0080\u0005\u0095\n\u0000\u0000\u0000\u0000\u0000\u0000\u0000G\u007f\u00f8\u0000\u0000\u0000\u0000\u0000\u0000."))

Object = NaN

But Decimal('NaN') can NOT be loaded with java.lang.reflect.InvocationTargetException

scala> unpickle.loads(PickleUtils.str2bytes("\u0080\u0005\u0095!\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u008c\u0007decimal\u0094\u008c\u0007Decimal\u0094\u0093\u0094\u008c\u0003NaN\u0094\u0085\u0094R\u0094."))
net.razorvine.pickle.PickleException: problem construction object: java.lang.reflect.InvocationTargetException
  at net.razorvine.pickle.objects.AnyClassConstructor.construct(AnyClassConstructor.java:29)
  at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:773)
  at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:213)
  at net.razorvine.pickle.Unpickler.load(Unpickler.java:123)
  at net.razorvine.pickle.Unpickler.loads(Unpickler.java:136)
  ... 48 elided

After some investigation, I found, as pointed from https://github.com/irmen/pickle/blob/master/README.md#python-to-java-unpickling , the decimal is mapped to java BigDecimal, but accroing to java doc the BigDecimal can't be NaN.

Do you have any idea for fixing it, Thanks!

Related: SPARK-36000 cc @HyukjinKwon @xinrong-databricks

seems cannot load compressed file

I'm using pandas to save a dataframe with 'xz' compression.

df.to_pickle("abc.pickle", compression="xz")

and the dotnet version threw an exception while loading it.

Unhandled exception. Razorvine.Pickle.InvalidOpcodeException: invalid pickle opcode: 253

The performace is slow.

W're trying to load a imdb dataset which is packed in pickle format.
The loading sample is in the repo of TensorFlow.Keras.

The specific slowness code is:

(NDArray, NDArray) LoadX(byte[] bytes)
{
    var x = np.Load_Npz<int[,]>(bytes);
    return (x["x_train.npy"], x["x_test.npy"]);
}

You can use thie code snippet to reproduce this performance issue:

var cfg = new TransformerClassificationConfig();
var dataloader = new IMDbDataset(cfg); //the dataset is initially downloaded at TEMP dir, e.g., C:\Users\{user name}\AppData\Local\Temp\imdb\imdb.npz
var dataset = dataloader.GetData();

Limitation on pickle size?

Hello,

I'm using sklearn2jpmml using this pickle library and facing the following error:

Exception in thread "main" java.lang.NegativeArraySizeException: -1394647443
    at net.razorvine.pickle.PickleUtils.readbytes(PickleUtils.java:54)
    at net.razorvine.pickle.Unpickler.load_binbytes(Unpickler.java:510)
    at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:315)
    at org.jpmml.python.PickleUtil$1.dispatch(PickleUtil.java:83)
    at net.razorvine.pickle.Unpickler.load(Unpickler.java:109)
    at org.jpmml.python.PickleUtil.unpickle(PickleUtil.java:104)
    at org.jpmml.sklearn.Main.run(Main.java:173)
    at org.jpmml.sklearn.Main.main(Main.java:161)

It happens when the model is "too big" apparently, is there known limitations about the size or is that a bug?

seems cannot load compressed file

I'm using pandas, save a dataframe with 'xz' compression.

df.to_pickle("abc.pickle", compression="xz")

and dotnet version threw an exception.

Unhandled exception. Razorvine.Pickle.InvalidOpcodeException: invalid pickle opcode: 253

Unpickling python2 crated binary file.

It is known that pickle fine created by python2 may not be loaded by python3 using "pickle.load"
Could your solution unpickle a binary file created by python2 involving Numpy arrays?

Unpickle failed on __builtin__ bytes with no arguments

Got an exception saying that constructor needs 1 or 2 arguments but got 0 on load_reduce of a byte array.
Pickle file is as follows:
0: \x80 PROTO 2
2: } EMPTY_DICT
3: q BINPUT 0
5: ( MARK
6: X BINUNICODE 'modules/ATL.rpymc'
28: q BINPUT 1
30: ] EMPTY_LIST
31: q BINPUT 2
33: J BININT 1111638641
38: J BININT 1111622394
43: c GLOBAL 'builtin bytes'
62: q BINPUT 3
64: ) EMPTY_TUPLE
65: R REDUCE
66: q BINPUT 4
68: \x87 TUPLE3
69: q BINPUT 5
71: a APPEND

Upper unicode points not supported

net.razorvine.pickle.PickleException: invalid escape sequence char 'U' in string "seth rollins \U0001f455 [...]" (possibly truncated)

Python pickles any unicode char above \uffff to \U0001xxxx.
decode_unicode_escaped() doesn't currently support
case: 'U'
and throws. Can this support be added?

Load in JSON

I'm trying to load a wide serie of .sav files using your Nuget, but my attempts ended in this error.

expected zero arguments for construction of ClassDict (for numpy.dtype). This happens when an unsupported/unregistered class is being unpickled that requires construction arguments. Fix it by registering a custom IObjectConstructor for this class

Is it possible to set a default translation in a KeyValuePair<string, string> for the unregistered/unsupported classes, or add a specific constructor for numpy.dtype class?

DecimalConstructor fails when parsing decimal string using scientific notation

DecimalConstructor.cs is calling

Convert.ToDecimal(stringArg, CultureInfo.InvariantCulture)

So something like

System.Convert.ToDecimal("5E-1",CultureInfo.InvariantCulture)

Will throw a System.FormatException: Input string was not in a correct format. exception

This has bit us when using Microsoft.Spark (which takes this library as a dependency). We build C# UDFs that run in Spark which requires data to be serialized back and forth between the JVM and the CLR. In sparkland we have some decimal values that Spark decides to represent in scientific notation and when this gets pickled into the CLR we hit the above issue.

I did a bit of research and using something like

Decimal.Parse("5E-1, NumberStyles.AllowAny)

would work but is maybe too permissive. I think at a minimum we would need

Decimal.Parse("5E-1, NumberStyles.AllowLeadingSign | NumberStyles.DecimalPoint | NumberStyles.Exponent)

but there might be others.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.