Is there a scalable way to deserialize lattice objects to disk? Especially when concep

From <a href="https://docs.python.org/3/library/pickle.html#what-can-be-pickled-and-un

Prototype for long-term serialization and deserialization: <div class="highlight h

merged most of the implementation, see <a href="https://github.com/xflr6/concepts/comp

Scalable method for deserializing lattice objects? about concepts HOT 15 CLOSED

xflr6 commented on July 19, 2024

Scalable method for deserializing lattice objects?

from concepts.

Comments (15)

xflr6 commented on July 19, 2024

Can you provide more details how picking/unpickling results in max recursion depth errors (maybe you have a .cxt file that you can share)?

In case this is meant to be for long-time storage, one could consider adding ex/import using e.g. a json or csv-based format.

from concepts.

xflr6 commented on July 19, 2024

From pickle docs:

Trying to pickle a highly recursive data structure may exceed the maximum recursion depth, a RecursionError will be raised in this case. You can carefully raise this limit with sys.setrecursionlimit().

Couldn't pickle use a stack-based implementation to (un)pickle deeply linked object graphs?

from concepts.

xflr6 commented on July 19, 2024

Prototype for long-term serialization and deserialization:

from concepts import Context, EXAMPLE

def to_dict(lattice):
    return {
        'objects': lattice._context.objects,
        'properties': lattice._context.properties,
        'context': lattice._context._intents.index_sets(),
        'concepts': [(tuple(c._extent.iter_set()),
                      tuple(c._intent.iter_set()),
                      tuple(u.index for u in c.upper_neighbors),
                      tuple(l.index for l in c.lower_neighbors),
                     ) for c in lattice]
    }

def from_dict(d):
    from concepts.lattices import Lattice, Concept, Infimum, Supremum, Atom
    ctx = Context(d['objects'],
                  d['properties'],
                  [tuple(i in e for i in range(len(d['properties'])))
                   for e in map(set, d['context'])])
    l = object.__new__(Lattice)
    con = [Concept(l,
                   ctx._Extent.fromint(sum(1 << e for e in ex)),
                   ctx._Intent.fromint(sum(1 << i for i in in_)),
                   up, lo)
           for ex, in_, up, lo in d['concepts']]
    atoms = [con[i] for i in d['concepts'][0][2]]
    for index, cc in enumerate(con):
        cc.index = index
        cc.upper_neighbors = tuple(con[i] for i in cc.upper_neighbors)
        cc.lower_neighbors = tuple(con[i] for i in cc.lower_neighbors)
        cc.atoms = tuple(a for a in atoms
                         if cc._extent | a._extent == cc._extent)
    for dindex, cc in enumerate(sorted(con, key=lambda c: c._extent.longlex())):
        cc.dindex = dindex
    mapping = {c._extent: c for c in con}
    Lattice._annotate(ctx, mapping)
    l._context = ctx
    l._concepts = con
    l._map = mapping
    l.infimum = l._concepts[0]
    l.infimum.__class__ = Infimum
    l.supremum = l._concepts[-1]
    l.supremum.__class__ = Supremum
    for a in l.infimum.upper_neighbors:
        a.__class__ = Atom  
    return l
    
l = Context.fromstring(EXAMPLE).lattice
d = to_dict(l)
ll = from_dict(d)

Format example (could use json, pprint+ast.literal_eval, yaml, etc. for transport/storage):

{
    'objects': (
        '1sg', '1pl', '2sg', '2pl', '3sg', '3pl'
    ),
    'properties': (
        '+1', '-1', '+2', '-2', '+3', '-3', '+sg', '+pl', '-sg', '-pl'
    ),
    'context': [
        (0, 3, 5, 6, 9),
        (0, 3, 5, 7, 8),
        (1, 2, 5, 6, 9),
        (1, 2, 5, 7, 8),
        (1, 3, 4, 6, 9),
        (1, 3, 4, 7, 8)
    ],
    'concepts': [
        ((), (0, 1, 2, 3, 4, 5, 6, 7, 8, 9), (1, 2, 3, 4, 5, 6), ()),
        ((0,), (0, 3, 5, 6, 9), (7, 8, 9), (0,)),
        ((1,), (0, 3, 5, 7, 8), (7, 10, 11), (0,)),
        ((2,), (1, 2, 5, 6, 9), (8, 12, 13), (0,)),
        ((3,), (1, 2, 5, 7, 8), (10, 12, 14), (0,)),
        ((4,), (1, 3, 4, 6, 9), (9, 13, 15), (0,)),
        ((5,), (1, 3, 4, 7, 8), (11, 14, 15), (0,)),
        ((0, 1), (0, 3, 5), (18, 19), (1, 2)),
        ((0, 2), (5, 6, 9), (16, 18), (1, 3)),
        ((0, 4), (3, 6, 9), (16, 19), (1, 5)),
        ((1, 3), (5, 7, 8), (17, 18), (2, 4)),
        ((1, 5), (3, 7, 8), (17, 19), (2, 6)),
        ((2, 3), (1, 2, 5), (18, 20), (3, 4)),
        ((2, 4), (1, 6, 9), (16, 20), (3, 5)),
        ((3, 5), (1, 7, 8), (17, 20), (4, 6)),
        ((4, 5), (1, 3, 4), (19, 20), (5, 6)),
        ((0, 2, 4), (6, 9), (21,), (8, 9, 13)),
        ((1, 3, 5), (7, 8), (21,), (10, 11, 14)),
        ((0, 1, 2, 3), (5,), (21,), (7, 8, 10, 12)),
        ((0, 1, 4, 5), (3,), (21,), (7, 9, 11, 15)),
        ((2, 3, 4, 5), (1,), (21,), (12, 13, 14, 15)),
        ((0, 1, 2, 3, 4, 5), (), (), (18, 19, 20, 16, 17))
    ]
}

from concepts.

EdwinWenink commented on July 19, 2024

That response is more than I could hope for! Thanks so much.

I indeed found the same documentation on pickle. So far I tried raising the recursion limit gradually up to 50.000, but without luck. Trying 100.000 now, but testing is very slow due to testing with large concept lattices. FYI I'm running FCA on 500 objects with 30 attributes currently, which results in a lattice with 9 atoms and 443020 concepts. (But do note that pickling also fails with way smaller lattices). I am also testing with the latest pickle protocol currently, because according to the documentation it is designed to deal with very large objects.

But the work you did above is definitely the way to go. I was already considered using json format, but that solution required writing a custom format (like you elegantly did just now) and I didn't know how to proceed. I will try out your idea on a big lattice when I have time and report back to you!

from concepts.

EdwinWenink commented on July 19, 2024

Works like a charm! First I checked that lattices are equal to their reconstruction after deserialization. Then I tested performance on a big lattice. Computing the lattice took me 4+ hours, loading it afterwards from the json file more or less 20 seconds.

I think this is a really cool feature to integrate in your package. I'm for example writing a small "query by navigation" program for navigating lattices based on your package, and then it is very inconvenient if you have to recompute the lattice every time you run the program. Thanks for your effort!

from concepts.

xflr6 commented on July 19, 2024

Great, thanks. +1 that this will be a nice feature.

Will start to work on a branch to add this for the next version.

Considering adding 'ordered': True to the dict to rely/depend on the order of concepts and upper/lower neighbors (to otherwise ensure the order via extra sorting)

from concepts.

xflr6 commented on July 19, 2024

merged most of the implementation, see https://github.com/xflr6/concepts/compare/f0a7069153b487aea572bca133f984c5de111056..de2993a2df9d515a84883c19ab788cc274059d3b

from concepts.

xflr6 commented on July 19, 2024

Can you maybe check the (speed) difference between these two for a big lattice in master?

c = concepts.Context(...)
assert c.lattice
c.tojson(f)

concepts.Context.fromjson(f)
# vs.
concepts.Context.fromjson(f, raw=True)

from concepts.

EdwinWenink commented on July 19, 2024

Nice that you merged it! I can test that, but I only have time next week to do so. Feel free to remind me if after next week I have not responded yet :-)

from concepts.

xflr6 commented on July 19, 2024

No rush, just curious.. :) I think I'm getting close: tell me if you have any concerns, especially about the api.

from concepts.

xflr6 commented on July 19, 2024

released concepts 0.9

from concepts.

EdwinWenink commented on July 19, 2024

Hi there, I ran a test on the biggest concept lattice I currently have available, with the following code:

import timeit

setup = '''
from concepts import tools, contexts
import concepts
load = concepts.Context.fromjson
'''

print(min(timeit.Timer('load("lattice_sorted.json", raw=False)', setup=setup).repeat(100, 1)))

I tested on a lattice with 143906 concepts, once with raw=False and once with raw=True.
This gave the following (fastest) run times:

For raw=False 1.03071959 sec.

For raw=True 1.03416534 sec.

However, I must make a note here. Your approach is to serialize the whole context + lattice to json (which makes sense!), but in my previous work I used the code from here to serialize the lattice only.
I wanted to avoid recomputing my big lattice, only in order to serialize it together with its context.
What I did instead is sorting the json of the lattice on keys for the above testing.
I'm not sure if that is the only change you made, and whether the above results generalize to the code in master.

The specs of the pc on which I tested are:

8 GB RAM
3.10GHz i5-4440

from concepts.

xflr6 commented on July 19, 2024

Thanks. I should have noted that the released implementation differs from the prototype above in using 'lattice' as key for the concepts list instead of 'concepts', so to re-use the json-file you have created with the prototype you might need to rename 'concepts' to 'lattice' (judging from the numbers, you are probably only loading the context currently, hence there is no difference between the default raw=False and raw=True). To make this more explicit you can add require_lattice=True to your load().

Btw, the order of the keys does not matter with either setting (see here and here for details).

from concepts.

EdwinWenink commented on July 19, 2024

Ah that explains a lot. Here are the updated times, but now with the best time for 10 repetitions rather than 100:

For raw=False 28.72975 sec.

For raw=True 354.92165 sec.

Hope that info is helpful :-)

from concepts.

xflr6 commented on July 19, 2024

Thanks, definitely helpful. That's consistent with my experiments.

I just wanted to confirm that relying on the order in the serialization significantly speeds things up (otherwise I would have considered to make the raw=True behaviour the default and only code path). I think raw=True might be useful for using other tools to calculate the lattice and convert it into the format here (but whithout the need to replicate the order used/expected by this library).

from concepts.

Scalable method for deserializing lattice objects? about concepts HOT 15 CLOSED

Comments (15)

Related Issues (8)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent