Giter VIP home page Giter VIP logo

Comments (30)

gonzaponte avatar gonzaponte commented on July 29, 2024

I have in my fork (gonzaponte/issue149) a demonstration of the failure. I had to modify the test file so it does not include event #0. Although I have written a new test to demonstrate the bug, all tests fail because of the bug. However, since the new test is more exhaustive, I will rewrite the previous ones to perform as the new one.

from ic.

jacg avatar jacg commented on July 29, 2024

I had to modify the test file so it does not include event #0.

Would it be possible/better to do this by adding another test file? How big a new data file would you need to be able to write this test?

Please do a whitespace cleanup. pmaps_functions_c.pyx, for example, contains a number of spurious whitespace additions.

You have removed the initializations for evt and pk in cdf_to_dict. Is there any good reason for this? A quick search into uninitialized variables in Cython does not inspire confidence. Please initialize these variables, unless there's some good reason not to (which I find hard to imagine).

Please squash into two commits: one adding the tests in their final form, leaving the implementation alone; one implementing the solution, leaving the tests alone.

@jjgomezcadenas Please have a quick look at this, and comment on the how this fits in the grand scheme of things. Also please comment on the desirability of adding a new test data file vs. modifying the existing one.

from ic.

gonzaponte avatar gonzaponte commented on July 29, 2024

Would it be possible/better to do this by adding another test file? How big a new data file would you need to be able to write this test?

It is possible, but maybe redundant. It will not be very large (few tens or hundreds of kb).

Please do a whitespace cleanup. pmaps_functions_c.pyx, for example, contains a number of spurious whitespace additions.

Ok.

You have removed the initializations for evt and pk in cdf_to_dict. Is there any good reason for this? A quick search into uninitialized variables in Cython does not inspire confidence. Please initialize these variables, unless there's some good reason not to (which I find hard to imagine).

No good reason, it is just not necessary as they are always assigned when needed. Maybe initializing them to 0 can be misleading, but it is not a big deal. I will set them to 0.

Please squash into two commits: one adding the tests in their final form, leaving the implementation alone; one implementing the solution, leaving the tests alone.

Ok.

from ic.

jacg avatar jacg commented on July 29, 2024

Maybe initializing them to 0 can be misleading, but it is not a big deal. I will set them to 0.

No. My bad. I looked at the code too hastily. You are initializing them before first use, so it's fine as it is. As you say, it might even be better this way.

from ic.

jacg avatar jacg commented on July 29, 2024

This was bugging me, and it became clear that I wasn't going to get to sleep until I had fixed it ...

It turns out that, beautiful though it was, the xxx implementation (discussed in #158) based on Pandas groupby was two orders of magnitude slower than s12df_to_s12l.

It looks like Pandas groupby is fundamentally slow. There may be a way of achieving the desired effect with numpy.unique or something similar, but, for now, xxx has been reimplemented in Cython,
forgetting any attempt to make the result contain small sections of the original dataframe rather than a pair of numpy arrays stuffed into a tuple.

The new xxx now passes exactly the same tests as s12df_to_s12l and the tests used to drive the development of xxx are now also being applied to s12df_to_s12l: all these tests are parametrized with both implementations.

As far as I can see so far, xxx has two advantages over s12df_to_s12l (and it's cython core cdf_to_dict):

  • it is legible

  • it is 15% faster

The next step is to see how xxx stacks up against #149, and how it compares to the proposed solution.

In the long term we want to look at whether we want the dictionaries which are produced to contain namedtuples or dataframes, rather than the current tuple of numpy arrays representing the times and energies in each peak.

from ic.

jacg avatar jacg commented on July 29, 2024

I have merged (yes! merged, don't worry, it's only temporary: it serves the purpose of keeping track of the alternatives until we decide which one we want) the implementation of xxx and the tests (which now run both on xxx and s12df_to_s12l) into the issue149 branch, and pushed it to my fork.

It seems that xxx passes the issue-149 tests out of the box. @gonzaponte please confirm that I haven't missed something, and that I am actually running the test that points out the bug, and not doing anything else stupid.

  • The proposed bugfix seems to slow down s12df_to_s12l a bit, so the performance advantage of xxx increases from 15% to about 20%.

  • The proposed bugfix shortens and clears up s12df_to_s12l quite a bit, but it still seems rather more cryptic than xxx, but as the author of the latter I would say that, wouldn't I!

Are there any other problems or bugs pending with s12df_to_s12l?

As this is a function which is pivotal and has been problematic, we should make an effort to test it more thoroughly.

@neuslopez Could you please confirm that my timings make sense: I merely ran timeit on the s1 dataframe that comes out of database/test_data/KrMC_pmaps.h5.

Good night.

from ic.

gonzaponte avatar gonzaponte commented on July 29, 2024

As far as I can see so far, xxx has two advantages over s12df_to_s12l (and it's cython core cdf_to_dict):

  • it is legible

  • it is 15% faster

Well, I don't find it much more readable. In s12df_to_s12l the workflow is clearer, imho. But of course, this is a matter of interpretation. Concerning performance, s12df_to_s12l contains a function call overload because it is a python function calling the cython one. It can by written in the same fashion as yours (pure cython function). I guess this structure comes because we want a pure python interface, @jjgomezcadenas?
In any case, I don't think performance is critical for this function, which will only be executed once per file.

It seems that xxx passes the issue-149 tests out of the box. @gonzaponte please confirm that I haven't missed something, and that I am actually running the test that points out the bug, and not doing anything else stupid.

I think it is ok!

Are there any other problems or bugs pending with s12df_to_s12l?

I don't think there are more bugs, but we can extend testing to include those that came up in issue #146.

from ic.

neuslopez avatar neuslopez commented on July 29, 2024

Using branch issue149, see bellow the new timings using the new xxx (same Kr file):

  1. pf.s12df_to_s12l(S1df,10000): 10 loops, best of 3: 58.2 ms per loop
  2. cpm.xxx (S1df,10000): 10 loops, best of 3: 41.3 ms per loop

from ic.

jacg avatar jacg commented on July 29, 2024

Concerning performance, s12df_to_s12l contains a function call overload because it is a python function calling the cython one.

It does, but this experiment

python -m timeit -s 'f = lambda:1' 'f()'
10000000 loops, best of 3: 0.0858 usec per loop

python -m timeit -s 'f = lambda:1' 'f'  
100000000 loops, best of 3: 0.0142 usec per loop

shows that an pure-Python function call can be performed in 86 nanoseconds, while the lookup of the function's name takes 14, leaving no more than about 70 nanoeconds for the overhead of calling a pure-Python function. This tiny O(1) overhead cannot explain the (O(n)-ish) differences of 30 microseconds that I have been observing for small files, and the 17 milliseconds that @neuslopez reports above for larger files.

It can by written in the same fashion as yours (pure cython function). I guess this structure comes because we want a pure python interface

In this case I cannot see any advantages offered by the pure python interface wrapper. It just needlessly complicates and obfuscates the issue. I'm always happy to be educated, though.

In any case, the current version of cdf_to_dict is much better (even if it is a little slower) than the old one. The old version was written in write-only style, and I admire your courage and determination in having fixed the bug in it. I was not up to the task, without rewriting it from scratch.

Tasks remaining:

  • Decide whether we prefer cdf_to_dict + d12df_to_s12l or xxx.
  • If we choose the former, combine it into a single Cython function.
  • Clean up and expand the tests, including any that emerge from #146.

Once that is done we could merge. Or we could discuss whether we really want dictionaries containing, dictionaries, containing tuples, containing arrays; or whether we would like to replace the arrays-in-tuples with dataframes. But I think that the latter would be better discussed in a separate issue/PR. Some observations though:

  • After my experience with writing xxx, I am less keen on the dataframes: indexing them is very slow! But perhaps this doesn't matter in our usage?
  • This issue probably needs to be discussed with reference to #157.

Oh, and for cross-referencing purposes, this issue is addressed in PR #162.

(GitHub used to have the capability to turn issues into PRs. They got rid of it. Shame, as it seems very useful.)

from ic.

gonzaponte avatar gonzaponte commented on July 29, 2024

In this case I cannot see any advantages offered by the pure python interface wrapper. It just needlessly complicates and obfuscates the issue

The only "advantage" I see is that gathering all functions in a single file (pmaps_functions.py) allows you to forget the different implementations (either python or cython). For example in your fork you have:

import invisible_cities.reco.pmaps_functions   as pmapf
import invisible_cities.reco.pmaps_functions_c as pmapf_c
pmaps_read_parm = ('df_to_dict',
                   (pmapf.s12df_to_s12l,
                    pmapf_c.xxx))

The following may be nicer to look at:

import invisible_cities.reco.pmaps_functions   as pmapf
pmaps_read_parm = ('df_to_dict',
                   (pmapf.s12df_to_s12l,
                    pmapf.xxx))

This can still be addressed by creating aliases or using a from statement in the python module.

Once that is done we could merge. Or we could discuss whether we really want dictionaries containing, dictionaries, containing tuples, containing arrays; or whether we would like to replace the arrays-in-tuples with dataframes. But I think that the latter would be better discussed in a separate issue/PR. Some observations though:

  • After my experience with writing xxx, I am less keen on the dataframes: indexing them is very slow! But perhaps this doesn't matter in our usage?
  • This issue probably needs to be discussed with reference to #157.

Agreed. JJ wants to use DFs bacause of the statistical and correlation properties they provide. After using the current tuple-of-arrays structure in a small analysis I don't see any advantage in using the DFs functionalities, but my imagination is limited and future analysis may need them.

from ic.

jjgomezcadenas avatar jjgomezcadenas commented on July 29, 2024

from ic.

jacg avatar jacg commented on July 29, 2024

The only "advantage" I see is that gathering all functions in a single file (pmaps_functions.py) allows you to forget the different implementations (either python or cython).

Totally agreed. As far as I am concerned, all the Cython modules are low-level implementations of the Python modules, and I would like not to see them being imported directly in client code at all. (IIRC, @jjgomezcadenas disagrees). But the solution to this is not to implement pure-Python wrappers (containing arbitrary subsets of the functionality, like in the case of s12df_to_s12l (I have grown to hate that name, BTW)). No, one should simply import the Cython implementation into the pure python module. In this case, inside pmaps_functions.py we put

from invisible_cities.reco.pmaps_functions_c import xxx as s12df_to_12l

and we're done. No need to write a wrapper around it. And indeed, you seem to acknowledge this in your comment.

JJ wants to use DFs bacause of the statistical and correlation properties they provide.

Seems jolly sensible to me. As long as the huge indexing overhead of dataframes doesn't prevent us from doing other things in a reasonable amount of time.

the current tuple-of-arrays

I really would like those to be named tuples, though. I'll implement that tomorrow morning, unless somene beats me to it. It should take all of 30 seconds. Hmm, I might just do it now ... OK, I've pushed it to my fork on the issue149 branch. (No test for accessing time and energy by name yet, though!)

from ic.

jjgomezcadenas avatar jjgomezcadenas commented on July 29, 2024

from ic.

jacg avatar jacg commented on July 29, 2024

Was it really necessary that @Jacq and @neuslopez would write yet another implementation?

  • Given that the original was write-only: yes.

  • Given that the brokenness of this function was getting in our way in everything we wanted to do, and re-implementing a new one was easier than understanding and fixing the origial: yes.

  • Given that it was the quickest way to understand what cdf_to_dict was supposed to do: yes.

  • Given that it demonstrates in 4 clear lines what 49 convoluted lines were failing to do correctly: yes.

  • Given that it increased the bus number of the feature considerably: yes.

being called names, eg art guy

Please don't confuse statements about specific things you say, with calling you names. (We've been here before, haven't we?). It is patently clear that you are not an ARTist, but the argument you used on that occasion was very ARTistic. I make no apology for pointing this out, and I would be grateful if you return the favour, if the situation is ever reversed.

did not reinvent the wheel but turned out to be two orders of magnitude slower

While it was two orders of magnitude slower, it did have three rather important merits over the original.

  • it was correct.

  • it was obviously correct ... not to mention tested.

  • it explained what the faulty one was attempting to do.

The process demonstrated a clear performance problem with Pandas dataframes of which I, for one, was not previously aware. Oh, and it took less time to implement than these meta-discussions.

Then, a new re-reimp resurrected the wheel in cython code.

Yes, the ideas in the high-level imp were expressed at a lower level, and passed all tests at first go, beating all other existing implementations for speed.

can we solve this? By all means get rid of my faulty, clumsy and artistic imp.

That goes without saying: it contains bugs which neither of its successors do.

can we agree in some scheme to share the work that does not prompt us to three reimps of the sane function in the future?

I submit to you that if you were pera-programming with us when we came up with xxx, you would have been an enthusiastic supporter of it. It took little time, and it was a very edifying experience.

I believe that we can do better than spinning around the reimp of a single function!

This meta-discussion is taking more effort that the re-imp. The delays caused by the bugs in the original, took up much more than resources than the re-imp. The re-imp increased the bus number of the function in question by two.

If your point is that we should talk less and do more, I agree. If your point is that we should not write code when trying to understand problematic code, then I respectfully and vehemently disagree.

Sorry about top posting writing from iphone in the train Enviado desde mi iPhone

No worries, GitHub hid the noise.

from ic.

jacg avatar jacg commented on July 29, 2024

i like pure pyhon hiding cython imp basically for the same reasons offered by @gonzaponte

Stripped to the bare bones, we are talking about these two alternatives:

  • Option 1

    Inside python_implementation.py you write

     import cython_implementation
    
     def feature():
         return cython_implementation.feature()
    
  • Option 2

    Inside python_implementation.py you write

      from cython_implementation import feature
    

Agreed?

from ic.

jjgomezcadenas avatar jjgomezcadenas commented on July 29, 2024

from ic.

jacg avatar jacg commented on July 29, 2024

Good. Now, to me, ceteris paribus, one of those options is clearly better than the other.

Agreed?

from ic.

jacg avatar jacg commented on July 29, 2024

Using branch issue149, see bellow the new timings using the new xxx (same Kr file):

pf.s12df_to_s12l(S1df,10000): 10 loops, best of 3: 58.2 ms per loop
cpm.xxx (S1df,10000): 10 loops, best of 3: 41.3 ms per loop

@neuslopez Could you please give us some idea of how long it takes read_pmaps to get these data from disk into memory?

from ic.

jjgomezcadenas avatar jjgomezcadenas commented on July 29, 2024

from ic.

jjgomezcadenas avatar jjgomezcadenas commented on July 29, 2024

from ic.

jacg avatar jacg commented on July 29, 2024

Given that some one else (e.g, Gonzalo) was already looking at the problem and he was not around. Debatable, I think.

The bus number of this thing is now 4. Without xxx it would have been 2.

Post xxx I am confident that my opinion on the PR is worth hearing. Before xxx there is no way I could have formed a sensible opinion (in a any amount of time that I would have been prepared to invest in forming it).

Oh, well, I will manage to recover from current low level of Karma, don’t worry.

Your low level of Karma is but an illusion. Your Karma remains high.

Incidentally the reason I wrote the write-only, faulty, nasty, ugly cython imp was because I had found long ago that Pandas DF were slow. Alas, I failed to make that part of the public lore. Now is clear.

  • The personal journey of discovery was worth far more to me than lore.

  • Just because Pandas is slow, doesn't mean that the only alternative is writing write-only, buggy code!

My point is simple: There is life (I hope) beyond XXX. Please, PR XXX and friends and let’s be done with it.

That's what I was proposing before this meta-discussion got started. Repeating the task list I made at that point ...

Tasks remaining:

  • Decide whether we prefer cdf_to_dict + d12df_to_s12l or xxx.
  • If we choose the former, combine it into a single Cython function (i.e. option 2, above).
  • Clean up and expand the tests, including any that emerge from #146.

from ic.

jjgomezcadenas avatar jjgomezcadenas commented on July 29, 2024

from ic.

gonzaponte avatar gonzaponte commented on July 29, 2024

If we go for xxx we skip that step. What is your opinion, @gonzaponte?

To me, cdf_to_dict is easier to understand, but I have devoted more time to this function than to xxx. Brais says he likes xxx more and since he is not biased, I would stick to the latter. Let's move on!

from ic.

gonzaponte avatar gonzaponte commented on July 29, 2024

I've just realized that the xxx function assumes the ordering of the input data frame. It is not a problem now, but it must be taken into account and should be written in the function documentation. This probably means that the PMapWriter function/class/whatever should have a test that ensures the output file is written in the correct order.

from ic.

jacg avatar jacg commented on July 29, 2024

OK, I'll clean up the branch and submit before I retire for the night. Failing that, first thing tomorrow morning.

The only thing I don’t like about it is that I wasn’t around eating some peras and increasing my Karma for free. I suggest that we use xxx and my previous version in our docs to show the difference between read/write and write-only code.

I think that last sentence is seriously raising your karma.

from ic.

Jacq avatar Jacq commented on July 29, 2024

Good to know.... I have no idea what is this project about and suddenly there are 15 notifications of this discussion in my inbox :D
I think that at some point @jacg mutated to @Jacq at the party started.
Cheers

from ic.

jacg avatar jacg commented on July 29, 2024

I've just realized that the xxx function assumes the ordering of the input data frame.

Well, yes, that did seem to be a fundamental feature of the data I had to go on. But I suppose you are right: it is an assumption which could be violated by the pmaps writer.

While cdf_to_dict is a rather more robust in this respect than xxx, it will fail silently if an event limit is required and some portion of an event with id number greater than the limit appears before all lower-numbered events have been completed.

FWIW, here is my take on cdf_to_dict, imaginatively called yyy. It passes all the same tests and seems to be a little faster than s12df_to_s12l but slower than xxx. (Remember: we are well within a factor of two between slowest and fastest!)

cpdef yyy(df, max_events=None):

    cdef dict all_events = {}
    cdef dict current_event
    cdef tuple current_peak

    cdef int   [:] event = df.event.values
    cdef char  [:] peak  = df.peak .values
    cdef float [:] time  = df.time .values
    cdef float [:] ene   = df.ene  .values

    cdef int df_size = len(df.index)
    cdef int i
    cdef long limit = np.iinfo(int).max if max_events is None or max_events < 0 else max_events
    cdef int peak_number
    cdef list t, E

    for i in range(df_size):

        if event[i] >= limit: break

        current_event = all_events   .setdefault(event[i], {}      )
        current_peak  = current_event.setdefault( peak[i], ([], []))
        current_peak[0].append(time[i])
        current_peak[1].append( ene[i])

    # Postprocess: Turn lists into numpy arrays before returning
    for current_event in all_events.values():
        for peak_number, (t, E) in current_event.items():
            current_event[peak_number] = (np.array(t), np.array(E))

    return all_events

Perhaps we prefer this one to xxx?

I will prepare the PR with xxx in place, but it will be trivial to change

One more thing: Are we attached to the name s12df_to_s1l? I have grown to hate it. The l at the end is especially irksome.

And another thing: do I have your permission to replace the tuple with a namedtuple? (I have the implementation ready and tested. It slightly degrades the performance, but I'm confident that this is completely irrelevant.)

from ic.

jjgomezcadenas avatar jjgomezcadenas commented on July 29, 2024

My five cents:

  1. I prefer yyy to xxx. It is very concise and elegant, and the time difference is not all that relevant. Since you don't have the PR yet submitted, why don't you go ahead and use yyy instead of xxx.
  2. I suggest df_to_pmaps_dict as the new name.
  3. I am all in favour of a named tuple.

from ic.

gonzaponte avatar gonzaponte commented on July 29, 2024

What a wonderfull method setdefult is. I have been doing stupid things so far because I didn't know that method (and I thought I knew them all)!

Perhaps we prefer this one to xxx?

Fine with me.

  1. I suggest df_to_pmaps_dict as the new name

Since it is already in pmaps_functions.py, I would avoid naming it pmaps (simply df_to_dict?).

  1. I am all in favour of a named tuple.

Me too.

from ic.

jacg avatar jacg commented on July 29, 2024

What a wonderfull method setdefult is.

Even better, for this purpose, is collections.defaultdict, but Cython doesn't know how to type that one, so I stuck with the basic dict.

from ic.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.