Comments (30)
I have in my fork (gonzaponte/issue149) a demonstration of the failure. I had to modify the test file so it does not include event #0. Although I have written a new test to demonstrate the bug, all tests fail because of the bug. However, since the new test is more exhaustive, I will rewrite the previous ones to perform as the new one.
from ic.
I had to modify the test file so it does not include event #0.
Would it be possible/better to do this by adding another test file? How big a new data file would you need to be able to write this test?
Please do a whitespace cleanup. pmaps_functions_c.pyx
, for example, contains a number of spurious whitespace additions.
You have removed the initializations for evt
and pk
in cdf_to_dict
. Is there any good reason for this? A quick search into uninitialized variables in Cython does not inspire confidence. Please initialize these variables, unless there's some good reason not to (which I find hard to imagine).
Please squash into two commits: one adding the tests in their final form, leaving the implementation alone; one implementing the solution, leaving the tests alone.
@jjgomezcadenas Please have a quick look at this, and comment on the how this fits in the grand scheme of things. Also please comment on the desirability of adding a new test data file vs. modifying the existing one.
from ic.
Would it be possible/better to do this by adding another test file? How big a new data file would you need to be able to write this test?
It is possible, but maybe redundant. It will not be very large (few tens or hundreds of kb).
Please do a whitespace cleanup. pmaps_functions_c.pyx, for example, contains a number of spurious whitespace additions.
Ok.
You have removed the initializations for evt and pk in cdf_to_dict. Is there any good reason for this? A quick search into uninitialized variables in Cython does not inspire confidence. Please initialize these variables, unless there's some good reason not to (which I find hard to imagine).
No good reason, it is just not necessary as they are always assigned when needed. Maybe initializing them to 0 can be misleading, but it is not a big deal. I will set them to 0.
Please squash into two commits: one adding the tests in their final form, leaving the implementation alone; one implementing the solution, leaving the tests alone.
Ok.
from ic.
Maybe initializing them to 0 can be misleading, but it is not a big deal. I will set them to 0.
No. My bad. I looked at the code too hastily. You are initializing them before first use, so it's fine as it is. As you say, it might even be better this way.
from ic.
This was bugging me, and it became clear that I wasn't going to get to sleep until I had fixed it ...
It turns out that, beautiful though it was, the xxx
implementation (discussed in #158) based on Pandas groupby
was two orders of magnitude slower than s12df_to_s12l
.
It looks like Pandas groupby
is fundamentally slow. There may be a way of achieving the desired effect with numpy.unique
or something similar, but, for now, xxx
has been reimplemented in Cython,
forgetting any attempt to make the result contain small sections of the original dataframe rather than a pair of numpy arrays stuffed into a tuple.
The new xxx
now passes exactly the same tests as s12df_to_s12l
and the tests used to drive the development of xxx
are now also being applied to s12df_to_s12l
: all these tests are parametrized with both implementations.
As far as I can see so far, xxx
has two advantages over s12df_to_s12l
(and it's cython core cdf_to_dict
):
-
it is legible
-
it is 15% faster
The next step is to see how xxx
stacks up against #149, and how it compares to the proposed solution.
In the long term we want to look at whether we want the dictionaries which are produced to contain namedtuples or dataframes, rather than the current tuple of numpy arrays representing the times and energies in each peak.
from ic.
I have merged (yes! merged, don't worry, it's only temporary: it serves the purpose of keeping track of the alternatives until we decide which one we want) the implementation of xxx
and the tests (which now run both on xxx
and s12df_to_s12l
) into the issue149 branch, and pushed it to my fork.
It seems that xxx
passes the issue-149 tests out of the box. @gonzaponte please confirm that I haven't missed something, and that I am actually running the test that points out the bug, and not doing anything else stupid.
-
The proposed bugfix seems to slow down
s12df_to_s12l
a bit, so the performance advantage ofxxx
increases from 15% to about 20%. -
The proposed bugfix shortens and clears up
s12df_to_s12l
quite a bit, but it still seems rather more cryptic thanxxx
, but as the author of the latter I would say that, wouldn't I!
Are there any other problems or bugs pending with s12df_to_s12l
?
As this is a function which is pivotal and has been problematic, we should make an effort to test it more thoroughly.
@neuslopez Could you please confirm that my timings make sense: I merely ran timeit on the s1 dataframe that comes out of database/test_data/KrMC_pmaps.h5
.
Good night.
from ic.
As far as I can see so far, xxx has two advantages over s12df_to_s12l (and it's cython core cdf_to_dict):
it is legible
it is 15% faster
Well, I don't find it much more readable. In s12df_to_s12l the workflow is clearer, imho. But of course, this is a matter of interpretation. Concerning performance, s12df_to_s12l contains a function call overload because it is a python function calling the cython one. It can by written in the same fashion as yours (pure cython function). I guess this structure comes because we want a pure python interface, @jjgomezcadenas?
In any case, I don't think performance is critical for this function, which will only be executed once per file.
It seems that xxx passes the issue-149 tests out of the box. @gonzaponte please confirm that I haven't missed something, and that I am actually running the test that points out the bug, and not doing anything else stupid.
I think it is ok!
Are there any other problems or bugs pending with s12df_to_s12l?
I don't think there are more bugs, but we can extend testing to include those that came up in issue #146.
from ic.
Using branch issue149, see bellow the new timings using the new xxx (same Kr file):
- pf.s12df_to_s12l(S1df,10000): 10 loops, best of 3: 58.2 ms per loop
- cpm.xxx (S1df,10000): 10 loops, best of 3: 41.3 ms per loop
from ic.
Concerning performance, s12df_to_s12l contains a function call overload because it is a python function calling the cython one.
It does, but this experiment
python -m timeit -s 'f = lambda:1' 'f()'
10000000 loops, best of 3: 0.0858 usec per loop
python -m timeit -s 'f = lambda:1' 'f'
100000000 loops, best of 3: 0.0142 usec per loop
shows that an pure-Python function call can be performed in 86 nanoseconds, while the lookup of the function's name takes 14, leaving no more than about 70 nanoeconds for the overhead of calling a pure-Python function. This tiny O(1)
overhead cannot explain the (O(n)
-ish) differences of 30 microseconds that I have been observing for small files, and the 17 milliseconds that @neuslopez reports above for larger files.
It can by written in the same fashion as yours (pure cython function). I guess this structure comes because we want a pure python interface
In this case I cannot see any advantages offered by the pure python interface wrapper. It just needlessly complicates and obfuscates the issue. I'm always happy to be educated, though.
In any case, the current version of cdf_to_dict
is much better (even if it is a little slower) than the old one. The old version was written in write-only style, and I admire your courage and determination in having fixed the bug in it. I was not up to the task, without rewriting it from scratch.
Tasks remaining:
- Decide whether we prefer
cdf_to_dict
+d12df_to_s12l
orxxx
. - If we choose the former, combine it into a single Cython function.
- Clean up and expand the tests, including any that emerge from #146.
Once that is done we could merge. Or we could discuss whether we really want dictionaries containing, dictionaries, containing tuples, containing arrays; or whether we would like to replace the arrays-in-tuples with dataframes. But I think that the latter would be better discussed in a separate issue/PR. Some observations though:
- After my experience with writing
xxx
, I am less keen on the dataframes: indexing them is very slow! But perhaps this doesn't matter in our usage? - This issue probably needs to be discussed with reference to #157.
Oh, and for cross-referencing purposes, this issue is addressed in PR #162.
(GitHub used to have the capability to turn issues into PRs. They got rid of it. Shame, as it seems very useful.)
from ic.
In this case I cannot see any advantages offered by the pure python interface wrapper. It just needlessly complicates and obfuscates the issue
The only "advantage" I see is that gathering all functions in a single file (pmaps_functions.py) allows you to forget the different implementations (either python or cython). For example in your fork you have:
import invisible_cities.reco.pmaps_functions as pmapf
import invisible_cities.reco.pmaps_functions_c as pmapf_c
pmaps_read_parm = ('df_to_dict',
(pmapf.s12df_to_s12l,
pmapf_c.xxx))
The following may be nicer to look at:
import invisible_cities.reco.pmaps_functions as pmapf
pmaps_read_parm = ('df_to_dict',
(pmapf.s12df_to_s12l,
pmapf.xxx))
This can still be addressed by creating aliases or using a from statement in the python module.
Once that is done we could merge. Or we could discuss whether we really want dictionaries containing, dictionaries, containing tuples, containing arrays; or whether we would like to replace the arrays-in-tuples with dataframes. But I think that the latter would be better discussed in a separate issue/PR. Some observations though:
- After my experience with writing xxx, I am less keen on the dataframes: indexing them is very slow! But perhaps this doesn't matter in our usage?
- This issue probably needs to be discussed with reference to #157.
Agreed. JJ wants to use DFs bacause of the statistical and correlation properties they provide. After using the current tuple-of-arrays structure in a small analysis I don't see any advantage in using the DFs functionalities, but my imagination is limited and future analysis may need them.
from ic.
from ic.
The only "advantage" I see is that gathering all functions in a single file (pmaps_functions.py) allows you to forget the different implementations (either python or cython).
Totally agreed. As far as I am concerned, all the Cython modules are low-level implementations of the Python modules, and I would like not to see them being imported directly in client code at all. (IIRC, @jjgomezcadenas disagrees). But the solution to this is not to implement pure-Python wrappers (containing arbitrary subsets of the functionality, like in the case of s12df_to_s12l
(I have grown to hate that name, BTW)). No, one should simply import the Cython implementation into the pure python module. In this case, inside pmaps_functions.py
we put
from invisible_cities.reco.pmaps_functions_c import xxx as s12df_to_12l
and we're done. No need to write a wrapper around it. And indeed, you seem to acknowledge this in your comment.
JJ wants to use DFs bacause of the statistical and correlation properties they provide.
Seems jolly sensible to me. As long as the huge indexing overhead of dataframes doesn't prevent us from doing other things in a reasonable amount of time.
the current tuple-of-arrays
I really would like those to be named tuples, though. I'll implement that tomorrow morning, unless somene beats me to it. It should take all of 30 seconds. Hmm, I might just do it now ... OK, I've pushed it to my fork on the issue149 branch. (No test for accessing time and energy by name yet, though!)
from ic.
from ic.
Was it really necessary that @Jacq and @neuslopez would write yet another implementation?
-
Given that the original was write-only: yes.
-
Given that the brokenness of this function was getting in our way in everything we wanted to do, and re-implementing a new one was easier than understanding and fixing the origial: yes.
-
Given that it was the quickest way to understand what
cdf_to_dict
was supposed to do: yes. -
Given that it demonstrates in 4 clear lines what 49 convoluted lines were failing to do correctly: yes.
-
Given that it increased the bus number of the feature considerably: yes.
being called names, eg art guy
Please don't confuse statements about specific things you say, with calling you names. (We've been here before, haven't we?). It is patently clear that you are not an ARTist, but the argument you used on that occasion was very ARTistic. I make no apology for pointing this out, and I would be grateful if you return the favour, if the situation is ever reversed.
did not reinvent the wheel but turned out to be two orders of magnitude slower
While it was two orders of magnitude slower, it did have three rather important merits over the original.
-
it was correct.
-
it was obviously correct ... not to mention tested.
-
it explained what the faulty one was attempting to do.
The process demonstrated a clear performance problem with Pandas dataframes of which I, for one, was not previously aware. Oh, and it took less time to implement than these meta-discussions.
Then, a new re-reimp resurrected the wheel in cython code.
Yes, the ideas in the high-level imp were expressed at a lower level, and passed all tests at first go, beating all other existing implementations for speed.
can we solve this? By all means get rid of my faulty, clumsy and artistic imp.
That goes without saying: it contains bugs which neither of its successors do.
can we agree in some scheme to share the work that does not prompt us to three reimps of the sane function in the future?
I submit to you that if you were pera-programming with us when we came up with xxx
, you would have been an enthusiastic supporter of it. It took little time, and it was a very edifying experience.
I believe that we can do better than spinning around the reimp of a single function!
This meta-discussion is taking more effort that the re-imp. The delays caused by the bugs in the original, took up much more than resources than the re-imp. The re-imp increased the bus number of the function in question by two.
If your point is that we should talk less and do more, I agree. If your point is that we should not write code when trying to understand problematic code, then I respectfully and vehemently disagree.
Sorry about top posting writing from iphone in the train Enviado desde mi iPhone
No worries, GitHub hid the noise.
from ic.
i like pure pyhon hiding cython imp basically for the same reasons offered by @gonzaponte
Stripped to the bare bones, we are talking about these two alternatives:
-
Option 1
Inside
python_implementation.py
you writeimport cython_implementation def feature(): return cython_implementation.feature()
-
Option 2
Inside
python_implementation.py
you writefrom cython_implementation import feature
Agreed?
from ic.
from ic.
Good. Now, to me, ceteris paribus, one of those options is clearly better than the other.
Agreed?
from ic.
Using branch issue149, see bellow the new timings using the new xxx (same Kr file):
pf.s12df_to_s12l(S1df,10000): 10 loops, best of 3: 58.2 ms per loop cpm.xxx (S1df,10000): 10 loops, best of 3: 41.3 ms per loop
@neuslopez Could you please give us some idea of how long it takes read_pmaps
to get these data from disk into memory?
from ic.
from ic.
from ic.
Given that some one else (e.g, Gonzalo) was already looking at the problem and he was not around. Debatable, I think.
The bus number of this thing is now 4. Without xxx
it would have been 2.
Post xxx
I am confident that my opinion on the PR is worth hearing. Before xxx
there is no way I could have formed a sensible opinion (in a any amount of time that I would have been prepared to invest in forming it).
Oh, well, I will manage to recover from current low level of Karma, don’t worry.
Your low level of Karma is but an illusion. Your Karma remains high.
Incidentally the reason I wrote the write-only, faulty, nasty, ugly cython imp was because I had found long ago that Pandas DF were slow. Alas, I failed to make that part of the public lore. Now is clear.
-
The personal journey of discovery was worth far more to me than lore.
-
Just because Pandas is slow, doesn't mean that the only alternative is writing write-only, buggy code!
My point is simple: There is life (I hope) beyond XXX. Please, PR XXX and friends and let’s be done with it.
That's what I was proposing before this meta-discussion got started. Repeating the task list I made at that point ...
Tasks remaining:
- Decide whether we prefer
cdf_to_dict
+d12df_to_s12l
orxxx
. - If we choose the former, combine it into a single Cython function (i.e. option 2, above).
- Clean up and expand the tests, including any that emerge from #146.
from ic.
from ic.
If we go for xxx we skip that step. What is your opinion, @gonzaponte?
To me, cdf_to_dict
is easier to understand, but I have devoted more time to this function than to xxx
. Brais says he likes xxx
more and since he is not biased, I would stick to the latter. Let's move on!
from ic.
I've just realized that the xxx
function assumes the ordering of the input data frame. It is not a problem now, but it must be taken into account and should be written in the function documentation. This probably means that the PMapWriter function/class/whatever should have a test that ensures the output file is written in the correct order.
from ic.
OK, I'll clean up the branch and submit before I retire for the night. Failing that, first thing tomorrow morning.
The only thing I don’t like about it is that I wasn’t around eating some peras and increasing my Karma for free. I suggest that we use xxx and my previous version in our docs to show the difference between read/write and write-only code.
I think that last sentence is seriously raising your karma.
from ic.
Good to know.... I have no idea what is this project about and suddenly there are 15 notifications of this discussion in my inbox :D
I think that at some point @jacg mutated to @Jacq at the party started.
Cheers
from ic.
I've just realized that the xxx function assumes the ordering of the input data frame.
Well, yes, that did seem to be a fundamental feature of the data I had to go on. But I suppose you are right: it is an assumption which could be violated by the pmaps writer.
While cdf_to_dict
is a rather more robust in this respect than xxx
, it will fail silently if an event limit is required and some portion of an event with id number greater than the limit appears before all lower-numbered events have been completed.
FWIW, here is my take on cdf_to_dict
, imaginatively called yyy
. It passes all the same tests and seems to be a little faster than s12df_to_s12l
but slower than xxx
. (Remember: we are well within a factor of two between slowest and fastest!)
cpdef yyy(df, max_events=None):
cdef dict all_events = {}
cdef dict current_event
cdef tuple current_peak
cdef int [:] event = df.event.values
cdef char [:] peak = df.peak .values
cdef float [:] time = df.time .values
cdef float [:] ene = df.ene .values
cdef int df_size = len(df.index)
cdef int i
cdef long limit = np.iinfo(int).max if max_events is None or max_events < 0 else max_events
cdef int peak_number
cdef list t, E
for i in range(df_size):
if event[i] >= limit: break
current_event = all_events .setdefault(event[i], {} )
current_peak = current_event.setdefault( peak[i], ([], []))
current_peak[0].append(time[i])
current_peak[1].append( ene[i])
# Postprocess: Turn lists into numpy arrays before returning
for current_event in all_events.values():
for peak_number, (t, E) in current_event.items():
current_event[peak_number] = (np.array(t), np.array(E))
return all_events
Perhaps we prefer this one to xxx
?
I will prepare the PR with xxx
in place, but it will be trivial to change
One more thing: Are we attached to the name s12df_to_s1l
? I have grown to hate it. The l
at the end is especially irksome.
And another thing: do I have your permission to replace the tuple with a namedtuple? (I have the implementation ready and tested. It slightly degrades the performance, but I'm confident that this is completely irrelevant.)
from ic.
My five cents:
- I prefer yyy to xxx. It is very concise and elegant, and the time difference is not all that relevant. Since you don't have the PR yet submitted, why don't you go ahead and use yyy instead of xxx.
- I suggest df_to_pmaps_dict as the new name.
- I am all in favour of a named tuple.
from ic.
What a wonderfull method setdefult
is. I have been doing stupid things so far because I didn't know that method (and I thought I knew them all)!
Perhaps we prefer this one to xxx?
Fine with me.
- I suggest df_to_pmaps_dict as the new name
Since it is already in pmaps_functions.py, I would avoid naming it pmaps (simply df_to_dict?).
- I am all in favour of a named tuple.
Me too.
from ic.
What a wonderfull method
setdefult
is.
Even better, for this purpose, is collections.defaultdict
, but Cython doesn't know how to type that one, so I stuck with the basic dict
.
from ic.
Related Issues (20)
- Generating MC timestamps HOT 1
- mark.feetest warning
- Travis changes HOT 4
- Get Event Numbers from MC files HOT 2
- Review skipped and xfail tests
- Singularity container for IC and Nix HOT 5
- Avoid shape mismatch in buffy with inconsistent sensor ids
- Buffy/detsim max_time and buffer_length
- Wrong dst loading in city sources HOT 3
- `make_tracks` does not use `contiguity` argument
- Generalize the waveform writer to store floats
- Fake energy in hypathia's PMT noise simulation HOT 3
- Standardize naming of groups and nodes in output files
- Add MC info in kdsts
- Understand builtins' somewhat dark behavior
- Gather symbols in a better way
- Remove (most) default arguments
- Store cities' arguments in the output file
- `hits_from_df` forces event numbers to be int32
- Environment variables in config files
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from ic.