kratsg / optimization Goto Github PK
View Code? Open in Web Editor NEWCode for optimizing simple n-tuples
Home Page: http://giordonstark.com/
License: MIT License
Code for optimizing simple n-tuples
Home Page: http://giordonstark.com/
License: MIT License
The following branches have been skipped...
1
Since we can take in a series of files and we process each did separately, make it faster by parallelizing across dids.
13735 NotImplementedError: couldn't find matching opcode for 'mul_bbb'
this occurs because of multiplication between two booleans. Need to write it as 1*({0} < m_effective)*(m_effective < {0}+200)
instead.
Currently programmed in to cut on weighted events
@swiatlow: let's talk about signal regions now?
Max Swiatlowski [11:22 AM]
Cool
Max Swiatlowski [11:22 AM]
Did you see my slides?
Giordon Stark [11:23 AM]
the new ones?
Giordon Stark [11:23 AM]
with the 4 signal regions?
Max Swiatlowski [11:23 AM]
Yeah
Giordon Stark [11:23 AM]
I don't understand the 4 particularly.
Max Swiatlowski [11:23 AM]
4?
Giordon Stark [11:23 AM]
do you mean 2 signal regions for low lumi, and 2 for high lumi?
Max Swiatlowski [11:24 AM]
Yup
Max Swiatlowski [11:24 AM]
Low boost/ high boost
Max Swiatlowski [11:24 AM]
It's very approximate
Max Swiatlowski [11:24 AM]
The first thing to do is just implement these 4 SRs
Max Swiatlowski [11:25 AM]
Then check, for each lumi, which SR is best for each point
Max Swiatlowski [11:25 AM]
And what the significance is
Giordon Stark [11:25 AM]
so a SR == a cut?
Max Swiatlowski [11:25 AM]
If the sig is very different from the OPTIMAL you found... We might need to adjust, or add a new SR, etc.
Max Swiatlowski [11:25 AM]
SR is a set of cuts
Max Swiatlowski [11:26 AM]
It's the table on slide 15 or whatever
Giordon Stark [11:26 AM]
err, a SR = a supercut?
Max Swiatlowski [11:26 AM]
Yeah with all pivot
Max Swiatlowski [11:26 AM]
Just fixed cuts
Giordon Stark [11:26 AM]
gotcha. No problem.
Max Swiatlowski [11:26 AM]
Then, you check which Sr is best at each point, etc etx
Giordon Stark [11:26 AM]
so I run all 4 supercut's I have
Max Swiatlowski [11:27 AM]
Yeah
Giordon Stark [11:27 AM]
and just look at all the significances reported for a given signal region
Giordon Stark [11:27 AM]
find which of the 4 maximizes the signal region
Giordon Stark [11:27 AM]
and just make a grid showing 1,2,3,4
Giordon Stark [11:27 AM]
basically saying which one maximized that box?
Max Swiatlowski [11:27 AM]
Yup!
#@echo(write=logger.debug)
did_regex = re.compile('\.?(?:00)?(\d{6,8})\.?')
def get_did(filename):
global did_regex
m = did_regex.search(filename.split("/")[-2])
if m is None:
logger.warning('Can\'t figure out the DID! Trying format II ...')
m = did_regex.search(filename.split("/")[-1])
if m is None:
logger.warning('Can\'t figure out the DID! Using input filename ...')
return filename.split("/")[-1]
return m.group(1)
This repo is missing a license. Without a license, all code is copyrighted to the author and may not be used by anyone else.
Please use something like http://choosealicense.com/ to decide what license to use. I recommend MIT or GPL.
A PSA called Add a License Please
we should make sure that it’s clear to use (with a flag, or just an abort, or something) when B (or S, though B is more likely) is statistically insignificant
Max Swiatlowski [4:42 PM]
People often do this by requiring B(unweighted) > 10 or something, for example
Given a supercuts file, iterate through all N-1 combinations and produce plots based on the subset of the supercuts.
When applying a cut, it would be nice to have the number of events in each.
Just to write this down, we decided in the meeting to optimize towards 0.5 for insignificance
Aviv Cukierman [12:55 PM]
In optimize
the insignificance for signal and background is the same number
Aviv Cukierman [12:55 PM]
Just --insignificance
Currently, if you have a cuts/ directory, it yells at you.
Only make it yell at you if the specific file you're making is already made.
Better, add an option to allow you to overwrite (default false obviously).
Let's say you're given a series of output hashes that contain cuts. You should be able to do something like
from optimize import *
cut = load_cut('/path/to/hash.json')
trees = get_ttrees('....')
signal = get_signal(...)
bkgd = get_bkgd(...)
apply_cut(...)
See weights.yml for example file. It will be instructive to provide a way to incorporate the weights similar to how kratsg/TakeOverTheWorld does it.
[
{
"branch": "multiplicity_jet",
"min": 2,
"max": "...",
"stepSize": 1
},
{
"branch": "pt_jet_rc8_1",
"min": 50.0,
"max": 250.0,
"stepSize": 2.5
},
]
Given a list of branches (e.g. m_jet_largeR_0, m_jet_largeR_1, etc.), and a cut (e.g. ">200", or a window cut if that gets supported), create a value that counts the number of those branches that passed that cut. Then, allow us to cut on that value.
Perhaps in supercuts allow a flag called "derived" where you specify:
list of branches
initial cut (start,step,stop) (e.g. [200;100;500]) [or the initial cut could be fixed]
final cut (start,step,stop) (e.g. [0;1;4])
Note that if TA only stores the top 4 large R jets, e.g., as it currently does, then the final cut could only go from 0 to 4. But I think that would be fine (e.g. top tagging is usually 0-1).
This needs to be resolved asap.
It is currently not clear at all.
Need to do something like SetTextColor(kWhite)
if we see we’re running into #3 , we can try to change the order of the dictionaries to explore the issue
Often, @AvivCukierman does something like [100, 201, 10]
and this causes the z-axis to look ugly. The best way to fix this is to incorporate logic that does modular arithmetic to fix it :)
Or use numpy arange, and grab the last element as it'll also handle that.
"selections": "((m_jet_largeR_1 > {0})+ (m_jet_largeR_2 > {0}) + (m_jet_largeR_3 > {0}) + (m_jet_largeR_4 > {0})) > {1}",
"st3": [
[100, 250, 50],
[0, 3, 1]
]
Shouldn't be too hard and not too consuming in memory.
This is going to be a rather trivial check. Ideally, we want to skip showing the signal points for which the most significant cut happens to be insignificant in signal or background.
We will like a utility that is able to do this. The best idea is to separate the plotting functionality into two pieces.
One piece produces a plain-text file of the points and values at each point, while another piece creates the actual grid. This allows us to diff the text-files separately and not have to remake them... and makes this portion easier to do.
for 3.1 results, all files in pdf format for INT note plz
Most likely, create a dictionary of values that maps each item to a corresponding grid entry that we understand how to translate in the utility.
Also add in things like plotting run1, and so on.
https://github.com/kratsg/Optimization/blob/master/optimize.py#L170
Should not do that.
Add another option to the cuts j-son, to flag a branch for exploration as a window cut. This is pretty easy to do in practice: you just have two iterators, for the start of the window and the end of the window. Start with the iterators some MINWINDOW size apart from each other, and advance the right-iterator by a step. Advance the left iterator to sweep all available windows. Reset the left iterator. Advance the right iterator. You can have an option to search for signal INSIDE the window, and outside the window (99% of the time we want inside, but just to be sure).
This is mostly useful for mass cuts on the re-clustered jets: would be good to see whether a window improves anything over a simple cut.
I would not worry about extending the generator to this case; this is something we can specify manually, when we have a good reason to expect a particular branch to want a window.
This means we need to parse to figure out the (more than one) branches we wish to select when computing the event_weight as well as incorporating scale factors :)
EG: load in the supercuts file, ask the user which cuts to fix and at what values -- and then just generate the hashes. A workflow would be
Giordon Stark [8:26 AM]
this gives us a speed up because we don't spend time recomputing cuts.
See how well it fares for 1, 2, 4, 10 ifb
Would be good to add simple branch math support. This is the ability to specify in the json that we want to scan over a variable like "BRANCH1 + BRANCH2."
I'm not sure about the best way to do this, but TTree::Draw's branch manipulation syntax is very good at doing this sort of thing. I'm sure numpy has something clever as well.
Probably copy Trisha's regions for that.
Doing cuts takes a long time. Doing optimize takes a little time. If we want to quickly check different levels of luminosity (which we will, e.g. when we learn the final luminosity for 2015) then it makes more sense to make luminosity part of optimize. Might as well make all scaling part of optimize as well then, so cuts only returns a weighted (by event weights) value, and no weights file is necessary as an argument.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.