kratsg / optimization Goto Github PK

View Code? Open in Web Editor NEW

7.0 5.0 9.0 894 KB

Code for optimizing simple n-tuples

Home Page: http://giordonstark.com/

License: MIT License

Python 95.55% Shell 2.40% C 0.94% Makefile 0.49% HCL 0.62%

ntuples optimization root-ntuples root-cern high-energy-physics analysis

optimization's People

Contributors

Stargazers

Watchers

Forkers

avivcukierman lawrenceleejr rsmith54 jmrolsson python3pkg huyuhuster pbutti chiara-rizzi mghasemi19

optimization's Issues

'1' is considered a branch with the mass scan supercuts

The following branches have been skipped...
1

Make the do_n-1_cuts.py run in parallel

Since we can take in a series of files and we process each did separately, make it faster by parallelizing across dids.

Add documentation about the `mul_bbb` error

13735 NotImplementedError: couldn't find matching opcode for 'mul_bbb'

this occurs because of multiplication between two booleans. Need to write it as 1*({0} < m_effective)*(m_effective < {0}+200) instead.

Insignificance threshold is on unweighted events

Currently programmed in to cut on weighted events

figure out with of the 4 SR is optimal for a given signal

@swiatlow: let's talk about signal regions now?

Max Swiatlowski [11:22 AM]
Cool

Max Swiatlowski [11:22 AM]
Did you see my slides?

Giordon Stark [11:23 AM]
the new ones?

Giordon Stark [11:23 AM]
with the 4 signal regions?

Max Swiatlowski [11:23 AM]
Yeah

Giordon Stark [11:23 AM]
I don't understand the 4 particularly.

Max Swiatlowski [11:23 AM]
4?

Giordon Stark [11:23 AM]
do you mean 2 signal regions for low lumi, and 2 for high lumi?

Max Swiatlowski [11:24 AM]
Yup

Max Swiatlowski [11:24 AM]
Low boost/ high boost

Max Swiatlowski [11:24 AM]
It's very approximate

Max Swiatlowski [11:24 AM]
The first thing to do is just implement these 4 SRs

Max Swiatlowski [11:25 AM]
Then check, for each lumi, which SR is best for each point

Max Swiatlowski [11:25 AM]
And what the significance is

Giordon Stark [11:25 AM]
so a SR == a cut?

Max Swiatlowski [11:25 AM]
If the sig is very different from the OPTIMAL you found... We might need to adjust, or add a new SR, etc.

Max Swiatlowski [11:25 AM]
SR is a set of cuts

Max Swiatlowski [11:26 AM]
It's the table on slide 15 or whatever

Giordon Stark [11:26 AM]
err, a SR = a supercut?

Max Swiatlowski [11:26 AM]
Yeah with all pivot

Max Swiatlowski [11:26 AM]
Just fixed cuts

Giordon Stark [11:26 AM]
gotcha. No problem.

Max Swiatlowski [11:26 AM]
Then, you check which Sr is best at each point, etc etx

Giordon Stark [11:26 AM]
so I run all 4 supercut's I have

Max Swiatlowski [11:27 AM]
Yeah

Giordon Stark [11:27 AM]
and just look at all the significances reported for a given signal region

Giordon Stark [11:27 AM]
find which of the 4 maximizes the signal region

Giordon Stark [11:27 AM]
and just make a grid showing 1,2,3,4

Giordon Stark [11:27 AM]
basically saying which one maximized that box?

Max Swiatlowski [11:27 AM]
Yup!

Parse DSID from directory-based structure as well

#@echo(write=logger.debug)
did_regex = re.compile('\.?(?:00)?(\d{6,8})\.?')
def get_did(filename):
  global did_regex
  m = did_regex.search(filename.split("/")[-2])
  if m is None:
    logger.warning('Can\'t figure out the DID! Trying format II ...')
    m = did_regex.search(filename.split("/")[-1])
  if m is None:
    logger.warning('Can\'t figure out the DID! Using input filename ...')
    return filename.split("/")[-1]
  return m.group(1)

Add a license file

This repo is missing a license. Without a license, all code is copyrighted to the author and may not be used by anyone else.

Please use something like http://choosealicense.com/ to decide what license to use. I recommend MIT or GPL.

A PSA called Add a License Please

Statistically insignificant warnings

we should make sure that it’s clear to use (with a flag, or just an abort, or something) when B (or S, though B is more likely) is statistically insignificant

Max Swiatlowski [4:42 PM]
People often do this by requiring B(unweighted) > 10 or something, for example

Automatically create N-1 plots for Optimizations

Given a supercuts file, iterate through all N-1 combinations and produce plots based on the subset of the supercuts.

Output number of weighted events in signal and background

When applying a cut, it would be nice to have the number of events in each.

Change insignificance

Just to write this down, we decided in the meeting to optimize towards 0.5 for insignificance

Aviv Cukierman [12:55 PM]
In optimize the insignificance for signal and background is the same number

Aviv Cukierman [12:55 PM]
Just --insignificance

Only return branches that exist in the ttree when using selection_to_branches

Allow the program to run even if the output directory is already made

Currently, if you have a cuts/ directory, it yells at you.

Only make it yell at you if the specific file you're making is already made.

Better, add an option to allow you to overwrite (default false obviously).

Convenience functions for calculating some cuts manually

Let's say you're given a series of output hashes that contain cuts. You should be able to do something like

from optimize import *
cut = load_cut('/path/to/hash.json')
trees = get_ttrees('....')
signal = get_signal(...)
bkgd = get_bkgd(...)
apply_cut(...)

Apply the correct weights.

See weights.yml for example file. It will be instructive to provide a way to incorporate the weights similar to how kratsg/TakeOverTheWorld does it.

Example JSON dict usage

[
{
    "branch": "multiplicity_jet",
    "min": 2,
    "max": "...",
    "stepSize": 1
  },
  {
    "branch": "pt_jet_rc8_1",
    "min": 50.0,
    "max": 250.0,
    "stepSize": 2.5
  },
]

For optimize, allow the signal/background arguments to be lists of filenames

Allow counting of branches that pass a certain cut

Given a list of branches (e.g. m_jet_largeR_0, m_jet_largeR_1, etc.), and a cut (e.g. ">200", or a window cut if that gets supported), create a value that counts the number of those branches that passed that cut. Then, allow us to cut on that value.

Perhaps in supercuts allow a flag called "derived" where you specify:
list of branches
initial cut (start,step,stop) (e.g. [200;100;500]) [or the initial cut could be fixed]
final cut (start,step,stop) (e.g. [0;1;4])

Note that if TA only stores the top 4 large R jets, e.g., as it currently does, then the final cut could only go from 0 to 4. But I think that would be fine (e.g. top tagging is usually 0-1).

graph-cuts z-axis has too many colors for the number of bins

This needs to be resolved asap.

When doing graph-cuts, make the y-axis label a bit clearer

It is currently not clear at all.

Need to fix the SetTextColor on the histograms

Need to do something like SetTextColor(kWhite)

Variable order in which we loop over cuts

if we see we’re running into #3 , we can try to change the order of the dictionaries to explore the issue

Use final metric after applying event cuts

https://root.cern.ch/root/html/RooStats__NumberCountingUtils.html#RooStats__NumberCountingUtils:BinomialExpZ

Graph-cuts, graph-grid, and the like : z-axis is broken if the range in supercuts is not perfect

Often, @AvivCukierman does something like [100, 201, 10] and this causes the z-axis to look ugly. The best way to fix this is to incorporate logic that does modular arithmetic to fix it :)

Or use numpy arange, and grab the last element as it'll also handle that.

Fix the colz tick centering on the contour

Replace top tagging with mass cuts

 "selections": "((m_jet_largeR_1 > {0})+ (m_jet_largeR_2 > {0}) + (m_jet_largeR_3 > {0}) + (m_jet_largeR_4 > {0})) > {1}",
  "st3": [
    [100, 250, 50],
    [0, 3, 1]
  ]

Store number of rows (unweighted counts)

Shouldn't be too hard and not too consuming in memory.

Make a GUI

Make running on batch possible

Update readme with correct path to output files

ratios where bkgd=0 in graph-grid crash the script

This is going to be a rather trivial check. Ideally, we want to skip showing the signal points for which the most significant cut happens to be insignificant in signal or background.

% increase in significance plots

We will like a utility that is able to do this. The best idea is to separate the plotting functionality into two pieces.

One piece produces a plain-text file of the points and values at each point, while another piece creates the actual grid. This allows us to diff the text-files separately and not have to remake them... and makes this portion easier to do.

pdfs plz

for 3.1 results, all files in pdf format for INT note plz

Parallelize the code

http://blog.dominodatalab.com/simple-parallelization/

Create graph-grid utilities to merge duplicated code

Most likely, create a dictionary of values that maps each item to a corresponding grid entry that we understand how to translate in the utility.

Also add in things like plotting run1, and so on.

get_ttrees depends on args

https://github.com/kratsg/Optimization/blob/master/optimize.py#L170

Should not do that.

New workflow

Create significances as a dictionary whose keys are the hash of the cut.
Run over a set of files to produce a set of event counts for the series of cuts specified by supercuts
potentially incorporate the calculation of significance into the plotting script

Add window cuts

Add another option to the cuts j-son, to flag a branch for exploration as a window cut. This is pretty easy to do in practice: you just have two iterators, for the start of the window and the end of the window. Start with the iterators some MINWINDOW size apart from each other, and advance the right-iterator by a step. Advance the left iterator to sweep all available windows. Reset the left iterator. Advance the right iterator. You can have an option to search for signal INSIDE the window, and outside the window (99% of the time we want inside, but just to be sure).

This is mostly useful for mass cuts on the re-clustered jets: would be good to see whether a window improves anything over a simple cut.

I would not worry about extending the generator to this case; this is something we can specify manually, when we have a good reason to expect a particular branch to want a window.

Allow event_weight branch to be a selection too!

This means we need to parse to figure out the (more than one) branches we wish to select when computing the event_weight as well as incorporating scale factors :)

Allow for a fixed cut in supercuts

Find max significance for a subset of the significances

EG: load in the supercuts file, ask the user which cuts to fix and at what values -- and then just generate the hashes. A workflow would be

new supercuts
load it in
loop over cuts
calculate hash
look up hash in significances file
dump it into output

Giordon Stark [8:26 AM]
this gives us a speed up because we don't spend time recomputing cuts.

Run baseline without mTb

See how well it fares for 1, 2, 4, 10 ifb

Update to include Moriond Limits

Add simple maths support

Would be good to add simple branch math support. This is the ability to specify in the json that we want to scan over a variable like "BRANCH1 + BRANCH2."

I'm not sure about the best way to do this, but TTree::Draw's branch manipulation syntax is very good at doing this sort of thing. I'm sure numpy has something clever as well.

Produce a fixed set of standard cuts

Probably copy Trisha's regions for that.

Make luminosity/scaling part of optimize rather than cuts

Doing cuts takes a long time. Doing optimize takes a little time. If we want to quickly check different levels of luminosity (which we will, e.g. when we learn the final luminosity for 2015) then it makes more sense to make luminosity part of optimize. Might as well make all scaling part of optimize as well then, so cuts only returns a weighted (by event weights) value, and no weights file is necessary as an argument.