ldmx-software / fire Goto Github PK

View Code? Open in Web Editor NEW

1.0 0.0 0.0 8.16 MB

Event-by-event processing framework using HDF5 and C++17

Home Page: https://ldmx-software.github.io/fire/

License: GNU General Public License v3.0

CMake 3.46% C++ 88.68% Python 7.80% C 0.06%

hdf5 hep hep-ex

fire's People

Stargazers

fire's Issues

Allow user to define type soft/hard choice

Is your feature request related to a problem? Please describe.
Demangling type names may change across systems and may even change across compiler versions. This leads us to want to make fire more future-proof by allowing the users to run fire without the string-based type checking that is done at various points.

Describe the solution you'd like
Compile-time or run-time choice allowing users to effectively turn off the string-based type checking that is done.
The places these demangled names are required to be effectively persisted across compiler versions is in the Event::add and Event::get methods.

fire/include/fire/Event.h

Lines 196 to 209 in 9e62a02

 // check available_objects_ listing so we don't in-advertently replace  

 // any datasets of the same name read in from the inputfile 

 // we know we are worried about data from the input file 

 // because data from previous producers in the sequence 

 // would already exist in the objects_ map 

 // we rely on trusting that setInputFile gets the listing 

 // of event objects from the input file and puts them 

 // into availble_objects_ 

 if (search("^"+name+"$","^"+pass_+"$",".*").size() > 0) { 

 throw Exception("Repeat", 

 "Data named "+full_name+" already exists in the input file."); 

 } 

 available_objects_.emplace_back(name, pass_, 

 boost::core::demangle(typeid(DataType).name()));

fire/include/fire/Event.h

Lines 306 to 317 in 9e62a02

 auto type = boost::core::demangle(typeid(DataType).name()); 

 auto options{search("^" + name + "$", ".*", "^" + type + "$")}; 

 if (options.size() == 0) { 

 throw Exception("Miss", 

 "Data " + name + " of type " + type + " not found."); 

 } else if (options.size() > 1) { 

 throw Exception("Ambig", 

 "Data " + name + " of type " + type + " is ambiguous. Provide a pass name."); 

 } 

 // exactly one option 

 full_name = fullName(options.at(0).name(), options.at(0).pass());

You'll notice that this is only meant to prevent us from trying to coerce the data on disk into a event object class which cannot hold that data. We could allow the user to run in an "unsafe" mode where we just try to load the data without this run-time check - this may lead to very difficult to parse exceptions or undefined behavior in the destination class, so this mode should not be the default.

Make input/output file error more specific

Is your feature request related to a problem? Please describe.
When running with the most basic python config

import fire.cfg
p = fire.cfg.Process('sim')

the following error is encountered

[Python] Python object does not have a __dict__ member

This is due to the lack of input or output file being specified. The behavior of throwing an error is correct but the message should be more specific so the user knows what to correct.

Describe the solution you'd like
Make the error message more specific in the case that an input or output file is missing from the configuration.

Histogram Helpers

I need to determine if fire should support a HistogramPool. This would significantly affect how a merging program #4 would operate and may not even be beneficial given how efficent h5py and numpy are on the analysis end.

Schema Evolution of Event Object Types

Is your feature request related to a problem? Please describe.
If a user wants to update an event object but still retain the ability to read data files written with the old version.

Describe the solution you'd like
Both ROOT and Boost.Serialization have a method for versioning the class that is being serialized. Adopting this macro-style versioning scheme would be appropriate.

Enable MacOS Compilation

Now that we are building the tests out of the container, we can start expanding our reach.

@pbutti are you interested in trying to test this out? The dependencies are pretty easy to obtain and I can give you more info if you'd like.

Move away from HighFive

HighFive is a wonderful package; however, it doesn't line up with our workflow perfectly. Moreover, we only use a small subset of its features making it relatively easy to translate HighFive calls into direct HDF5 C-API calls.

Don't translate into HDF5 CXX API because it doesn't support multi-threading right now. With the potential interest of becoming multi-threaded in the future, using the C-API leaves that door open.

Missing Boost targets

Describe the bug
Building of fire fails due to a failure to find the boost headers.

To Reproduce
Build fire on SLAC's SDF using devtoolset 8.

Expected behavior
I expected it to build without failure.

Terminal Print Out

fatal error: boost/core/demangle.hpp: No such file or directory
 #include <boost/core/demangle.hpp>

Desktop (please complete the following information):

OS: Centos7
Compiler and its version: gcc8
HDF5 version: trunk
fire version: trunk
fire modules (link to modules you are attempting to use with fire) All of them

Additional context
Just need to add the appropriate targets.

Hide io::Data header from user class

The goal here is to limit the number of header includes that reach the end user classes. This is for safety's sake, to prevent low-level implementation changes from breaking high-level usage. But it can also hopefully improve compile time.

My thought process is to look into a fire::io::accessor templated struct to interface between a user class and its io::Data wrapper.

io::root::Reader::copy implementation

Is your feature request related to a problem? Please describe.
We want to ensure that all event objects that pass the drop/keep rules will be persisted into the output file. Even if they are not access by Event::get. The implementation of this for h5::Reader was done in #45 ; however, implementing this for root::Readeris delayed due to the necessary complexity.

Describe the solution you'd like
An implementation of root::Reader::copy which silently reformats a branch of the LDMX_Events tree into the new HDF5 style. This would effectively open the door to a full reformat program which does nothing except move data from a ROOT file into the new fire/HDF5 files.

Describe alternatives you've considered
A medium-term alternative is just to include a processor in the sequence which accesses all of the objects that the user wishes to transport from the input ROOT file into the output HDF5 file. This is annoying because then the user needs to write this producer and make sure to link the producer with the object definitions and ROOT dictionary. The objects in the processor do not need to be processed at all, the Event::get call is all that is necessary.

Additional context
Copying over notes from #45 ...

Sneaking a look at how uproot identifies branch types, I wrote a quick ROOT macro to test it out and it is printing a type name. The key is to retrieve the TLeaf rather than try to use the TBranch::GetClassName directly. I think I can use this to construct a copy mechanism for ROOT files.

// in file: introspect.C
std::vector<TBranch*> flatten(TObjArray* list) {
  std::vector<TBranch*> flattened;
  for (unsigned int i{0}; i < list->GetEntries(); i++) {
    TBranch* sub_branch = (TBranch*)list->At(i);
    if (sub_branch->GetListOfBranches()->GetEntries() > 0) {
      std::vector<TBranch*> sub_list{flatten(sub_branch->GetListOfBranches())};
      flattened.insert(flattened.end(), sub_list.begin(), sub_list.end());
    } else {
      flattened.push_back(sub_branch);
    }
  }
  return flattened;
}   

void introspect() {
  TFile f{"test.root"};
  
  auto t{dynamic_cast<TTree*>(f.Get("LDMX_Events"))};

  if (not t) {
    std::cerr << "No tree"<< std::endl;
    return;
  }

  auto branches{flatten(t->GetListOfBranches())};
  for (TBranch* br : branches) {
    std::cout << br->GetFullName() << " : " << br->GetTitle();
    if (br->GetListOfLeaves()->GetEntries() > 0) {
      auto leaf{(TLeaf*)br->GetListOfLeaves()->At(0)};
      std::cout << " leaf: " 
        << leaf->GetTypeName() << " "
        << leaf->GetFullName() << " " 
        << leaf->GetTitle();
    }
    std::cout << std::endl; 
  }
}

Integrate CentOS7+ into CI testing

Is your feature request related to a problem? Please describe.
Related to bug caught by @omar-moreno in #36 . Different OS+compilers have different search paths. Expanding our testing should be helpful to this end.

Describe the solution you'd like
Add more OS options to the test job. We can use the container option to specify an OS not offered by GitHub runners themselves.

Describe alternatives you've considered
Alternative is not to include these OS's in CI testing. I think we can make the decision on OS on a case-by-case basis, but CentOS is heavily used in the community so it should definitely be included.

Python analysis package

Goal

I want a user to be able to do the following

import fire.ana
with fire.ana.File('input.h5') as f : 
    for event in f : 
        total_E = sum(event['recon/hits/energy'])

This will make fire explicitly depend on h5py, but only at this module level.

I'm thinking the implementation of this would be similar to the current Framework's EventTree module while using h5py to access the data sets on disk. Similar to the EventTree module, this would only be designed to read fire files. The user could still produce other HDF5 files with direct access to h5py, but those files will not be standardized in the way fire files are. (This is similar to how ROOT-based Python analyses function as well).

Interop with ROOT-based Framework

@EinarElen suggested an excellent talk Bridge to New Thingia which gave excellent evidence that a solid transition ramp is necessary in order to make adoption of a new design successful.

With this in mind, I am putting support for reading ROOT files on my priority list. I am intentionally avoiding writing ROOT files because the transition should only be one directional; however, having a reader that can handle data files generated with older pre-HDF5 versions of fire will help ease the transition to the new framework within ldmx-sw.

Perhaps this read support can be deprecated after some time is given for the old data files to become stale or deleted.

I also want to look into reducing the number of changes necessary in the Processors. These changes are most clearly shown in the ldmx-rootless/Bench module for example Produce. I'm thinking many of these can be easily resolved with some thoughtful additions here.

#31
Translation layer for C++ processors to convert ROOT-based Framework Producer/Analyser into fire::Processor #33
Translation Python module to convert ldmxcfg function calls to fire.cfg function calls #33

From ROOT doc

During LDMX meeting discussions, I've been hearing a lot of comments of the form "I do X in ROOT, can I/how do I do that without ROOT?". Having an extra page in the documentation specifically geared towards answering questions of that sort would help point people in the right direction.

ROOT Abilities People Have Asked About

TBrowser --> HDFView
TTree::Draw --> numpy slicing in h5py and matplotlib
Histogram serialization (i.e. storage of binned data) -> h5py, numpy.histogram, and fire.hist #9
hadd -> h5py and fire.merge #4
uproot's interface to pandas and awkard array #27

Transient objects in event bus

Discussion with @pbutti this morning led to the conclusion that having a "parallel track" of event bus objects which are explicitly transient can be helpful for sharing objects between processors which should not be serialized but could be created by a single processor and then used by one or more downstream processors.

The example PF brought up is some ACTS/Eigen objects which would be difficult to serialize directly because they are from an external package. They can be faithfully constructed from simpler, serializable objects and we could save time in processing by simply only having one instance of them across the event.

@pbutti While writing this, I thought of another potential solution. Basically, you could have members of your serialized class that are intentionally not in the attach method, but can be constructed once per event. I'm imagining something along the following lines.

#include <fire/h5/Data.h>
class Track {
  std::vector<double> parameters_;
  std::vector<double> cov_matrix_;
  std::unique_ptr<ACTS::Track> track_;
  friend class fire::h5::Data<Track>;
  void clear() {
    parameters_.clear();
    cov_matrix_.clear();
    track_.reset(nullptr);
  }
  void attach(fire::h5::Data<Track>& d) {
    d.attach("parameters", parameters_);
    d.attach("cov_matrix", cov_matrix_);
  }
  Track() = default;
 public:
  Track(ACTS::Track& t) {
    track_ = std::make_unique<ACTS::Track>(t); // copy into our handle to track
    // copy ACTS::Track stuff into our serializable objects
  }
  ACTS::Track& track() {
    if (not track_) {
      // put our serializable objects into ACTS::Track
    }
    return *track_;
  }
};

Python Testing

Right now, there is only basic testing of the Python package involving only the components used in the test module. Implemeting some basic functional testing with pytest would allow the package to be more robust and help quicken future development.

Merging fire Files

Combining files after a thorough skim would make analysis of the data much easier. I'm imagining another executable (written in C++ or Python) that would handle combining files in a way that preserves the necessary internal file relationships so that the output file can still be an input fire file and work with our ecosystem.

This would include managing the RunHeader dataset so that overwriting is prevented.

Digram io::Data access pattern

In my discussion with @jmmans about this serialization method, I found myself describing the "tree" of classes and how they can recursively handle a multitude of event object types. I think having a diagram showing this "tree" with some annotations describing what the different nodes mean would be beneficial for documenting how the fire::io::Data class is designed and meant to be used.

Containerless Dev

Move onto bare meetal for testing actions so we can test other OS and compilers.

Should be simple referencing HighFive's action as a starting point.

Python Bindings?

Need to make a decision on if we should pursue Python bindings rather than the separated structure we currently have.

Boolean Serialization

Currently, we are just serializing bools into shorts. This is not a very satisfactory solution, especially the data copying necessary to get around the vector sepcialization in C++.

The solution is to implement a bool<->enum mapping and serialize the enum. This has already been done by h5py and would mean that opening a boolean dataset in h5py would work 'out of the box'.

Public Docs Checklist

a contributing document outlining how to contribute to this project
quick start section on README (and therefore docs landing page)
guiding/design principles explaining why design decisions were made
fire HDF5 data format (for developers of fire or other access libraries)
#29

Update Factory Registration

The eudaq Factory is intriguing to me. Their registration process involves much simpler code than what we have inside of our macro:

// registration code in eudaq
namespace {
auto dummy = Factory<eudaq::Producer>::Register<MyProducer, ConstructorArgTypes...>(my_factory_id);
}

Then they create the maker function within the factory itself with a private Factory function:

// in Factory<BASE> definition

template <typename DERIVED, typename... ARGS>
std::uint64_t Register(std::uint32_t id){
  auto &ins = Instance<ARGS&&...>();
  // std::cout<<"Register ID "<<id <<"  to Factory<"
  // 	     <<static_cast<const void *>(&ins)<<">    ";
  ins[id] = &MakerFun<DERIVED, ARGS&&...>;
  // std::cout<<"   map items: ";
  // for(auto& e: ins)
  //   std::cout<<e.first<<"  ";
  // std::cout<<std::endl;
  return reinterpret_cast<std::uintptr_t>(&ins);
}

template <typename DERIVED, typename... ARGS>
static UP_BASE MakerFun(ARGS&& ...args){
  return UP_BASE(new DERIVED(std::forward<ARGS>(args)...), [](BASE *p) {delete p; });
}

I'm not really sure how this registration code gets called when the central run control is opened,
but this registration style is more robust and could help simplify the macro that needs to declare objects.

Objects that aren't accessed aren't copied

Describe the bug
Looking deeper at the Event get method, I've noticed that the only objects from the input file that would be copied into the output file are ones that are accessed by processors. This is a somewhat niche bug, so it is not high on my priority list; however, I think it is important to document since it is a departure from how the current ROOT-based framework operates.

Goal
My goal is to make un-accessed objects be copied into the output file in the same way that objects which are accessed would be. This means the copying would only occur for objects passing drop/keep rules and the entries that correspond to dropped events would be skipped.

Potential Name Conflicts

Describe the bug
There are other packages that use the name fire which may draw confusion:

https://github.com/google/python-fire : a python package for auto generating a CLI
https://github.com/kongaskristjan/fire-hpp : a C++ header-only library for auto generating a CLI
https://indico.cern.ch/event/352160/contributions/1755276/attachments/697200/957346/FIRE-Athenes.pdf : a C++ implementation of Feynman Integral REduction

Solutions

keep the current name and forced acronym
choose another name to avoid confusion?

User File Reader

It is fore-seeable that a user of fire will want to be able to create a new program that reads from a file that is produced by fire. (Specifically, I am thinking of the OverlayProducer in ldmx-sw.)

This UserReader would be able to read from a fire file with more ease than the lower-level h5::Reader. It would support looping from the end to the beginning of the file as well as starting at an offset entry in the file.

Split cfg.py

This module is getting quickly overloaded and so splitting it will be helpful for future maintainability. I'm imagining that each class will have its own file and then the init script for the cfg module would import these classses into the cfg namespace.

Benchmarking PR Action

Implement an action that can do a relative benchmarking comparison between trunk and the PR branch similar to how ldmx-sw is validated during PRs.

This will help prevent in-advertent performance changes during future development as well as it make easier to test out other designs for performance affects.

Drop/Keep Rules Shouldn't Know about Internal Naming Structure

Currently, the drop/keep rules require the user to know that the object names have a <pass>/<name> structure. We can avoid this requirement by expanding the DropKeepRule configuration class and expanding EventObjectTag or perhaps adding another class.

add after get same object

I need to check what happens and decide if it is the behavior we want.

I can't sus out if the add would overwrite what was get or if there is an exception I'm not thinking of.

Use cmake_dependent_option for root reading

Describe the solution you'd like
use cmake_dependent_option to have the fire_USE_ROOT option only available if ROOT is found.

Describe alternatives you've considered
Current solution is best known alternative I believe.

Additional context
Found while parsing EUDAQ's cmake infrastructre

include(CMakeDependentOption)
find_package(Qt5Widgets CONFIG)

cmake_dependent_option(EUDAQ_BUILD_GUI "Compile GUI executables (requires QT5)" ON
  "Qt5Widgets_FOUND" OFF)

if(NOT EUDAQ_BUILD_GUI)
    message(STATUS "GUIs of executables (euRun, euLog) are NOT to be built (EUDAQ_BUILD_GUI=OFF)")
  return()
endif()
message(STATUS "GUIs of executables (euRun, euLog) are to be built (EUDAQ_BUILD_GUI=ON)")

	// check available_objects_ listing so we don't in-advertently replace
	// any datasets of the same name read in from the inputfile
	// we know we are worried about data from the input file
	// because data from previous producers in the sequence
	// would already exist in the objects_ map
	// we rely on trusting that setInputFile gets the listing
	// of event objects from the input file and puts them
	// into availble_objects_
	if (search("^"+name+"$","^"+pass_+"$",".*").size() > 0) {
	throw Exception("Repeat",
	"Data named "+full_name+" already exists in the input file.");
	}
	available_objects_.emplace_back(name, pass_,
	boost::core::demangle(typeid(DataType).name()));

	auto type = boost::core::demangle(typeid(DataType).name());
	auto options{search("^" + name + "$", ".*", "^" + type + "$")};
	if (options.size() == 0) {
	throw Exception("Miss",
	"Data " + name + " of type " + type + " not found.");
	} else if (options.size() > 1) {
	throw Exception("Ambig",
	"Data " + name + " of type " + type + " is ambiguous. Provide a pass name.");
	}

	// exactly one option
	full_name = fullName(options.at(0).name(), options.at(0).pass());

ldmx-software / fire Goto Github PK

fire's People

Stargazers

fire's Issues

Goal

ROOT Abilities People Have Asked About

Recommend Projects

Recommend Topics

Recommend Org