ldmx-software / fire Goto Github PK
View Code? Open in Web Editor NEWEvent-by-event processing framework using HDF5 and C++17
Home Page: https://ldmx-software.github.io/fire/
License: GNU General Public License v3.0
Event-by-event processing framework using HDF5 and C++17
Home Page: https://ldmx-software.github.io/fire/
License: GNU General Public License v3.0
Is your feature request related to a problem? Please describe.
Demangling type names may change across systems and may even change across compiler versions. This leads us to want to make fire more future-proof by allowing the users to run fire without the string-based type checking that is done at various points.
Describe the solution you'd like
Compile-time or run-time choice allowing users to effectively turn off the string-based type checking that is done.
The places these demangled names are required to be effectively persisted across compiler versions is in the Event::add and Event::get methods.
Lines 196 to 209 in 9e62a02
Lines 306 to 317 in 9e62a02
You'll notice that this is only meant to prevent us from trying to coerce the data on disk into a event object class which cannot hold that data. We could allow the user to run in an "unsafe" mode where we just try to load the data without this run-time check - this may lead to very difficult to parse exceptions or undefined behavior in the destination class, so this mode should not be the default.
Is your feature request related to a problem? Please describe.
When running with the most basic python config
import fire.cfg
p = fire.cfg.Process('sim')
the following error is encountered
[Python] Python object does not have a __dict__ member
This is due to the lack of input or output file being specified. The behavior of throwing an error is correct but the message should be more specific so the user knows what to correct.
Describe the solution you'd like
Make the error message more specific in the case that an input or output file is missing from the configuration.
I need to determine if fire should support a HistogramPool. This would significantly affect how a merging program #4 would operate and may not even be beneficial given how efficent h5py and numpy are on the analysis end.
Is your feature request related to a problem? Please describe.
If a user wants to update an event object but still retain the ability to read data files written with the old version.
Describe the solution you'd like
Both ROOT and Boost.Serialization have a method for versioning the class that is being serialized. Adopting this macro-style versioning scheme would be appropriate.
Now that we are building the tests out of the container, we can start expanding our reach.
@pbutti are you interested in trying to test this out? The dependencies are pretty easy to obtain and I can give you more info if you'd like.
HighFive is a wonderful package; however, it doesn't line up with our workflow perfectly. Moreover, we only use a small subset of its features making it relatively easy to translate HighFive calls into direct HDF5 C-API calls.
Don't translate into HDF5 CXX API because it doesn't support multi-threading right now. With the potential interest of becoming multi-threaded in the future, using the C-API leaves that door open.
Describe the bug
Building of fire fails due to a failure to find the boost headers.
To Reproduce
Build fire on SLAC's SDF using devtoolset 8.
Expected behavior
I expected it to build without failure.
Terminal Print Out
fatal error: boost/core/demangle.hpp: No such file or directory
#include <boost/core/demangle.hpp>
Desktop (please complete the following information):
Additional context
Just need to add the appropriate targets.
The goal here is to limit the number of header includes that reach the end user classes. This is for safety's sake, to prevent low-level implementation changes from breaking high-level usage. But it can also hopefully improve compile time.
My thought process is to look into a fire::io::accessor
templated struct to interface between a user class and its io::Data wrapper.
Is your feature request related to a problem? Please describe.
We want to ensure that all event objects that pass the drop/keep rules will be persisted into the output file. Even if they are not access by Event::get
. The implementation of this for h5::Reader
was done in #45 ; however, implementing this for root::Reader
is delayed due to the necessary complexity.
Describe the solution you'd like
An implementation of root::Reader::copy
which silently reformats a branch of the LDMX_Events
tree into the new HDF5 style. This would effectively open the door to a full reformat
program which does nothing except move data from a ROOT file into the new fire/HDF5 files.
Describe alternatives you've considered
A medium-term alternative is just to include a processor in the sequence which accesses all of the objects that the user wishes to transport from the input ROOT file into the output HDF5 file. This is annoying because then the user needs to write this producer and make sure to link the producer with the object definitions and ROOT dictionary. The objects in the processor do not need to be processed at all, the Event::get
call is all that is necessary.
Additional context
Copying over notes from #45 ...
Sneaking a look at how uproot identifies branch types, I wrote a quick ROOT macro to test it out and it is printing a type name. The key is to retrieve the TLeaf rather than try to use the TBranch::GetClassName directly. I think I can use this to construct a copy mechanism for ROOT files.
// in file: introspect.C
std::vector<TBranch*> flatten(TObjArray* list) {
std::vector<TBranch*> flattened;
for (unsigned int i{0}; i < list->GetEntries(); i++) {
TBranch* sub_branch = (TBranch*)list->At(i);
if (sub_branch->GetListOfBranches()->GetEntries() > 0) {
std::vector<TBranch*> sub_list{flatten(sub_branch->GetListOfBranches())};
flattened.insert(flattened.end(), sub_list.begin(), sub_list.end());
} else {
flattened.push_back(sub_branch);
}
}
return flattened;
}
void introspect() {
TFile f{"test.root"};
auto t{dynamic_cast<TTree*>(f.Get("LDMX_Events"))};
if (not t) {
std::cerr << "No tree"<< std::endl;
return;
}
auto branches{flatten(t->GetListOfBranches())};
for (TBranch* br : branches) {
std::cout << br->GetFullName() << " : " << br->GetTitle();
if (br->GetListOfLeaves()->GetEntries() > 0) {
auto leaf{(TLeaf*)br->GetListOfLeaves()->At(0)};
std::cout << " leaf: "
<< leaf->GetTypeName() << " "
<< leaf->GetFullName() << " "
<< leaf->GetTitle();
}
std::cout << std::endl;
}
}
Is your feature request related to a problem? Please describe.
Related to bug caught by @omar-moreno in #36 . Different OS+compilers have different search paths. Expanding our testing should be helpful to this end.
Describe the solution you'd like
Add more OS options to the test
job. We can use the container
option to specify an OS not offered by GitHub runners themselves.
Describe alternatives you've considered
Alternative is not to include these OS's in CI testing. I think we can make the decision on OS on a case-by-case basis, but CentOS is heavily used in the community so it should definitely be included.
I want a user to be able to do the following
import fire.ana
with fire.ana.File('input.h5') as f :
for event in f :
total_E = sum(event['recon/hits/energy'])
This will make fire
explicitly depend on h5py
, but only at this module level.
I'm thinking the implementation of this would be similar to the current Framework's EventTree module while using h5py to access the data sets on disk. Similar to the EventTree module, this would only be designed to read fire
files. The user could still produce other HDF5 files with direct access to h5py
, but those files will not be standardized in the way fire
files are. (This is similar to how ROOT-based Python analyses function as well).
@EinarElen suggested an excellent talk Bridge to New Thingia which gave excellent evidence that a solid transition ramp is necessary in order to make adoption of a new design successful.
With this in mind, I am putting support for reading ROOT files on my priority list. I am intentionally avoiding writing ROOT files because the transition should only be one directional; however, having a reader that can handle data files generated with older pre-HDF5 versions of fire will help ease the transition to the new framework within ldmx-sw.
Perhaps this read support can be deprecated after some time is given for the old data files to become stale or deleted.
I also want to look into reducing the number of changes necessary in the Processors. These changes are most clearly shown in the ldmx-rootless/Bench module for example Produce. I'm thinking many of these can be easily resolved with some thoughtful additions here.
During LDMX meeting discussions, I've been hearing a lot of comments of the form "I do X in ROOT, can I/how do I do that without ROOT?". Having an extra page in the documentation specifically geared towards answering questions of that sort would help point people in the right direction.
Discussion with @pbutti this morning led to the conclusion that having a "parallel track" of event bus objects which are explicitly transient can be helpful for sharing objects between processors which should not be serialized but could be created by a single processor and then used by one or more downstream processors.
The example PF brought up is some ACTS/Eigen objects which would be difficult to serialize directly because they are from an external package. They can be faithfully constructed from simpler, serializable objects and we could save time in processing by simply only having one instance of them across the event.
@pbutti While writing this, I thought of another potential solution. Basically, you could have members of your serialized class that are intentionally not in the attach
method, but can be constructed once per event. I'm imagining something along the following lines.
#include <fire/h5/Data.h>
class Track {
std::vector<double> parameters_;
std::vector<double> cov_matrix_;
std::unique_ptr<ACTS::Track> track_;
friend class fire::h5::Data<Track>;
void clear() {
parameters_.clear();
cov_matrix_.clear();
track_.reset(nullptr);
}
void attach(fire::h5::Data<Track>& d) {
d.attach("parameters", parameters_);
d.attach("cov_matrix", cov_matrix_);
}
Track() = default;
public:
Track(ACTS::Track& t) {
track_ = std::make_unique<ACTS::Track>(t); // copy into our handle to track
// copy ACTS::Track stuff into our serializable objects
}
ACTS::Track& track() {
if (not track_) {
// put our serializable objects into ACTS::Track
}
return *track_;
}
};
Right now, there is only basic testing of the Python package involving only the components used in the test module. Implemeting some basic functional testing with pytest would allow the package to be more robust and help quicken future development.
Combining files after a thorough skim would make analysis of the data much easier. I'm imagining another executable (written in C++ or Python) that would handle combining files in a way that preserves the necessary internal file relationships so that the output file can still be an input fire file and work with our ecosystem.
This would include managing the RunHeader dataset so that overwriting is prevented.
In my discussion with @jmmans about this serialization method, I found myself describing the "tree" of classes and how they can recursively handle a multitude of event object types. I think having a diagram showing this "tree" with some annotations describing what the different nodes mean would be beneficial for documenting how the fire::io::Data
class is designed and meant to be used.
Move onto bare meetal for testing actions so we can test other OS and compilers.
Should be simple referencing HighFive's action as a starting point.
Need to make a decision on if we should pursue Python bindings rather than the separated structure we currently have.
Currently, we are just serializing bools into shorts. This is not a very satisfactory solution, especially the data copying necessary to get around the vector sepcialization in C++.
The solution is to implement a bool<->enum mapping and serialize the enum. This has already been done by h5py and would mean that opening a boolean dataset in h5py would work 'out of the box'.
The eudaq Factory is intriguing to me. Their registration process involves much simpler code than what we have inside of our macro:
// registration code in eudaq
namespace {
auto dummy = Factory<eudaq::Producer>::Register<MyProducer, ConstructorArgTypes...>(my_factory_id);
}
Then they create the maker function within the factory itself with a private Factory function:
// in Factory<BASE> definition
template <typename DERIVED, typename... ARGS>
std::uint64_t Register(std::uint32_t id){
auto &ins = Instance<ARGS&&...>();
// std::cout<<"Register ID "<<id <<" to Factory<"
// <<static_cast<const void *>(&ins)<<"> ";
ins[id] = &MakerFun<DERIVED, ARGS&&...>;
// std::cout<<" map items: ";
// for(auto& e: ins)
// std::cout<<e.first<<" ";
// std::cout<<std::endl;
return reinterpret_cast<std::uintptr_t>(&ins);
}
template <typename DERIVED, typename... ARGS>
static UP_BASE MakerFun(ARGS&& ...args){
return UP_BASE(new DERIVED(std::forward<ARGS>(args)...), [](BASE *p) {delete p; });
}
I'm not really sure how this registration code gets called when the central run control is opened,
but this registration style is more robust and could help simplify the macro that needs to declare objects.
Describe the bug
Looking deeper at the Event get
method, I've noticed that the only objects from the input file that would be copied into the output file are ones that are accessed by processors. This is a somewhat niche bug, so it is not high on my priority list; however, I think it is important to document since it is a departure from how the current ROOT-based framework operates.
Goal
My goal is to make un-accessed objects be copied into the output file in the same way that objects which are accessed would be. This means the copying would only occur for objects passing drop/keep rules and the entries that correspond to dropped events would be skipped.
Describe the bug
There are other packages that use the name fire
which may draw confusion:
Solutions
It is fore-seeable that a user of fire will want to be able to create a new program that reads from a file that is produced by fire. (Specifically, I am thinking of the OverlayProducer in ldmx-sw.)
This UserReader would be able to read from a fire file with more ease than the lower-level h5::Reader. It would support looping from the end to the beginning of the file as well as starting at an offset entry in the file.
This module is getting quickly overloaded and so splitting it will be helpful for future maintainability. I'm imagining that each class will have its own file and then the init script for the cfg module would import these classses into the cfg namespace.
Implement an action that can do a relative benchmarking comparison between trunk and the PR branch similar to how ldmx-sw is validated during PRs.
This will help prevent in-advertent performance changes during future development as well as it make easier to test out other designs for performance affects.
Currently, the drop/keep rules require the user to know that the object names have a <pass>/<name>
structure. We can avoid this requirement by expanding the DropKeepRule configuration class and expanding EventObjectTag or perhaps adding another class.
I need to check what happens and decide if it is the behavior we want.
I can't sus out if the add
would overwrite what was get
or if there is an exception I'm not thinking of.
Describe the solution you'd like
use cmake_dependent_option
to have the fire_USE_ROOT
option only available if ROOT
is found.
Describe alternatives you've considered
Current solution is best known alternative I believe.
Additional context
Found while parsing EUDAQ's cmake infrastructre
include(CMakeDependentOption)
find_package(Qt5Widgets CONFIG)
cmake_dependent_option(EUDAQ_BUILD_GUI "Compile GUI executables (requires QT5)" ON
"Qt5Widgets_FOUND" OFF)
if(NOT EUDAQ_BUILD_GUI)
message(STATUS "GUIs of executables (euRun, euLog) are NOT to be built (EUDAQ_BUILD_GUI=OFF)")
return()
endif()
message(STATUS "GUIs of executables (euRun, euLog) are to be built (EUDAQ_BUILD_GUI=ON)")
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.