spenhouet / tensorboard-aggregator Goto Github PK

Aggregate multiple tensorboard runs to new summary or csv files

License: MIT License

Python 100.00%

tensorboard tensorflow aggregator summarizer csv-export

tensorboard-aggregator's Introduction

tensorboard-aggregator

This project contains an easy to use method to aggregate multiple tensorboard runs. The max, min, mean, median, standard deviation and variance of the scalars from multiple runs is saved either as new tensorboard summary or as .csv table.

There is a similar tool which uses pytorch to output the tensorboard summary: TensorBoard Reducer

Feature Overview

Aggregates scalars of multiple tensorboard files
Saves aggregates as new tensorboard summary or as .csv
Aggregate by any numpy function (default: max, min, mean, median, std, var)
Allows any number of subpath structures
Keeps step numbering
Saves wall time average per step

Setup and run configuration

Download or clone repository files to your computer
Go into repository folder
Install requirements: pip3 install -r requirements.txt --upgrade
You can now run the aggregation with: python aggregator.py

Parameters

Parameter		Default	Description
--path	optional	current working directory	Path to folder containing runs
--subpaths	optional	`['.']`	List of all subpaths
--output	optional	`summary`	Possible values: `summary`, `csv`

Recommendation

Add the repository folder to the PATH (global environment variables).
Create an additional script file within the repository folder containing python static/path/to/aggregator.py
- Script name: aggregate.sh / aggregate.bat / ... (depending on your OS)
- Change default behavior via parameters
- Do not change path parameter since this will by default be the path the script is run from
Workflow from here: Open folder with tensorboard files and call the script: aggregate files will be created for the current directory

Explanation

Example folder structure:

.
├── ...
├── test_param_xy      # Folder containing the runs for aggregation
│   ├── run_1          # Folder containing tensorboard files of one run
│   │   ├── test       # Subpath containing one tensorboard file
│   │   │   └── events.out.tfevents. ...
│   │   └── train   
│   │       └── events.out.tfevents. ...
│   ├── run_2
│   ├── ...
│   └── run_X
└── ...

The folder test_param_xy will be the base path (cd test_param_xy). The tensorboard summaries for the aggregation will be created by calling the aggregate script (containing: python static/path/to/aggregator.py --subpaths ['test', 'train'] --output summary)

The base folder contains multiple subfolders. Each subfolder contains the tensorboard files of different runs for the same model and configuration as all other subfolders.

The resulting folder structure for summary looks like this:

.
├── ...
├── test_param_xy
│   ├── ...
│   └── aggregate
│       ├── test
│       │   ├── max
│       │   │   └── test_param_xy 
│       │   │       └── events.out.tfevents. ...
│       │   ├── min
│       │   ├── mean
│       │   ├── median
│       │   └── std    
│       └── train
└── ...

Multiple aggregate summaries can be put together in one directory. Since the original base folder name is kept as subfolder to the aggregate function folder the summaries are distinguishable within tensorboard.

.
├── ...
├── max
│   ├── test_param_x
│   ├── test_param_y
│   ├── test_param_z
│   └── test_param_v 
├── min
├── mean
├── median
└── std

The .csv table files for the aggregation will be created by calling the aggregate script (containing: python static/path/to/aggregator.py --subpaths ['test', 'train'] --output csv)

The resulting folder structure for summary looks like this:

.
├── ...
├── test_param_xy
│   ├── ...
│   └── aggregate
│       ├── test
│       │   ├── max_test_param_xy.csv
│       │   ├── min_test_param_xy.csv
│       │   ├── mean_test_param_xy.csv
│       │   ├── median_test_param_xy.csv
│       │   └── std_test_param_xy.csv
│       └── train
└── ...

The .csv files are primarily for latex plots.

Limitations

The aggregation only works for scalars and not for other types like histograms
All runs for one aggregation need the exact same tags. Basically the naming and number of scalar metrics needs to be equal for all runs.
All runs for one aggregation need the same steps. Basically the number of iterations, epochs and the saving frequency needs to be equal for all runs of one scalar.

Contributions

If there are potential problems (bugs, incompatibilities to newer library versions or to a OS) or feature requests, please create an GitHub issue here.

Dependencies are managed using pip-tools. Just add new dependencies to requirements.in and generate a new requirements.txt using pip-compile in the command line.

License

MIT License

tensorboard-aggregator's People

Contributors

Stargazers

Watchers

tensorboard-aggregator's Issues

What if there are no subfolders test/train?

Describe the bug
if there are no subfolders then the code doesn't work.
Get an error:
/path/test is not a valid path --> since there is no test folder.

No scalars found in event files

events.out.tfevents.1594678042.pi-dell.10209.12.v2.zip

Describe the bug
When running aggregator.py against our event files, we get the error:

  File "/home/patrick/src/tensorboard-aggregator/aggregator.py", line 155, in <module>
    aggregate(path, args.output, args.subpaths)
  File "/home/patrick/src/tensorboard-aggregator/aggregator.py", line 120, in aggregate
    extracts_per_subpath = {subpath: extract(dpath, subpath) for subpath in subpaths}
  File "/home/patrick/src/tensorboard-aggregator/aggregator.py", line 120, in <dictcomp>
    extracts_per_subpath = {subpath: extract(dpath, subpath) for subpath in subpaths}
  File "/home/patrick/src/tensorboard-aggregator/aggregator.py", line 36, in extract
    assert len(set(all_keys)) == 1, "All runs need to have the same scalar keys. There are mismatches in {}".format(all_keys)
AssertionError: All runs need to have the same scalar keys. There are mismatches in []

To Reproduce
Run aggregator.py against the attached event file.

Expected behavior
Expected summary files to be generated from scalars.

Screenshots
None.

Desktop (please complete the following information):

OS: Ubuntu 16.04
python version: 3.6
tensorflow version: 1.15
numpy version: 1.17.2

Additional context
This is the output we get when we run tensorboard --inspect on the same event file:

tensorboard --inspect --event_file events.out.tfevents.1594678042.pi-dell.10209.12.v2
======================================================================
Processing event files... (this can take a few minutes)
======================================================================

These tags are in events.out.tfevents.1594678042.pi-dell.10209.12.v2:
audio -
histograms -
images -
scalars -
tensor
   Metrics/AverageEpisodeLength
   Metrics/AverageReturn
   Metrics/average_distance_to_nearest_neighbor
   Metrics_vs_EnvironmentSteps/AverageEpisodeLength
   Metrics_vs_EnvironmentSteps/AverageReturn
   Metrics_vs_NumberOfEpisodes/AverageEpisodeLength
   Metrics_vs_NumberOfEpisodes/AverageReturn
======================================================================

Event statistics for events.out.tfevents.1594678042.pi-dell.10209.12.v2:
audio -
graph -
histograms -
images -
scalars -
sessionlog:checkpoint -
sessionlog:start -
sessionlog:stop -
tensor
   first_step           0
   last_step            100
   max_step             1100
   min_step             0
   num_steps            4
   outoforder_steps     [(1100, 85), (1100, 100)]
======================================================================

Extracting stored Hyperparameters

I think it would be really useful if this could be extended to extract hyperparameters that are logged in HParams objects

Urgent Help Aggregating

I am trying to use this aggregator, but i am getting the error

  File "aggregator.py", line 31, in extract
    assert len(set(all_keys)) == 1, "All runs need to have the same scalar keys. There are mismatches in {}".format(all_keys)
AssertionError: All runs need to have the same scalar keys. There are mismatches in []

my dir structure looks like this

ls ../results/
inference.run63++100  inference.run64  run61  run62  run63  run63++025  run63++050  run63++075  run63++100  run64  run64-inference  run64-nospeed  run64+vision  run65  run65-pure  run66  run67  run68  run68++050  run68++050-inference

I am trying to aggregate all my runs (which have many variables) into a single csv so I don't have to download each one manually and then aggregate it manually... picture example attached:

readme link to tensorboard-reducer for PyTorch?

I created a pip-installable package with similar functionality to this repo over at https://github.com/janosh/tensorboard-reducer.

It's targeted at PyTorch users and doesn't require a TensorFlow installation. How about we each add a readme link to the other's repo suggesting users of the other framework look there instead?

[Feature] Support variable step counts for a single scalar

I assume you made this work for specifically your project's file system, but it's a good enough idea that I'll probably make it work for mine.

Traceback (most recent call last):
File "../tensorboard_aggregator/aggregator.py", line 117, in
aggregate(args.path, args.output, args.subpaths)
File "../tensorboard_aggregator/aggregator.py", line 85, in aggregate
aggregations_per_key = {key: op(values, axis=0) for key, values in values_per_key.items()}
File "../tensorboard_aggregator/aggregator.py", line 85, in
aggregations_per_key = {key: op(values, axis=0) for key, values in values_per_key.items()}
File "/home/vincent/.virtualenvs/py3env/lib/python3.6/site-packages/numpy/core/fromnumeric.py", line 3118, in mean
out=out, **kwargs)
File "/home/vincent/.virtualenvs/py3env/lib/python3.6/site-packages/numpy/core/_methods.py", line 87, in _mean
ret = ret / rcount
TypeError: unsupported operand type(s) for /: 'list' and 'int'

trying to use this with a TensorBoard result

Hello Spen,

Thanks for writing this tool. I'm trying to use it with a model. Here is my usage and my error.

(py3) Huo-Yang~/progs/GSD-ML/bc-eta/gsdeta_trained$ adoit.sh
Traceback (most recent call last):
  File "/Users/davis/progs/notmine/tensorboard-aggregator/aggregator.py", line 145, in <module>
    raise argparse.ArgumentTypeError("Parameter {} is not a valid path".format(subpath))
argparse.ArgumentTypeError: Parameter /Users/davis/progs/GSD-ML/bc-eta/gsdeta_trained/model.ckpt-99000.index/test is not a valid path

Here is my model trained directory contents

(py3) Huo-Yang~/progs/GSD-ML/bc-eta/gsdeta_trained$ ls
checkpoint                                    model.ckpt-100000.index                       model.ckpt-98500.data-00001-of-00002          model.ckpt-99500.data-00000-of-00002
eval                                          model.ckpt-100000.meta                        model.ckpt-98500.index                        model.ckpt-99500.data-00001-of-00002
events.out.tfevents.1562007676.Huo-Yang.local model.ckpt-98000.data-00000-of-00002          model.ckpt-98500.meta                         model.ckpt-99500.index
export                                        model.ckpt-98000.data-00001-of-00002          model.ckpt-99000.data-00000-of-00002          model.ckpt-99500.meta
graph.pbtxt                                   model.ckpt-98000.index                        model.ckpt-99000.data-00001-of-00002
model.ckpt-100000.data-00000-of-00002         model.ckpt-98000.meta                         model.ckpt-99000.index
model.ckpt-100000.data-00001-of-00002         model.ckpt-98500.data-00000-of-00002          model.ckpt-99000.meta

Here is my script which I've attempted multiple tries

(py3) Huo-Yang~/progs/GSD-ML/bc-eta/gsdeta_trained$ cat /Users/davis/progs/notmine/tensorboard-aggregator/adoit.sh
#!/bin/bash
#python /Users/davis/progs/notmine/tensorboard-aggregator/aggregator.py --subpaths ['train','eval']
#python /Users/davis/progs/notmine/tensorboard-aggregator/aggregator.py --subpaths ['.']
python /Users/davis/progs/notmine/tensorboard-aggregator/aggregator.py

Here is how I started TensorBoard

tensorboard --logdir=/Users/davis/progs/GSD-ML/bc-eta/gsdeta_trained

What platform is this code tested on?

Describe the bug
I run into several bugs regarding the path variable.

The first attempt:

E:\Program\Anaconda3\envs\Py35\lib\site-packages\h5py\__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
Traceback (most recent call last):
  File "aggregator.py", line 142, in <module>
    subpaths = [path / dname / subpath for subpath in args.subpaths for dname in os.listdir(path) if dname != FOLDER_NAME]
  File "aggregator.py", line 142, in <listcomp>
    subpaths = [path / dname / subpath for subpath in args.subpaths for dname in os.listdir(path) if dname != FOLDER_NAME]
TypeError: listdir: illegal type for path parameter

This is because the 'os.listdir' in line 141 complains the Path object should not be used with the os lib, instead, it should be used with its class method Path.iterdir().

Anyway, I give it a second try after fixing this.
Again, another bug

Traceback (most recent call last):
  File "aggregator.py", line 146, in <module>
    if not os.path.exists(subpath):
  File "E:\Program\Anaconda3\envs\Py35\lib\genericpath.py", line 19, in exists
    os.stat(path)
TypeError: argument should be string, bytes or integer, not WindowsPath

The problem is still related to the wrongly use of Path object with os library.

To Reproduce
Steps to reproduce the behavior.

Expected behavior
A clear and concise description of what you expected to happen.

Screenshots
If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

OS: Windows 10.0
python version 3.5
tensorflow version
numpy version

Additional context
Add any other context about the problem here.

Event in tf2

bug
unable to import
tensorflow.core.util.event_pb2 import Event
Unresolved reference Event
Desktop :

OS: Windows 10
python version : 3.7
tensorflow version : 2.0
numpy version : 1.17.2

Sample figure for the documentation

Is your feature request related to a problem? Please describe.
I am not sure what this library achieves is exactly what I want.

Describe the solution you'd like
You could add a sample figure to the README file.

Tensorflow Migration Issues

Dear creators of this repository,

I wanted to use this tool for my master's thesis, but I experienced some migration issues related to Tensorflow.
The issues occured in the function 'write_summary'. I only wanted to generate a tensorboard summary, so these issues might occur in other places of the code as well.

I used this code to fix the issues:

def write_summary(dpath, aggregations_per_key):
    writer = tf.summary.create_file_writer(str(dpath))

    for key, (steps, wall_times, aggregations) in aggregations_per_key.items():
        for step, wall_time, aggregation in zip(steps, wall_times, aggregations):
            with writer.as_default():
                tf.summary.scalar(key, aggregation, step=step)
                writer.flush()

I am definitely not experienced in Tensorflow. I did a quick internet search and updated your code.
I created this issue to make you aware of these migration issues, as well as to help other users with the same problems.

My versions:
Pandas: 1.2.4
Numpy: 1.19.5
Tensorflow: 2.4.1
Tensorboard: 2.5.0

Have a great day.
Cheers.