mosaicml / composer Goto Github PK
View Code? Open in Web Editor NEWSupercharge Your Model Training
Home Page: http://docs.mosaicml.com
License: Apache License 2.0
Supercharge Your Model Training
Home Page: http://docs.mosaicml.com
License: Apache License 2.0
When running a baseline resnet50 model on imagenet, I encountered this error:
wandb: ERROR Error while calling W&B API: Error 1062: Duplicate entry '6394579-1' for key 'unique_artifact_collection_membership_version' (<Response [409]>)
Exception in thread Thread-7:
Traceback (most recent call last):
File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner
self.run()
File "/usr/lib/python3.8/threading.py", line 870, in run
self._target(*self._args, **self._kwargs)
File "/usr/local/lib/python3.8/dist-packages/wandb/filesync/step_upload.py", line 50, in _thread_body
self._handle_event(event)
File "/usr/local/lib/python3.8/dist-packages/wandb/filesync/step_upload.py", line 79, in _handle_event
self._maybe_commit_artifact(job.artifact_id)
File "/usr/local/lib/python3.8/dist-packages/wandb/filesync/step_upload.py", line 161, in _maybe_commit_artifact
self._api.commit_artifact(artifact_id)
File "/usr/local/lib/python3.8/dist-packages/wandb/sdk/internal/internal_api.py", line 2235, in commit_artifact
response = self.gql(mutation, variable_values={"artifactID": artifact_id})
File "/usr/local/lib/python3.8/dist-packages/wandb/sdk/lib/retry.py", line 102, in __call__
result = self._call_fn(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/wandb/sdk/internal/internal_api.py", line 147, in execute
six.reraise(*sys.exc_info())
File "/usr/local/lib/python3.8/dist-packages/six.py", line 719, in reraise
raise value
File "/usr/local/lib/python3.8/dist-packages/wandb/sdk/internal/internal_api.py", line 141, in execute
return self.client.execute(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/wandb/vendor/gql-0.2.0/gql/client.py", line 52, in execute
result = self._get_result(document, *args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/wandb/vendor/gql-0.2.0/gql/client.py", line 60, in _get_result
return self.transport.execute(document, *args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/wandb/vendor/gql-0.2.0/gql/transport/requests.py", line 39, in execute
request.raise_for_status()
File "/usr/local/lib/python3.8/dist-packages/requests/models.py", line 953, in raise_for_status
raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 409 Client Error: Conflict for url: https://api.wandb.ai/graphql
I've asked the WandB folks and they think it's from an attempted upload of an artifact with the same ID as another. The recent addition of artifact uploading from run_directory
seems to be causing this, so PR #89 will disable it by default, but we need to verify that artifact uploads are working as expected.
If the seed is not set in hparams, it is randomly selected in __init__
. Each DDP process, when it starts up, gets a different random seed.
The seed from the rank 0 process is saved in checkpoints
When resuming from a checkpoint, the seed from the rank 0 process is restored across all DDP processes.
This leads to inconsistent behavior, since the non-rank-0 process now resume with a different seed than they first trained with.
To fix: add the seed
to the RNG state, and sync across all DDP processes
mosaicml/research:latest docker container on 3080s.
Command:
python examples/run_mosaic_trainer.py -f composer/yamls/models/resnet50.yaml --loggers wandb --loggers.wandb.entity mosaic-ml --loggers.wandb.project landan-random --callbacks speed_monitor lr_monitor --callbacks.speed_monitor.window_size 100
I believe Cory saw hanging at the end of the CIFAR-10 benchmark as well, so that may be sufficient to reproduce the bug.
All (sub)processes to be killed at the end of training.
Training runs hang at the end of training. This means the processes will continue to run although training is complete.
The unet and gpt models currently fail on tests/test_load.py
due to something about the mock model.
They likely need a mock model of the appropriate type.
Need to debug and fix these tests.
Supporting stage 3 is expected to be non-trivial, since we can no longer store a complete copy of the model on each node.
The gpt2 models in the code (38m, 85m, 114m) are different from what's in the docs (52m, 83m, 125m). Also p sure d_attn for GPT2-52m is incorrect in the table in the docs.
Add a semantic segmentation benchmark based on the Cityscapes dataset and the Deeplabv3 architecture.
Our current segmentation benchmark is based on the Multimodal Brain Tumor Segmentation Challenge (BraTS) and the Unet architecture. There are a couple of reasons why we may want to add another segmentation benchmark:
Cityscapes appears to be the second most common semantic segmentation benchmark (behind Pascal VOC), so evaluating methods on Cityscapes should be relevant to the community. Cityscapes image resolution is 1024 x 2048 and the training set contains 2,975 densely and 20,000 coarsely annotated images (not as many as we would like, but a start). Alternatively, we could use ADE20k or Pascal VOC segmentation if others feel strongly towards either dataset.
It would be easier to benchmark with Deeplabv3 since the hyperparameters and target performance on Cityscapes are known. As of now, we have no numbers on training time, so this will be unknown. For Unet, we would need to tune hyperparameters and would not be sure if we are achieving an expected performance.
Simple implementation outline, but should be made more detailed:
Cityscapes DataloaderSpec: will try to use torchvision.datasets.cityscapes if it fits our use case
Deeplabv3 BaseMosaicModel: will try to use torchvision.models.segmentation.deeplabv3_resnet101 if it fits our use case
Implement intersection-over-union (IoU) metric for evaluation: should be easyish?
Dataset and model throughput profiling
Dataset and model card
In order to space out calls to the wandb
client, we should support the same frequency settings as the FileLogger
.
selective_backprop
needs to be the first algorithm used in the AFTER_DATALOADER
event because it prunes data samples and we only want to run other data-modification algorithms on the pruned set of data samples.
The seed is stored in the State
object in the Trainer
but instead it should be stored in the checkpoint_rng
object.
Note that right now, if the user does not set a seed on trainer init, then a different seed is created on each process but only the rank 0 seed its saved in the checkpoint. We want to enforce each device using the same seed which will be addressed by #12
Brats does not support synthetic data. It would be great to add support for it.
For example: https://github.com/mosaicml/composer/tree/main/composer/models/resnet50
The links seem to point to themselves.
Add a --smoke-test
flag or something similar.
I would like to be able to start a run that simply checks one step of training and one step of validation to ensure as well as possible that the training pipeline is working. This will make it easier when running many runs in parallel, where a small bug in the validation loop can waste a lot of time and compute resources.
When testing, benchmarking, smoke testing, and profiling, it is helpful to be able to easily get synthetic data that can then be passed into the model.forward() function for any type of model. However, it is impossible to automatically read the input (tensor) shape off of the model graph, so we are currently manually specifying the input shape wherever we perform synthetic passes (e.g. in tests, when constructing the synthetic dataset, etc...)
Because different models have different input formats, it would be difficult to describe this via a static parameter such as input shape -- e.g. nlp models use an input dictionary. As such, generating a synthetic batch would be preferred.
get_synethic_batch(batch_size)
on each BaseMosaicModel:class BaseMosaicModel:
@abc.abstractclass
def get_synethic_batch(self, batch_size: int, synthetic_data_distribution: SynethicDataDistributionEnum) -> Batch:
# for ease of subclass implementation, a set of helper methods would be available
pass
Then, anything that needs to perform a forward pass could do:
def my_profiling_script(model: BaseMosaicModel):
batch = model.get_synethic_batch(batch_size=10) # returns a batch size of 10 samples that the model can train on
output = model(batch)
We could also generalize the synthetic dataset to do something like:
class SyntheticDataset:
def __init__(self, model):
self.model = model
def __getitem__(self, i):
return self.model.get_synthetic_batch(1)
Instead of storing how to generate synthetic batch information on each model, this could instead be stored in a common registry-like design. For example:
class SyntheticDatasetGenerator:
def get_synethic_dataset(self, model, *args, **kwargs):
if isinstance(model, MNIST):
return SyntheticDataset(input_shape=(1, 28, 28), *args, **kwargs)
if isinstance(model, Resent):
return SyntheticDataset(input_shape=(3, 224, 224), *args, **kwargs)
This option would require generalization of the SyntheticDataset to support NLP data.
Add in the officially supported docker images to docker/README.md
Originally posted by @ravi-mosaicml in #66 (comment)
The Trainer
class should have a predict() function as a convenience for a user who wants to run inference on a trained model.
With #65, the global rank is now known when the python process starts. Thus, for rank zero loggers, it is not necessary to wait until training start to initialize the logger. Instead, loggers should initialize on the INIT event, and process all logging calls immediately.
By convention, there will not be any calls to the loggers before the init event.
Add callbacks to upload the run directory to blob stores (s3, gcs)
Currently, the run directory is only saved locally (or, uploaded to WANDB, but we're running into issues with that). When a K8S pod dies, we lose the run directory. We store logs, checkpoints, traces, etc... in the run directory, so this should be persisted.
This can be implemented via a callback, quite trivially. It would be best to delegate the directory monitoring / uploading to a subprocess (not sub thread), as not to use GIL time in the main training loop. While network I/O happens outside the GIL, other work related to uploading (e.g. computing file hashes) does occur within the GIL, so it would be best to offload this. However, an initial implementation can use a background thread.
For cross-cloud compatibility, going to use apache libcloud.
Efficient SAM (anonymous, 2022) is a proposed duo of SAM optimizations to reduce the throughput hit of SAM. The composer repo already supports an interval
hyperparameter which has empirically been found to maintain much of the quality improvement of SAM while sacrificing little throughput, but it would be interesting to see if ESAM could enable setting lower values of interval
.
Enable github actions for:
yapf
, pyright
)add support for python 3.7 build
I wanted to play with composer but was not able to install via pip because google Colab runs in a python 3.7 environment
Anytime an error occurs while I am using multi-gpu training, the job crashes, but the error is not printed. I need to run the experiment with a single GPU to find what the error was.
Is there a way to fix this? It makes determining issue very difficult.
I can try to create an example with the current release if needed.
Ordinarily, when training with gradient accumulation, we only need to do a DDP sync on the final microbatch, because synced gradients aren't needed until the optimizer runs at the end of the batch. However, the find_unused_parameters
flag indicates that some algorithms (such as stochastic depth) may cause not all gradients to be generated. Critically, the set of unused parameters may vary between microbatches. Syncing on only the last microbatch may cause some parameters used in earlier microbatches but unused in the final microbatch to not be properly synced - resulting in severe quality degradations.
Our current solution to this issue is just to sync all microbatches when the find_unused_paramaters
flag is set, but this incurs a throughput penalty of about 5%, depending on gradient accumulation setting. We would like to investigate whether it is possible to sync all parameters used in any microbatch, to avoid this throughput penalty.
LM datasets do not support synthetic data. It would be great to add support for it.
This was renamed to seq_length_warmup but for some reason curriculum_learning.py
still exists.
Methods such as AdaHessian
or similar need support for this feature.
Instead of getting a ValueError (and stack trace) when running run_mosaic_trainer.py
without arguments, it might be a bit more friendly to print out the help text from -h
.
Getting a good first impression (and not feeling accused) as a CLI user is good practice, and we can print out the help text easily.
DeepSpeed currently crashes if you try using it to train RN50 with FP16 (FP32 works fine). The problem is that the model needs the input tensor to also be in FP16, but the dataloader does nothing to change the dtype of the batches it returns according to the current precision. This isn't a problem for NLP models because the dtypes of NLP batches are generally all integer types anyways, so those models already handle casting the batch types (or something like that, I'm a bit unclear on exactly what's happening).
My proposed fix is fairly hacky. I'd like to avoid having to add code to dataloaders, datasets, and models to handle FP16 precision settings. Instead, I'd like to have the trainer itself handle casting batches to FP16 as appropriate. The hacky part of this is that the trainer needs to be able to determine when this cast should be done, as for ImageNet, but not for NLP. There's no perfect way to do this. I'm going to try having it just cast any FP32 tensor it sees in loaded batches to FP16.
We have a number of unit tests (ex. tests/trainer/test_trainer.py
and tests/trainer/test_checkpoint.py
which use GPUs as a part of the test. However, these tests are not run as a part of the GitHub actions tests, which results in the potential for GPU-related bugs. We should have a system in place which runs GPU tests before code can be merged into dev
.
There have been GPU-specific bugs in the past that were not caught because GPU tests do not run in our unit testing suite.
Can use CircleCI for this.
For post-hoc measurements on different datasets, we want to be able to load a checkpoint and run --eval_only
.
Add a --eval_only
flag that loads from checkpoint and only runs eval. User would need to specify a new dataset/dataloader that differs from the checkpointed hparams.
Add support for training on only a subset of the dataset on each epoch.
During testing and profiling, it can be important to skip over the first epoch (e.g. to ignore io bandwidth), but it is usually not needed to train over the entire dataset. Only a small subset is needed.
Add support for https://pytorch.org/docs/stable/_modules/torch/utils/data/sampler.html#SubsetRandomSampler.
It will be a bit more complicated to make a DDP version of this.
Trainers do not cleanup properly when KeyboardInterrupted. Should cleanup the model/possibly keep the model in a state where it can be partially trained but evaluated .fit()
is exited early. Probably should gracefully exit and cleanup for interactive composer users
** To reproduce
from composer import algorithms, trainer, Trainer
from composer.core.types import Precision
hparams = trainer.load("classify_mnist_cpu") # loads from composer/yamls/models/classify_mnist_cpu.yaml
hparams.algorithms = algorithms.load_multiple("blurpool", "label_smoothing")
# edit other properties in the hparams object
hparams.precision = Precision.FP32
hparams.grad_accum = 2
hparams.set_datadir("~/datasets")
trainer = Trainer.create_from_hparams(hparams)
trainer.fit()
then CTRL-C, Keyboard Escape
then
trainer.fit()
Produces
>>> trainer.fit()
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
File "/usr/lib/python3.8/multiprocessing/queues.py", line 245, in _feed
send_bytes(obj)
File "<stdin>", line 1, in <module>
File "/usr/lib/python3.8/multiprocessing/connection.py", line 200, in send_bytes
self._send_bytes(m[offset:offset + size])
File "/usr/lib/python3.8/multiprocessing/queues.py", line 245, in _feed
send_bytes(obj)
File "/usr/lib/python3.8/multiprocessing/connection.py", line 200, in send_bytes
self._send_bytes(m[offset:offset + size])
File "/usr/lib/python3.8/multiprocessing/connection.py", line 411, in _send_bytes
self._send(header + buf)
File "/usr/lib/python3.8/multiprocessing/connection.py", line 368, in _send
n = write(self._handle, buf)
File "/usr/lib/python3.8/multiprocessing/connection.py", line 411, in _send_bytes
self._send(header + buf)
BrokenPipeError: [Errno 32] Broken pipe
File "/usr/lib/python3.8/multiprocessing/connection.py", line 368, in _send
n = write(self._handle, buf)
File "/home/avery/mosaic/composer/composer/trainer/trainer.py", line 356, in fit
BrokenPipeError: [Errno 32] Broken pipe
self._train_loop()
File "/home/avery/mosaic/composer/composer/trainer/trainer.py", line 488, in _train_loop
assert isinstance(original_model, BaseMosaicModel)
AssertionError
When resuming from a checkpoint, max_epochs
currently defaults to the original max_epochs
which prevents users from being able to train for more than the original max_epochs
when resuming from a checkpoint.
It would be good to be able to resume from checkpoint and train for more epochs than the original max_epochs
. However, we need to come up with a scheme to make this work with scale_schedule_ratio
because scale schedule ratios are computed assuming that max_epochs
does not change.
How should we go about handling this?
When running tests, we validate that algorithms run on each model type. Some algorithms are not compatible with some models (e.g. NLP algs on image classification models), so we manually hard-code this in the tests. It would be helpful to have a first-class API to get which models support which algorithms, and which algorithms support which models.
The engine could also use this information to perform a static analysis to detect runtime issues before they arise.
One possible design could be to have a ModelType
that would work like this:
class ModelType(StringEnum):
CLASSIFICATION = "Classification"
NLP = "Nlp"
...
class BaseMosaicModel:
model_type: ModelType # would be set on each model
...
class Algorithm:
@classmethod
def get_supported_model_types(cls):
return list(ModelType) # can be overridden on each algorithm
All non-core dependencies should be lazily loaded, so one can use the library without having to install composer[all]
This likely means that functions that depend on a non-core dependency should import that dependency inside the function, not at module-level.
Right now model surgery does not work after the model parameters have been passed to an optimizer. As a result, we call the Event.INIT
(which is used by model-modifying methods such as Blurpool and SqueezeExcite) call back in the Trainer __init__
before the optimizer is constructed rather than in the training loop.
This yields API complications because the user cannot pass a pre-constructed optimizer into the Trainer __init__
.
We need to get surgery working properly and test it on Blurpool and SqueezeExcite to make sure there are no regressions.
Our current trainer relaunches itself N times to create N processes for DDP. The problem with this is that DDP does so by rerunning the very script that launched the trainer in the first place. This is problematic for any user invoking DDP via a custom script, and also for testing.
The canonical solution to this problem is to provide a launch executable that wraps a user provided script to initialize a trainer. The launch executable runs the script N times to create N processes. This appears to be the direction that many ML frameworks, including DeepSpeed, are moving towards.
This will simplify testing and allow us to accurately calculate coverage metrics. This is also essentially a prerequisite to integrating the trainer with DeepSpeed, which also uses an executable.
Run regression tests on pytorch v1.10 (https://pytorch.org/blog/pytorch-1.10-released/)
Steps to reproduce the behavior:
.fit()
twice on the same Trainer in a script or notebook (not cleaned up port usage for torch.distributed
)Ideally the Trainer by default won't use a static port in TCPStore
and instead select an open port to use for torch.distributed
coordination.
The trainer convergence test is flaky right now. This is likely due to the fact that we are using a CNN for the test which does significant dimensionality reduction and is thus hard to reason about in terms of linear separability of gaussian data. A fix would be to convert the test into training logistic regression.
** To reproduce
Run the test many times on the same code (seems to fail once every ~50-100 times)
The test behavior should be consistent (i.e. if it passes once on some code then it should always pass on that code).
As of this update in pytorch: pytorch/pytorch#61044, we no longer need our implementation of soft_cross_entropy
in composer/models/loss.py
and should remove it in favor of the one in pytorch.
The trainer can automatically determine the appropriate grad_accum to use based off of hardware properties.
It is cumbersome to manually specify the grad accum for every hardware and model.
while True:
try:
train_model()
except CudaOOOM:
state.grad_accum += 1
For fine-tuning tasks (e.g. GLUE) and also many vision experiments, need to support multiple eval datasets. The metrics needed could be different across different datasets.
eval_dataloaders
as a List
eval
loop, run through multiple dataloaders and log the metrics for each datasetmetric
function return different metrics depending on the dataset.Add a helper method in callback
for run_event
. This helper method would then call the correct method on callback. It would look something like:
class Callback:
def run_event(self, state; State, logger; Logger, event: Event):
if event == Event.TRAINING_START:
self.training_start(state, logger)
if event == Event.BEFORE_FORWARD:
self.before_forward(state, logger)
...
Then, the engine would do callback.run_event(state, logger, event)
.
This would help clean up code in the following places:
EventCounterCallback
basically does this, via monkeypatching** To reproduce
Steps to reproduce the behavior:
Produces a traceback in DDP spawn (cpu only), where workers crash (still trains fine)
from composer.trainer import TrainerHparams, Trainer
hparams = TrainerHparams.create('composer/yamls/models/classify_mnist_cpu.yaml')
hparams.set_datadir("~/datasets")
trainer = Trainer.create_from_hparams(hparams)
trainer.fit()
/home/avery/mosaic/composer/composer/utils/ddp.py:20: UserWarning: DDPDefaultValueWarning: RANK env var not set and process group not initialized; returning 0 for global rank.
warnings.warn(f"DDPDefaultValueWarning: {env_var} env var not set"
Epoch 1: 3%|โ | 1/29 [00:00<00:21, 1.32it/s] /home/avery/mosaic/composer/composer/utils/ddp.py:20: UserWarning: DDPDefaultValueWarning: WORLD_SIZE env var not set and process group not initialized; returning 1 for world size.
warnings.warn(f"DDPDefaultValueWarning: {env_var} env var not set"
Epoch 1: 3%|โ | 1/29 [00:00<00:21, 1.32it/s, loss/train=2.3191] Exception ignored in: <function _MultiProcessingDataLoaderIter.__del__ at 0x7ff33771e5e0>
Traceback (most recent call last):
File "/home/avery/.local/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1328, in __del__
self._shutdown_workers()
File "/home/avery/.local/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1320, in _shutdown_workers
if w.is_alive():
File "/usr/lib/python3.8/multiprocessing/process.py", line 160, in is_alive
assert self._parent_pid == os.getpid(), 'can only test a child process'
AssertionError: can only test a child process
Exception ignored in: <function _MultiProcessingDataLoaderIter.__del__ at 0x7ff33771e5e0>
Traceback (most recent call last):
File "/home/avery/.local/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1328, in __del__
self._shutdown_workers()
File "/home/avery/.local/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1320, in _shutdown_workers
if w.is_alive():
File "/usr/lib/python3.8/multiprocessing/process.py", line 160, in is_alive
assert self._parent_pid == os.getpid(), 'can only test a child process'
AssertionError: can only test a child process
Add a callback to monitor memory statistics during training such as memory reserved by the caching allocator, number of malloc calls, number of free calls, etc...
Having memory allocator statistics during training available is very helpful for debugging issues such as OOM and memory leaks.
See: https://pytorch.org/docs/stable/generated/torch.cuda.memory_stats.html#torch.cuda.memory_stats for the API that gives this information.
Integration with DeepSpeed. The V0 use case is targeted only on data parallelism strategies like ZeRO.
Necessary to train GPT models above 1.3B parameters.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.