Giter VIP home page Giter VIP logo

wilds's Introduction


PyPI License

Overview

WILDS is a benchmark of in-the-wild distribution shifts spanning diverse data modalities and applications, from tumor identification to wildlife monitoring to poverty mapping.

The WILDS package contains:

  1. Data loaders that automatically handle data downloading, processing, and splitting, and
  2. Dataset evaluators that standardize model evaluation for each dataset.

In addition, the example scripts contain default models, optimizers, schedulers, and training/evaluation code. New algorithms can be easily added and run on all of the WILDS datasets.

For more information, please visit our website or read the main WILDS paper (1) and its follow-up integrating unlabeled data (2). For questions and feedback, please post on the discussion board.

Installation

We recommend using pip to install WILDS:

pip install wilds

If you have already installed it, please check that you have the latest version:

python -c "import wilds; print(wilds.__version__)"
# This should print "2.0.0". If it doesn't, update by running:
pip install -U wilds

If you plan to edit or contribute to WILDS, you should install from source:

git clone [email protected]:p-lambda/wilds.git
cd wilds
pip install -e .

In examples/, we provide a set of scripts that can be used to train models on the WILDS datasets. These scripts were also used to benchmark baselines in our papers [1, 2]. These scripts are not part of the installed WILDS package. To use them, you should install from source, as described above.

Requirements

The WILDS package depends on the following requirements:

  • numpy>=1.19.1
  • ogb>=1.2.6
  • outdated>=0.2.0
  • pandas>=1.1.0
  • pillow>=7.2.0
  • pytz>=2020.4
  • torch>=1.7.0
  • torch-scatter>=2.0.5
  • torch-geometric>=2.0.1
  • torchvision>=0.8.2
  • tqdm>=4.53.0
  • scikit-learn>=0.20.0
  • scipy>=1.5.4

Running pip install wilds or pip install -e . will automatically check for and install all of these requirements except for the torch-scatter and torch-geometric packages, which require a quick manual install.

Example script requirements

To run the example scripts, you will also need to install these additional dependencies:

All baseline experiments in the paper were run on Python 3.8.5 and CUDA 10.1.

Datasets

WILDS currently includes 10 datasets, which we've briefly listed below. For full dataset descriptions, please see our papers (1, 2).

Dataset Modality Labeled splits Unlabeled splits
iwildcam Image train, val, test, id_val, id_test extra_unlabeled
camelyon17 Image train, val, test, id_val train_unlabeled, val_unlabeled, test_unlabeled
rxrx1 Image train, val, test, id_test -
ogb-molpcba Graph train, val, test train_unlabeled, val_unlabeled, test_unlabeled
globalwheat Image train, val, test, id_val, id_test train_unlabeled, val_unlabeled, test_unlabeled, extra_unlabeled
civilcomments Text train, val, test extra_unlabeled
fmow Image train, val, test, id_val, id_test train_unlabeled, val_unlabeled, test_unlabeled
poverty Image train, val, test, id_val, id_test train_unlabeled, val_unlabeled, test_unlabeled
amazon Text train, val, test, id_val, id_test val_unlabeled, test_unlabeled, extra_unlabeled
py150 Text train, val, test, id_val, id_test -

Using the WILDS package

Data

The WILDS package provides a simple, standardized interface for all datasets in the benchmark. This short Python snippet covers all of the steps of getting started with a WILDS dataset, including dataset download and initialization, accessing various splits, and preparing a user-customizable data loader. We discuss data loading in more detail in #Data loading.

from wilds import get_dataset
from wilds.common.data_loaders import get_train_loader
import torchvision.transforms as transforms

# Load the full dataset, and download it if necessary
dataset = get_dataset(dataset="iwildcam", download=True)

# Get the training set
train_data = dataset.get_subset(
    "train",
    transform=transforms.Compose(
        [transforms.Resize((448, 448)), transforms.ToTensor()]
    ),
)

# Prepare the standard data loader
train_loader = get_train_loader("standard", train_data, batch_size=16)

# (Optional) Load unlabeled data
dataset = get_dataset(dataset="iwildcam", download=True, unlabeled=True)
unlabeled_data = dataset.get_subset(
    "test_unlabeled",
    transform=transforms.Compose(
        [transforms.Resize((448, 448)), transforms.ToTensor()]
    ),
)
unlabeled_loader = get_train_loader("standard", unlabeled_data, batch_size=16)

# Train loop
for labeled_batch, unlabeled_batch in zip(train_loader, unlabeled_loader):
    x, y, metadata = labeled_batch
    unlabeled_x, unlabeled_metadata = unlabeled_batch
    ...

The metadata contains information like the domain identity, e.g., which camera a photo was taken from, or which hospital the patient's data came from, etc., as well as other metadata.

Domain information

To allow algorithms to leverage domain annotations as well as other groupings over the available metadata, the WILDS package provides Grouper objects. These Grouper objects are helper objects that extract group annotations from metadata, allowing users to specify the grouping scheme in a flexible fashion. They are used to initialize group-aware data loaders (as discussed in #Data loading) and to implement algorithms that rely on domain annotations (e.g., Group DRO). In the following code snippet, we initialize and use a Grouper that extracts the domain annotations on the iWildCam dataset, where the domain is location.

from wilds.common.grouper import CombinatorialGrouper

# Initialize grouper, which extracts domain information
# In this example, we form domains based on location
grouper = CombinatorialGrouper(dataset, ['location'])

# Train loop
for x, y_true, metadata in train_loader:
    z = grouper.metadata_to_group(metadata)
    ...

Data loading

For training, the WILDS package provides two types of data loaders. The standard data loader shuffles examples in the training set, and is used for the standard approach of empirical risk minimization (ERM), where we minimize the average loss.

from wilds.common.data_loaders import get_train_loader

# Prepare the standard data loader
train_loader = get_train_loader('standard', train_data, batch_size=16)

To support other algorithms that rely on specific data loading schemes, we also provide the group data loader. In each minibatch, the group loader first samples a specified number of groups, and then samples a fixed number of examples from each of those groups. (By default, the groups are sampled uniformly at random, which upweights minority groups as a result. This can be toggled by the uniform_over_groups parameter.) We initialize group loaders as follows, using Grouper that specifies the grouping scheme.

# Prepare a group data loader that samples from user-specified groups
train_loader = get_train_loader(
    "group", train_data, grouper=grouper, n_groups_per_batch=2, batch_size=16
)

Lastly, we also provide a data loader for evaluation, which loads examples without shuffling (unlike the training loaders).

from wilds.common.data_loaders import get_eval_loader

# Get the test set
test_data = dataset.get_subset(
    "test",
    transform=transforms.Compose(
        [transforms.Resize((224, 224)), transforms.ToTensor()]
    ),
)

# Prepare the evaluation data loader
test_loader = get_eval_loader("standard", test_data, batch_size=16)

Evaluators

The WILDS package standardizes and automates evaluation for each dataset. Invoking the eval method of each dataset yields all metrics reported in the paper and on the leaderboard.

from wilds.common.data_loaders import get_eval_loader

# Get the test set
test_data = dataset.get_subset(
    "test",
    transform=transforms.Compose(
        [transforms.Resize((224, 224)), transforms.ToTensor()]
    ),
)

# Prepare the data loader
test_loader = get_eval_loader("standard", test_data, batch_size=16)

# Get predictions for the full test set
for x, y_true, metadata in test_loader:
    y_pred = model(x)
    # Accumulate y_true, y_pred, metadata

# Evaluate
dataset.eval(all_y_pred, all_y_true, all_metadata)
# {'recall_macro_all': 0.66, ...}

Most eval methods take in predicted labels for all_y_pred by default, but the default inputs vary across datasets and are documented in the eval docstrings of the corresponding dataset class.

Using the example scripts

In examples/, we provide a set of scripts that can be used to train models on the WILDS datasets.

python examples/run_expt.py --dataset iwildcam --algorithm ERM --root_dir data
python examples/run_expt.py --dataset civilcomments --algorithm groupDRO --root_dir data
python examples/run_expt.py --dataset fmow --algorithm DANN --unlabeled_split test_unlabeled --root_dir data

The scripts are configured to use the default models and reasonable hyperparameters. For exact hyperparameter settings used in our papers, please see our CodaLab executable paper.

Downloading and training on the WILDS datasets

The first time you run these scripts, you might need to download the datasets. You can do so with the --download argument, for example:

# downloads (labeled) dataset
python examples/run_expt.py --dataset globalwheat --algorithm groupDRO --root_dir data --download

# additionally downloads all unlabeled data
python examples/run_expt.py --dataset globalwheat --algorithm groupDRO --root_dir data --download  --unlabeled_split [...]

Note that downloading the large amount of unlabeled data is optional; unlabeled data will only be downloaded if some --unlabeled_split is set. (It does not matter which --unlabeled_split is set; all unlabeled data will be downloaded together.)

Alternatively, you can use the standalone wilds/download_datasets.py script to download the datasets, for example:

# downloads (labeled) data
python wilds/download_datasets.py --root_dir data

# downloads (unlabeled) data
python wilds/download_datasets.py --root_dir data --unlabeled

This will download all datasets to the specified data folder. You can also use the --datasets argument to download particular datasets.

These are the sizes of each of our datasets, as well as their approximate time taken to train and evaluate the default model for a single ERM run using a NVIDIA V100 GPU.

Dataset command Modality Download size (GB) Size on disk (GB) Train+eval time (Hours)
iwildcam Image 11 25 7
camelyon17 Image 10 15 2
rxrx1 Image 7 7 11
ogb-molpcba Graph 0.04 2 15
globalwheat Image 10 10 2
civilcomments Text 0.1 0.3 4.5
fmow Image 50 55 6
poverty Image 12 14 5
amazon Text 7 7 5
py150 Text 0.1 0.8 9.5

The following are the sizes of the unlabeled data bundles:

Dataset command Modality Download size (GB) Size on disk (GB)
iwildcam Image 41 41
camelyon17 Image 69.4 96
ogb-molpcba Graph 1.2 21
globalwheat Image 103 108
civilcomments Text 0.3 0.6
fmow* Image 50 55
poverty Image 172 184
amazon* Text 7 7

* These unlabeled datasets are downloaded simultaneously with the labeled data and do not need to be downloaded separately.

While the camelyon17 dataset is small and fast to train on, we advise against using it as the only dataset to prototype methods on, as the test performance of models trained on this dataset tend to exhibit a large degree of variability over random seeds.

The image datasets (iwildcam, camelyon17, rxrx1, globalwheat, fmow, and poverty) tend to have high disk I/O usage. If training time is much slower for you than the approximate times listed above, consider checking if I/O is a bottleneck (e.g., by moving to a local disk if you are using a network drive, or by increasing the number of data loader workers). To speed up training, you could also disable evaluation at each epoch or for all splits by toggling --evaluate_all_splits and related arguments.

Algorithms

In the examples/algorithms folder, we provide implementations of the adaptation algorithms benchmarked in our papers (1, 2). All algorithms train on labeled data from a WILDS dataset's train split. Some algorithms are designed to also leverage unlabeled data. To load unlabeled data, specify an --unlabeled_split when running.

In addition to shared hyperparameters such as lr, weight_decay, batch_size, and unlabeled_batch_size, the scripts also take in command line arguments for algorithm-specific hyperparameters.

Algorithm command Hyperparameters Notes See WILDS paper
ERM - Only uses labeled data (1, 2)
groupDRO group_dro_step_size Only uses labeled data (1)
deepCORAL coral_penalty_weight Can optionally use unlabeled data (1, 2)
IRM irm_lambda, irm_penalty_anneal_iters Only uses labeled data (1)
DANN dann_penalty_weight, dann_classifier_lr, dann_featurizer_lr, dann_discriminator_lr Can use unlabeled data (2)
AFN afn_penalty_weight, safn_delta_r, hafn_r Designed to use unlabeled data (2)
FixMatch self_training_lambda, self_training_threshold Designed to use unlabeled data (2)
PseudoLabel self_training_lambda, self_training_threshold, pseudolabel_T2 Designed to use unlabeled data (2)
NoisyStudent soft_pseudolabels, noisystudent_dropout_rate Designed to use unlabeled data (2)

The repository is set up to facilitate general-purpose algorithm development: new algorithms can be added to examples/algorithms and then run on all of the WILDS datasets using the default models.

Evaluating trained models

We also provide an evaluation script that aggregates prediction CSV files for different replicates and reports on their combined evaluation. To use this, run:

python examples/evaluate.py <predictions_dir> <output_dir> --root_dir <root_dir>

where <predictions_dir> is the path to your predictions directory, <output_dir> is where the results JSON will be writte, and <root_dir> is the dataset root directory. The predictions directory should have a subdirectory for each dataset (e.g. iwildcam) containing prediction CSV files to evaluate; see our submission guidelines for the format. The evaluation script will skip over any datasets that has missing prediction files. Any dataset not in <root_dir> will be downloaded to <root_dir>.

Reproducibility

We have an executable version of our paper on CodaLab that contains the exact commands, code, and data for the experiments reported in our paper, which rely on these scripts. Trained model weights for all datasets can also be found there. All configurations and hyperparameters can also be found in the examples/configs folder of this repo, and dataset-specific parameters are in examples/configs/datasets.py.

Leaderboard

If you are developing new training algorithms and/or models on WILDS, please consider submitting them to our public leaderboard.

Citing WILDS (Bibtex)

If you use WILDS datasets in your work, please cite our paper:

  1. WILDS: A Benchmark of in-the-Wild Distribution Shifts. Pang Wei Koh*, Shiori Sagawa*, Henrik Marklund, Sang Michael Xie, Marvin Zhang, Akshay Balsubramani, Weihua Hu, Michihiro Yasunaga, Richard Lanas Phillips, Irena Gao, Tony Lee, Etienne David, Ian Stavness, Wei Guo, Berton A. Earnshaw, Imran S. Haque, Sara Beery, Jure Leskovec, Anshul Kundaje, Emma Pierson, Sergey Levine, Chelsea Finn, and Percy Liang. ICML 2021.

If you use unlabeled data from the WILDS datasets, please also cite:

  1. Extending the WILDS Benchmark for Unsupervised Adaptation. Shiori Sagawa*, Pang Wei Koh*, Tony Lee*, Irena Gao*, Sang Michael Xie, Kendrick Shen, Ananya Kumar, Weihua Hu, Michihiro Yasunaga, Henrik Marklund, Sara Beery, Etienne David, Ian Stavness, Wei Guo, Jure Leskovec, Kate Saenko, Tatsunori Hashimoto, Sergey Levine, Chelsea Finn, and Percy Liang. ICLR 2022.

In addition, please cite the original papers that introduced the datasets, as listed on the datasets page.

Acknowledgements

The design of the WILDS benchmark was inspired by the Open Graph Benchmark, and we are grateful to the Open Graph Benchmark team for their advice and help in setting up WILDS.

wilds's People

Contributors

b-akshay avatar bearnshaw avatar etiennedavid avatar henrikmarklund avatar keawang avatar kohpangwei avatar michiyasunaga avatar rlphilli avatar ssagawa avatar teetone avatar weihua916 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

wilds's Issues

What are the random seeds used in the paper

Could you please share what are the random seeds used for the experiments in the paper? I think using the same set of random seeds for our own experiments can offer a fairer comparison of results.

Obtaining (full) model predictions for trained models

Hello and thank you for all your work on this important project!

I am wondering if it's possible to share the full predictions of the trained models on the val/test data.

If I understand correctly, this is similar to the output files
{dataset}_split:{split}_seed:{seed}_epoch:{epoch}.csv (e.g. this file for Rxrx1), but where the information in the csv is not only the argmax class prediction, but the entire logits vectors (e.g. in this case a vector in R^1139).

I think this would be useful as it will allow people (myself included) to evaluate trained models using a variety of custom metrics, but without actually downloading the data and doing the evaluation (which could be prohibitive for some of the larger datasets).

Thanks!
Gal

run_expt.py: --device argument doesn't set the device

Hey, I'm running Wilds on a p2.8xlarge AWS EC2 instance with 8 K80 GPUs. I noticed that when I try to run run_expt.py and use the --device argument to divide the jobs I'm trying to run between the GPUs, they all end up running on GPU 0. I verified this by the memory usage in nvidia-smi as well as printing the device used by torch using torch.cuda.current_device(). My guess is that the CUDA_VISIBLE_DEVICES environment variable, set here, is set too late and PyTorch just defaults to device 0.

I've worked around this by setting the CUDA_VISIBLE_DEVICES variable manually, before running the script. I just thought I'd let you know I encountered this issue.

Really appreciate the project by the way! Being able to access multiple datasets for domain generalization with the same interface is really useful, and I managed to use run_expt pretty easily to run my own experiments.

camelyon17 split scheme: in-dist

I am not able to run Camelyon17 with --split_scheme in-dist (I'm assuming this corresponds to the setting with ID val data).

Any pointers on how to run this, or in general how to run camelyon with the ID val data?

Thank you for the help!

Calculation of OOD within the paper

Hello together,

first of all I would like to thanks for making the paper and code to "WILDS: A Benchmark of in-the-Wild Distribution Shifts" publicly available. What caught my interest when reading the paper was the estimation of the in (IID) and out-of-distribution (OOD) which has been evaluated using empirical risk minimization (Table 1 page 20). My question is how the IIDs and OODs were calculated. Did you use the softmax with temperature scaling according to the paper "Enhancing the reliability of out of distribution image detection in neural networks"? If not, can you give reference to the way you tackled this problem?

Thank you in advance for your kindful reply

algorithm.eval() vs. algorithm.model.eval()

Hi,

Really nice job with this repo! I had a small comment on the use of algorithm.eval() vs. algorithm.model.eval() in the wilds/examples/train.py file that might be useful to others.

I wasn't able to find this in the code, but how does algorithm.eval() differ from algorithm.model.eval()?

I ask because algorithm.model.eval() preserves the grad_fn attribute on the model output, while algorithm.eval() does not. This was unexpected behavior since pytorch's .eval() function doesn't do this. This is important for my use case, since I'm trying to evaluate the gradients when the model is in eval mode. If this does not break behavior elsewhere, I'd suggest switching to algorithm.model.eval().

Happy to explain more if this was confusing!

Model loaded from a .pth predicts only zeros

Hello !

I downloaded for the Camelyon17 dataset your trained model from CodaLab (ERM and seed0). I have installed all packages correctly according to your readme and load the model as follows:

path = "/best_model.pth"
state = torch.load(path)['algorithm']

state_dict = {}
 
for key in list(state.keys()):
    state_dict[key.replace('model.', '')] = state[key]

model.load_state_dict(state_dict)

model.eval()

I initialize the dataset I use for testing the model as follows:

import datasets_load  # from wilds package
dataset = datasets_load.Dataset('camelyon17', 32, '/data', 0.75, False)

For the prediction I used the following piece of code:

from wilds.common.data_loaders import get_eval_loader

test_data = dataset.test_set
test_loader = get_eval_loader('standard', test_data, batch_size=32)

with torch.no_grad():
    for x, y_true, metadata in test_loader:
          y_pred = model(x)
          labels = y_true
          _, predicted = torch.max(y_pred, 1)
          # print statements to check the output
          print("Labels: ", labels)
          print("Predicted: ", predicted)
          print("Correct: ", (predicted == labels).sum().item())

So far so good. When I run the code, the labels are printed (which always consist of 1 at the beginning, because shuffle=False) and the prediction which always consists of 0 values.

Labels:  tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])
Predicted:  tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
Correct:  0

I would appreciate any advice or assistance. Many thanks in advance.
Tim

Oracle results for UDA tasks

Hi,

Could you please share Oracle, i.e. training on labeled target domain, results for the Camelyon17 and iWildCAM datasets in "Extending the WILDS benchmark for unsupervised adaptation"? Oracle results for the other datasets would be appreciated too.

If Oracle results are not available, could you please share the commands that can be used to obtain them?

Thanks.

Error loading the ogb-molpcba dataset

In ogbmolpcba_dataset.py line 96 (and similarly 98), a PyGCollater object is initialized without passing a required positional argument (dataset) which raises an error when calling the get_dataset function for the molecule dataset. I think this can be fixed by replacing with
self._collate = PyGCollater(self.ogb_dataset, follow_batch=[], exclude_keys=[])? Or is this how it is supposed to be and I am missing something?

Downloading FMoW dataset

Hello author,
Thanks for the release of the code of your paper. I love your work.

The download is stuck in the middle of the process when implementing "wilds.get_dataset" for downloading the FMoW dataset.
Could you check this one?

Thank you in advance.

releasing smaller subsets of the datasets

Hi, thanks for releasing the benchmark and the datasets. I was wondering if it would be possible to release smaller subsets of the datasets (e.g. similar size to cifar, mnist, etc) in order to allow for rapid prototyping? As it stands currently it takes more than 2 days just to download one dataset which could occupy the majority of disk space as well. This alone could prohibit people from trying out and exploring the datasets.

Also, it would be nice if you could put on the info page here on github what is the download size of each dataset and what is the actual size on disk.

Thanks!

How do I access data from only one group?

Hello, Thanks for the fantastic library!

I have two questions:

  1. Is there any way I can get a per-group dataloader in wilds? This will help with, for instance, training a separate model for each group of data.
  2. Can I change the split of data for each dataset? My application requires 50% of the data for each group/domain for testing.

Thanks!

Replicating Civil Comments results with standard deviation with Group DRO (label ) algorithm

Dear Team,

I am trying to replicate results for the civil comments dataset using GROUP DRO - LABEL (i.e group by = 'Y') in the leaderboard .
Test average accuracy is mentioned as 90.2 (0.3) and validation average accuracy as 90.4 (0.4) ,I am not understanding on how these values were obtained, does that mean from each seed max of average accuracy among 5 epochs are used, or do we use only the average accuracy from the last epoch(i.e. 5th epoch)

I tried taking the average from all 5 seeds by using the 5th epoch average accuracy, i didn't not achieve 90.2 in the test average accuracy.( I used the average value from the test.eval.csv of all the 5 seeds published in the notebook)

Could you please help me with how to replicate the results including the standard deviation?

`assert` error in new wilds version with FMoW

Hello, I am using the new version of WILDS and getting the error:

... wilds/common/utils.py" line 86, in avg_over_groups
    assert v.numel()==g.numel()

any ideas? It may be a bug on my end and if I catch it I'll update here.

Map for adding cross validation training and evaluation

Hello and thank you for this amazing package.

Instead of using replicates, I would be interested in adding a cross validation training and evaluation scheme based on the domain metadata.

Say a dataset has domain: A,B,C. I would like to:

  • train on 70% of data sampled from A,B and evaluate in distribution on the remaining 30 % from A,B and out of distribution on C.
  • train on 70% of data sampled from B,C and evaluate in distribution on the remaining 30 % from B,C and out of distribution on A.
  • train on 70% of data sampled from C,A and evaluate in distribution on the remaining 30 % from C,A and out of distribution on B.

Finally average the in distribution and the out of distribution metric to have the final performance.

Here the 70-30 split is arbitrary and should be modifiable.

I am just starting exploring the package having only replicated the ERM result on the camelyon17 dataset.

It seems that the grouper object might be a good start to implement the following procedure. But, I am still lacking a high level overview of the code. So how would you do this ?

Cannot fetch 'ogb-molpcba' dataset due to missing arg

dataset = get_dataset(dataset='ogb-molpcba', download=True, root_dir='../data/')

Results in the following error:

--------------------------------------------------------------------
TypeError                          Traceback (most recent call last)
<ipython-input-2-c369817b9157> in <module>
----> 1 dataset = get_dataset(dataset='ogb-molpcba', download=True, root_dir='../data/')

~/anaconda3/envs/benchmark/lib/python3.7/site-packages/wilds/get_dataset.py in get_dataset(dataset, version, **dataset_kwargs)
     51     elif dataset == 'ogb-molpcba':
     52         from wilds.datasets.ogbmolpcba_dataset import OGBPCBADataset
---> 53         return OGBPCBADataset(version=version, **dataset_kwargs)
     54 
     55     elif dataset == 'poverty':

~/anaconda3/envs/benchmark/lib/python3.7/site-packages/wilds/datasets/ogbmolpcba_dataset.py in __init__(self, version, root_dir, download, split_scheme)
     88             download_url('https://snap.stanford.edu/ogb/data/misc/ogbg_molpcba/scaffold_group.npy', os.path.join(self.ogb_dataset.root, 'raw'))
     89         self._metadata_array = torch.from_numpy(np.load(metadata_file_path)).reshape(-1,1).long()
---> 90         self._collate = PyGCollater(follow_batch=[])
     91 
     92         self._metric = Evaluator('ogbg-molpcba')

TypeError: __init__() missing 1 required positional argument: 'exclude_keys'

Versions:

wilds 1.1.0
torch_geometric 1.7.0

[Question] Easily accessible pre-trained models

Hi, is there any way to easily access pretrained models for quick evaluation?

For instance something like the following,

|Agorithm | Model | Parameters |
+--------+-------+------------+
| ERM | Resnet50 | Weights50|
| .......
| .......
| ......

Fail to download ogb-molpcba dataset caused by the version of torch_geometric.

I ran python wilds/wilds/download_datasets.py --root_dir data --datasets ogb-molpcba.
And got an error message like this.

Traceback (most recent call last):
  File "wilds/wilds/download_datasets.py", line 34, in <module>
    main()
  File "wilds/wilds/download_datasets.py", line 27, in main
    wilds.get_dataset(
  File "..../wilds/get_dataset.py", line 52, in get_dataset
    from wilds.datasets.ogbmolpcba_dataset import OGBPCBADataset
  File "..../wilds/datasets/ogbmolpcba_dataset.py", line 7, in <module>
    from torch_geometric.data.dataloader import Collater as PyGCollater
ModuleNotFoundError: No module named 'torch_geometric.data.dataloader'

I found it is caused by torch_geometric changing the module name or moving the function.
I fixed it by changing from torch_geometric.data.dataloader import Collater as PyGCollater to from torch_geometric.loader.dataloader import Collater as PyGCollater. And successfully downloaded the data.

I guess you could check the dependency and solve it. The version of my torch_geometric is 2.0.2.

BTW This benchmark is very useful. It would be nice to have a TensorFlow version. I am looking forward to it.

Support for faster model training

May I check if there's any plan to support: 1) multi-gpu parallel training; 2) fp16; 3) gradient accumulation. These functions would allow us to train models much faster and with larger batch size (especially for large models like BERT).

Unable to retrieve CodaLab experiment outputs

Hello,

I am trying to download the trained models using the link provided (CodaLab).

When clicking on any of the iWildCAM v2.0 (or any other dataset) experiment results in CodaLab, I get a page with the command line to run (to train the model myself) and loading logo underneath it.

It seems like it is trying to load something, but I had this page open for hours and it still isn't giving me anything. When I click the 'download' button on the left, it leads me to an error page.
Is there a way I can get the results like the best_model.pth that the CodaLab pages describes?

Thank you!

Could you provide the trained weights?

Hello,

I am training BERT+ERM on the Amazon dataset but it is very time cost. Is it possible to provide the best trained parameters to the users? ( like BERT is proving the pretrained weights, maybe you can have another folder under examples which contains all the weights for users.) It will save users about a week ( and computations). Thank you!

n_groups_per_batch does not work for Poverty

Hi WILDS Team,
I am currently working with the WILDS repository. Currently, I work with the poverty dataset. I've run the script run_exp.py in the examples folder with the argument --n_groups_per_batch=3 or a different number. However, per batch, I get samples from more than 3 different groups. Do is use this argument wrongly? I understood the argument --n_groups_per_batch as the number of different environments from which samples exist in one batch.

The command line reads:
python examples/run_expt.py --dataset poverty --algorithm ERM --root_dir data --n_epochs=200 --seed=0 --log_every=200 --batch_size=64 --n_groups_per_batch=2 --progress_bar True

The output when i use the n_groups_variable defined in IRM.py:
n groups: 13
groups: tensor([ 3, 5, 7, 9, 10, 11, 13, 14, 16, 19, 20, 21, 22], device='cuda:0')

In addition, is the command --uniform_over_groups valid for Poverty? Since the samples are not uniformly distributed over the different environments used in the training split?

Thanks in advance for your help.

Niels

Data loader for PovertyMap is very slow

Hi -

Ran into a bit of an issue with data loading the Povertymap dataset - loading a single minibatch with 128 examples takes about 5-6 seconds. This is not a huge deal but slow enough to make me curious if there's a faster way of doing this.

Digging into the code a bit, it looks like the slowdown is mostly due to the array copy on line 239 of poverty_dataset.py

img = self.imgs[idx].copy()

FWIW it looks like this is a known issue for memory-mapped numpy arrays on Linux systems (https://stackoverflow.com/questions/42864320/numpy-memmap-performance-issues).

I'm not sure if there are any recommendations for getting around this, or if there's another way the data could be loaded in? Or let me know if I'm totally off-base here. Thanks!

Installating via pip seems to miss `torch_scatter` dependency

Hey,

I noticed that the installation via pip install wilds seems to miss the torch_scatter dependency that is also listed in the README. When e.g. trying to do from wilds.datasets.amazon_dataset import AmazonDataset I got

from wilds.datasets.amazon_dataset import AmazonDataset
  File "/Users/deul/Desktop/wilds/wilds/datasets/amazon_dataset.py", line 6, in <module>
    from wilds.common.utils import map_to_id_array
  File "/Users/deul/Desktop/wilds/wilds/common/utils.py", line 1, in <module>
    import torch, torch_scatter
ModuleNotFoundError: No module named 'torch_scatter'

As far as I can see, the solution should be as easy as adding torch_scatter>=2.0.5 to the install_requires attribute in setup.py. In my case, the error was resolved after installing torch_scatter separately.

fmow and Pandas 2.0.0 datetime conversion

I'm getting an error when initializing the "fmow" dataset. I got the following error for the conversion of the timestamp to datetime with Pandas:

ValueError: time data "2011-02-07T02:48:56.643Z" doesn't match format "%Y-%m-%dT%H:%M:%S%z", at position 92. You might want to try:
- passing format if your strings have a consistent format;
- passing format='ISO8601' if your strings are all ISO8601 but not necessarily in exactly the same format;
- passing format='mixed', and the format will be inferred for each element individually. You might want to use dayfirst alongside this.

I noticed I was using Pandas 2.0.0 (presumably the most recent version) and when I reverted to Pandas 1.5.3, the issue seemed to go away. I'm guessing the datetime formatting was changed in version 2 and it might be good to update WILDS to still work with the new version. Thanks!

Error with example fMOW command: incorrect value of "unlabeled_n_groups_per_batch"

Hello,
If I directly run this command suggested in the README:
python examples/run_expt.py --dataset fmow --algorithm DANN --unlabeled_split test_unlabeled --root_dir data

I get the following exeption:

Traceback (most recent call last):
  File "/mnt/beegfs/bulk/mirror/jyf6/datasets/wilds/examples/run_expt.py", line 491, in <module>
    main()
  File "/mnt/beegfs/bulk/mirror/jyf6/datasets/wilds/examples/run_expt.py", line 454, in main
    train(
  File "/mnt/beegfs/bulk/mirror/jyf6/datasets/wilds/examples/train.py", line 114, in train
    run_epoch(algorithm, datasets['train'], general_logger, epoch, config, train=True, unlabeled_dataset=unlabeled_dataset)
  File "/mnt/beegfs/bulk/mirror/jyf6/datasets/wilds/examples/train.py", line 38, in run_epoch
    unlabeled_data_iterator = InfiniteDataIterator(unlabeled_dataset['loader'])
  File "/mnt/beegfs/bulk/mirror/jyf6/datasets/wilds/examples/utils.py", line 393, in __init__
    self.iter = iter(self.data_loader)
  File "/home/fs01/jyf6/miniconda3/envs/ponds/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 442, in __iter__
    return self._get_iterator()
  File "/home/fs01/jyf6/miniconda3/envs/ponds/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 388, in _get_iterator
    return _MultiProcessingDataLoaderIter(self)
  File "/home/fs01/jyf6/miniconda3/envs/ponds/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1085, in __init__
    self._reset(loader, first_iter=True)
  File "/home/fs01/jyf6/miniconda3/envs/ponds/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1118, in _reset
    self._try_put_index()
  File "/home/fs01/jyf6/miniconda3/envs/ponds/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1352, in _try_put_index
    index = self._next_index()
  File "/home/fs01/jyf6/miniconda3/envs/ponds/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 624, in _next_index
    return next(self._sampler_iter)  # may raise StopIteration
  File "/mnt/beegfs/bulk/mirror/jyf6/datasets/wilds/wilds/common/data_loaders.py", line 131, in __iter__
    groups_for_batch = np.random.choice(
  File "mtrand.pyx", line 984, in numpy.random.mtrand.RandomState.choice
ValueError: Cannot take a larger sample than population when 'replace=False'

I think this occurs because there are only 2 unique years in the test_unlabeled split, but unlabeled_n_groups_per_batch is set to 8, so it tries to sample 8 years without replacement.

I was able to fix this by changing the argument unlabeled_n_groups_per_batch to 2, here: https://github.com/p-lambda/wilds/blob/main/examples/configs/datasets.py#L220

It would be great if this can be fixed. Thank you so much for releasing these wonderful datasets and baseline algorithms!

"Corrupt" image in dataset iWildCam version 2.0

Hi and thanks for sharing the code.

I found an issue when training on dataset iWildCam version 2.0. The problem doesn't exist in iWildCam v1.0.

I think that there is at least 1 corrupt image in iWildCam v2.0. When I train on this dataset, the training breaks because it cannot open the image.

The corrupt image is: /iwildcam_v2.0/train/8ad9843e-21bc-11ea-a13a-137349068a90.jpg
I also tried to open this image with Image Viewer on Ubuntu and it didn't work.

Other people on the internet encountered the problem as well:
https://www.kaggle.com/c/iwildcam-2020-fgvc7/discussion/134923

I'm not exactly sure what is the best solution here, but maybe if you would temporarily make "v1.0" the default it might decrease the people that stumble in it.

Thanks,
George

Understanding the prediction_dir format for leaderboard submission

I wonder if the log folder used during training is the prediction_dir described in Get Started: Evaluating trained models.

I tried to reproduce the ERM result on a subset of camelyon with the following command:

python examples/run_expt.py --dataset camelyon17 --algorithm ERM--root_dir data --frac 0.1 --log_dir log_erm_01.

Training goes well.

But my file camelyon17_split:id_val_seed:0_epoch is empty.

Then I ran the following command:
python examples/evaluate.py log_erm_01 erm_01_output --root-dir data --dataset camelyon17

And I got this:

Traceback (most recent call last):
  File "examples/evaluate.py", line 282, in <module>
    main()
  File "examples/evaluate.py", line 244, in main
    evaluate_benchmark(
  File "examples/evaluate.py", line 136, in evaluate_benchmark
    predictions_file = get_prediction_file(
  File "examples/evaluate.py", line 89, in get_prediction_file
    raise FileNotFoundError(
FileNotFoundError: Could not find CSV or pth prediction file that starts with camelyon17_split:id_val_seed:0.

So my question is whether the log file is the prediction_dir described in Get Started ?

Label Description

Thanks for sharing the dataset, I have find the label description in the code.

The Waterbirds dataset's link is invalid

Hi,

The Waterbirds dataset with UUID: '0x505056d5cdea4e4eaa0e242cbfe2daa4" on CodaLab is invalid right now with Error: 404 (cannot manually download it from the link or the page). It would be greatly appreciated if you could kindly fix it.

Thank you very much!

Issue in OOD data distribution when Grouper is set to "regions" for FMoW

Hi,

I am trying to change the groupby from "year" to "region". I have followed the instructions in the README page and currently using the following command:
python3 wilds/examples/run_expt.py --dataset fmow --algorithm ERM --groupby_fields region --root_dir wilds_fmow/

However, the issue is that the training dataset is not being separated in terms of distinct regions for ID and OOD manner. That is, all regions are included in ID as well as OOD. Here is a screenshot of the output:
Screenshot 2022-11-09 at 15 37 47

Therefore, I was wondering if that is a bug in the code or am I missing something?

Thanks
Sara A. Al-Emadi

Dataset Split Size

I noted that the dataset has been updated, e.g. iwildcam. Where can I find the latest information about the dataset split size?

Waterbirds give 0 worst-group accuracy

Training waterbirds dataset out-of-the-box gives 0 worst group accuracy. Digging deeper, I noticed that all the predictions immediately become the 0 label. Any advice would be helpful. Thanks in advance.

Question about the creation of WILDS-FMoW subset

Hi,

In your paper, you have mentioned that you have used a subset of FMoW. However, in the rgb_metadata.csv file provided, you analyse the entire fmow dataset and I couldn't find where in the code you are creating the subset (sampling from the rgb_metadata.csv file). I have also looked at the parameter frac which was equal to 1.0 in the config file as well as the worksheet (https://worksheets.codalab.org/rest/bundles/0x20182ee424504e4a916fe88c91afd5a2/contents/blob/log.txt). Therefore, I would greatly appreciate it if you could kindly let me know how you created the subset.

Thank you.

Sara A. Al-Emadi

pre-trained SwAV model weights for Camelyon17

Hi,

  1. Are pre-trained SwAV model weights for Camelyon17 publicly shared?
    I refer to this file used in the fine-tuning step: "--pretrained_model_path pretrained/checkpoints/ckp-55.pth"

  2. Also, may I know which commands were used for the SwAV-Camelyon17 results in Table 2. of the paper [1]? I can find three sets of commands (camelyon17_swav55_ermaugment_seed, camelyon17_swav55_ermaugment_val_seed, camelyon17_swav55_ermaugment_train_seed) at this link: https://worksheets.codalab.org/worksheets/0xb148346a5e4f4ce9b7cfc35c6dcedd63.

    I am not sure which ones were used as I get slightly different results when I calculate the results using the logs.

Thanks!

[1] Extending the WILDS benchmark for unsupervised adaptation

Figuring out the log files

I am trying to understand the log output.
After running a training command for instance python examples/run_expt.py --dataset camlyon17 --algorithm ERM --root_dir data I will get a log folder with many files. What is the difference between test_algo.csv and test_eval.csv. I have seen that they are related to two loggers:

datasets[split]['eval_logger'] = BatchLogger(
            os.path.join(config.log_dir, f'{split}_eval.csv'), mode=mode, use_wandb=(config.use_wandb and verbose))
datasets[split]['algo_logger'] = BatchLogger(
            os.path.join(config.log_dir, f'{split}_algo.csv'), mode=mode, use_wandb=(config.use_wandb and verbose))

What is the difference between algo and eval ?

ModuleNotFoundError: No module named 'transformers'

Hello, in your several files in examples (e.g. optimizer.py and transforms.py), you tried to import some functions from transformers but it is not provided in the current version. Could you please upload the module file? Thanks!

Poverty Map: Unable to map the image_id to its corresponding wealth_pooled and country domain from the dhs_meta.csv file.

Hi, In the train_mixup(train_loader, epoch, agg) function, for ith sample of a batch, I got image id= 5863, domain = 13 with wealth_pooled= -0.8209. Upon looking into the dhs_meta.csv file, I figured out the wealth_pool value and country domain, however the corresponding image_id is not 5863. Would you please help me how can I map the image_id to the corresponding country and wealth_pooled. Thanks.

Unable to Train ERM model with civilcomments

Hi,

I am having trouble in running the code with command
python3 wilds/examples/run_expt.py --dataset civilcomments --algorithm ERM --root_dir data --download
Everything stuck, no error reported, both GPU and CPU are not leveraged.

If ctrl+C, it shows
image

The same thing didn't happen when I tried to run the same script but with groupDRO.

It would be very helpful if you have any clue on this, and thank you a lot for your amazing, well developed code!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.