Giter VIP home page Giter VIP logo

Comments (4)

paulosabile-wb avatar paulosabile-wb commented on August 18, 2024

Hi @yangjhui Good day and thank you for reporting this to us. Happy to help you on this!

To further investigate this issue, we would like to request for your help if you can share us how you are logging this to wandb. Could you please share us a code snippet that you used for your training?

Also, to get more information with this error, can you also share the debug-internal.log and debug.log for the affected run. These files are under your local folder wandb/run-_-/logs in the same directory where you’re running your code. Thank you!

from wandb.

yangjhui avatar yangjhui commented on August 18, 2024

Hello @paulosabile-wb Thank you for your quick reply.
The content of "train.py" is as follows:

def main():
    # Step 0: Parse Arguments and Setup
    args = argument_parser()  # 解析命令行参数
    run_id = datetime.now().strftime("%Y%m%d") + '-' + str(random.randint(0, 9999))  # 创建一个唯一的运行ID,基于当前日期和随机数
    LOG_DIR = 'logs'  # 定义日志保存目录
    SAVE_DIR = 'models'  # 定义模型保存目录
    TRAIN_LOG_PATH = os.path.join(LOG_DIR, 'train_log/train_log_' + run_id + '.pt')  # 定义训练日志文件路径
    SAVE_LOG_PATH = os.path.join(LOG_DIR, 'save_logs_HBN.json')  # 定义保存日志文件路径
    SAVE_MODEL_PATH = os.path.join(SAVE_DIR, 'model_' + run_id + '.pt')  # 定义保存模型文件路径
    # 定义模型字典,其中键为模型名称,值为模型类
    models = {
        'MPN': MPN,
        'MPN_simplenet': MPN_simplenet,
        'SkipMPN': SkipMPN,
        'MaskEmbdMPN': MaskEmbdMPN,
        'MultiConvNet': MultiConvNet,
        'MultiMPN': MultiMPN,
        'MaskEmbdMultiMPN': MaskEmbdMultiMPN
    }
    mixed_cases = ['118v2', '14v2']  # 定义混合案例列表

    # Training parameters
    data_dir = args.data_dir  # 数据目录 'data'
    nomalize_data = not args.disable_normalize  # 是否归一化数据 not False
    # num_epochs = args.num_epochs  # 训练的轮数 100
    num_epochs = 100 # 训练的轮数 100
    # 定义损失函数,如果启用正则化,则添加正则化项,并指定正则化系数
    loss_fn = Masked_L2_loss(regularize=args.regularize, regcoeff=args.regularization_coeff) # regularize=args.regularize=True,regcoeff=args.regularization_coeff=1
    eval_loss_fn = Masked_L2_loss(regularize=False)  # 定义评估损失函数,不启用正则化
    lr = args.lr  # 学习率 1e-3
    batch_size = args.batch_size  # 批量大小 128
    # grid_case = args.case  # 网格案例 初始是case14
    # grid_case = '14'  # 网格案例
    grid_case = 'originalHBN40000'  # 网格案例originalHBN40000
    
    # Network parameters
    # 网络参数
    nfeature_dim = args.nfeature_dim  # 输入特征维度 6 caseHBN:7
    efeature_dim = args.efeature_dim  # 边特征维度 6 caseHBN:5
    hidden_dim = args.hidden_dim  # 隐藏层维度 128 caseHBN:512
    output_dim = args.output_dim  # 输出维度 6 caseHBN:7
    n_gnn_layers = args.n_gnn_layers  # GNN 层数 4 caseHBN:5
    conv_K = args.K  # 卷积层的 K 值 3
    dropout_rate = args.dropout_rate  # Dropout 比率 0.2
    # model = models[args.model]  # 根据参数选择模型 MPN
    model = models['MaskEmbdMultiMPN']  # 根据参数选择模型

    log_to_wandb = True# 是否将日志记录到WandB

    # wandb_entity = args.wandb_entity  # WandB 实体 PowerFlowNet
    if log_to_wandb:
        wandb.init(project="PowerFlowNet",  # 初始化WandB项目
                   # entity=wandb_entity,  # 设置WandB实体
                   name=run_id,  # 设置运行名称
                   config=args)  # 传递参数配置给WandB

    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')  
    # device = torch.device('cpu')
    # print(f"device:{device}")
    torch.manual_seed(1234)  # 设置随机种子以保证结果的可复现性
    np.random.seed(1234)  # 设置NumPy的随机种子
    # torch.backends.cudnn.deterministic = True  # 设置确定性计算,可以提高计算稳定性,但可能会降低速度
    # torch.backends.cudnn.benchmark = False  # 不启用Benchmark模式,这可以确保在输入大小变化时,内置的cuDNN自动找到最适合的卷积算法

    # Step 1: Load data
    print('下面初始化训练数据集:')
    trainset = PowerFlowData(root=data_dir, case=grid_case, split=[.5, .2, .3], task='train', normalize=nomalize_data)
    print('下面初始化验证数据集:')
    valset = PowerFlowData(root=data_dir, case=grid_case, split=[.5, .2, .3], task='val', normalize=nomalize_data)
    print('下面初始化测试数据集:')
    testset = PowerFlowData(root=data_dir, case=grid_case, split=[.5, .2, .3], task='test', normalize=nomalize_data)
    # 创建目录来存储归一化参数
    os.makedirs(os.path.join(data_dir, 'params'), exist_ok=True)
    # 保存训练集中用于归一化的参数
    torch.save({
        'xymean': trainset.xymean,  # 保存节点特征的均值
        'xystd': trainset.xystd,  # 保存节点特征的标准差
        'edgemean': trainset.edgemean,  # 保存边特征的均值
        'edgestd': trainset.edgestd,  # 保存边特征的标准差
    }, os.path.join(data_dir, 'params', f'data_params_{run_id}.pt'))

    # 初始化数据加载器用于训练,设置批量大小,并打乱顺序
    train_loader = DataLoader(trainset, batch_size=batch_size, shuffle=True)
    # 初始化数据加载器用于验证,设置批量大小,不打乱顺序
    val_loader = DataLoader(valset, batch_size=batch_size, shuffle=False)
    # 初始化数据加载器用于测试,设置批量大小,不打乱顺序
    test_loader = DataLoader(testset, batch_size=batch_size, shuffle=False)
    
    ## [Optional] physics-informed loss function
    # 根据参数选择训练时使用的损失函数
    print(f'args.train_loss_fn: {args.train_loss_fn}')
    if args.train_loss_fn == 'power_imbalance':
        # 如果选用电力不平衡损失函数,使用训练集的均值和标准差初始化,并移至计算设备
        loss_fn = PowerImbalance(*trainset.get_data_means_stds()).to(device)
    elif args.train_loss_fn == 'masked_l2':
        # 如果选用带掩码的L2损失函数,可以设置正则化项和系数
        loss_fn = Masked_L2_loss(regularize=args.regularize, regcoeff=args.regularization_coeff)
    elif args.train_loss_fn == 'mixed_mse_power_imbalance':
        # 如果选用混合MSE和电力不平衡损失函数,使用训练集的均值和标准差初始化,设置权重参数alpha,移至计算设备
        loss_fn = MixedMSEPoweImbalance(*trainset.get_data_means_stds(), alpha=0.9).to(device)
    else:
        # 默认使用MSE损失函数
        loss_fn = torch.nn.MSELoss()

    # Step 2: Create model and optimizer (and scheduler)
    # 获取数据集的节点输入维度、节点输出维度和边维度
    node_in_dim, node_out_dim, edge_dim = trainset.get_data_dimensions()
    # 断言节点输入维度必须为16,作为一种检查机制
    # assert node_in_dim == 16 # case14
    assert node_in_dim == 18 # caseHBN
    # 初始化模型,设置相关参数如特征维度、隐藏层维度、卷积层数等,并移至计算设备  model = models[args.model]  # 根据参数选择模型 MPN
    model = model(
        nfeature_dim=nfeature_dim,  # nfeature_dim = args.nfeature_dim  # 输入特征维度 6
        efeature_dim=efeature_dim,  # efeature_dim = args.efeature_dim  # 边特征维度 6
        output_dim=output_dim,      # output_dim = args.output_dim  # 输出维度 6
        hidden_dim=hidden_dim,      # hidden_dim = args.hidden_dim  # 隐藏层维度 128
        n_gnn_layers=n_gnn_layers,  # n_gnn_layers = args.n_gnn_layers  # GNN 层数 4
        K=conv_K,                   # conv_K = args.K  # 卷积层的 K 值 3
        dropout_rate=dropout_rate   # dropout_rate = args.dropout_rate  # Dropout 比率 0.2
    ).to(device)

    #calculate model size
    # 计算模型参数总数
    pytorch_total_params = sum(p.numel() for p in model.parameters()) # 4483591
    print("Total number of parameters: ", pytorch_total_params)

    # 使用AdamW优化器来优化模型参数,设置学习率为lr
    optimizer = torch.optim.AdamW(model.parameters(), lr=lr)
    # scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer,
    #                                                        mode='min',
    #                                                        factor=0.5,
    #                                                        patience=5,
    #                                                        verbose=True)
    # 使用OneCycleLR学习率调度器,设置最大学习率为lr,steps_per_epoch为每个epoch的步数,epochs为总epoch数
    scheduler = torch.optim.lr_scheduler.OneCycleLR(optimizer, max_lr=lr, steps_per_epoch=len(train_loader), epochs=num_epochs)

    # Step 3: Train model
    # 初始化最佳训练损失和最佳验证损失
    best_train_loss = 10000.
    best_val_loss = 10000.
    train_log = {
        'train': {
            'loss': []},
        'val': {
            'loss': []},
    }
    # pbar = tqdm(range(num_epochs), total=num_epochs, position=0, leave=True)
    # 遍历每个epoch进行训练
    for epoch in range(num_epochs): # epoch = 100
        # 训练当前epoch的模型,计算训练损失
        train_loss = train_epoch(
            model, train_loader, loss_fn, optimizer, device)
        # 计算当前epoch的验证损失
        val_loss = evaluate_epoch(model, val_loader, eval_loss_fn, device) # eval_loss_fn
        # val_loss = evaluate_epoch(model, val_loader, loss_fn, device)
        # 更新学习率
        scheduler.step()
        # 记录训练损失和验证损失
        train_log['train']['loss'].append(train_loss)
        train_log['val']['loss'].append(val_loss)

        # 如果开启了日志记录到wandb,则记录训练损失和验证损失
        if log_to_wandb:
            wandb.log({'train_loss': train_loss,
                       'val_loss': val_loss})

        # 更新最佳训练损失
        if train_loss < best_train_loss:
            best_train_loss = train_loss

        # 如果当前epoch的验证损失比之前的最佳验证损失更低,则更新最佳验证损失并保存模型
        if val_loss < best_val_loss:
            best_val_loss = val_loss
            # 如果设置了保存模型的参数,则保存模型
            if args.save:
                # 准备要保存的数据
                _to_save = {
                    'epoch': epoch,
                    'args': args,
                    'val_loss': best_val_loss,
                    'model_state_dict': model.state_dict(),
                }
                # 确保保存模型的文件夹存在
                os.makedirs('models', exist_ok=True)
                # 保存模型数据到指定路径
                torch.save(_to_save, SAVE_MODEL_PATH)
                # 在日志文件中添加当前模型的信息
                append_to_json(
                    SAVE_LOG_PATH,
                    run_id,
                    {
                        'val_loss': f"{best_val_loss: .4f}",
                        # 'test_loss': f"{test_loss: .4f}",
                        'train_log': TRAIN_LOG_PATH,
                        'saved_file': SAVE_MODEL_PATH,
                        'epoch': epoch,
                        'model': args.model,
                        'train_case': args.case,
                        'train_loss_fn': args.train_loss_fn,
                        'args': vars(args)
                    }
                )
                # 保存训练日志到指定路径
                torch.save(train_log, TRAIN_LOG_PATH)
        # 打印当前epoch的训练损失、验证损失和最佳验证损失
        print(f"Epoch {epoch+1} / {num_epochs}: train_loss={train_loss:.4f}, val_loss={val_loss:.4f}, best_val_loss={best_val_loss:.4f}")

    # 打印训练完成信息以及最佳验证损失
    print(f"Training Complete. Best validation loss: {best_val_loss:.4f}")

    # Step 4: Evaluate model 模型评估
    # 如果设置了保存模型的参数,则加载保存的模型并评估测试集损失
    if args.save:
        _to_load = torch.load(SAVE_MODEL_PATH)
        model.load_state_dict(_to_load['model_state_dict'])
        test_loss = evaluate_epoch(model, test_loader, eval_loss_fn, device)
        # test_loss = evaluate_epoch(model, test_loader, loss_fn, device)
        print(f"Test loss: {best_val_loss:.4f}")
        # 如果开启了wandb日志记录,则记录测试集损失
        if log_to_wandb:
            wandb.log({'test_loss': test_loss}) # 字典形式

    # Step 5: Save results
    # 确保训练日志文件夹存在
    os.makedirs(os.path.join(LOG_DIR, 'train_log'), exist_ok=True)
    # 如果设置了保存模型的参数,则保存训练日志到指定路径
    if args.save:
        torch.save(train_log, TRAIN_LOG_PATH)

    end = time.perf_counter()  # 结束计时
    runtime = end - start
    s = runtime % 60
    m = runtime // 60 % 60
    h = runtime // 3600
    print(f"输出代码运行时间{runtime} s")
    print('代码运行时间'"{:0>2} h:{:0>2} min:{:0>2} s".format(h,m,s))

if __name__ == '__main__':
    main()

The debug-internal.log and debug.log are as follows:
debug.log
debug-internal.log

Thank you again for your reply. I would greatly appreciate it.

from wandb.

exalate-issue-sync avatar exalate-issue-sync commented on August 18, 2024

WandB Internal User commented:
paulosabile-wb commented:
Hi @yangjhui Good day and thank you for reporting this to us. Happy to help you on this!

To further investigate this issue, we would like to request for your help if you can share us how you are logging this to wandb. Could you please share us a code snippet that you used for your training?

Also, to get more information with this error, can you also share the debug-internal.log and debug.log for the affected run. These files are under your local folder wandb/run-_-/logs in the same directory where you’re running your code. Thank you!

from wandb.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.