Giter VIP home page Giter VIP logo

paddle-ce-latest-kpis's Introduction

Paddle Continuous Evaluation Baselines

Howtos

Add New Evaluation Task

Reference mnist task, the following files are required by CE framework:

  • run.xsh , a script to start this evaluation execution
    • this script can be any bash script, just place #!/bin/bash or #/bin/xonsh to the head if it is written in the bash or xonsh language
  • continuous_evaluation.py to include all the KPIs this task tracks
  • latest_kpis directory, include all the baseline files

PR and Add to Service

  • PR to fast branch, and run ce-kpi-fast-test test on teamcity,
  • if passed, PR from fast to master branch.

Add new KPI to track

Reference the interface kpi.py, there are two basic KPIs:

  • LessWorseKpi
  • GreaterWorseKpi

paddle-ce-latest-kpis's People

Contributors

aurelius84 avatar ccmeteorljh avatar chengduozh avatar dddivano avatar dzhwinter avatar guochaorong avatar guoshengcs avatar iamwhtwd avatar jiaxiao243 avatar kolinwei avatar luotao1 avatar mmglove avatar oliverlph avatar paddlece avatar panyx0718 avatar pcjmmc avatar putcn avatar qingqing01 avatar ray2020bd avatar reyoung avatar sneaxiy avatar superjomn avatar velconia avatar wanghaoshuang avatar xiaosang avatar xiegegege avatar yancey1989 avatar yanmeng1019 avatar zeref996 avatar zhengya01 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

paddle-ce-latest-kpis's Issues

Model stability analysis

Environment:Tesla V100 cuda version:384.81

We run each model ten times and get the follow data.
resnet50:

kpi min max mean median std (std/mean)*100%
cifar10_128_gpu_memory 1566 1596 1589.6 1596 10 0.6%
cifar10_128_train_acc 0.966 0.998 0.985 0.990 0.011 1.1%
cifar10_128_train_speed 346.3 406.1 373.0 370.7 19.7 5.2%
flowers_64_gpu_memory 10680 12848 12139 12459 779.9 6.4%
flowers_64_train_speed 70.3 76.3 74.1 74.9 2.14 2.8%

resnet30

kpi min max mean median std (std/mean)*100%
train_cost 2.32 2.69 2.51 2.51 0.096 3.8%
train_duration 10.24 10.65 10.366 10.347 0.101 0.9%

minst

kpi min max mean median std (std/mean)*100%
test_acc 0.9858 0.9890 0.9877 0.9876 0.0010 0.1%
train_acc 0.9919 0.9933 0.9929 0.9929 0.0003 0.03%
train_duration 37.6037 38.7788 38.2673 38.5541 0.4592 1.2%

debug单个模型的方法

ggg

最下面一个是正确的, 最上面是错误的,
出问题的应该在这些之间,

排查方法:
ggg

kkk

清理各个 task 的 log

现在 task 输出的 log 太多,特别 teamcity 的log 太乱,不利于查出错的地方。

可以删掉一部分,或者降低打印频率

vgg16 模型random出现" Segmentation fault"

CE 框架,vgg16 出现两次 seg fault,

第一次job地址:http://18.222.34.7:8080/viewLog.html?buildId=1383&buildTypeId=Paddle_ContinuousEvaluation&tab=buildLog

第二次job地址:http://180.76.57.222:8111/viewLog.html?buildId=118&buildTypeId=PaddleCe_CEBuild&tab=buildLog&_focus=7990

:19][Step 1/1] Pass: 0, Loss: 4.501836, Train Accuray: 0.000000
[17:55:19][Step 1/1] 
[17:55:19][Step 1/1] 
[17:55:19][Step 1/1] Total examples: 3040, total time: 68.43846, 44.41947 examples/sed
[17:55:19][Step 1/1] 
[17:55:19][Step 1/1] *** Aborted at 1531245319 (unix time) try "date -d @1531245319" if you are using GNU date ***
[17:55:19][Step 1/1] PC: @                0x0 (unknown)
[17:55:19][Step 1/1] *** SIGSEGV (@0x58) received by PID 4890 (TID 0x7fbc3a8c7700) from PID 88; stack trace: ***
[17:55:19][Step 1/1]     @     0x7fbcc2fe37e0 (unknown)
[17:55:19][Step 1/1]     @     0x7fbcc32f650c PyEval_EvalFrameEx
[17:55:19][Step 1/1]     @     0x7fbcc32ff37d PyEval_EvalCodeEx
[17:55:19][Step 1/1]     @     0x7fbcc3276905 (unknown)
[17:55:19][Step 1/1]     @     0x7fbcc3244d33 PyObject_Call
[17:55:19][Step 1/1]     @     0x7fbcc32fa0a2 PyEval_EvalFrameEx
[17:55:19][Step 1/1]     @     0x7fbcc32fce9e PyEval_EvalFrameEx
[17:55:19][Step 1/1]     @     0x7fbcc32fce9e PyEval_EvalFrameEx
[17:55:19][Step 1/1]     @     0x7fbcc32ff37d PyEval_EvalCodeEx
[17:55:19][Step 1/1]     @     0x7fbcc3276830 (unknown)
[17:55:19][Step 1/1]     @     0x7fbcc3244d33 PyObject_Call
[17:55:19][Step 1/1]     @     0x7fbcc325374d (unknown)
[17:55:19][Step 1/1]     @     0x7fbcc3244d33 PyObject_Call
[17:55:19][Step 1/1]     @     0x7fbcc32f5897 PyEval_CallObjectWithKeywords
[17:55:19][Step 1/1]     @     0x7fbcc3341f32 (unknown)
[17:55:19][Step 1/1]     @     0x7fbcc2fdbaa1 start_thread
[17:55:19][Step 1/1]     @     0x7fbcc269dbcd clone
[17:55:19][Step 1/1]     @                0x0 (unknown)
[17:55:19][Step 1/1] ./run.xsh: line 14:  4890 Segmentation fault      FLAGS_benchmark=true FLAGS_fraction_of_gpu_memory_to_use=0.0 python model.py --device=GPU --batch_size=32 --data_set=flowers --iterations=100 --gpu_id=$cudaid
[17:55:20][Step 1/1] 4887

均在最后预测阶段:

        if args.with_test:
            pass_test_acc = test(exe)
        break

模型代码:
https://github.com/PaddlePaddle/paddle-ce-latest-kpis/blob/master/vgg16/model.py

models repo 模型设置CE监控用的KPI阈值

为模型添加启动脚本, .run.sh

#!/bin/bash
rm -rf *_factor.txt
model_file='model.py'
python $model_file --batch_size 128 --pass_num 5 --device CPU

为模型添加阈值文件 .continuous_evaluation.py

import os
import sys
sys.path.append(os.environ['ceroot'])
from kpi import CostKpi, DurationKpi, AccKpi

train_cost_kpi = CostKpi('train_cost', 0.02, actived=True)
test_acc_kpi = AccKpi('test_acc', 0.005, actived=True)
train_duration_kpi = DurationKpi('train_duration', 0.06, actived=True)
train_acc_kpi = AccKpi('train_acc', 0.005, actived=True)

tracking_kpis = [
    train_acc_kpi,
    train_cost_kpi,
    test_acc_kpi,
    train_duration_kpi,
]

为模型添加base kpi数据:隐藏文件夹 .latest_kpis
里面是各个kpi的base数据
test_acc_factor.txt train_acc_factor.txt train_cost_factor.txt train_duration_factor.txt

参考CE model mnist模型:
https://github.com/PaddlePaddle/paddle-ce-latest-kpis/tree/master/mnist

model repo 待release 模型改造规范

  1. 都改造成使用新的api接口, 如parallelDo都改成Parallel executor
  2. 每个模型支持有CPU, GPU 单卡,多卡,
  3. 每个模型风格统一: model一个文件、train一个文件、infer一个文件

相关同学如有建议请进行补充

where can i import commands

Hi, gays,

I notice that in this file ce_models/resnet50_net_GPU/train.py line 12
import commands

but I can't find the where to import commands.
ModuleNotFoundError: No module named 'commands'

所有模型去随机性

CE模型是从paddlepaddle models repo 陆续挪过来的一些模型。目前一共12个。
共3类

NLP seq2seq, lstm, language_model, transformer, sequence_tagging_for_ner, text_classification
图像 mnsit, image_classification, resnet50, vgg16, object_detection
多机 vgg16_aws_dist

CE监测到一些模型的数据仍存在随机性,比如during指标 (时长)、memory指标,
还有不定期会有一些模型的acc(精确度) 过些天又震荡一下的情况。 现在整体情况:http://18.222.34.7/
比如:

11

#41
#42
http://18.222.34.7/commit/draw_scalar?task=mnist
http://18.222.34.7/commit/draw_scalar?task=image_classification
http://18.222.34.7:8080/viewLog.html?tab=buildLog&buildTypeId=Paddle_ContinuousEvaluation&buildId=828

需要每个方向的模型有一个owner。各自确定其模型的不稳定指标的阈值。
在这两周内消除所有不稳定指标的阈值报警。

aws 与内网机器性能差异

Diff 参考:

https://github.com/PaddlePaddle/paddle-ce-latest-kpis/pull/74/files

AWS 型号:

  • GPU
Tesla V100-SXM2-16GB
Tesla V100-SXM2-16GB
Tesla V100-SXM2-16GB
Tesla V100-SXM2-16GB
  • Others
ip-172-31-23-234
    description: Computer
    width: 64 bits
    capabilities: vsyscall32
  *-core
       description: Motherboard
       physical id: 0
     *-memory
          description: System memory
          physical id: 0
          size: 240GiB
     *-cpu
          product: Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
          vendor: Intel Corp.
          physical id: 1
          bus info: cpu@0
          size: 2699MHz
          capacity: 3GHz
          width: 64 bits
          capabilities: fpu fpu_exception wp vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx pdpe1gb rdtscp x86-64 constant_tsc rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single retpoline kaiser fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx xsaveopt cpufreq
     *-pci
          description: Host bridge
          product: 440FX - 82441FX PMC [Natoma]
          vendor: Intel Corporation
          physical id: 100
          bus info: pci@0000:00:00.0
          version: 02
          width: 32 bits
          clock: 33MHz
        *-isa
             description: ISA bridge
             product: 82371SB PIIX3 ISA [Natoma/Triton II]
             vendor: Intel Corporation
             physical id: 1
             bus info: pci@0000:00:01.0
             version: 00
             width: 32 bits
             clock: 33MHz
             capabilities: isa bus_master
             configuration: latency=0
        *-ide
             description: IDE interface
             product: 82371SB PIIX3 IDE [Natoma/Triton II]
             vendor: Intel Corporation
             physical id: 1.1
             bus info: pci@0000:00:01.1
             version: 00
             width: 32 bits
             clock: 33MHz
             capabilities: ide bus_master
             configuration: driver=ata_piix latency=64
             resources: irq:0 ioport:1f0(size=8) ioport:3f6 ioport:170(size=8) ioport:376 ioport:c100(size=16)
        *-bridge UNCLAIMED
             description: Bridge
             product: 82371AB/EB/MB PIIX4 ACPI
             vendor: Intel Corporation
             physical id: 1.3
             bus info: pci@0000:00:01.3
             version: 01
             width: 32 bits
             clock: 33MHz
             capabilities: bridge bus_master
             configuration: latency=0
        *-display:0 UNCLAIMED
             description: VGA compatible controller
             product: GD 5446
             vendor: Cirrus Logic
             physical id: 2
             bus info: pci@0000:00:02.0
             version: 00
             width: 32 bits
             clock: 33MHz
             capabilities: vga_controller bus_master
             configuration: latency=0
             resources: memory:80000000-81ffffff memory:8f004000-8f004fff
        *-network
             description: Ethernet interface
             physical id: 3
             bus info: pci@0000:00:03.0
             logical name: ens3
             version: 00
             serial: 06:64:f5:87:78:b8
             width: 32 bits
             clock: 33MHz
             capabilities: bus_master cap_list ethernet physical
             configuration: broadcast=yes driver=ena driverversion=1.3.0K ip=172.31.23.234 latency=0 link=yes multicast=yes
             resources: irq:0 memory:8f000000-8f003fff
        *-display:1
             description: 3D controller
             product: NVIDIA Corporation
             vendor: NVIDIA Corporation
             physical id: 1b
             bus info: pci@0000:00:1b.0
             version: a1
             width: 64 bits
             clock: 33MHz
             capabilities: bus_master cap_list
             configuration: driver=nvidia latency=248
             resources: iomemory:400-3ff irq:252 memory:8a000000-8affffff memory:4000000000-43ffffffff memory:82000000-83ffffff
        *-display:2
             description: 3D controller
             product: NVIDIA Corporation
             vendor: NVIDIA Corporation
             physical id: 1c
             bus info: pci@0000:00:1c.0
             version: a1
             width: 64 bits
             clock: 33MHz
             capabilities: bus_master cap_list
             configuration: driver=nvidia latency=248
             resources: iomemory:440-43f irq:253 memory:8b000000-8bffffff memory:4400000000-47ffffffff memory:84000000-85ffffff
        *-display:3
             description: 3D controller
             product: NVIDIA Corporation
             vendor: NVIDIA Corporation
             physical id: 1d
             bus info: pci@0000:00:1d.0
             version: a1
             width: 64 bits
             clock: 33MHz
             capabilities: bus_master cap_list
             configuration: driver=nvidia latency=248
             resources: iomemory:480-47f irq:254 memory:8c000000-8cffffff memory:4800000000-4bffffffff memory:86000000-87ffffff
        *-display:4
             description: 3D controller
             product: NVIDIA Corporation
             vendor: NVIDIA Corporation
             physical id: 1e
             bus info: pci@0000:00:1e.0
             version: a1
             width: 64 bits
             clock: 33MHz
             capabilities: bus_master cap_list
             configuration: driver=nvidia latency=248
             resources: iomemory:4c0-4bf irq:255 memory:8d000000-8dffffff memory:4c00000000-4fffffffff memory:88000000-89ffffff
        *-generic
             description: Unassigned class
             product: Xen Platform Device
             vendor: XenSource, Inc.
             physical id: 1f
             bus info: pci@0000:00:1f.0
             version: 01
             width: 32 bits
             clock: 33MHz
             capabilities: bus_master
             configuration: driver=xen-platform-pci latency=0
             resources: irq:47 ioport:c000(size=256) memory:8e000000-8effffff
  *-network:0
       description: Ethernet interface
       physical id: 1
       logical name: veth1678770
       serial: 72:b6:14:e8:37:ef
       size: 10Gbit/s
       capabilities: ethernet physical
       configuration: autonegotiation=off broadcast=yes driver=veth driverversion=1.0 duplex=full link=yes multicast=yes port=twisted pair speed=10Gbit/s
  *-network:1
       description: Ethernet interface
       physical id: 2
       logical name: vethe2a2a5e
       serial: 5a:7a:fe:1a:a0:22
       size: 10Gbit/s
       capabilities: ethernet physical
       configuration: autonegotiation=off broadcast=yes driver=veth driverversion=1.0 duplex=full link=yes multicast=yes port=twisted pair speed=10Gbit/s
  *-network:2
       description: Ethernet interface
       physical id: 3
       logical name: vetha97198a
       serial: 12:9e:41:ec:6b:1e
       size: 10Gbit/s
       capabilities: ethernet physical
       configuration: autonegotiation=off broadcast=yes driver=veth driverversion=1.0 duplex=full link=yes multicast=yes port=twisted pair speed=10Gbit/s
  *-network:3
       description: Ethernet interface
       physical id: 4
       logical name: vethf7fcba7
       serial: 1a:c9:0f:cd:60:42
       size: 10Gbit/s
       capabilities: ethernet physical
       configuration: autonegotiation=off broadcast=yes driver=veth driverversion=1.0 duplex=full link=yes multicast=yes port=twisted pair speed=10Gbit/s
  *-network:4
       description: Ethernet interface
       physical id: 5
       logical name: veth437a440
       serial: ce:8b:df:ab:d2:77
       size: 10Gbit/s
       capabilities: ethernet physical
       configuration: autonegotiation=off broadcast=yes driver=veth driverversion=1.0 duplex=full link=yes multicast=yes port=twisted pair speed=10Gbit/s
  *-network:5
       description: Ethernet interface
       physical id: 6
       logical name: veth67fe6d0
       serial: 52:f5:1d:0a:f1:59
       size: 10Gbit/s
       capabilities: ethernet physical
       configuration: autonegotiation=off broadcast=yes driver=veth driverversion=1.0 duplex=full link=yes multicast=yes port=twisted pair speed=10Gbit/s

where is memory.txt

Hi, there,

In this file ce_models/seq2seq/get_gpu_data.py line 29
with open('memory.txt', 'r') as f:

where is memory.txt, I can't find it anywhere.

不同类型下加速比比较

benchmark file:benchmark/fluid/resnet.py
dataset:flowers
batchsize:64

GPU类型 2GPU 4GPU 8GPU
K40 1.78 2.82  
P40 1.61 2.05 2.67
V100 1.38 1.93  

models repo的模型接入CE监测框架

为了支持8月中旬models release版本的发布,以及今后models repo模型的持续监控。https://github.com/PaddlePaddle/models
现计划将models repo的模型改造以接入CE监测框架。

plan

接入的模型及负责人(8月8号完成)[模型改造规范]:#106
需要merge 两个pr
视觉方向 接口人:青青
1 . mnist  郭超容 【done 】
2.       object_detection  一帆 【done,观察指标】
3.       image_classification 青青 【merge,观察指标】
4.       ocr_recognition 豪爽 【merge, 待设置阈值】
5.       icnet  豪爽 【merge ,待设置阈值】

NLP方向 接口人:毅冰
1.       seq2seq  青晟 【done】
2.       language_model 超容 【done, 观察指标】
3.       transformer   郭晟 【merge, 观察指标】
4.       sequence_tagging_for_ner  毅冰 【done, 观察job指标】
5.       text_classification 毅冰 【done,观察job指标】

How

模型的改动

例子pr:
mnist : https://github.com/PaddlePaddle/models/pull/1080/files
seq2seq: PaddlePaddle/models#1104
以mnist为例

1. models repo该模型的改动

mnist code参考:PaddlePaddle/models#1080

  1. 在model.py文件中输出kpi指标
    【具体kpi输出可以参考 ce models repo对应模型的,https://github.com/PaddlePaddle/paddle-ce-latest-kpis 中找对应模型的kpi 文件continuous_evaluation.py】
    以tab分隔
        print ("kpis    train_acc       %f" % train_avg_acc)
        print ("kpis    train_cost      %f" % train_avg_loss)
        print ("kpis    test_acc        %f" % test_avg_acc)
        print ("kpis    train_duration  %f" % (pass_end - pass_start))
  1. 需要增加_ce.py文件, 用于解析model.py输出的日志kpi指标。
####this file is only used for continuous evaluation test!

import os
import sys
sys.path.append(os.environ['ceroot'])
from kpi import CostKpi, DurationKpi, AccKpi

#### NOTE kpi.py should shared in models in some way!!!!

train_cost_kpi = CostKpi('train_cost', 0.02, 0, actived=True)
test_acc_kpi = AccKpi('test_acc', 0.005, 0, actived=True)
train_duration_kpi = DurationKpi('train_duration', 0.06, 0, actived=True)
train_acc_kpi = AccKpi('train_acc', 0.005, 0, actived=True)

tracking_kpis = [
    train_acc_kpi,
    train_cost_kpi,
    test_acc_kpi,
    train_duration_kpi,
]

def parse_log(log):
    '''
    This method should be implemented by model developers.

    The suggestion:

    each line in the log should be key, value, for example:

    "
    train_cost\t1.0
    test_cost\t1.0
    train_cost\t1.0
    train_cost\t1.0
    train_acc\t1.2
    "
    '''
    #kpi_map = {}
    for line in log.split('\n'):
        fs = line.strip().split('\t')
        print (fs)
        if len(fs) == 3 and fs[0] == 'kpis':
            print ("-----%s" % fs)
            kpi_name = fs[1]
            kpi_value = float(fs[2])
            #kpi_map[kpi_name] = kpi_value
            yield kpi_name, kpi_value
    #return kpi_map


def log_to_ce(log):
    kpi_tracker = {}
    for kpi in tracking_kpis:
        kpi_tracker[kpi.name] = kpi

    for (kpi_name, kpi_value) in parse_log(log):
        print (kpi_name, kpi_value)
        kpi_tracker[kpi_name].add_record(kpi_value)
        kpi_tracker[kpi_name].persist()


if __name__ == '__main__':
    log = sys.stdin.read()
    print ("*****")
    print (log)
    print ("****")
    log_to_ce(log) 
  1. 增加一个启动脚本 .run_ce.sh (设置可执行权限):
###!/bin/bash
####This file is only used for continuous evaluation.

model_file='model.py'
python $model_file --batch_size 128 --pass_num 5 --device CPU | python _ce.py

2. CE models repo该模型的改动

上面第1步merge后,在第2步中设置阈值。
mnist model code 参考:https://github.com/PaddlePaddle/paddle-ce-latest-kpis/tree/master/model_mnist

在该repo(paddle-ce-latest-kpis)的根目录, 增加该模型对应目录, 目录名以 'model_' 前缀开头,两个repo文件夹对应关系:

 #ls models_repo/fluid/mnist/   <---->    model_mnist/ 

并在该目录中:

  1. 增加一个启动脚本 run.xsh(可执行权限)
#!/bin/bash

./.run_ce.sh
  1. 增加相应的kpi基数据
    latest_kpis文件夹
test_acc_factor.txt  train_acc_factor.txt  train_cost_factor.txt  train_duration_factor.txt

本地测试方法

在模型目录:
拷贝一个 kpi.py https://github.com/PaddlePaddle/continuous_evaluation/blob/develop/kpi.py
拷贝一个config.py https://github.com/PaddlePaddle/continuous_evaluation/blob/develop/config.py
执行 sh .run_ce.sh
输出如下 这种 表示成功,

['kpis', 'train_acc', '0.993373']
('train_acc', 0.993373)
['kpis', 'train_cost', '0.021950']
('train_cost', 0.02195)
['kpis', 'test_acc', '0.984175']
('test_acc', 0.984175)
['kpis', 'train_duration', '42.244845']
('train_duration', 42.244845)

例子:
model repo mnist CE监测 job:
http://ce.paddlepaddle.org:8080/viewLog.html?buildId=668&buildTypeId=PaddleCe_ModelsRepo

附: CE如何支持models repo模型监测 #105

where can i import "kpi"?

Hi, there,
from kpi import AccKpi from kpi import CostKpi from kpi import DurationKpi

where can i find kpi?
it can't be imported by python
ModuleNotFoundError: No module named 'kpi'

CE支持models repo模型监控方法

CE中搭建model repo监控
http://ce.paddlepaddle.org:8080/viewType.html?buildTypeId=PaddleCe_ModelsRepo

具体配置:

git clone https://github.com/PaddlePaddle/models.git models_repo #拉取models repo的模型
git clone https://github.com/PaddlePaddle/paddle-ce-latest-kpis.git tasks #拉取CE 模型

export specific_tasks='model_mnist'; 
#models repo需要监测的模型列表, mnist在CE repo对应为model_mnist
array=(${specific_tasks//,/ });
alias cp='cp';
for task in ${array[@]}  # 对其中需要监测的模型
do 
    echo $task; 
    cp -rf models_repo/fluid/${task:6}/. tasks/${task}/;  #models repo相应文件拷贝过来(包括隐藏文件, 如.run_ce.sh , _ce.py等)
done; 
./main.xsh"

models repo的某模型配置和CE model repo 的某模型一样,待该模型稳定运行后,下掉CE models repo的这个模型。
通过上述方式逐渐完成10个需要release模型的监测迁移。

CE生效的模型

基准模型

model kpi 是否生效 diff threshold
resnet50 cifar10_128_train_acc 是   0.03
cifar10_128_train_speed 0.03
cifar10_128_gpu_memory 0.05
flowers_64_train_speed 0.05
flowers_64_gpu_memory 0.05
lstm imdb_32_train_speed 0.03
imdb_32_gpu_memory 0.05
vgg16 cifar10_128_train_speed 0.02
cifar10_128_gpu_memory 0.05
flowers_32_train_speed 0.02
flowers_32_gpu_memory 0.05

业务模型

model kpi 是否生效 diff threshold
text_classification lstm_pass_duration 0.02
language_model imikolov_20_pass_duration 0.02
sequence_tagging_for_ner pass_duration 0.02
object_detection train_cost 0.02
train_speed 0.02

CE 模型负责人

CE repo模型:https://github.com/PaddlePaddle/paddle-ce-latest-kpis

负责模型人:(后续该模型的任何问题,由模型负责人own)

青晟:seq2seq
于洋:language_model,
郭晟:transformer,
毅冰: sequence_tagging_for_ner
春伟:text_classification
志宏:lstm
超容: mnist
豪爽: vgg16
白一帆:object_detection
青青: image_classification,
卫科:resnet50
佳宜:resnet50_net_CPU
成舵:resnet50_net_GPU

yibing和青青分别确认需要补充的模型和测试场景
NLP:
视觉:

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.