38、记录使用华为的ModelArts去调用npu训练yolov5模型和推理

标签：yolov5 38 ModelArts label ckpt lr file path size

基本思想：有机会使用华为ModelArts云服务，做一下尝试，逐记录一下

第一步：登录帐号，查看一下服务配置，镜像自己选择和缴费就行

[ma-user ~]$npu-smi info
+------------------------------------------------------------------------------------+
| npu-smi 21.0.3.1                 Version: 21.0.2                                   |
+----------------------+---------------+---------------------------------------------+
| NPU   Name           | Health        | Power(W)   Temp(C)                          |
| Chip                 | Bus-Id        | AICore(%)  Memory-Usage(MB)  HBM-Usage(MB)  |
+======================+===============+=============================================+
| 0     910PremiumA    | OK            | 87.5       38                               |
| 0                    | 0000:C1:00.0  | 0          758  / 15171      0    / 32768   |
+======================+===============+=============================================+

下载官方源码yolov5进行测试,注意要下载到work目录下，否则下次登录，文件不复存在，下面源码来自官方，附录１

链接: https://pan.baidu.com/s/11hsEUT-1abqIZNQy8rc4Dw 提取码: cp7k

二、上传文件夹和训练数据，注意用法和notebook规则一样，只能上传文件，参考官方　这个训练源码，搞成了coco数据集进行训练，图片太多，需要使用压缩包经过OBS中转上传

１)、设置本次训练的default_config.yaml

# Builtin Configurations(DO NOT CHANGE THESE CONFIGURATIONS unless you know exactly what you are doing)
enable_modelarts: False
# Url for modelarts
data_url: ""
train_url: ""
checkpoint_url: ""
# Path for local
output_dir: "/cache"
data_path: "/cache/data"
output_path: "/cache/train"
load_path: "/cache/checkpoint_path"
device_target: "Ascend"
need_modelarts_dataset_unzip: True
modelarts_dataset_unzip_name: "coco"

# ==============================================================================
# Train options
data_dir: "dataset/"
per_batch_size: 32
yolov5_version: "yolov5s"
pretrained_backbone: ""
resume_yolov5: ""
pretrained_checkpoint: ""

lr_scheduler: "cosine_annealing"
lr: 0.013
lr_epochs: "220,250"
lr_gamma: 0.1
eta_min: 0.0
T_max: 300
#训练轮次
max_epoch: 3000
warmup_epochs: 20
weight_decay: 0.0005
momentum: 0.9
loss_scale: 1024
label_smooth: 0
label_smooth_factor: 0.1
log_interval: 100
#ckpt输出位置
ckpt_path: "outputs/"
ckpt_interval: 1
is_save_on_master: 1
is_distributed: 0
rank: 0
group_size: 1
need_profiler: 0
training_shape: ""
resize_rate: 10
is_modelArts: 0

# Eval options
pretrained: ""
#log输出位置
log_path: "outputs/"
ann_val_file: "annotations/val.json"
eval（num_classes + 5)

input_shape: [[3, 32, 64, 128, 256, 512, 1],
              [3, 48, 96, 192, 384, 768, 2],
              [3, 64, 128, 256, 512, 1024, 3],
              [3, 80, 160, 320, 640, 1280, 4]]

# test_param
test_img_shape: [640, 640]

labels: [ 'seatbelt', 'helmet', 'no_coverall', 'no_helmet', 'coverall', 'no_seatbelt']

coco_ids: [ 1, 2, 3, 4, 5, 6 ]

result_files: './result_Files'

---

# Help description for each configuration
# Train options
data_dir: "Train dataset directory."
per_batch_size: "Batch size for Training."
pretrained_backbone: "The ckpt file of CspDarkNet53."
resume_yolov5: "The ckpt file of YOLOv5, which used to fine tune."
pretrained_checkpoint: "The ckpt file of YOLOv5CspDarkNet53."
lr_scheduler: "Learning rate scheduler, options: exponential, cosine_annealing."
lr: "Learning rate."
lr_epochs: "Epoch of changing of lr changing, split with ','."
lr_gamma: "Decrease lr by a factor of exponential lr_scheduler."
eta_min: "Eta_min in cosine_annealing scheduler."
T_max: "T-max in cosine_annealing scheduler."
max_epoch: "Max epoch num to train the model."
warmup_epochs: "Warmup epochs."
weight_decay: "Weight decay factor."
momentum: "Momentum."
loss_scale: "Static loss scale."
label_smooth: "Whether to use label smooth in CE."
label_smooth_factor: "Smooth strength of original one-hot."
log_interval: "Logging interval steps."
ckpt_path: "Checkpoint save location."
ckpt_interval: "Save checkpoint interval."
is_save_on_master: "Save ckpt on master or all rank, 1 for master, 0 for all ranks."
is_distributed: "Distribute train or not, 1 for yes, 0 for no."
rank: "Local rank of distributed."
group_size: "World size of device."
need_profiler: "Whether use profiler. 0 for no, 1 for yes."
training_shape: "Fix training shape."
resize_rate: "Resize rate for multi-scale training."
ann_file: "path to annotation"
each_multiscale: "Apply multi-scale for each scale"
labels: "the label of train data"
multi_label: "use multi label to nms"
multi_label_thresh: "multi label thresh"

# Eval options
pretrained: "model_path, local pretrained model to load"
log_path: "checkpoint save location"
ann_val_file: "path to annotation"

# Export options
device_id: "Device id for export"
batch_size: "batch size for export"
testing_shape: "shape for test"
ckpt_file: "Checkpoint file path for export"
file_name: "output file name for export"
file_format: "file format for export"
result_files: 'path to 310 infer result floder'

其他文件还需要修改postprocess.py和src/yolo.py的类别、数据集路径

２）、开始训练

[ma-user outputs]$python3 train.py

38、记录使用华为的ModelArts去调用npu训练yolov5模型和推理_数据集

38、记录使用华为的ModelArts去调用npu训练yolov5模型和推理_上传文件_02

训练过程中可以监控一下npu的使用率

38、记录使用华为的ModelArts去调用npu训练yolov5模型和推理_上传文件_03

第三步：评测训练结果

[ma-user yolov5_self]$python eval.py --data_dir=dataset/ --eval_shape=640 --pretrained="outputs/2022-09-07_time_14_04_50/ckpt_0/0-2_168.ckpt

38、记录使用华为的ModelArts去调用npu训练yolov5模型和推理_python_04

识别结果

38、记录使用华为的ModelArts去调用npu训练yolov5模型和推理_数据集_05

参考

LegnaDoc

models: Models of MindSpore - Gitee.com

标签：yolov5,38,ModelArts,label,ckpt,lr,file,path,size
From： https://blog.51cto.com/u_12504263/5719075

38、记录使用华为的ModelArts去调用npu训练yolov5模型和推理

相关文章

赞助商

阅读排行