Distribute tensorflow model training on a kubernetes cluster

时间：2024-02-03 18:13:03浏览次数：30

标签：training strat dist kubernetes Distribute worker server tf tensorflow

[ERRRO: AttributeError: module 'tensorflow' has no attribute 'app']

(base) maye@maye-Inspiron-5547:~/github_repository/tensorflow_ecosystem/distribution_strategy$ kubectl describe pod dist-strat-example-worker-0-w6rsb
Name: dist-strat-example-worker-0-w6rsb
Namespace: default
Priority: 0
Service Account: default
Node: maye-inspiron-5547/192.168.0.104
Start Time: Sat, 03 Feb 2024 12:56:01 +0800
Labels: job=worker
name=dist-strat-example
task=0
Annotations:
Status: Running
IP: 10.244.0.30
IPs:
IP: 10.244.0.30
Controlled By: ReplicationController/dist-strat-example-worker-0
Containers:
tensorflow:
Container ID: containerd://4d271f040fdfaeebcc6f111fb6fa6666cee129d52cb429a89407c60c3c1180e6
Image: tf_std_server:v1
Image ID: sha256:d39144c35ea9a32641039358493137fdbce32ee5688b2c307cf255d127e6a0ed
Port: 5000/TCP
Host Port: 0/TCP
Command:
/usr/bin/python
/tf_std_server.py

State:          Waiting
  Reason:       CrashLoopBackOff
Last State:     Terminated
  Reason:       Error
  Exit Code:    1
  Started:      Sat, 03 Feb 2024 15:38:32 +0800
  Finished:     Sat, 03 Feb 2024 15:38:37 +0800
Ready:          False
Restart Count:  36
Environment:
  TF_CONFIG:                       { "cluster": { "worker": ["dist-strat-example-worker-0:5000","dist-strat-example-worker-1:5000"]}, "task": { "type": "worker", "index": "0" } }
  GOOGLE_APPLICATION_CREDENTIALS:  /var/secrets/google/key.json
Mounts:
  /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-b8qlz (ro)

Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
kube-api-access-b8qlz:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional:
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors:
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message

Normal Pulled 59m (x26 over 165m) kubelet Container image "tf_std_server:v1" already present on machine
Warning BackOff 4m55s (x716 over 164m) kubelet Back-off restarting failed container tensorflow in pod dist-strat-example-worker-0-w6rsb_default(a31b45d9-1dbf-43c1-95c6-4f8d8112c5e9)

(base) maye@maye-Inspiron-5547:~/github_repository/tensorflow_ecosystem/distribution_strategy$ kubectl logs dist-strat-example-worker-0-w6rsb
2024-02-03 07:38:33.104872: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Traceback (most recent call last):
File "/tf_std_server.py", line 35, in
tf.app.run()
^^^^^^
AttributeError: module 'tensorflow' has no attribute 'app'
(base) maye@maye-Inspiron-5547:~/github_repository/tensorflow_ecosystem/distribution_strategy$

[SOLUTION]
This is due to that the tensorflow in use is v2, tf.app.run() is a sentence of tensorflow v1, module app has been removed in tensorflow v2,

tf.app.run() = main(sys.argv)

replace tf.app.run() to main(sys.argv), and import sys, in file tf_std_server.py .
rebuild the container image of tensorflow standard server:

$ cd <directory-which-contains-Dockerfile.tf_std_server>
$ nerdctl build --no-cache -t tf_std_server:v1 -f Dockerfile.tf_std_server . --namespace k8s.io

Attention:

The . means specifying the current directory as the context, namely nerdctl build will find files it needs in this directory, if no directory specified, raise error: "FATA[0004] context needs to be specified " .
If not specifying namespace, the built image will be in namespace "default", and crictl (container runtime interface cli of kubernetes) can only see images in namespace "k8s.io" .

Note:

"exit code: 1" : something wrong in executing code of the process.
"exit code: 137": the process has received SIGNAL KILL, in the case of kubernetes, if kubelet needs to stop a container process, it will call containerd, and containerd will send SIGNAL KILL to the container process. Linux will send SIGNAL KILL to a process if cpu, or memory is not enough.
"exit code: 139': SEGMENT FAULT, the process tries to access memory, or file, or table in a database which is not accessible, such as, memory out of boundary, not existed file or database table.

标签：training,strat,dist,kubernetes,Distribute,worker,server,tf,tensorflow
From： https://www.cnblogs.com/zhenxia-jiuyou/p/18005018

Kubernetes 为用户使用 Dashboard 创建 RBAC 权限
文章目录目录文章目录一、创建Namespace二、创建ServiceAccount三、创建Namespace的RBAC权限1、方式一：使用系统提供角色分配Namespace权限2、方式二：使用自定义角色分配Namespace权限四、解决登录Dashboard不能选择Namespace问题系统环境：kubernetes版本：1.16.3......
Kubernetes:kube-scheduler 源码分析
0.前言[译]kubernetes:kube-scheduler调度器代码结构概述介绍了kube-scheduler的代码结构。本文围绕代码结构，从源码角度出发，分析kube-scheduler的调度逻辑。1.启动kube-schedulerkube-scheduler使用Cobra框架初始化参数，配置和应用。//kubernetes/cmd/kube-sche......
kubernetes健康检查配置解析
参考：https://zhuanlan.zhihu.com/p/542202680一，健康检查种类在kubernetes中，经常会看到健康检查相关的配置。一般有两种健康检查方式：存活性健康检查和可用性健康检查，也叫做存活探针（livenessProbe）或者就绪探针（readinessProbe）。livenessProbe探测应用是否处于健康状态，如果不健康......
kubeadm安装Kubernetes集群踩坑笔记
目录背景步骤一安装DockerEngine步骤二：安装前配置步骤三：安装kubeadm步骤四：安装kubernetes的Master节点镜像准备开始安装安装Flannel网络插件步骤五：安装kubernetes的Worker节点总结思考背景最近在极客时间上跟Chrono大神学习Kubernetes基础，在实践过程中遇到一些运维、使用方面......
如何修改Azure Kubernetes Services节点池VM Size
如何修改AzureKubernetesServices节点池大小今天和大家聊聊AzureKubernetesServices（AKS）修改节点池VMSize的问题。这也是很多客户在使用AKS的过程中都会遇到的一个问题。随着AKS群集使用时间的增长，很多客户都会面临扩展或修改AKS节点池VMSize的问题，具体的原因大致如下：性能优化......
Kubernetes 学习整理（五）
k8s-ConfigMapConfigureaPodtoUseaConfigMapCreateaConfigMapCreateaConfigMapfromadirectory读取configmap内容并以yaml格式输出createconfigmapfromsingleonefileormorefilescreateconfigmapfromaenvfile:--from-env-file也支持多个en......
分布式训练Distributed training
motivation为什么需要分布式训练？随着模型规模和参数量的增大，对硬件的要求也变大（算力、内存）。现实困境是单一设备有内存墙（模型需要的运算量提高很快→需要研发AI硬件（提高硬件的峰值算力）→简化或者删除其他部分（例如内存的分层架构））最新模型训练的瓶颈（特别是对NLP和推荐系统相......
Kubernetes 推荐学习资料课程视频
以下是一些推荐的Kubernetes学习资料、课程和视频：学习资料：Kubernetes官方文档：https://kubernetes.io/docs/home/《Kubernetes操作指南》（KubernetesUp&Running）一书，由KelseyHightower、BrendanBurns、JoeBeda著。《KubernetesinAction》一书，由MarkoLuksa著。《Kuberne......
kubernetes 导出干净的 yaml内容
从Kubernetes集群导出对象的完整清单很容易，但它会包含无关的会计字段，这不仅难以直观地评估和与其原始形式进行比较，而且还可能导致重新申请失败。结合使用jq和yq实用程序，我们可以使用以下语法导出干净的yaml清单。安装yq jq sudoadd-apt-repositoryppa:rmescando......
三、kubernetes 集群 YAML 文件详解
1、YAML文件概述k8s集群中对资源管理和资源对象编排部署都可以通过声明样式（YAML）文件来解决，也就是可以把需要对资源对象操作编辑到YAML格式文件中，我们把这种文件叫做资源清单文件，通过kubectl命令直接使用资源清单文件就可以实现对大量的资源对象进行编排部署了。2、YA......

Distribute tensorflow model training on a kubernetes cluster

相关文章

赞助商

阅读排行