首页 > 其他分享 >Debug: tf distribute strategy parameter server: NOT_FOUND: No such file or directory

Debug: tf distribute strategy parameter server: NOT_FOUND: No such file or directory

时间:2024-02-14 18:00:13浏览次数:19  
标签:pv No grpc distribute worker server tfx file example

[ERROR: NOT_FOUND: /tfx/tfx_pv/pipelines/detect_anomolies_on_wafer_tfdv_schema/ImportExampleGen/examples/67/Split-train/data_tfrecord-00000-of-00001.gz; No such file or directory]

log of pod tfx-trainer-component:

ERROR:tensorflow: /job:worker/task:0 encountered the following error when processing closure: NotFoundError():Graph execution error:

2 root error(s) found.
  (0) NOT_FOUND:   /tfx/tfx_pv/pipelines/detect_anomolies_on_wafer_tfdv_schema/ImportExampleGen/examples/67/Split-train/data_tfrecord-00000-of-00001.gz; No such file or directory
	 [[{{node MultiDeviceIteratorGetNextFromShard}}]]
	 [[RemoteCall]]
	 [[IteratorGetNextAsOptional]]
Additional GRPC error information from remote target /job:worker/replica:0/task:0 while calling /tensorflow.WorkerService/RecvTensor:
:{"created":"@1707896978.099891609","description":"Error received from peer ipv4:10.102.74.8:5000","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"/tfx/tfx_pv/pipelines/detect_anomolies_on_wafer_tfdv_schema/ImportExampleGen/examples/67/Split-train/data_tfrecord-00000-of-00001.gz; No such file or directory\n\t [[{{node MultiDeviceIteratorGetNextFromShard}}]]\n\t [[RemoteCall]]\n\t [[IteratorGetNextAsOptional]]","grpc_status":5}
	 [[Cast_27/_24]]
Additional GRPC error information from remote target /job:ps/replica:0/task:0/device:CPU:0 while calling /tensorflow.eager.EagerService/RunComponentFunction:
:{"created":"@1707896978.103488999","description":"Error received from peer ipv4:10.96.200.160:5000","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":" /tfx/tfx_pv/pipelines/detect_anomolies_on_wafer_tfdv_schema/ImportExampleGen/examples/67/Split-train/data_tfrecord-00000-of-00001.gz; No such file or directory\n\t [[{{node MultiDeviceIteratorGetNextFromShard}}]]\n\t [[RemoteCall]]\n\t [[IteratorGetNextAsOptional]]\nAdditional GRPC error information from remote target /job:worker/replica:0/task:0 while calling /tensorflow.WorkerService/RecvTensor:\n:{"created":"@1707896978.099891609","description":"Error received from peer ipv4:10.102.74.8:5000","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"/tfx/tfx_pv/pipelines/detect_anomolies_on_wafer_tfdv_schema/ImportExampleGen/examples/67/Split-train/data_tfrecord-00000-of-00001.gz; No such file or directory\n\t [[{{node MultiDeviceIteratorGetNextFromShard}}]]\n\t [[RemoteCall]]\n\t [[IteratorGetNextAsOptional]]","grpc_status":5}\n\t [[Cast_27/_24]]","grpc_status":5}
  (1) NOT_FOUND:  /tfx/tfx_pv/pipelines/detect_anomolies_on_wafer_tfdv_schema/ImportExampleGen/examples/67/Split-train/data_tfrecord-00000-of-00001.gz; No such file or directory
	 [[{{node MultiDeviceIteratorGetNextFromShard}}]]
	 [[RemoteCall]]
	 [[IteratorGetNextAsOptional]]
0 successful operations.
0 derived errors ignored.

[SOLUTION]

This error is due to that pipeline_root directory has not been mounted to file system of worker container, mount it in definition yaml file of worker service:

# definition yaml file of worker service
kind: Service
apiVersion: v1
metadata:
  name: dist-strat-example-worker-0
  
  namespace: kubeflow
  
spec:

  type: LoadBalancer

  selector:
    app: dist-strat-example-worker-0
       
  ports:
  - port: 5000
---
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: dist-strat-example-worker-0

  name: dist-strat-example-worker-0
  
  namespace: kubeflow

spec:
  replicas: 1
  
  selector:
    matchLabels:
      app: dist-strat-example-worker-0

      
  template:
    metadata:
      labels:
        app: dist-strat-example-worker-0
    
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: kubernetes.io/hostname
                operator: In
                values:
                - maye-inspiron-5547
      
      containers:

      - name: tensorflow
        image: tf_std_server:v1
        resources:
          limits:
            #nvidia.com/gpu: 2

        env:

        - name: TF_CONFIG
          value: "{
  \"cluster\": {
    \"worker\": [\"dist-strat-example-worker-0:5000\",\"dist-strat-example-worker-1:5000\"], 
    \"ps\": [\"dist-strat-example-ps-0:5000\"]},
  \"task\": {
    \"type\": \"worker\",
    \"index\": \"0\"
  }
}"

        #- name: GOOGLE_APPLICATION_CREDENTIALS
        #  value: "/var/secrets/google/key.json"
        ports:
        - containerPort: 5000

        command:
        - "/usr/bin/python"
        - "/tf_std_server.py"
        - ""
        
        
        volumeMounts:
        - mountPath: /tfx/tfx_pv
          name: tfx-pv  
        
        #- name: credential
        #  mountPath: /var/secrets/google
        
        
      volumes:
      - name: tfx-pv
        persistentVolumeClaim:
          claimName: tfx-pv-claim


      #- name: credential
      #  secret:
      #    secretName: credential
---

Attention:

  1. persistentVolumeClaim can only be used by resource in the same namespace. In this example, persistentVolumeClaim "tfx-pv-claim" is in namespace "kubeflow", so worker service and worker deployment should also specify namespace "kubeflow". Or raise error:
...
Events:
  Type     Reason            Age   From               Message
  ----     ------            ----  ----               -------
  Warning  FailedScheduling  83s   default-scheduler  0/2 nodes are available: persistentvolumeclaim "tfx-pv-claim" not found. preemption: 0/2 nodes are available: 2 Preemption is not helpful for scheduling..

标签:pv,No,grpc,distribute,worker,server,tfx,file,example
From: https://www.cnblogs.com/zhenxia-jiuyou/p/18015391

相关文章

  • Go 100 mistakes - #16: Not using linters
    Alinterisanautomatictooltoanalyzecodeandcatcherrors. Tounderstandwhylintersareimportant,let’stakeoneconcreteexample.Inmistake#1,“Unintendedvariableshadowing,”wediscussedpotentialerrorsrelatedto variableshadowing.Using......
  • Debug: tf distribute strategy parameter server: stuck at "INFO:tensorflow:Parame
    [ERROR:stuckat"INFO:tensorflow:ParameterServerStrategyV2isnowconnectingtoclusterwithcluster_spec:ClusterSpec({'ps':['dist-strat-example-ps-0:5000'],'worker':['dist-strat-example-worker-0:5000',&#......
  • Idea--解决character ‘@‘ that cannot start any token. (Do not use @ for indenta
    原文网址:​​Idea--解决character‘@‘thatcannotstartanytoken.(Donotuse@forindentation_IT利刃出鞘的博客-CSDN博客​​简介本文介绍如何解决character‘@‘thatcannotstartanytoken.(Donotuse@forindentation这个问题。分享Java技术星球(自学精灵),网......
  • Debug: tf_ditribute_strategy_worker.yaml: unknown field "spec.template.spec.node
    [ERROR:unknownfield"spec.template.spec.nodeAffinity"](base)maye@maye-Inspiron-5547:~/github_repository/tensorflow_ecosystem/distribution_strategy$kubectlapply-fmaye_template.yamlservice/dist-strat-example-worker-0createdservice/dis......
  • Debug: tf_ditribute_strategy_worker.yaml: resource mapping not found for name:
    [ERROR:resourcemappingnotfoundforname:"dist-strat-example-worker-0"namespace:""from"maye_template.yaml":nomatchesforkind"Deployment"inversion"v1"]apiVersion:apps/v1kind:Deploymentme......
  • Debug: tf_distribute_strategy_worker.yaml: Exit Code: 132, and log of pod is emp
    [ERROR:ExitCode:132,andlogofpodisempty.](base)maye@maye-Inspiron-5547:~/github_repository/tensorflow_ecosystem/distribution_strategy$kubectldescribepoddist-strat-example-worker-1-qv8wpName:dist-strat-example-worker-1-qv8wpNa......
  • P1093 [NOIP2007 普及组] 奖学金
    1.题目介绍[NOIP2007普及组]奖学金题目背景NOIP2007普及组T1题目描述某小学最近得到了一笔赞助,打算拿出其中一部分为学习成绩优秀的前\(5\)名学生发奖学金。期末,每个学生都有\(3\)门课的成绩:语文、数学、英语。先按总分从高到低排序,如果两个同学总分相同,再按语文成绩......
  • P1059 [NOIP2006 普及组] 明明的随机数
    1.题目介绍[NOIP2006普及组]明明的随机数题目描述明明想在学校中请一些同学一起做一项问卷调查,为了实验的客观性,他先用计算机生成了\(N\)个\(1\)到\(1000\)之间的随机整数\((N\leq100)\),对于其中重复的数字,只保留一个,把其余相同的数去掉,不同的数对应着不同的学生的学......
  • NOI真题记录
    NOI真题记录一些做过的NOI真题。NOI2013向量内积题意:有\(n\)个\(d\)为向量,求是否有两对向量的点积是2或3的倍数。思路:先random_shuffle一下,然后一次判断和前面的和的乘积,如果发现出现了不满足全部模起来都不为0就说明出现了答案,与前面的每一个判断一下就可以了。......
  • Go 100 mistakes - #11: Not using the functional options pattern
      Here,WithPortreturnsaclosure.Aclosureisananonymousfunctionthatreferences variablesfromoutsideitsbody;inthiscase,theportvariable.Theclosurerespectsthe Optiontypeandimplementstheport-validationlogic.Eachconfigfieldr......