[ERROR: NOT_FOUND: /tfx/tfx_pv/pipelines/detect_anomolies_on_wafer_tfdv_schema/ImportExampleGen/examples/67/Split-train/data_tfrecord-00000-of-00001.gz; No such file or directory]
log of pod tfx-trainer-component:
ERROR:tensorflow: /job:worker/task:0 encountered the following error when processing closure: NotFoundError():Graph execution error:
2 root error(s) found.
(0) NOT_FOUND: /tfx/tfx_pv/pipelines/detect_anomolies_on_wafer_tfdv_schema/ImportExampleGen/examples/67/Split-train/data_tfrecord-00000-of-00001.gz; No such file or directory
[[{{node MultiDeviceIteratorGetNextFromShard}}]]
[[RemoteCall]]
[[IteratorGetNextAsOptional]]
Additional GRPC error information from remote target /job:worker/replica:0/task:0 while calling /tensorflow.WorkerService/RecvTensor:
:{"created":"@1707896978.099891609","description":"Error received from peer ipv4:10.102.74.8:5000","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"/tfx/tfx_pv/pipelines/detect_anomolies_on_wafer_tfdv_schema/ImportExampleGen/examples/67/Split-train/data_tfrecord-00000-of-00001.gz; No such file or directory\n\t [[{{node MultiDeviceIteratorGetNextFromShard}}]]\n\t [[RemoteCall]]\n\t [[IteratorGetNextAsOptional]]","grpc_status":5}
[[Cast_27/_24]]
Additional GRPC error information from remote target /job:ps/replica:0/task:0/device:CPU:0 while calling /tensorflow.eager.EagerService/RunComponentFunction:
:{"created":"@1707896978.103488999","description":"Error received from peer ipv4:10.96.200.160:5000","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":" /tfx/tfx_pv/pipelines/detect_anomolies_on_wafer_tfdv_schema/ImportExampleGen/examples/67/Split-train/data_tfrecord-00000-of-00001.gz; No such file or directory\n\t [[{{node MultiDeviceIteratorGetNextFromShard}}]]\n\t [[RemoteCall]]\n\t [[IteratorGetNextAsOptional]]\nAdditional GRPC error information from remote target /job:worker/replica:0/task:0 while calling /tensorflow.WorkerService/RecvTensor:\n:{"created":"@1707896978.099891609","description":"Error received from peer ipv4:10.102.74.8:5000","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"/tfx/tfx_pv/pipelines/detect_anomolies_on_wafer_tfdv_schema/ImportExampleGen/examples/67/Split-train/data_tfrecord-00000-of-00001.gz; No such file or directory\n\t [[{{node MultiDeviceIteratorGetNextFromShard}}]]\n\t [[RemoteCall]]\n\t [[IteratorGetNextAsOptional]]","grpc_status":5}\n\t [[Cast_27/_24]]","grpc_status":5}
(1) NOT_FOUND: /tfx/tfx_pv/pipelines/detect_anomolies_on_wafer_tfdv_schema/ImportExampleGen/examples/67/Split-train/data_tfrecord-00000-of-00001.gz; No such file or directory
[[{{node MultiDeviceIteratorGetNextFromShard}}]]
[[RemoteCall]]
[[IteratorGetNextAsOptional]]
0 successful operations.
0 derived errors ignored.
[SOLUTION]
This error is due to that pipeline_root directory has not been mounted to file system of worker container, mount it in definition yaml file of worker service:
# definition yaml file of worker service
kind: Service
apiVersion: v1
metadata:
name: dist-strat-example-worker-0
namespace: kubeflow
spec:
type: LoadBalancer
selector:
app: dist-strat-example-worker-0
ports:
- port: 5000
---
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app: dist-strat-example-worker-0
name: dist-strat-example-worker-0
namespace: kubeflow
spec:
replicas: 1
selector:
matchLabels:
app: dist-strat-example-worker-0
template:
metadata:
labels:
app: dist-strat-example-worker-0
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/hostname
operator: In
values:
- maye-inspiron-5547
containers:
- name: tensorflow
image: tf_std_server:v1
resources:
limits:
#nvidia.com/gpu: 2
env:
- name: TF_CONFIG
value: "{
\"cluster\": {
\"worker\": [\"dist-strat-example-worker-0:5000\",\"dist-strat-example-worker-1:5000\"],
\"ps\": [\"dist-strat-example-ps-0:5000\"]},
\"task\": {
\"type\": \"worker\",
\"index\": \"0\"
}
}"
#- name: GOOGLE_APPLICATION_CREDENTIALS
# value: "/var/secrets/google/key.json"
ports:
- containerPort: 5000
command:
- "/usr/bin/python"
- "/tf_std_server.py"
- ""
volumeMounts:
- mountPath: /tfx/tfx_pv
name: tfx-pv
#- name: credential
# mountPath: /var/secrets/google
volumes:
- name: tfx-pv
persistentVolumeClaim:
claimName: tfx-pv-claim
#- name: credential
# secret:
# secretName: credential
---
Attention:
- persistentVolumeClaim can only be used by resource in the same namespace. In this example, persistentVolumeClaim "tfx-pv-claim" is in namespace "kubeflow", so worker service and worker deployment should also specify namespace "kubeflow". Or raise error:
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 83s default-scheduler 0/2 nodes are available: persistentvolumeclaim "tfx-pv-claim" not found. preemption: 0/2 nodes are available: 2 Preemption is not helpful for scheduling..
标签:pv,No,grpc,distribute,worker,server,tfx,file,example
From: https://www.cnblogs.com/zhenxia-jiuyou/p/18015391