标签：trainer OutOfRangeError End lib py usr File line python3.8

[ERROR: tf distribute strategy parameter server: tfx component trainer: OutOfRangeError(), Node: 'cond/IteratorGetNext' End of sequence]

log of pod tfx-component-trainer:

2024-02-14 09:43:48.571820: W ./tensorflow/core/distributed_runtime/eager/destroy_tensor_handle_node.h:58] Ignoring an error encountered when deleting remote tensors handles: INVALID_ARGUMENT: Unable to find the relevant tensor remote_handle: Op ID: 5019, Output num: 7
Additional GRPC error information from remote target /job:worker/replica:0/task:0:
:{"created":"@1707903828.498704799","description":"Error received from peer ipv4:10.105.206.29:5000","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Unable to find the relevant tensor remote_handle: Op ID: 5019, Output num: 7","grpc_status":3} [type.googleapis.com/tensorflow.core.platform.ErrorSourceProto='\x08\x05']
ERROR:tensorflow: /job:worker/task:1 encountered the following error when processing closure: OutOfRangeError():Graph execution error:

Detected at node 'cond/IteratorGetNext' defined at (most recent call last):
    File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
      return _run_code(code, main_globals, None,
    File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
      exec(code, run_globals)
    File "/usr/local/lib/python3.8/dist-packages/tfx/orchestration/kubeflow/container_entrypoint.py", line 510, in <module>
      main(sys.argv[1:])
    File "/usr/local/lib/python3.8/dist-packages/tfx/orchestration/kubeflow/container_entrypoint.py", line 502, in main
      execution_info = component_launcher.launch()
    File "/usr/local/lib/python3.8/dist-packages/tfx/orchestration/portable/launcher.py", line 574, in launch
      executor_output = self._run_executor(execution_info)
    File "/usr/local/lib/python3.8/dist-packages/tfx/orchestration/portable/launcher.py", line 449, in _run_executor
      executor_output = self._executor_operator.run_executor(execution_info)
    File "/usr/local/lib/python3.8/dist-packages/tfx/orchestration/portable/python_executor_operator.py", line 135, in run_executor
      return run_with_executor(execution_info, executor)
    File "/usr/local/lib/python3.8/dist-packages/tfx/orchestration/portable/python_executor_operator.py", line 58, in run_with_executor
      result = executor.Do(execution_info.input_dict, output_dict,
    File "/usr/local/lib/python3.8/dist-packages/tfx/components/trainer/executor.py", line 178, in Do
      run_fn(fn_args)
    File "/tmp/tmpss6gfjtk/detect_anomalies_in_wafer_trainer.py", line 316, in run_fn
      trainer_train_history = model.fit(
    File "/usr/local/lib/python3.8/dist-packages/keras/utils/traceback_utils.py", line 65, in error_handler
      return fn(*args, **kwargs)
    File "/usr/local/lib/python3.8/dist-packages/keras/engine/training.py", line 1729, in fit
      val_logs = self.evaluate(
    File "/usr/local/lib/python3.8/dist-packages/keras/utils/traceback_utils.py", line 65, in error_handler
      return fn(*args, **kwargs)
    File "/usr/local/lib/python3.8/dist-packages/keras/engine/training.py", line 2072, in evaluate
      tmp_logs = self.test_function(iterator)
    File "/usr/local/lib/python3.8/dist-packages/keras/engine/training.py", line 1861, in <lambda>
      lambda it: self._cluster_coordinator.schedule(
    File "/usr/local/lib/python3.8/dist-packages/keras/engine/training.py", line 1852, in test_function
      return step_function(self, iterator)
    File "/usr/local/lib/python3.8/dist-packages/keras/engine/training.py", line 1835, in step_function
      data = next(iterator)
Node: 'cond/IteratorGetNext'
Detected at node 'cond/IteratorGetNext' defined at (most recent call last):
    File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
      return _run_code(code, main_globals, None,
    File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
      exec(code, run_globals)
    File "/usr/local/lib/python3.8/dist-packages/tfx/orchestration/kubeflow/container_entrypoint.py", line 510, in <module>
      main(sys.argv[1:])
    File "/usr/local/lib/python3.8/dist-packages/tfx/orchestration/kubeflow/container_entrypoint.py", line 502, in main
      execution_info = component_launcher.launch()
    File "/usr/local/lib/python3.8/dist-packages/tfx/orchestration/portable/launcher.py", line 574, in launch
      executor_output = self._run_executor(execution_info)
    File "/usr/local/lib/python3.8/dist-packages/tfx/orchestration/portable/launcher.py", line 449, in _run_executor
      executor_output = self._executor_operator.run_executor(execution_info)
    File "/usr/local/lib/python3.8/dist-packages/tfx/orchestration/portable/python_executor_operator.py", line 135, in run_executor
      return run_with_executor(execution_info, executor)
    File "/usr/local/lib/python3.8/dist-packages/tfx/orchestration/portable/python_executor_operator.py", line 58, in run_with_executor
      result = executor.Do(execution_info.input_dict, output_dict,
    File "/usr/local/lib/python3.8/dist-packages/tfx/components/trainer/executor.py", line 178, in Do
      run_fn(fn_args)
    File "/tmp/tmpss6gfjtk/detect_anomalies_in_wafer_trainer.py", line 316, in run_fn
      trainer_train_history = model.fit(
    File "/usr/local/lib/python3.8/dist-packages/keras/utils/traceback_utils.py", line 65, in error_handler
      return fn(*args, **kwargs)
    File "/usr/local/lib/python3.8/dist-packages/keras/engine/training.py", line 1729, in fit
      val_logs = self.evaluate(
    File "/usr/local/lib/python3.8/dist-packages/keras/utils/traceback_utils.py", line 65, in error_handler
      return fn(*args, **kwargs)
    File "/usr/local/lib/python3.8/dist-packages/keras/engine/training.py", line 2072, in evaluate
      tmp_logs = self.test_function(iterator)
    File "/usr/local/lib/python3.8/dist-packages/keras/engine/training.py", line 1861, in <lambda>
      lambda it: self._cluster_coordinator.schedule(
    File "/usr/local/lib/python3.8/dist-packages/keras/engine/training.py", line 1852, in test_function
      return step_function(self, iterator)
    File "/usr/local/lib/python3.8/dist-packages/keras/engine/training.py", line 1835, in step_function
      data = next(iterator)
Node: 'cond/IteratorGetNext'
End of sequence
	 [[{{node cond/IteratorGetNext}}]]
Additional GRPC error information from remote target /job:worker/replica:0/task:1:
:{"created":"@1707903828.595072820","description":"Error received from peer ipv4:10.102.137.138:5000","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":" End of sequence\n\t [[{{node cond/IteratorGetNext}}]]","grpc_status":11} [Op:__inference_test_function_29421]
ERROR:tensorflow: /job:worker/task:1 encountered the following error when processing closure: OutOfRangeError():Graph execution error:

Detected at node 'cond/IteratorGetNext' defined at (most recent call last):
    File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
      return _run_code(code, main_globals, None,
    File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
      exec(code, run_globals)
    File "/usr/local/lib/python3.8/dist-packages/tfx/orchestration/kubeflow/container_entrypoint.py", line 510, in <module>
      main(sys.argv[1:])
    File "/usr/local/lib/python3.8/dist-packages/tfx/orchestration/kubeflow/container_entrypoint.py", line 502, in main
      execution_info = component_launcher.launch()
    File "/usr/local/lib/python3.8/dist-packages/tfx/orchestration/portable/launcher.py", line 574, in launch
      executor_output = self._run_executor(execution_info)
    File "/usr/local/lib/python3.8/dist-packages/tfx/orchestration/portable/launcher.py", line 449, in _run_executor
      executor_output = self._executor_operator.run_executor(execution_info)
    File "/usr/local/lib/python3.8/dist-packages/tfx/orchestration/portable/python_executor_operator.py", line 135, in run_executor
      return run_with_executor(execution_info, executor)
    File "/usr/local/lib/python3.8/dist-packages/tfx/orchestration/portable/python_executor_operator.py", line 58, in run_with_executor
      result = executor.Do(execution_info.input_dict, output_dict,
    File "/usr/local/lib/python3.8/dist-packages/tfx/components/trainer/executor.py", line 178, in Do
      run_fn(fn_args)
    File "/tmp/tmpss6gfjtk/detect_anomalies_in_wafer_trainer.py", line 316, in run_fn
      trainer_train_history = model.fit(
    File "/usr/local/lib/python3.8/dist-packages/keras/utils/traceback_utils.py", line 65, in error_handler
      return fn(*args, **kwargs)
    File "/usr/local/lib/python3.8/dist-packages/keras/engine/training.py", line 1729, in fit
      val_logs = self.evaluate(
    File "/usr/local/lib/python3.8/dist-packages/keras/utils/traceback_utils.py", line 65, in error_handler
      return fn(*args, **kwargs)
    File "/usr/local/lib/python3.8/dist-packages/keras/engine/training.py", line 2072, in evaluate
      tmp_logs = self.test_function(iterator)
    File "/usr/local/lib/python3.8/dist-packages/keras/engine/training.py", line 1861, in <lambda>
      lambda it: self._cluster_coordinator.schedule(
    File "/usr/local/lib/python3.8/dist-packages/keras/engine/training.py", line 1852, in test_function
      return step_function(self, iterator)
    File "/usr/local/lib/python3.8/dist-packages/keras/engine/training.py", line 1835, in step_function
      data = next(iterator)
Node: 'cond/IteratorGetNext'
Detected at node 'cond/IteratorGetNext' defined at (most recent call last):
    File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
      return _run_code(code, main_globals, None,
    File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
      exec(code, run_globals)
    File "/usr/local/lib/python3.8/dist-packages/tfx/orchestration/kubeflow/container_entrypoint.py", line 510, in <module>
      main(sys.argv[1:])
    File "/usr/local/lib/python3.8/dist-packages/tfx/orchestration/kubeflow/container_entrypoint.py", line 502, in main
      execution_info = component_launcher.launch()
    File "/usr/local/lib/python3.8/dist-packages/tfx/orchestration/portable/launcher.py", line 574, in launch
      executor_output = self._run_executor(execution_info)
    File "/usr/local/lib/python3.8/dist-packages/tfx/orchestration/portable/launcher.py", line 449, in _run_executor
      executor_output = self._executor_operator.run_executor(execution_info)
    File "/usr/local/lib/python3.8/dist-packages/tfx/orchestration/portable/python_executor_operator.py", line 135, in run_executor
      return run_with_executor(execution_info, executor)
    File "/usr/local/lib/python3.8/dist-packages/tfx/orchestration/portable/python_executor_operator.py", line 58, in run_with_executor
      result = executor.Do(execution_info.input_dict, output_dict,
    File "/usr/local/lib/python3.8/dist-packages/tfx/components/trainer/executor.py", line 178, in Do
      run_fn(fn_args)
    File "/tmp/tmpss6gfjtk/detect_anomalies_in_wafer_trainer.py", line 316, in run_fn
      trainer_train_history = model.fit(
    File "/usr/local/lib/python3.8/dist-packages/keras/utils/traceback_utils.py", line 65, in error_handler
      return fn(*args, **kwargs)
    File "/usr/local/lib/python3.8/dist-packages/keras/engine/training.py", line 1729, in fit
      val_logs = self.evaluate(
    File "/usr/local/lib/python3.8/dist-packages/keras/utils/traceback_utils.py", line 65, in error_handler
      return fn(*args, **kwargs)
    File "/usr/local/lib/python3.8/dist-packages/keras/engine/training.py", line 2072, in evaluate
      tmp_logs = self.test_function(iterator)
    File "/usr/local/lib/python3.8/dist-packages/keras/engine/training.py", line 1861, in <lambda>
      lambda it: self._cluster_coordinator.schedule(
    File "/usr/local/lib/python3.8/dist-packages/keras/engine/training.py", line 1852, in test_function
      return step_function(self, iterator)
    File "/usr/local/lib/python3.8/dist-packages/keras/engine/training.py", line 1835, in step_function
      data = next(iterator)
Node: 'cond/IteratorGetNext'
End of sequence
	 [[{{node cond/IteratorGetNext}}]]
Additional GRPC error information from remote target /job:worker/replica:0/task:1:
:{"created":"@1707903828.595072820","description":"Error received from peer ipv4:10.102.137.138:5000","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":" End of sequence\n\t [[{{node cond/IteratorGetNext}}]]","grpc_status":11} [Op:__inference_test_function_29421]
ERROR:tensorflow:Start cancelling closures due to error OutOfRangeError(): Graph execution error:

[SOLUTION]

This error is due to that validation dataset is finite, operation IteratorGetNext meets end of validation dataset.
The solution is:

# repeat validation dataset indefinitely
validation_dataset = validation_dataset.repeat()
# specify validation_steps, one step = one batch
model.fit(validation_steps=<an integer>, ...)

ok log of pod tfx-component-trainer:

Epoch 1/50
/usr/local/lib/python3.8/dist-packages/tensorflow/python/data/ops/dataset_ops.py:467: UserWarning: To make it possible to preserve tf.data options across serialization boundaries, their implementation has moved to be part of the TensorFlow graph. As a consequence, the options value is in general no longer known at graph construction time. Invoking this method in graph mode retains the legacy behavior of the original implementation, but note that the returned value might not reflect the actual value of the options.
  warnings.warn("To make it possible to preserve tf.data options across "
INFO:tensorflow:Reduce to /device:CPU:0 then broadcast to ('/replica:0/device:CPU:0',).
INFO:tensorflow:Reduce to /device:CPU:0 then broadcast to ('/replica:0/device:CPU:0',).
INFO:tensorflow:Reduce to /device:CPU:0 then broadcast to ('/replica:0/device:CPU:0',).
INFO:tensorflow:Reduce to /device:CPU:0 then broadcast to ('/replica:0/device:CPU:0',).
INFO:tensorflow:Reduce to /device:CPU:0 then broadcast to ('/replica:0/device:CPU:0',).
INFO:tensorflow:Reduce to /device:CPU:0 then broadcast to ('/replica:0/device:CPU:0',).
INFO:tensorflow:Reduce to /device:CPU:0 then broadcast to ('/replica:0/device:CPU:0',).
INFO:tensorflow:Reduce to /device:CPU:0 then broadcast to ('/replica:0/device:CPU:0',).
INFO:tensorflow:Reduce to /device:CPU:0 then broadcast to ('/replica:0/device:CPU:0',).
INFO:tensorflow:Reduce to /device:CPU:0 then broadcast to ('/replica:0/device:CPU:0',).
INFO:tensorflow:Reduce to /device:CPU:0 then broadcast to ('/replica:0/device:CPU:0',).
INFO:tensorflow:Reduce to /device:CPU:0 then broadcast to ('/replica:0/device:CPU:0',).
INFO:tensorflow:Reduce to /device:CPU:0 then broadcast to ('/replica:0/device:CPU:0',).
INFO:tensorflow:Reduce to /device:CPU:0 then broadcast to ('/replica:0/device:CPU:0',).
INFO:tensorflow:Reduce to /device:CPU:0 then broadcast to ('/replica:0/device:CPU:0',).
INFO:tensorflow:Reduce to /device:CPU:0 then broadcast to ('/replica:0/device:CPU:0',).
INFO:tensorflow:Reduce to /device:CPU:0 then broadcast to ('/replica:0/device:CPU:0',).
INFO:tensorflow:Reduce to /device:CPU:0 then broadcast to ('/replica:0/device:CPU:0',).
INFO:tensorflow:Reduce to /device:CPU:0 then broadcast to ('/replica:0/device:CPU:0',).
INFO:tensorflow:Reduce to /device:CPU:0 then broadcast to ('/replica:0/device:CPU:0',).
INFO:tensorflow:Waiting for all global closures to be finished.
INFO:tensorflow:Waiting for all global closures to be finished.
INFO:tensorflow:Waiting for all global closures to be finished.
INFO:tensorflow:Waiting for all global closures to be finished.
21/21 - 31s - loss: 0.7040 - cross entropy: 0.7036 - tp: 882.0000 - fp: 820.0000 - tn: 547.0000 - fn: 439.0000 - precision: 0.5182 - recall: 0.6677 - auc: 0.5415 - prc: 0.5361 - val_loss: 0.6753 - val_cross entropy: 0.6749 - val_tp: 30.0000 - val_fp: 181.0000 - val_tn: 166.0000 - val_fn: 7.0000 - val_precision: 0.1422 - val_recall: 0.8108 - val_auc: 0.7620 - val_prc: 0.3268 - 31s/epoch - 1s/step
Epoch 2/50
INFO:tensorflow:Waiting for all global closures to be finished.
INFO:tensorflow:Waiting for all global closures to be finished.
INFO:tensorflow:Waiting for all global closures to be finished.
INFO:tensorflow:Waiting for all global closures to be finished.
21/21 - 14s - loss: 0.5821 - cross entropy: 0.5817 - tp: 1036.0000 - fp: 439.0000 - tn: 926.0000 - fn: 287.0000 - precision: 0.7024 - recall: 0.7831 - auc: 0.8096 - prc: 0.8139 - val_loss: 0.5677 - val_cross entropy: 0.5673 - val_tp: 25.0000 - val_fp: 82.0000 - val_tn: 271.0000 - val_fn: 6.0000 - val_precision: 0.2336 - val_recall: 0.8065 - val_auc: 0.8646 - val_prc: 0.3799 - 14s/epoch - 657ms/step
Epoch 3/50
INFO:tensorflow:Waiting for all global closures to be finished.
INFO:tensorflow:Waiting for all global closures to be finished.
INFO:tensorflow:Waiting for all global closures to be finished.
INFO:tensorflow:Waiting for all global closures to be finished.
(base) maye@maye-Inspiron-5547:~/github_repository/tensorflow_ecosystem/distribution_strategy$

标签：trainer,OutOfRangeError,End,lib,py,usr,File,line,python3.8
From： https://www.cnblogs.com/zhenxia-jiuyou/p/18015849

Debug: tf distribute strategy parameter server: tfx component trainer: OutOfRangeError(), End of seq

[ERROR: tf distribute strategy parameter server: tfx component trainer: OutOfRangeError(), Node: 'cond/IteratorGetNext' End of sequence]

[SOLUTION]

相关文章

赞助商

阅读排行