首页 > 其他分享 >Debug: tf distribute strategy parameter server: tfx component trainer: OutOfRangeError(), End of seq

Debug: tf distribute strategy parameter server: tfx component trainer: OutOfRangeError(), End of seq

时间:2024-02-14 23:44:16浏览次数:27  
标签:trainer OutOfRangeError End lib py usr File line python3.8

[ERROR: tf distribute strategy parameter server: tfx component trainer: OutOfRangeError(), Node: 'cond/IteratorGetNext' End of sequence]

log of pod tfx-component-trainer:

2024-02-14 09:43:48.571820: W ./tensorflow/core/distributed_runtime/eager/destroy_tensor_handle_node.h:58] Ignoring an error encountered when deleting remote tensors handles: INVALID_ARGUMENT: Unable to find the relevant tensor remote_handle: Op ID: 5019, Output num: 7
Additional GRPC error information from remote target /job:worker/replica:0/task:0:
:{"created":"@1707903828.498704799","description":"Error received from peer ipv4:10.105.206.29:5000","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Unable to find the relevant tensor remote_handle: Op ID: 5019, Output num: 7","grpc_status":3} [type.googleapis.com/tensorflow.core.platform.ErrorSourceProto='\x08\x05']
ERROR:tensorflow: /job:worker/task:1 encountered the following error when processing closure: OutOfRangeError():Graph execution error:

Detected at node 'cond/IteratorGetNext' defined at (most recent call last):
    File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
      return _run_code(code, main_globals, None,
    File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
      exec(code, run_globals)
    File "/usr/local/lib/python3.8/dist-packages/tfx/orchestration/kubeflow/container_entrypoint.py", line 510, in <module>
      main(sys.argv[1:])
    File "/usr/local/lib/python3.8/dist-packages/tfx/orchestration/kubeflow/container_entrypoint.py", line 502, in main
      execution_info = component_launcher.launch()
    File "/usr/local/lib/python3.8/dist-packages/tfx/orchestration/portable/launcher.py", line 574, in launch
      executor_output = self._run_executor(execution_info)
    File "/usr/local/lib/python3.8/dist-packages/tfx/orchestration/portable/launcher.py", line 449, in _run_executor
      executor_output = self._executor_operator.run_executor(execution_info)
    File "/usr/local/lib/python3.8/dist-packages/tfx/orchestration/portable/python_executor_operator.py", line 135, in run_executor
      return run_with_executor(execution_info, executor)
    File "/usr/local/lib/python3.8/dist-packages/tfx/orchestration/portable/python_executor_operator.py", line 58, in run_with_executor
      result = executor.Do(execution_info.input_dict, output_dict,
    File "/usr/local/lib/python3.8/dist-packages/tfx/components/trainer/executor.py", line 178, in Do
      run_fn(fn_args)
    File "/tmp/tmpss6gfjtk/detect_anomalies_in_wafer_trainer.py", line 316, in run_fn
      trainer_train_history = model.fit(
    File "/usr/local/lib/python3.8/dist-packages/keras/utils/traceback_utils.py", line 65, in error_handler
      return fn(*args, **kwargs)
    File "/usr/local/lib/python3.8/dist-packages/keras/engine/training.py", line 1729, in fit
      val_logs = self.evaluate(
    File "/usr/local/lib/python3.8/dist-packages/keras/utils/traceback_utils.py", line 65, in error_handler
      return fn(*args, **kwargs)
    File "/usr/local/lib/python3.8/dist-packages/keras/engine/training.py", line 2072, in evaluate
      tmp_logs = self.test_function(iterator)
    File "/usr/local/lib/python3.8/dist-packages/keras/engine/training.py", line 1861, in <lambda>
      lambda it: self._cluster_coordinator.schedule(
    File "/usr/local/lib/python3.8/dist-packages/keras/engine/training.py", line 1852, in test_function
      return step_function(self, iterator)
    File "/usr/local/lib/python3.8/dist-packages/keras/engine/training.py", line 1835, in step_function
      data = next(iterator)
Node: 'cond/IteratorGetNext'
Detected at node 'cond/IteratorGetNext' defined at (most recent call last):
    File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
      return _run_code(code, main_globals, None,
    File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
      exec(code, run_globals)
    File "/usr/local/lib/python3.8/dist-packages/tfx/orchestration/kubeflow/container_entrypoint.py", line 510, in <module>
      main(sys.argv[1:])
    File "/usr/local/lib/python3.8/dist-packages/tfx/orchestration/kubeflow/container_entrypoint.py", line 502, in main
      execution_info = component_launcher.launch()
    File "/usr/local/lib/python3.8/dist-packages/tfx/orchestration/portable/launcher.py", line 574, in launch
      executor_output = self._run_executor(execution_info)
    File "/usr/local/lib/python3.8/dist-packages/tfx/orchestration/portable/launcher.py", line 449, in _run_executor
      executor_output = self._executor_operator.run_executor(execution_info)
    File "/usr/local/lib/python3.8/dist-packages/tfx/orchestration/portable/python_executor_operator.py", line 135, in run_executor
      return run_with_executor(execution_info, executor)
    File "/usr/local/lib/python3.8/dist-packages/tfx/orchestration/portable/python_executor_operator.py", line 58, in run_with_executor
      result = executor.Do(execution_info.input_dict, output_dict,
    File "/usr/local/lib/python3.8/dist-packages/tfx/components/trainer/executor.py", line 178, in Do
      run_fn(fn_args)
    File "/tmp/tmpss6gfjtk/detect_anomalies_in_wafer_trainer.py", line 316, in run_fn
      trainer_train_history = model.fit(
    File "/usr/local/lib/python3.8/dist-packages/keras/utils/traceback_utils.py", line 65, in error_handler
      return fn(*args, **kwargs)
    File "/usr/local/lib/python3.8/dist-packages/keras/engine/training.py", line 1729, in fit
      val_logs = self.evaluate(
    File "/usr/local/lib/python3.8/dist-packages/keras/utils/traceback_utils.py", line 65, in error_handler
      return fn(*args, **kwargs)
    File "/usr/local/lib/python3.8/dist-packages/keras/engine/training.py", line 2072, in evaluate
      tmp_logs = self.test_function(iterator)
    File "/usr/local/lib/python3.8/dist-packages/keras/engine/training.py", line 1861, in <lambda>
      lambda it: self._cluster_coordinator.schedule(
    File "/usr/local/lib/python3.8/dist-packages/keras/engine/training.py", line 1852, in test_function
      return step_function(self, iterator)
    File "/usr/local/lib/python3.8/dist-packages/keras/engine/training.py", line 1835, in step_function
      data = next(iterator)
Node: 'cond/IteratorGetNext'
End of sequence
	 [[{{node cond/IteratorGetNext}}]]
Additional GRPC error information from remote target /job:worker/replica:0/task:1:
:{"created":"@1707903828.595072820","description":"Error received from peer ipv4:10.102.137.138:5000","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":" End of sequence\n\t [[{{node cond/IteratorGetNext}}]]","grpc_status":11} [Op:__inference_test_function_29421]
ERROR:tensorflow: /job:worker/task:1 encountered the following error when processing closure: OutOfRangeError():Graph execution error:

Detected at node 'cond/IteratorGetNext' defined at (most recent call last):
    File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
      return _run_code(code, main_globals, None,
    File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
      exec(code, run_globals)
    File "/usr/local/lib/python3.8/dist-packages/tfx/orchestration/kubeflow/container_entrypoint.py", line 510, in <module>
      main(sys.argv[1:])
    File "/usr/local/lib/python3.8/dist-packages/tfx/orchestration/kubeflow/container_entrypoint.py", line 502, in main
      execution_info = component_launcher.launch()
    File "/usr/local/lib/python3.8/dist-packages/tfx/orchestration/portable/launcher.py", line 574, in launch
      executor_output = self._run_executor(execution_info)
    File "/usr/local/lib/python3.8/dist-packages/tfx/orchestration/portable/launcher.py", line 449, in _run_executor
      executor_output = self._executor_operator.run_executor(execution_info)
    File "/usr/local/lib/python3.8/dist-packages/tfx/orchestration/portable/python_executor_operator.py", line 135, in run_executor
      return run_with_executor(execution_info, executor)
    File "/usr/local/lib/python3.8/dist-packages/tfx/orchestration/portable/python_executor_operator.py", line 58, in run_with_executor
      result = executor.Do(execution_info.input_dict, output_dict,
    File "/usr/local/lib/python3.8/dist-packages/tfx/components/trainer/executor.py", line 178, in Do
      run_fn(fn_args)
    File "/tmp/tmpss6gfjtk/detect_anomalies_in_wafer_trainer.py", line 316, in run_fn
      trainer_train_history = model.fit(
    File "/usr/local/lib/python3.8/dist-packages/keras/utils/traceback_utils.py", line 65, in error_handler
      return fn(*args, **kwargs)
    File "/usr/local/lib/python3.8/dist-packages/keras/engine/training.py", line 1729, in fit
      val_logs = self.evaluate(
    File "/usr/local/lib/python3.8/dist-packages/keras/utils/traceback_utils.py", line 65, in error_handler
      return fn(*args, **kwargs)
    File "/usr/local/lib/python3.8/dist-packages/keras/engine/training.py", line 2072, in evaluate
      tmp_logs = self.test_function(iterator)
    File "/usr/local/lib/python3.8/dist-packages/keras/engine/training.py", line 1861, in <lambda>
      lambda it: self._cluster_coordinator.schedule(
    File "/usr/local/lib/python3.8/dist-packages/keras/engine/training.py", line 1852, in test_function
      return step_function(self, iterator)
    File "/usr/local/lib/python3.8/dist-packages/keras/engine/training.py", line 1835, in step_function
      data = next(iterator)
Node: 'cond/IteratorGetNext'
Detected at node 'cond/IteratorGetNext' defined at (most recent call last):
    File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
      return _run_code(code, main_globals, None,
    File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
      exec(code, run_globals)
    File "/usr/local/lib/python3.8/dist-packages/tfx/orchestration/kubeflow/container_entrypoint.py", line 510, in <module>
      main(sys.argv[1:])
    File "/usr/local/lib/python3.8/dist-packages/tfx/orchestration/kubeflow/container_entrypoint.py", line 502, in main
      execution_info = component_launcher.launch()
    File "/usr/local/lib/python3.8/dist-packages/tfx/orchestration/portable/launcher.py", line 574, in launch
      executor_output = self._run_executor(execution_info)
    File "/usr/local/lib/python3.8/dist-packages/tfx/orchestration/portable/launcher.py", line 449, in _run_executor
      executor_output = self._executor_operator.run_executor(execution_info)
    File "/usr/local/lib/python3.8/dist-packages/tfx/orchestration/portable/python_executor_operator.py", line 135, in run_executor
      return run_with_executor(execution_info, executor)
    File "/usr/local/lib/python3.8/dist-packages/tfx/orchestration/portable/python_executor_operator.py", line 58, in run_with_executor
      result = executor.Do(execution_info.input_dict, output_dict,
    File "/usr/local/lib/python3.8/dist-packages/tfx/components/trainer/executor.py", line 178, in Do
      run_fn(fn_args)
    File "/tmp/tmpss6gfjtk/detect_anomalies_in_wafer_trainer.py", line 316, in run_fn
      trainer_train_history = model.fit(
    File "/usr/local/lib/python3.8/dist-packages/keras/utils/traceback_utils.py", line 65, in error_handler
      return fn(*args, **kwargs)
    File "/usr/local/lib/python3.8/dist-packages/keras/engine/training.py", line 1729, in fit
      val_logs = self.evaluate(
    File "/usr/local/lib/python3.8/dist-packages/keras/utils/traceback_utils.py", line 65, in error_handler
      return fn(*args, **kwargs)
    File "/usr/local/lib/python3.8/dist-packages/keras/engine/training.py", line 2072, in evaluate
      tmp_logs = self.test_function(iterator)
    File "/usr/local/lib/python3.8/dist-packages/keras/engine/training.py", line 1861, in <lambda>
      lambda it: self._cluster_coordinator.schedule(
    File "/usr/local/lib/python3.8/dist-packages/keras/engine/training.py", line 1852, in test_function
      return step_function(self, iterator)
    File "/usr/local/lib/python3.8/dist-packages/keras/engine/training.py", line 1835, in step_function
      data = next(iterator)
Node: 'cond/IteratorGetNext'
End of sequence
	 [[{{node cond/IteratorGetNext}}]]
Additional GRPC error information from remote target /job:worker/replica:0/task:1:
:{"created":"@1707903828.595072820","description":"Error received from peer ipv4:10.102.137.138:5000","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":" End of sequence\n\t [[{{node cond/IteratorGetNext}}]]","grpc_status":11} [Op:__inference_test_function_29421]
ERROR:tensorflow:Start cancelling closures due to error OutOfRangeError(): Graph execution error:

[SOLUTION]

This error is due to that validation dataset is finite, operation IteratorGetNext meets end of validation dataset.
The solution is:

# repeat validation dataset indefinitely
validation_dataset = validation_dataset.repeat()
# specify validation_steps, one step = one batch
model.fit(validation_steps=<an integer>, ...)

ok log of pod tfx-component-trainer:

Epoch 1/50
/usr/local/lib/python3.8/dist-packages/tensorflow/python/data/ops/dataset_ops.py:467: UserWarning: To make it possible to preserve tf.data options across serialization boundaries, their implementation has moved to be part of the TensorFlow graph. As a consequence, the options value is in general no longer known at graph construction time. Invoking this method in graph mode retains the legacy behavior of the original implementation, but note that the returned value might not reflect the actual value of the options.
  warnings.warn("To make it possible to preserve tf.data options across "
INFO:tensorflow:Reduce to /device:CPU:0 then broadcast to ('/replica:0/device:CPU:0',).
INFO:tensorflow:Reduce to /device:CPU:0 then broadcast to ('/replica:0/device:CPU:0',).
INFO:tensorflow:Reduce to /device:CPU:0 then broadcast to ('/replica:0/device:CPU:0',).
INFO:tensorflow:Reduce to /device:CPU:0 then broadcast to ('/replica:0/device:CPU:0',).
INFO:tensorflow:Reduce to /device:CPU:0 then broadcast to ('/replica:0/device:CPU:0',).
INFO:tensorflow:Reduce to /device:CPU:0 then broadcast to ('/replica:0/device:CPU:0',).
INFO:tensorflow:Reduce to /device:CPU:0 then broadcast to ('/replica:0/device:CPU:0',).
INFO:tensorflow:Reduce to /device:CPU:0 then broadcast to ('/replica:0/device:CPU:0',).
INFO:tensorflow:Reduce to /device:CPU:0 then broadcast to ('/replica:0/device:CPU:0',).
INFO:tensorflow:Reduce to /device:CPU:0 then broadcast to ('/replica:0/device:CPU:0',).
INFO:tensorflow:Reduce to /device:CPU:0 then broadcast to ('/replica:0/device:CPU:0',).
INFO:tensorflow:Reduce to /device:CPU:0 then broadcast to ('/replica:0/device:CPU:0',).
INFO:tensorflow:Reduce to /device:CPU:0 then broadcast to ('/replica:0/device:CPU:0',).
INFO:tensorflow:Reduce to /device:CPU:0 then broadcast to ('/replica:0/device:CPU:0',).
INFO:tensorflow:Reduce to /device:CPU:0 then broadcast to ('/replica:0/device:CPU:0',).
INFO:tensorflow:Reduce to /device:CPU:0 then broadcast to ('/replica:0/device:CPU:0',).
INFO:tensorflow:Reduce to /device:CPU:0 then broadcast to ('/replica:0/device:CPU:0',).
INFO:tensorflow:Reduce to /device:CPU:0 then broadcast to ('/replica:0/device:CPU:0',).
INFO:tensorflow:Reduce to /device:CPU:0 then broadcast to ('/replica:0/device:CPU:0',).
INFO:tensorflow:Reduce to /device:CPU:0 then broadcast to ('/replica:0/device:CPU:0',).
INFO:tensorflow:Waiting for all global closures to be finished.
INFO:tensorflow:Waiting for all global closures to be finished.
INFO:tensorflow:Waiting for all global closures to be finished.
INFO:tensorflow:Waiting for all global closures to be finished.
21/21 - 31s - loss: 0.7040 - cross entropy: 0.7036 - tp: 882.0000 - fp: 820.0000 - tn: 547.0000 - fn: 439.0000 - precision: 0.5182 - recall: 0.6677 - auc: 0.5415 - prc: 0.5361 - val_loss: 0.6753 - val_cross entropy: 0.6749 - val_tp: 30.0000 - val_fp: 181.0000 - val_tn: 166.0000 - val_fn: 7.0000 - val_precision: 0.1422 - val_recall: 0.8108 - val_auc: 0.7620 - val_prc: 0.3268 - 31s/epoch - 1s/step
Epoch 2/50
INFO:tensorflow:Waiting for all global closures to be finished.
INFO:tensorflow:Waiting for all global closures to be finished.
INFO:tensorflow:Waiting for all global closures to be finished.
INFO:tensorflow:Waiting for all global closures to be finished.
21/21 - 14s - loss: 0.5821 - cross entropy: 0.5817 - tp: 1036.0000 - fp: 439.0000 - tn: 926.0000 - fn: 287.0000 - precision: 0.7024 - recall: 0.7831 - auc: 0.8096 - prc: 0.8139 - val_loss: 0.5677 - val_cross entropy: 0.5673 - val_tp: 25.0000 - val_fp: 82.0000 - val_tn: 271.0000 - val_fn: 6.0000 - val_precision: 0.2336 - val_recall: 0.8065 - val_auc: 0.8646 - val_prc: 0.3799 - 14s/epoch - 657ms/step
Epoch 3/50
INFO:tensorflow:Waiting for all global closures to be finished.
INFO:tensorflow:Waiting for all global closures to be finished.
INFO:tensorflow:Waiting for all global closures to be finished.
INFO:tensorflow:Waiting for all global closures to be finished.
(base) maye@maye-Inspiron-5547:~/github_repository/tensorflow_ecosystem/distribution_strategy$ 

标签:trainer,OutOfRangeError,End,lib,py,usr,File,line,python3.8
From: https://www.cnblogs.com/zhenxia-jiuyou/p/18015849

相关文章

  • 【Java 并发】【队列应用】【二】Tomcat的NioEndPoint中ConcurrentLinkedQueue 的使用
    1 前言这一节我们讲解Tomcat的NioEndPoint中ConcurrentLinkedQueue的使用。2  Tomcat的容器结构本节讲解apache-tomcat-7.0.32-src源码中ConcurrentLinkedQueue的使用。首先介绍Tomcat的容器结构以及NioEndPoint的作用,以便后面能够更加平滑地切入话题,如图11-4所示......
  • 国产AI训练卡,对标美国NVIDIA公司的A100,华为昇腾Atlas 300T A2(Ascend 910B4)高性能GPU/N
    ChinahassuccessfullyachievedthelocalizationofAIchips,breakingthroughthetechnologicalrestrictionsimposedbytheU.S.governmentandrealizingindependentdesignandproductionofdomesticAIchips.Huawei'sAscend910modelAIchiphass......
  • 【踩坑】Unity Android(安卓)平台 Render texture 有残留,即使调用 DiscardContents 还
    在编辑器环境下OK,打包成PC客户端也OK,但是打包成apk,在安卓手机上运行的时候就会有残留。 我的代码如下。主要是在LateUpdate()中,修改特定摄像机的cullmask,在捕捉到制定的rendertexture上(即此处的m_levelUnitRT)。 voidLateUpdate(){if(m_levelU......
  • src.backend.utils.JwtUtil
    packagecom.oep.backend.utils;importio.jsonwebtoken.Claims;importio.jsonwebtoken.JwtBuilder;importio.jsonwebtoken.Jwts;importio.jsonwebtoken.security.SecureDigestAlgorithm;importorg.springframework.stereotype.Component;importjavax.crypto.Sec......
  • src.backend.serviceImpl.UserDetailsServiceImpl
    packagecom.oep.backend.serviceImpl;importcom.oep.backend.pojo.Account;importlombok.AllArgsConstructor;importlombok.Data;importlombok.NoArgsConstructor;importorg.springframework.security.core.GrantedAuthority;importorg.springframework.securit......
  • 【c&c++】可变参数:va_list(),va_start(),va_arg(),va_end() 详细解析
    目录1、含义:2、使用:3、连续打印出自定义格式的文字:1、含义:(1)va_list是C语言中的一个宏定义,用于表示一个变长参数列表。它是一个指向变长参数列表的指针,可以通过宏va_start、va_arg和va_end对变长参数列表进行访问和操作。在函数中需要接收不定数量的参数时,可以使用va_list来处......
  • RegenDay01
    基本情况学到了不少,多谢雷根哥!拼接1学了另外两种写法,拼接2学了正解,后面还学到用拓扑排序判环,以及dfs来找连通块中的点数量充满希望的拼接质数1T246207充满希望的拼接质数1-洛谷|计算机科学教育新生态(luogu.com.cn)MySolutionDFS,通过让下标递增来找不同方案。intm......
  • 读论文-基于会话的推荐系统综述(A survey on session-based recommender systems)
    前言今天读的论文是一篇于2021年发表于"ACMComputingSurveys(CSUR)"的论文,文章写到,推荐系统在信息过载时代和数字化经济中非常重要。基于会话的推荐系统(SBRSs)是新的推荐系统范式,不同于其他模型化长期静态用户偏好的推荐系统,SBRSs专注于捕捉短期动态用户偏好。尽管SBRSs已被深......
  • AI TREND JAN 2024
      TikTokResponsibilities-DevelopcomputervisionmodelormultimodalitymodeltorecognizeviolationcontentinTikTokLivestream-Explorecutting-edgemultimodalorcomputervisionlargemodels(CLIP,COCA,ALBEF,BLIP,Flamingo,ViT-G,ViT-22B,EVA-......
  • Windows Dependency Walker & Dumpbin
    *[Windows查看exe依赖的dll的方法-知乎](https://zhuanlan.zhihu.com/p/395557318#%E6%96%B9%E6%B3%95%E4%B8%80%EF%BC%9ALucasg/Dependencies%EF%BC%88%E5%BC%80%E6%BA%90%E7%89%88%E7%9A%84%E7%8E%B0%E4%BB%A3%20Dependency%20Walker%EF%BC%89)*[DUMPBIN工具的使用-z......