首页 > 其他分享 >Debug: tf distribute strategy parameter server: stuck at "INFO:tensorflow:ParameterServerStrate

Debug: tf distribute strategy parameter server: stuck at "INFO:tensorflow:ParameterServerStrate

时间:2024-02-14 16:22:06浏览次数:42  
标签:INFO 5000 dist cluster strat worker stuck now example

[ERROR: stuck at "INFO:tensorflow:ParameterServerStrategyV2 is now connecting to cluster with cluster_spec: ClusterSpec({'ps': ['dist-strat-example-ps-0:5000'], 'worker': ['dist-strat-example-worker-0:5000', 'dist-strat-example-worker-1:5000']})"]

# service dist-strat-example-ps-0 definition yaml file
---
kind: Service
apiVersion: v1
metadata:
  name: dist-strat-example-ps-0
spec:
  type: LoadBalancer
  
  selector:
    app: dist-strat-example-ps-0  
  
  ports:
  - port: 5000
---

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: dist-strat-example-ps-0

  name: dist-strat-example-ps-0

spec:
        
  replicas: 1
  
  selector:
    matchLabels:
      app: dist-strat-example-ps-0  
  
  
  template:
    metadata:
      labels:
        app: dist-strat-example-ps-0 
  
  
    spec:

      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: kubernetes.io/hostname
                operator: In
                values:
                - maye-inspiron-5547   

    
      containers:

      - name: tensorflow
        image: tf_std_server:v1
        resources:
          limits:
            #nvidia.com/gpu: 2

        env:

        - name: TF_CONFIG
          value: "{
  \"cluster\": {
    \"worker\": [\"dist-strat-example-worker-0:5000\",\"dist-strat-example-worker-1:5000\"],
    \"ps\": [\"dist-strat-example-ps-0:5000\"]},
  \"task\": {
    \"type\": \"ps\",
    \"index\": \"0\"
  }
}"

        #- name: GOOGLE_APPLICATION_CREDENTIALS
        #  value: "/var/secrets/google/key.json"
        ports:
        - containerPort: 5000

        command:
        - "/usr/bin/python"
        - "/tf_std_server.py"
        - ""
        #volumeMounts:
        #- name: credential
        #  mountPath: /var/secrets/google
      #volumes:
      #- name: credential
      #  secret:
      #    secretName: credential
---
# run_fn in module file of tfx component trainer
def run_fn(fn_args: tfx.components.FnArgs):
    
    cluster_dict = {}

### ClusterIp should be used, not service name, or 
### this error will be raised.
    cluster_dict["worker"] = ["dist-strat-example-worker-0:5000", "dist-strat-example-worker-1:5000"]
    cluster_dict["ps"] = ["dist-strat-example-ps-0:5000"]
    
    #cluster_dict["worker"] = ["10.102.74.8:5000", "10.100.198.218:5000"]
    #cluster_dict["ps"] = ["10.96.200.160:5000"]
    
    cluster_spec = tf.train.ClusterSpec(cluster_dict)
    
    cluster_resolver = tf.distribute.cluster_resolver.SimpleClusterResolver(
      cluster_spec, rpc_layer="grpc")
    
    strategy = tf.distribute.ParameterServerStrategy(
    cluster_resolver,)
    
    tf_transform_output = tft.TFTransformOutput(fn_args.transform_output)
    
    train_dataset = _input_fn(
        fn_args.train_files,
        fn_args.data_accessor,
        tf_transform_output,
        batch_size=_TRAIN_BATCH_SIZE,
    )
    
    
    resampled_train_dataset = _resample_train_dataset(train_dataset, 
                                                      batch_size=_TRAIN_BATCH_SIZE)
    
    #tf.print(f"resampled_train_dataset {resampled_train_dataset.cardinality()}")
    
    val_dataset = _input_fn(
        fn_args.eval_files,
        fn_args.data_accessor,
        tf_transform_output,
        batch_size=_EVAL_BATCH_SIZE,
    )
          
    with strategy.scope():
        model = _build_keras_model()

    trainer_train_history = model.fit(
        resampled_train_dataset,
        epochs=fn_args.custom_config['epochs'],
        steps_per_epoch=fn_args.train_steps,
        validation_data=val_dataset,
        #callbacks=[tensorboard_callback],
    )
    
    with open('trainer_train_history.json', 'w') as f:
        json.dump(trainer_train_history.history, f)
    
    signatures = {
        'serving_default': _get_serve_tf_examples_fn(model, tf_transform_output),
    }
    
    model.save(fn_args.serving_model_dir, save_format='tf', signatures=signatures)
$ kubectl logs pod-tfx-trainer-component -n kubeflow
...
INFO:absl:Successfully installed '/tfx/pipelines/tfx_user_code_Trainer-0.0+a0a99f38e703a50fc266bc1da356164d31c1f23c893900324e04c03582c72555-py3-none-any.whl'.
INFO:absl:Training model.
INFO:tensorflow:`tf.distribute.experimental.ParameterServerStrategy` is initialized with cluster_spec: ClusterSpec({'ps': ['dist-strat-example-ps-0:5000'], 'worker': ['dist-strat-example-worker-0:5000', 'dist-strat-example-worker-1:5000']})
INFO:tensorflow:`tf.distribute.experimental.ParameterServerStrategy` is initialized with cluster_spec: ClusterSpec({'ps': ['dist-strat-example-ps-0:5000'], 'worker': ['dist-strat-example-worker-0:5000', 'dist-strat-example-worker-1:5000']})
INFO:tensorflow:ParameterServerStrategyV2 is now connecting to cluster with cluster_spec: ClusterSpec({'ps': ['dist-strat-example-ps-0:5000'], 'worker': ['dist-strat-example-worker-0:5000', 'dist-strat-example-worker-1:5000']})
INFO:tensorflow:ParameterServerStrategyV2 is now connecting to cluster with cluster_spec: ClusterSpec({'ps': ['dist-strat-example-ps-0:5000'], 'worker': ['dist-strat-example-worker-0:5000', 'dist-strat-example-worker-1:5000']})

(base) maye@maye-Inspiron-5547:~/github_repository/tensorflow_ecosystem/distribution_strategy$ kubectl logs dist-strat-example-ps-0-85fdfdddcb-9x6mt 
2024-02-14 05:51:36.101034: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-02-14 05:51:38.566981: E external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:282] failed call to cuInit: UNKNOWN ERROR (34)
2024-02-14 05:51:38.570913: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:457] Started server with target: grpc://dist-strat-example-ps-0:5000
(base) maye@maye-Inspiron-5547:~/github_repository/tensorflow_ecosystem/distribution_strategy$ 

[SOLUTION]

This error is due to that service name, e.g. "dist-strat-example-worker-0", is used when passing to tf.distribute.ParameterServerStrategy(), and "dist-strat-example-worker-0" is service name of worker-0, not its host name, so tf.distribute.ParameterServerStrategy() thinks that worker-0 it needs is not ready and keeps waiting.
clusterIp of service "dist-strat-example-worker-0" should be used here, so that tf.distribute.ParameterServerStrategy() can connect to it.

def run_fn(fn_args: tfx.components.FnArgs):
    
    cluster_dict = {}
    #cluster_dict["worker"] = ["dist-strat-example-worker-0:5000", "dist-strat-example-worker-1:5000"]
    #cluster_dict["ps"] = ["dist-strat-example-ps-0:5000"]
    
    cluster_dict["worker"] = ["10.102.74.8:5000", "10.100.198.218:5000"]
    cluster_dict["ps"] = ["10.96.200.160:5000"]
    
    cluster_spec = tf.train.ClusterSpec(cluster_dict)
    
    cluster_resolver = tf.distribute.cluster_resolver.SimpleClusterResolver(
      cluster_spec, rpc_layer="grpc")
    
    strategy = tf.distribute.ParameterServerStrategy(
    cluster_resolver,)

标签:INFO,5000,dist,cluster,strat,worker,stuck,now,example
From: https://www.cnblogs.com/zhenxia-jiuyou/p/18015258

相关文章

  • Debug: tf_ditribute_strategy_worker.yaml: unknown field "spec.template.spec.node
    [ERROR:unknownfield"spec.template.spec.nodeAffinity"](base)maye@maye-Inspiron-5547:~/github_repository/tensorflow_ecosystem/distribution_strategy$kubectlapply-fmaye_template.yamlservice/dist-strat-example-worker-0createdservice/dis......
  • C# WINFORM判断程序已经运行,切只能运行一个实例
    判断程序是否已经运行,使程序只能运行一个实例:方法1://这种检测进程的名的方法,并不绝对有效。因为打开第一个实例后,将运行文件改名后,还是可以运行第二个实例.privatestaticboolisAlreadyRunning(){boolb=false;Process[]mProcs=Process......
  • Sample-Efficient Deep Reinforcement Learning via Episodic Backward Update
    发表时间:2019(NeurIPS2019)文章要点:这篇文章提出EpisodicBackwardUpdate(EBU)算法,采样一整条轨迹,然后从后往前依次更新做experiencereplay,这种方法对稀疏和延迟回报的环境有很好的效果(allowssparseanddelayedrewardstopropagatedirectlythroughalltransitionso......
  • dremio 的InformationSchemaCatalog 服务三
    以前简单写过一些关于dremio的InformationSchemaCatalog,也说过dremio为了方便提供标准的INFORMATION_SCHEMA自己开发了存储扩展,以下是关于存储扩展的创建以及刷新说明创建创建是在CatalogService中处理的,具体的实现是CatalogServiceImpl参考处理if(roles.conta......
  • Oracle~ORA-12505, TNS:listener does not currently know of SID given in connect d
    问题描述ORA-12505,TNS:listenerdoesnotcurrentlyknowofSIDgiveninconnectdescriptorORA-12505:TNS:监听程序当前无法识别连接描述符中所给出的SID问题原因说明给到的sid有误,很有可能就是服务名和SID混淆使用。首先了解数据库url的写法有两种:(1)监听SID,表示SID......
  • spring boot 引入 log.info("[消息服务]初始化成功"); log 爆红
    首先在idea中下载lombok插件下载完就好了一个小辣椒logo的lombok其次导入日志库的问题:确保项目的依赖中包含正确的日志库。在SpringBoot项目中,常用的是SLF4J,您可以在pom.xml(如果是Maven项目)或build.gradle(如果是Gradle项目)中添加以下依赖:xml<!--Maven--><dependencie......
  • 「云原生可观测团队」获选「InfoQ 年度技术内容贡献奖」
    随着云原生、人工智能逐渐成为各行各业的创新生产力工具。可以预见,我们即将进入全新的智能化时代。随着数据成为新型生产要素,云和AI正走向深度融合。云原生通过提供大规模多元算力的高效供给,可观测成为业务创新的核心基础设施,加速智能化创新。这一过程离不开开发者、企业、厂商......
  • DevExpress WinForms中文教程 - 如何创建可访问的WinForms应用?(二)
    为用户创建易访问的WindowsForms应用程序不仅是最佳实践的体现,还是对包容性和以用户为中心的设计承诺。在应用程序开发生命周期的早期考虑与可访问性相关的需求可以节省长期运行的时间(因为它将决定设计决策和代码实现)。一个可访问的WinForms应用程序提供了各种好处,包括:扩大......
  • Reinforcement Learning Chapter2
    本文参考《ReinforcementLearning:AnIntroduction(2ndEdition)》SuttonK臂赌博机问题描述:你有k个选择,每个选择对应一个奖励,收益由所选动作决定的平稳概率分布产生,目标为最大化某段时间内的总收益期望。联系我们在chapter1中提到的reward,value,action等概念,我们在这个K臂赌博机......
  • 执行truncate时报错:ORA-00054:资源正忙但指定以NOWAIT 方式获取资源或者超时失效,怎样
    在执行TRUNCATE语句时出现错误,可能是由于以下原因之一:表正在被其他会话使用:如果表正在被其他会话使用,您将无法执行TRUNCATE操作。请确保没有其他会话正在使用该表,并尝试再次执行TRUNCATE。权限不足:如果您没有足够的权限来执行TRUNCATE操作,则会收到错误消息。请确保您具有足......