标签：pipeline pv name compoents Artifact tfx kubeflow root

What is Artifact?

An Artifact is a file or directory produced by a tfx component, which can be passed to a downstream component, and then the downstream component can use it.

How does tfx pass an Artifact between components?

tfx pipeline has an argument "pipeline_root" when instantiating, this is the directory where components will put output Artifacts in, and when one component completes executing, it stores its output Artifacts in pipeline_root, and records the uri (namely the path) of each output Artifact in database metadb. when a downstream component has an Input Artifact, namely it needs an Output Artifact of its upperdstream component, It queries the database metadb accoding to the Input Artifact.channel (namely information about which pipeline, which pipeline-run, which pipeline node the Artifact belongs to. ) , to find the Artifact record in the database metadb, and in the Artifact record, there is its uri, and the downstream component reads the Input Artifact's data from its uri.


tfx.dsl.Pipeline(
pipeline_name=pipeline_name,
pipeline_root=pipeline_root,
metadata_connection_config=tfx.orchestration.metadata
.sqlite_metadata_connection_config(metadata_path),
components=components,
)

when run tfx pipelne using kubeflow pipeline, how to pass an Artifact between components?

　　When run tfx pipelne using kubeflow pipeline, the pipeline runs on kubernetes cluster, one component runs in one pod, and a container in a pod has standalone file system. Even thoug pipeline_root is same in each component's container, since they belong to standaloine file systems, they are different directories. So one component can not read the Artifact from its uri (pipeline_root/xxx) since the Artifact is stored at pipeiine_root/xxx of another comtainer's file system.

pipeline_root needs to be a directory which can be read and written by all components of the pipeline, One solution is to mount one PersistentVolume to each component's container direcotry (pipeline_root) , and the PersistentVolume should be nfs (network files system), since normally components of a pipeline run on different computers (namely nodes) in the kubernetes cluster.

# create resource Persitent Volume tfx_pv in kubernetes cluster
kubectl apply -f tfx_pv.yaml

# create resource Persistent Volume Claim tfx_pv_claim in kubernetes cluster
# Attention: tfx_pv_claim needs to be in the same namespace with the object who wants to use it,
# in this case, the components of tfx pipeline.
# a pv claim will wait for the first comsumer (namely a pod who uses this pv claim) before binding to an available pv (persistent volume).
kubeclt apply -f tfx_pv_claim.yaml


# pipeline.yaml

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: detect-anomolies-on-wafer-tfdv-schema-
annotations: {pipelines.kubeflow.org/kfp_sdk_version: 1.8.0, 
  pipelines.kubeflow.org/pipeline_compilation_time: '2024-01-07T22:16:36.438482',
  pipelines.kubeflow.org/pipeline_spec: '{"description": "Constructs a Kubeflow
  pipeline.", "inputs": [{"default": "pipelines/detect_anomolies_on_wafer_tfdv_schema",
  "name": "pipeline-root"}], "name": "detect_anomolies_on_wafer_tfdv_schema"}'}
  labels: {pipelines.kubeflow.org/kfp_sdk_version: 1.8.0}
spec:
  entrypoint: detect-anomolies-on-wafer-tfdv-schema

  ...
  volumes:
  - name: tfx-pv
    persistentVolumeClaim:
      claimName: tfx-pv-claim

  templates:
  - name: detect-anomolies-on-wafer-tfdv-schema
    inputs:
    parameters:
    - {name: pipeline-root}

    dag:
      tasks:
      - name: importexamplegen
        template: importexamplegen
        arguments:
        parameters:
        - {name: pipeline-root, value: '{{inputs.parameters.pipeline-root}}'}
        volumes:
        - name: wafer-data
        - name: tfx-pv

      - name: pusher
        template: pusher
        dependencies: [trainer]
        arguments:
        parameters:
        - {name: pipeline-root, value: '{{inputs.parameters.pipeline-root}}'}
        volumes:
        - name: tfx-pv

      - name: schema-importer
        template: schema-importer
        arguments:
        parameters:
        - {name: pipeline-root, value: '{{inputs.parameters.pipeline-root}}'}
        volumes:
        - name: tfx-pv
        - name: schema-path

      - name: statisticsgen
        template: statisticsgen
        dependencies: [importexamplegen]
        arguments:
        parameters:
        - {name: pipeline-root, value: '{{inputs.parameters.pipeline-root}}'}
        volumes:
        - name: tfx-pv

      - name: trainer
        template: trainer
        dependencies: [importexamplegen, transform]
        arguments:
        parameters:
        - {name: pipeline-root, value: '{{inputs.parameters.pipeline-root}}'}
        volumes:
        - name: trainer-module
        - name: tfx-pv

      - name: transform
        template: transform
        dependencies: [importexamplegen, schema-importer]
        arguments:
        parameters:
        - {name: pipeline-root, value: '{{inputs.parameters.pipeline-root}}'}
        volumes:
        - name: transform-module
        - name: tfx-pv
        - name: schema-path

  - name: importexamplegen
    container:
      ...
      volumeMounts:
      - mountPath: /maye/trainEvalData
        name: wafer-data
      - mountPath: /tfx/tfx_pv
        name: tfx-pv

# tfx_pv_claim.yaml

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: tfx-pv-claim
namespace: kubeflow
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 5Gi

# tfx_pv.yaml

apiVersion: v1
kind: PersistentVolume
metadata:
name: tfx-pv
spec:
capacity:
storage: 5Gi
volumeMode: Filesystem
accessModes:
- ReadWriteOnce
persistentVolumeReclaimPolicy: Retain
storageClassName: local-storage
nfs:
server: nfs-server-ip
path: /home/maye/nfs/tfx_pv
nodeAffinity:
required:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/hostname
operator: In
values:
- maye-inspiron-5547

标签：pipeline,pv,name,compoents,Artifact,tfx,kubeflow,root
From： https://www.cnblogs.com/zhenxia-jiuyou/p/17999702

Pass Artifact between tfx compoents when running with kubeflow pipeline

What is Artifact?

How does tfx pass an Artifact between components?

when run tfx pipelne using kubeflow pipeline, how to pass an Artifact between components?

相关文章

赞助商

阅读排行