1. what is kubeflow pipeline for tfx pipeline ?
kubeflow pipeline is an ochetrator of tfx pipeline, which runs on a kubernetes cluster.
LocalDagRuner is an orchetrator of tfx pipeline, which runs local.
# run a tfx pipeline usging LocalGagRunner
tfx.orchestration.LocalDagRunner().run(
_create_schema_pipeline(
pipeline_name=SCHEMA_PIPELINE_NAME,
pipeline_root=SCHEMA_PIPELINE_ROOT,
data_root=DATA_ROOT,
schema_path=SCHEMA_PATH,
metadata_path=SCHEMA_METADATA_PATH,
module_file=_trainer_module_file,
serving_model_dir=SERVING_MODEL_DIR,
)
)
# run a tfx pipeline using KubeflowDagRunner
tfx.orchestration.experimental.KubeflowDagRunner().run(
_create_schema_pipeline(
pipeline_name=SCHEMA_PIPELINE_NAME,
pipeline_root=SCHEMA_PIPELINE_ROOT,
data_root=DATA_ROOT,
schema_path=SCHEMA_PATH,
metadata_path=SCHEMA_METADATA_PATH,
module_file=_trainer_module_file,
serving_model_dir=SERVING_MODEL_DIR,
)
)
2. steps of running a tfx pipeline using kubeflow pipeline
2.1 generate file pipeline.yaml (namely definition file of kubeflow pipeline):
tfx.orchestration.experimental.KubeflowDagRunner().run(
_create_schema_pipeline(
pipeline_name=SCHEMA_PIPELINE_NAME,
pipeline_root=SCHEMA_PIPELINE_ROOT,
data_root=DATA_ROOT,
schema_path=SCHEMA_PATH,
metadata_path=SCHEMA_METADATA_PATH,
module_file=_trainer_module_file,
serving_model_dir=SERVING_MODEL_DIR,
)
)
2.2 change image registry in file pipline.yaml, due to that gcr.io is not accessible in china.
# in file pipeline.yaml
# raw image generated by tfx.orchestration.experimental.KubeflowDagRunner().run(),
# it equals to hub.docker.com/tensorflow/tfx:1.14.0, which is not accessible in china.
#image: tensorflow/tfx:1.14.0
# replacement image
# docker.nju.edu.cn has not tfx:1.14.0 temporally, si3nce it's the latest version,
# and docker.nju.edu.cn has not pulled it yet,
# so use tfx:1.13.0.
image: docker.nju.edu.cn/tensorflow/tfx:1.13.0
imagePullPolicy: Never
Attention
- The size of image tensorflow/tfx:1.13.0 is about 30G, and its blobs (namely gzip) is about 9G, it's better to pull (namely download) it before hand.
imagePullPolicy: Never
means never to pull image when running the container, other options areAlways
, 'IfNotPresent` . imagePullPolicy: Never
needs the image exits on each node which is possible to assign the pod to, or will raise error:
'Warning ErrImageNeverPull 2m36s (x10 over 4m16s) kubelet Container image "docker.nju.edu.cn/tensorflow/tfx:1.13.0" is not present with pull policy of Never',
when scheduling the pod (namely assigning the pod to one node in the kubernetes cluster).
Or, settingnodeAffinity
to the node who has the image for the pod:
nodeAffinity:
required:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/hostname
operator: In
values:
- maye-inspiron-5547
- containerd has namespaces for images, the default namespace is "default", the namespace of images pulled by containers in a kubernetes cluster is "k8s.io" .
`crictl' is container runtime interface cli of kubernetes,
crictl镜像的namespace就一个,k8s.io。因此也是默认拉取镜像的namespace。
如果通过ctr拉取镜像时如果不指定放在k8s.io空间下,crictl是无法读取到本地的该镜像的。
ctr是containerd自带的命令行工具。一共有三个命名空间default,k8s.io 和moby。默认default。
nerdctl is docker-compatible cli of containerd.
ctr image ls
: list images in namespace "default" .
拉取镜像到k8s.io命名空间:
nerdctl pull nginx:latest --namespace k8s.io
查看k8s.io下的镜像:
sudo nerdctl images --namespace k8s.io
Attention
nerdctl image list --namespace k8s.io
No image shown
This is due to using rootless k8s nerdctl, acessing image namespace k8s.io needs root access right.
copy an image from one namespace to another namespace:
ctr -n default image export my-image.tar my-image
ctr -n k8s.io image import my-image.tar
# or,
nerdctl save my-image.tar my-image --namespace default
nerdctl load my-image.tar --namespace k8s.io
ttentionf
needed image not in namespace "k8s.io", kubernetes can not see it,
因此,Kubernetes在创建pod(ErrImageNeverPull,imagePullPolicy设置为Never)时无法找到映像。
2.3
Error & Solution
[ERROR: Failed to pull image]
(base) maye@maye-Inspiron-5547:~$ kubectl describe pod detect-anomolies-on-wafer-tfdv-schema-ldvtw-1952722848 -n kubeflow
Events:
Type Reason Age From Message
Warning Failed 52m (x4 over 92m) kubelet Error: ImagePullBackOff
Warning Failed 13m (x9 over 92m) kubelet Error: ErrImagePull
Warning Failed 4m42s (x9 over 92m) kubelet Failed to pull image "tensorflow/tfx:1.14.0": rpc error: code = Unknown desc = failed to pull and unpack image "docker.io/tensorflow/tfx:1.14.0": failed to copy: httpReadSeeker: failed open: unexpected status code https://pft7f97f.mirror.aliyuncs.com/v2/tensorflow/tfx/blobs/sha256:f2cce533751060f702397991bc7f0acf6d691c898fe1c7cc25b3ece25a409879?ns=docker.io: 500 Internal Server Error - Server message: unknown: unknown error
Normal BackOff 4m17s (x13 over 92m) kubelet Back-off pulling image "tensorflow/tfx:1.14.0"
(base) maye@maye-Inspiron-5547:~$
[Solution]
This is due to that docker.io is not accessible in china, replace it with its mirror website, such as: docker.nju.edu.cn , in file pipeline.yaml, namely replace "tensorflow/tfx:1.14.0" to "docker.nju.edu.cn/tensorflow/tfx:1.14.0" .
Note:
- mirror websites of hub.docker.com
“
汇总国内可用镜像
DaoCloud 镜像站
加速地址:https://docker.m.daocloud.io
支持:Docker Hub、GCR、K8S、GHCR、Quay、NVCR 等
对外免费:是
网易云
加速地址:https://hub-mirror.c.163.com
支持:Docker Hub
对外免费:是
Docker 镜像代理
加速地址:https://dockerproxy.com
支持:Docker Hub、GCR、K8S、GHCR
对外免费:是
百度云
加速地址:https://mirror.baidubce.com
支持:Docker Hub
对外免费:是
南京大学镜像站
加速地址:https://docker.nju.edu.cn
支持:Docker Hub、GCR、GHCR、Quay、NVCR 等
对外免费:是
上海交大镜像站
加速地址:https://docker.mirrors.sjtug.sjtu.edu.cn/
支持:Docker Hub、GCR 等
限制:无
阿里云
加速地址:https://<your_code>.mirror.aliyuncs.com
支持:Docker Hub
限制:需要登录账号获取CODE
“ [1]
- check trace stack of failed linux process which runs in backgroud
strace -e trace=none -p <PID>
Refernece: