目录
下面以containerd作为runtime进行介绍。
简介
Ignite是一个启动firecracker vm的引擎,它使用容器的方式承载了firecracker vm。目前项目处于停滞阶段,也比较可惜,通过阅读了解ignite的工作方式,学习到了很多,希望能借此维护该项目。
ignite的运作方式和kubernetes类似,可以将Firecracker看作是runc,将ignite看作是cri(还有一个Footloose可以看作是docker-compose)。此外它还使用了一个存储(下面统称Storage),用于保存集群的元数据(image/kernel/vm),可以看做是kubernetes中的etcd。
ignite创建vm的过程如上,可以看到vm其实是由容器中的Firecracker 命令创建出来的。下面将会用到容器和vm两个概念,注意区分。
首先使用containerd创建一个名为firecracker
的命名空间,后续会在该命名空间下拉取镜像和创建容器,然后在容器中通过ignite-spawn
调用firecracker
来创建vm。整个过程中需要涉及vm文件系统的制作和挂载、配置并使用containerd创建容器、vm网络的配置、使用firecracker
启动容器等流程。vm进程启动步骤如下:
- /usr/bin/containerd-shim-runc-v2 -namespace
firecracker
-idignite-ddf49307b5b27c34
-address /run/containerd/containerd.sock - /usr/local/bin/ignite-spawn --log-level=info
ddf49307b5b27c34
- firecracker --api-sock /var/lib/firecracker/vm/
ddf49307b5b27c34
/firecracker.sock
其进程树如下:
containerd-shim─┬─ignite-spawn─┬─firecracker───2*[{firecracker}]
│ └─14*[{ignite-spawn}]
└─11*[{containerd-shim}]
ignite通过如下接口来操作容器,可以看到和一般docker命令行支持的功能类似:
type Interface interface {
PullImage(image meta.OCIImageRef) error
InspectImage(image meta.OCIImageRef) (*ImageInspectResult, error)
ExportImage(image meta.OCIImageRef) (io.ReadCloser, func() error, error)
InspectContainer(container string) (*ContainerInspectResult, error)
AttachContainer(container string) error
RunContainer(image meta.OCIImageRef, config *ContainerConfig, name, id string) (string, error)
StopContainer(container string, timeout *time.Duration) error
KillContainer(container, signal string) error
RemoveContainer(container string) error
ContainerLogs(container string) (io.ReadCloser, error)
Name() Name
RawClient() interface{}
PreflightChecker() preflight.Checker
}
ignite中有三种资源:Image
、Kernel
、VM
三种,分别代表基础镜像,内核镜像和虚拟机。它使用Storage来保存这些资源的元数据。元数据存储路径为constants.DATA_DIR
,代码中定义为/var/lib/firecracker
。
ignite中有两个主要的目录:
/var/lib/firecracker/
:保存了Image
、Kernel
、VM
的元数据,以及内核文件和vm的文件系统文件等。ignite的三种对象都有一个UID,相关资源保存在对应的/var/lib/firecracker/<image/kernel/vm>/<UID>
目录中。/etc/firecracker/manifests
:ignited守护进程使用的vm manifest文件,用于通过watch 文件的方式管理vm
ignite中使用了多种存储类型,底层的Storage
接口如下,可以看到其支持的方法与client-go操作kubernetes资源的方式十分类似,Storage内部保存了ignite 的CRD对象:
type Storage interface {
// New creates a new Object for the specified kind
New(gvk schema.GroupVersionKind) (runtime.Object, error)
// Get returns a new Object for the resource at the specified kind/uid path, based on the file content
Get(gvk schema.GroupVersionKind, uid runtime.UID) (runtime.Object, error)
// GetMeta returns a new Object's APIType representation for the resource at the specified kind/uid path
GetMeta(gvk schema.GroupVersionKind, uid runtime.UID) (runtime.Object, error)
// Set saves the Object to disk. If the Object does not exist, the
// ObjectMeta.Created field is set automatically
Set(gvk schema.GroupVersionKind, obj runtime.Object) error
// Patch performs a strategic merge patch on the Object with the given UID, using the byte-encoded patch given
Patch(gvk schema.GroupVersionKind, uid runtime.UID, patch []byte) error
// Delete removes an Object from the storage
Delete(gvk schema.GroupVersionKind, uid runtime.UID) error
// List lists Objects for the specific kind
List(gvk schema.GroupVersionKind) ([]runtime.Object, error)
// ListMeta lists all Objects' APIType representation. In other words,
// only metadata about each Object is unmarshalled (uid/name/kind/apiVersion).
// This allows for faster runs (no need to unmarshal "the world"), and less
// resource usage, when only metadata is unmarshalled into memory
ListMeta(gvk schema.GroupVersionKind) ([]runtime.Object, error)
// Count returns the amount of available Objects of a specific kind
// This is used by Caches to check if all Objects are cached to perform a List
Count(gvk schema.GroupVersionKind) (uint64, error)
// Checksum returns a string representing the state of an Object on disk
// The checksum should change if any modifications have been made to the
// Object on disk, it can be e.g. the Object's modification timestamp or
// calculated checksum
Checksum(gvk schema.GroupVersionKind, uid runtime.UID) (string, error)
// RawStorage returns the RawStorage instance backing this Storage
RawStorage() RawStorage
// Serializer returns the serializer
Serializer() serializer.Serializer
// Close closes all underlying resources (e.g. goroutines) used; before the application exits
Close() error
}
ignite使用CRD的方式定义了其管理的资源,对应的gvk:group为ignite.weave.works
;version有v1alpha2
、v1alpha3
、v1alpha4
三个版本;kind有Image
、Kernel
、VM
三种。scheme.Serializer
提供了CRD的编解码方式:
var (
// Scheme is the runtime.Scheme to which all types are registered.
Scheme = runtime.NewScheme()
// codecs provides access to encoding and decoding for the scheme.
// codecs is private, as Serializer will be used for all higher-level encoding/decoding
codecs = k8sserializer.NewCodecFactory(Scheme)
// Serializer provides high-level encoding/decoding functions
Serializer = serializer.NewSerializer(Scheme, &codecs)
)
func init() {
AddToScheme(Scheme)
}
// AddToScheme builds the scheme using all known versions of the api.
func AddToScheme(scheme *runtime.Scheme) {
utilruntime.Must(ignite.AddToScheme(Scheme))
utilruntime.Must(v1alpha2.AddToScheme(Scheme))
utilruntime.Must(v1alpha3.AddToScheme(Scheme))
utilruntime.Must(v1alpha4.AddToScheme(Scheme))
utilruntime.Must(scheme.SetVersionPriority(v1alpha4.SchemeGroupVersion))
}
运行
ignite有两种方式来管理vm,分别对应两个命令:ignite和ignited。前者使用手动命令行的方式来管理vm,后者使用监听vm manifest文件的方式来自动管理vm。
ignite使用的Storage称为GenericStorage
,而ignited使用的Storage称为ManifestStorage
。其初始化方式分别如下:
func SetGenericStorage() error {
log.Trace("Initializing the GenericStorage provider...")
providers.Storage = cache.NewCache(
storage.NewGenericStorage(
storage.NewGenericRawStorage(constants.DATA_DIR), scheme.Serializer))
return nil
}
func SetManifestStorage() (err error) {
log.Trace("Initializing the ManifestStorage provider...")
ManifestStorage, err = manifest.NewTwoWayManifestStorage(constants.MANIFEST_DIR, constants.DATA_DIR, scheme.Serializer)
if err != nil {
return
}
providers.Storage = cache.NewCache(ManifestStorage)
return
}
Ignite使用如下几个变量来操作各个资源,Runtime
用于管理containerd的资源,Client
用于管理ignite自身的资源,NetworkPlugin
则用于配置容器的CNI。
制作vm文件系统
vm文件系统包含两部分内容:基础文件系统和内核文件,这两部分内容分别来自基础镜像和内核镜像。在制作vm文件系统过程中,会将这两部分合并成为一个完整的vm文件系统,后续在使用firecracker启动vm时,会将该文件系统挂载为vm的root fs。
ignite cli通过如下两个命令拉取基础镜像和内核镜像:
$ ignite image import <OCI image> [flags]
$ ignite kernel import <OCI image> [flags]
制作vm基础文件系统文件
ignite拉取镜像时需要指定两个参数:providers.RuntimeName
(默认containerd
)和providers.NetworkPluginName
(默认cni
),前者创建containerdClient,用于拉取镜像和创建容器,后者用于配置CNI网络,但在文件系统制作过程中并未用到。
创建contianerdClient
创建contianerdClient需要两个参数:containerd.sock
和containerd-shim
,这两个是创建containerdClient的必要参数。在实际系统中,containerd-shim
可能有多个版本,优先使用io.containerd.runc.v2
:
ignite使用containerd创建了名为firecracker
的命名空间,后续ignite的镜像和vm都是在该命名空间下面操作的,可以使用ctr命令直接查看ignite的容器和镜像。如下面展示了ignite创建出来的容器和镜像,ignite的容器是以ignite-
开头的,但使用ignite vm ls
时不会显示该前缀(也可以理解为该命令查看的是容器内的vm名称):
$ ctr -n firecracker containers ls
CONTAINER IMAGE RUNTIME
ignite-ddf49307b5b27c34 docker.io/weaveworks/ignite:v0.10.0 io.containerd.runc.v2
$ ctr -n firecracker images ls
REF TYPE DIGEST SIZE PLATFORMS LABELS
docker.io/weaveworks/ignite-kernel:5.10.51 application/vnd.docker.distribution.manifest.list.v2+json sha256:c1d99eafa5b2bcaeab26c0a093d83d709a560e4721f52b6e7c5ef7e9e771189d 15.0 MiB linux/amd64,linux/arm64 -
docker.io/weaveworks/ignite-ubuntu:latest application/vnd.docker.distribution.manifest.list.v2+json sha256:11550e0912d24aeaad847f06fdf2133302f2af2fd2ce231723d078ffce9216ba 78.1 MiB linux/amd64,linux/arm64/v8 -
docker.io/weaveworks/ignite:v0.10.0 application/vnd.docker.distribution.manifest.list.v2+json sha256:b8cc53c5cba81d685b1dc95a0f34ca3fa732ddd450b6f0eba0c829ccc1c67462 16.5 MiB linux/amd64,linux/arm64 -
下面是创建containerdClient的过程,获取containerd socket和runtime即可。注意在获取runtime的时候如果不存在RuntimeRuncV2
,则会退一步查找RuntimeRuncV1
:
func GetContainerdClient() (*ctdClient, error) {
ctdSocket, err := StatContainerdSocket()
if err != nil {
return nil, err
}
runtime, err := getNewestAvailableContainerdRuntime()//获取可用的runtime
if err != nil {
// proceed with the default runtime -- our PATH can't see a shim binary, but containerd might be able to
log.Warningf("Proceeding with default runtime %q: %v", runtime, err)
}
cli, err := containerd.New(
ctdSocket,
containerd.WithDefaultRuntime(runtime),
)
if err != nil {
return nil, err
}
return &ctdClient{
client: cli,
ctx: namespaces.WithNamespace(context.Background(), ctdNamespace), //设置ignite的命名空间
}, nil
}
func getNewestAvailableContainerdRuntime() (string, error) {
for _, rt := range v2ShimRuntimes {
binary := v2shim.BinaryName(rt)
if binary == "" {
// this shouldn't happen if the matching test is passing, but it's not fatal -- just log and continue
log.Errorf("shim binary could not be found -- %q is an invalid runtime/v2/shim", rt)
} else if _, err := exec.LookPath(binary); err == nil {
return rt, nil
}
}
...
}
v2ShimRuntimes = []string{
plugin.RuntimeRuncV2,
plugin.RuntimeRuncV1,
}
const (
// RuntimeLinuxV1 is the legacy linux runtime
RuntimeLinuxV1 = "io.containerd.runtime.v1.linux"
// RuntimeRuncV1 is the runc runtime that supports a single container
RuntimeRuncV1 = "io.containerd.runc.v1"
// RuntimeRuncV2 is the runc runtime that supports multiple containers per shim
RuntimeRuncV2 = "io.containerd.runc.v2"
)
最后通过containerd.New
创建containerdClient,并将其保存在providers.Runtime
变量中:
cli, err := containerd.New(
ctdSocket,
containerd.WithDefaultRuntime(runtime),
)
与containerd相关的默认配置如下:
const ( // DefaultRootDir is the default location used by containerd to store // persistent data DefaultRootDir = "/var/lib/containerd" // DefaultStateDir is the default location used by containerd to store // transient data DefaultStateDir = "/run/containerd" // DefaultAddress is the default unix socket address DefaultAddress = "/run/containerd/containerd.sock" // DefaultDebugAddress is the default unix socket address for pprof data DefaultDebugAddress = "/run/containerd/debug.sock" // DefaultFIFODir is the default location used by client-side cio library // to store FIFOs. DefaultFIFODir = "/run/containerd/fifo" // DefaultRuntime is the default linux runtime DefaultRuntime = "io.containerd.runc.v2" // DefaultConfigDir is the default location for config files. DefaultConfigDir = "/etc/containerd" )
不同版本的contianerd-shim的区别参见Containerd shim 原理深入解读
创建cniInstance
cniInstance用于设置容器的网络,这一步在制作vm文件系统中并没有用到,但会被一并初始化。
创建cniInstance时会依赖上面获取到的providers.Runtime
,表示用于配置特定容器runtime的CNI网络。结果保存在providers.NetworkPlugin
中。
这一步主要就是通过gocni.New
初始化一个cni实例cniInstance
,后续通过cniInstance.Setup
来设置容器网络(见下面的"CNI"章节)。
func GetCNINetworkPlugin(runtime runtime.Interface) (network.Plugin, error) {
// If the CNI configuration directory doesn't exist, create it
if !util.DirExists(CNIConfDir) {
if err := os.MkdirAll(CNIConfDir, constants.DATA_DIR_PERM); err != nil {
return nil, err
}
}
binDirs := []string{CNIBinDir}
cniInstance, err := gocni.New(gocni.WithMinNetworkCount(2),
gocni.WithPluginConfDir(CNIConfDir),
gocni.WithPluginDir(binDirs))
if err != nil {
return nil, err
}
return &cniNetworkPlugin{
runtime: runtime,
cni: cniInstance,
once: &sync.Once{},
}, nil
}
// CNIBinDir describes the directory where the CNI binaries are stored
CNIBinDir = "/opt/cni/bin"
// CNIConfDir describes the directory where the CNI plugin's configuration is stored
CNIConfDir = "/etc/cni/net.d"
拉取基础镜像
首先通过从ignite的image
元数据中查找镜像来判断是否已经存在该镜像,如果不存在,则通过containerdClient从(如果runtime为containerd的话)本地查找(类似执行 ctr --namespace firecracker images ls
),再找不到才会从远端拉取镜像。
func FindOrImportImage(c *client.Client, ociRef meta.OCIImageRef) (*api.Image, error) {
log.Debugf("Ensuring image %s exists, or importing it...", ociRef)
image, err := c.Images().Find(filter.NewIDNameFilter(ociRef.String())) //查看元数据中是否存在需要的镜像
if err == nil {
// Return the image found
log.Debugf("Found image with UID %s", image.GetUID())
return image, nil
}
switch err.(type) {
case *filterer.NonexistentError:
return importImage(c, ociRef) //从containerd本地或远端加载镜像
default:
return nil, err
}
}
看下imageClient的初始化,其指定了Storage以及镜像资源对应的gvk,这个跟使用client-go查找kubernetes的逻辑是一样的。kernel Client和vm Client的初始化和image Client方式类似,只是需要将kind设置为对应的类型。
func newImageClient(s storage.Storage, gv schema.GroupVersion) ImageClient {
return &imageClient{
storage: s,
filterer: filterer.NewFilterer(s),
gvk: gv.WithKind(api.KindImage.Title()),
}
}
主要处理函数importImage
如下,在从containerd本地或远端加载镜像成功之后,会初始化一个特定gvk的image对象,并配置相关参数,如镜像名称、镜像的OCI地址(如weaveworks/ignite-ubuntu:latest
)以及镜像的UID,UID用于确定唯一的镜像对象(注意UID表示的是CRD的对象,而不是镜像的SHA值),可以在/var/lib/firecracker/image/<UID>/metadata.json
中查看相关的镜像元数据。
在配置好image对象之后,会(调用dmlegacy.CreateImageFilesystem
)在/var/lib/firecracker/image/<UID>/
中创建一个名为image.ext4
的文件,然后调用truncate
调整文件大小,并使用"mkfs.ext4 -b 4096 -I 256 -F -E lazy_itable_init=0,lazy_journal_init=0 /var/lib/firecracker/image/<UID>/image.ext4
将其初始化为一个空的ext4格式的文件,然后通过将image.ext4
挂在到/dev/loop
形成一个虚拟文件系统,挂载该虚拟文件系统并导入基础镜像文件(细节见"创建基础文件系统文件"),最后umount挂载的文件系统,至此完成基础文件系统文件(image.ext4
)的制作。最后将image元数据保存到ignite的存储中,便于后续检索:
func importImage(c *client.Client, ociRef meta.OCIImageRef) (*api.Image, error) {
log.Debugf("Importing image with ociRef %q", ociRef)
// Parse the source
dockerSource := source.NewDockerSource()
src, err := dockerSource.Parse(ociRef) //从containerd本地加载或远端拉取镜像
if err != nil {
return nil, err
}
image := c.Images().New() //初始化一个ignite image对象
// Set the image name
image.Name = ociRef.String()
// Set the image's ociRef
image.Spec.OCI = ociRef
// Set the image's ociSource
image.Status.OCISource = *src
// Generate UID automatically
if err := metadata.SetNameAndUID(image, c); err != nil { //设置image对象的UID
return nil, err
}
log.Infoln("Starting image import...")
// Truncate a file for the filesystem, format it with ext4, and copy in the files from the source
if err := dmlegacy.CreateImageFilesystem(image, dockerSource); err != nil { //创建ext4文件系统,并导入镜像文件
return nil, err
}
if err := c.Images().Set(image); err != nil //存储新镜像的元数据
return nil, err
}
log.Infof("Imported OCI image %q (%s) to base image with UID %q", ociRef, image.Status.OCISource.Size, image.GetUID())
return image, nil
}
下面看下ignite是如何从containerd或远端加载镜像的。其实现比较简单,此处用到了providers.Runtime
。首先通过providers.Runtime.InspectImage
查找本地镜像,如果没有则从远端拉取(ctr --namespace firecracker images pull
):
func (ds *DockerSource) Parse(ociRef meta.OCIImageRef) (*api.OCIImageSource, error) {
res, err := providers.Runtime.InspectImage(ociRef)
if err != nil {
log.Infof("%s image %q not found locally, pulling...", providers.Runtime.Name(), ociRef)
if err := providers.Runtime.PullImage(ociRef); err != nil {
return nil, err
}
if res, err = providers.Runtime.InspectImage(ociRef); err != nil {
return nil, err
}
}
if res.Size == 0 || res.ID == nil {
return nil, fmt.Errorf("parsing %s image %q data failed", providers.Runtime.Name(), ociRef)
}
ds.imageRef = ociRef
return &api.OCIImageSource{
ID: res.ID,
Size: meta.NewSizeFromBytes(uint64(res.Size)),
}, nil
}
在镜像加载成功之后就可以在constants.DATA_DIR
(/var/lib/firecracker
)中查看镜像的元数据。下面是weaveworks/ignite-ubuntu:latest
的image元数据,其保存路径为/var/lib/firecracker/image/<UID>
。metadata.json
中以yaml格式保存了镜像的元数据,使用的CRD的gv为ignite.weave.works/v1alpha4
,kind为Image
:
# cd /var/lib/firecracker/image/669a5721d130ef1d
# ll
-rw-r--r--. 1 root root 295698432 Jul 14 10:53 image.ext4
-rw-r--r--. 1 root root 464 Jul 14 10:53 metadata.json
# cat metadata.json
{
"kind": "Image",
"apiVersion": "ignite.weave.works/v1alpha4",
"metadata": {
"name": "weaveworks/ignite-ubuntu:latest",
"uid": "669a5721d130ef1d",
"created": "2023-07-14T02:53:01Z"
},
"spec": {
"oci": "weaveworks/ignite-ubuntu:latest"
},
"status": {
"ociSource": {
"id": "oci://docker.io/weaveworks/ignite-ubuntu@sha256:52414720f26c808bc1273845c6d0f0a99472dfa8eaf8df52429261cbac27f1ba",
"size": "249308KB"
}
}
}
image对象的定义如下,可以看到它就是一个标准的对应上面的metadata.json
:
type Image struct {
runtime.TypeMeta `json:",inline"`
// runtime.ObjectMeta is also embedded into the struct, and defines the human-readable name, and the machine-readable ID
// Name is available at the .metadata.name JSON path
// ID is available at the .metadata.uid JSON path (the Go type is k8s.io/apimachinery/pkg/types.UID, which is only a typed string)
runtime.ObjectMeta `json:"metadata"`
Spec ImageSpec `json:"spec"`
Status ImageStatus `json:"status"`
}
// ImageSpec declares what the image contains
type ImageSpec struct {
OCI meta.OCIImageRef `json:"oci"`
}
type OCIImageSource struct {
// ID defines the source's content ID (e.g. the canonical OCI path or Docker image ID)
ID *meta.OCIContentID `json:"id"`
// Size defines the size of the source in bytes
Size meta.Size `json:"size"`
}
// ImageStatus defines the status of the image
type ImageStatus struct {
// OCISource contains the information about how this OCI image was imported
OCISource OCIImageSource `json:"ociSource"`
}
创建基础文件系统文件
上面提到ignite的Storage中会保存镜像的元数据,而镜像本身会导入到使用mkfs.ext4
创建出来的文件系统中,下面看下这个过程。
- 首先找到mkfs创建出来的
image.ext4
文件路径,然后创建一个临时目录,并将image.ext4
挂载到临时目录中。 - 使用export方式(类似
docker export
)将镜像导出为tar包,然后将该tar包解压到临时目录中(/dev/loop
用于将文件虚拟成文件系统) - 配置
/etc/resolv.conf
文件,主要是确保该文件的存在 - umount并删除临时目录,至此就完成了基础镜像文件系统文件的制作。
func addFiles(img *api.Image, src source.Source) (err error) {
log.Debugf("Copying in files to the image file from a source...")
p := path.Join(img.ObjectPath(), constants.IMAGE_FS) //mkfs创建出来的ext4文件路径
tempDir, err := ioutil.TempDir("", "") //创建一个临时文件
if err != nil {
return
}
defer os.RemoveAll(tempDir)
if _, err := util.ExecuteCommand("mount", "-o", "loop", p, tempDir); err != nil { //挂载ext4文件系统到临时目录
return fmt.Errorf("failed to mount image %q: %v", p, err)
}
defer util.DeferErr(&err, func() error {
_, execErr := util.ExecuteCommand("umount", tempDir)
return execErr
})
err = source.TarExtract(src, tempDir)//将基础镜像解压到历史目录中
if err != nil {
return
}
err = setupResolvConf(tempDir)//确保存在/etc/resolv.conf文件
return
}
制作vm内核文件
与制作基础文件系统文件类似,制作内核文件也需要拉取所需的内核镜像。同样也需要"创建containerdClient"和"创建cniInstance",不同的是,此处gvk中的kind为Kernel
。
创建内核文件的方法如下,大体上与创建基础文件系统类似,但并不需要所有内核镜像中的文件,只需要内核镜像的/boot
和 /lib
目录即可,且/boot
目录中必须包含vmlinux
文件(vmlinux
是kvm创建vm的必要文件)。过程如下:
- 查找内核镜像(本地获取或远程拉取)
- 创建kernel对象,并配置对象的相关参数,如名称、UID等
- 解压内核镜像中的
/boot
和/lib/modules
目录 - 将
vmlinux
文件拷贝到constants.DATA_DIR
路径下 - 将解压出来的文件打包到
constants.DATA_DIR
路径下,名称为kernel.tar
,后续和基础文件系统合并,目录结构如下:
$ pwd
/var/lib/firecracker/kernel/1bdd3b2354873157
$ ll
-rw-r--r--. 1 root root 73574400 Jul 14 10:53 kernel.tar
-rw-r--r--. 1 root root 492 Jul 14 10:53 metadata.json
-rwxr-xr-x. 1 root root 43526368 Jul 14 10:53 vmlinux
由于内核文件后续需要放到文件系统中,因此不需要再制作单独的文件系统,只需要将所需的文件拷贝打包到本地即可,在执行"Create vm"的过程中会将打包的内核文件解压到基础文件系统中进行合并:
// importKernel imports a kernel from an OCI image
func importKernel(c *client.Client, ociRef meta.OCIImageRef) (*api.Kernel, error) {
log.Debugf("Importing kernel with ociRef %q", ociRef)
// Parse the source
dockerSource := source.NewDockerSource()
src, err := dockerSource.Parse(ociRef) //从containerd本地或远端加载镜像
if err != nil {
return nil, err
}
kernel := c.Kernels().New() //初始化一个image对象
// Set the kernel name
kernel.Name = ociRef.String()
// Set the kernel's ociRef
kernel.Spec.OCI = ociRef
// Set the kernel's ociSource
kernel.Status.OCISource = *src
// Generate UID automatically
if err := metadata.SetNameAndUID(kernel, c); err != nil { //设置kernel对象的UID
return nil, err
}
// Cache the kernel contents in the kernel tar file
kernelTarFile := path.Join(kernel.ObjectPath(), constants.KERNEL_TAR)
// vmlinuxFile describes the uncompressed kernel file at /var/lib/firecracker/kernel/<id>/vmlinux
vmlinuxFile := path.Join(kernel.ObjectPath(), constants.KERNEL_FILE)
// Create both the kernel tar file and the vmlinux file it either doesn't exist
if !util.FileExists(kernelTarFile) || !util.FileExists(vmlinuxFile) {
// Create a temporary directory for extracting
// the necessary files from the OCI image
tempDir, err := ioutil.TempDir("", "")
if err != nil {
return nil, err
}
// Extract only the /boot and /lib directories of the tar stream into the tempDir
err = source.TarExtract(dockerSource, tempDir, "boot", "lib/modules") //抽取所需的内核文件到临时目录
if err != nil {
return nil, err
}
// Locate the kernel file in the temporary directory
kernelTmpFile, err := findKernel(tempDir) //查找vmlinux文件
if err != nil {
return nil, err
}
// Copy the vmlinux file
if err := util.CopyFile(kernelTmpFile, vmlinuxFile); err != nil {
return nil, fmt.Errorf("failed to copy kernel file %q to kernel %q: %v", kernelTmpFile, kernel.GetUID(), err)
}
// 将抽取出来的内核文件打包到 /var/lib/firecracker/kernel/<UID>/kernel.tar
if _, err := util.ExecuteCommand("tar", "-cf", kernelTarFile, "-C", tempDir, "."); err != nil {
return nil, err
}
// 移除临时目录
if err := os.RemoveAll(tempDir); err != nil {
return nil, err
}
}
// Populate the kernel version field if possible
if len(kernel.Status.Version) == 0 {
cmd := fmt.Sprintf("strings %s | grep 'Linux version' | awk '{print $3}'", vmlinuxFile)
// Use the pipefail option to return an error if any of the pipeline commands is not available
out, err := util.ExecuteCommand("/bin/bash", "-o", "pipefail", "-c", cmd)
if err != nil {
kernel.Status.Version = "<unknown>"
} else {
kernel.Status.Version = out
}
}
if err := c.Kernels().Set(kernel); err != nil { //将内核对象保存到Storage中
return nil, err
}
log.Infof("Imported OCI image %q (%s) to kernel image with UID %q", ociRef, kernel.Status.OCISource.Size, kernel.GetUID())
return kernel, nil
}
内核镜像的元数据如下:
# cat metadata.json
{
"kind": "Kernel",
"apiVersion": "ignite.weave.works/v1alpha4",
"metadata": {
"name": "weaveworks/ignite-kernel:5.10.51",
"uid": "1bdd3b2354873157",
"created": "2023-07-14T02:53:10Z"
},
"spec": {
"oci": "weaveworks/ignite-kernel:5.10.51"
},
"status": {
"version": "5.10.51",
"ociSource": {
"id": "oci://docker.io/weaveworks/ignite-kernel@sha256:a992aa9f7b6f5e7945e72610017c3f4f38338ff1452964e30410bb6110a794a7",
"size": "72588KB"
}
}
}
kernel对象的定义如下,对应上面的metadata.json
:
type Kernel struct {
runtime.TypeMeta `json:",inline"`
// runtime.ObjectMeta is also embedded into the struct, and defines the human-readable name, and the machine-readable ID
// Name is available at the .metadata.name JSON path
// ID is available at the .metadata.uid JSON path (the Go type is k8s.io/apimachinery/pkg/types.UID, which is only a typed string)
runtime.ObjectMeta `json:"metadata"`
Spec KernelSpec `json:"spec"`
Status KernelStatus `json:"status"`
}
// KernelSpec describes the properties of a kernel
type KernelSpec struct {
OCI meta.OCIImageRef `json:"oci"`
// Optional future feature, support per-kernel specific default command lines
// DefaultCmdLine string
}
// KernelStatus describes the status of a kernel
type KernelStatus struct {
Version string `json:"version"`
OCISource OCIImageSource `json:"ociSource"`
}
Create vm
创建vm使用的命令是ignite vm create
,这一步只是做好vm启动前的准备,如果要启动vm,还需要执行 ignite vm start
。
配置vm对象
首先需要初始化一个vm对象,包括:
- 配置vm对象的镜像、runtime和网络
- 合并命令行传入的自定义参数
- 校验vm对象的合法性
- 尝试拉取基础镜像和内核镜像,并给vm对象设置image和kernel信息
运行一个vm可以直接执行
ignite vm create
+ignite vm start
,或直接执行ignite vm run
func (cf *CreateFlags) NewCreateOptions(args []string, fs *flag.FlagSet) (*CreateOptions, error) {
// Create a new base VM and configure it by combining the component config,
// VM config file and flags.
baseVM := providers.Client.VMs().New() //初始化一个vm对象
// If component config is in use, set the VMDefaults on the base VM.
if providers.ComponentConfig != nil {
baseVM.Spec = providers.ComponentConfig.Spec.VMDefaults
}
// Resolve registry configuration used for pulling image if required.
cmdutil.ResolveRegistryConfigDir()
// Initialize the VM's Prefixer
baseVM.Status.IDPrefix = providers.IDPrefix //设置vm对象的基本信息
// Set the runtime and network-plugin on the VM, then override the global config.
baseVM.Status.Runtime.Name = providers.RuntimeName // 设置runtime 和 CNI实例
baseVM.Status.Network.Plugin = providers.NetworkPluginName
// Populate the runtime and network-plugin providers.
if err := config.SetAndPopulateProviders(providers.RuntimeName, providers.NetworkPluginName); err != nil {
return nil, err
}
// Set the passed image argument on the new VM spec.
// Image is necessary while serializing the VM spec.
if len(args) == 1 {
ociRef, err := meta.NewOCIImageRef(args[0])
if err != nil {
return nil, err
}
baseVM.Spec.Image.OCI = ociRef
}
// Generate a VM name and UID if not set yet.
if err := metadata.SetNameAndUID(baseVM, providers.Client); err != nil {//设置vm的UID和名称
return nil, err
}
// Apply the VM config on the base VM, if a VM config is given.
if len(cf.ConfigFile) != 0 {//如果使用文件指定了vm对象的配置信息,则将该配置合并到vm对象中
if err := applyVMConfigFile(baseVM, cf.ConfigFile); err != nil {
return nil, err
}
}
// Apply flag overrides.
if err := applyVMFlagOverrides(baseVM, cf, fs); err != nil {//使用命令行参数覆盖vm对象
return nil, err
}
// If --require-name is true, VM name must be provided.
if cf.RequireName && len(baseVM.Name) == 0 {
return nil, fmt.Errorf("must set VM name, flag --require-name set")
}
// Assign the new VM to the configFlag.
cf.VM = baseVM
// Validate the VM object.
if err := validation.ValidateVM(cf.VM).ToAggregate(); err != nil { //vm对象有效性校验
return nil, err
}
co := &CreateOptions{CreateFlags: cf}
//下面用于拉取基础镜像和内核镜像,相当于 ignite image import 和 ignite kernel import
// Get the image, or import it if it doesn't exist.
var err error
co.image, err = operations.FindOrImportImage(providers.Client, cf.VM.Spec.Image.OCI)
if err != nil {
return nil, err
}
// Populate relevant data from the Image on the VM object.
cf.VM.SetImage(co.image) //设置vm对象的image信息
// Get the kernel, or import it if it doesn't exist.
co.kernel, err = (providers.Client, cf.VM.Spec.Kernel.OCI)
if err != nil {
return nil, err
}
// Populate relevant data from the Kernel on the VM object.
cf.VM.SetKernel(co.kernel) //设置vm对象的kernel元数据
return co, nil
}
vm对象的元数据如下:
$ cat metadata.json
{
"kind": "VM",
"apiVersion": "ignite.weave.works/v1alpha4",
"metadata": {
"name": "restless-waterfall",
"uid": "ddf49307b5b27c34",
"created": "2023-07-18T08:33:25Z"
},
"spec": {
"image": {
"oci": "weaveworks/ignite-ubuntu:latest"
},
"sandbox": {
"oci": "weaveworks/ignite:v0.10.0"
},
"kernel": {
"oci": "weaveworks/ignite-kernel:5.10.51",
"cmdLine": "console=ttyS0 reboot=k panic=1 pci=off ip=dhcp"
},
"cpus": 1,
"memory": "512MB",
"diskSize": "4GB",
"network": {
},
"storage": {
},
"ssh": true
},
"status": {
"running": true,
"runtime": {
"id": "ignite-ddf49307b5b27c34",
"name": "containerd"
},
"startTime": "2023-07-18T08:33:25Z",
"network": {
"plugin": "cni",
"ipAddresses": [
"10.61.0.3"
]
},
"image": {
"id": "oci://docker.io/weaveworks/ignite-ubuntu@sha256:52414720f26c808bc1273845c6d0f0a99472dfa8eaf8df52429261cbac27f1ba",
"size": "249308KB"
},
"kernel": {
"id": "oci://docker.io/weaveworks/ignite-kernel@sha256:a992aa9f7b6f5e7945e72610017c3f4f38338ff1452964e30410bb6110a794a7",
"size": "72588KB"
},
"idPrefix": "ignite"
}
}
vm对象的定义如下,与上述metadata.json
对应:
type VM struct {
runtime.TypeMeta `json:",inline"`
// runtime.ObjectMeta is also embedded into the struct, and defines the human-readable name, and the machine-readable ID
// Name is available at the .metadata.name JSON path
// ID is available at the .metadata.uid JSON path (the Go type is k8s.io/apimachinery/pkg/types.UID, which is only a typed string)
runtime.ObjectMeta `json:"metadata"`
Spec VMSpec `json:"spec"`
Status VMStatus `json:"status"`
}
// VMSpec describes the configuration of a VM
type VMSpec struct {
Image VMImageSpec `json:"image"`
Sandbox VMSandboxSpec `json:"sandbox"`
Kernel VMKernelSpec `json:"kernel"`
CPUs uint64 `json:"cpus"`
Memory meta.Size `json:"memory"`
DiskSize meta.Size `json:"diskSize"`
// TODO: Implement working omitempty without pointers for the following entries
// Currently both will show in the JSON output as empty arrays. Making them
// pointers requires plenty of nil checks (as their contents are accessed directly)
// and is very risky for stability. APIMachinery potentially has a solution.
Network VMNetworkSpec `json:"network,omitempty"`
Storage VMStorageSpec `json:"storage,omitempty"`
// This will be done at either "ignite start" or "ignite create" time
// TODO: We might revisit this later
CopyFiles []FileMapping `json:"copyFiles,omitempty"`
// SSH specifies how the SSH setup should be done
// nil here means "don't do anything special"
// If SSH.Generate is set, Ignite will generate a new SSH key and copy it in to authorized_keys in the VM
// Specifying a path in SSH.Generate means "use this public key"
// If SSH.PublicKey is set, this struct will marshal as a string using that path
// If SSH.Generate is set, this struct will marshal as a bool => true
SSH *SSH `json:"ssh,omitempty"`
}
创建vm文件系统
至此我们已经创建了基础文件系统文件,抽取了必要的内核文件,并创建了一个vm对象,但创建一个vm还需要一个完整的文件系统。在上面的"制作vm文件系统"中只是分别制作了基础文件系统文件和内核文件,下面还需要将其合并成一个完整的文件系统。
下面是ignite vm create
命令的入口,首先设置vm的UID和名称(这一步在"创建vm对象"中已经执行过,此处主要是确保有UID和名称)以及标签,然后将其保存到Storage中,并创建vm的文件系统:
func Create(co *CreateOptions) (err error) {
// Generate a random UID and Name
if err = metadata.SetNameAndUID(co.VM, providers.Client); err != nil {
return
}
// Set VM labels.
if err = metadata.SetLabels(co.VM, co.Labels); err != nil {
return
}
defer util.DeferErr(&err, func() error { return metadata.Cleanup(co.VM, false) })
if err = providers.Client.VMs().Set(co.VM); err != nil {// 将vm对象存储到Storage中
return
}
// Allocate and populate the overlay file
if err = dmlegacy.AllocateAndPopulateOverlay(co.VM); err != nil {//创建vm文件系统
return
}
err = metadata.Success(co.VM)
return
}
AllocateAndPopulateOverlay
是文件系统制作的入口,最终生成一个devicemapper设备:
- 首先通过vm中的镜像地址(name:tag)找到镜像的UID,用于查找本地的本地
/var/lib/firecracker/image/<UID>/image.ext4
- 使用找到的UID定位基础镜像的文件系统
/var/lib/firecracker/image/<imageUID>/image.ext4
,并调整文件系统大小。后续作为devicemapper snapshot类型的origin device。(devicemapper snapshot的介绍见下文) - 创建目录
/var/lib/firecracker/vm/<vmUID>
,并创建文件/var/lib/firecracker/vm/<vmUID>/overlay.dm
,根据命令行或vm配置文件来定义overlay.dm
的大小,不能小于image.ext4
。后续作为devicemapper snapshot类型的COW device。 - 使用
image.ext4
和overlay.dm
创建一个snapshot类型的devicemapper,此时snapshot存储中包含了基础文件系统 - 将内核文件解压合并到snapshot存储中,并调整vm文件系统的其他配置,如hostname,DNS等。至此完成了一个vm文件系统。
func AllocateAndPopulateOverlay(vm *api.VM) error {
requestedSize := vm.Spec.DiskSize.Bytes()
// Truncate only accepts an int64
if requestedSize > math.MaxInt64 {
return fmt.Errorf("requested size %d too large, cannot truncate", requestedSize)
}
size := int64(requestedSize)
//获取基础镜像的UID,用于在/var/lib/firecracker/image中查找image.ext4
imageUID, err := lookup.ImageUIDForVM(vm, providers.Client)
if err != nil {
return err
}
// Get the size of the image ext4 file
fi, err := os.Stat(path.Join(constants.IMAGE_DIR, imageUID.String(), constants.IMAGE_FS))//查找image.ext4
if err != nil {
return err
}
imageSize := fi.Size()
// The overlay needs to be at least as large as the image
if size < imageSize { //调整overlay.dm的大小
log.Warnf("warning: requested overlay size (%s) < image size (%s), using image size for overlay\n",
vm.Spec.DiskSize.String(), meta.NewSizeFromBytes(uint64(imageSize)).String())
size = imageSize
}
// Make sure the all directories above the snapshot directory exists
if err := os.MkdirAll(path.Dir(vm.OverlayFile()), constants.DATA_DIR_PERM); err != nil {
return err
}
overlayFile, err := os.Create(vm.OverlayFile())//创建vm的overlay文件
if err != nil {
return fmt.Errorf("failed to create overlay file for %q, %v", vm.GetUID(), err)
}
defer overlayFile.Close()
if err := overlayFile.Truncate(size); err != nil {//调整overlay文件大小
return fmt.Errorf("failed to allocate overlay file for VM %q: %v", vm.GetUID(), err)
}
// populate the filesystem
return copyToOverlay(vm)//创建snapshot类型的devicemapper设备
}
现在根据,copyToOverlay
的实现如下,ActivateSnapshot
用于创建vm运行所需的devicemapper snapshot类型的存储,除此之外,都是对vm文件系统的调整,如导入内核文件,设置ssh等。
func copyToOverlay(vm *api.VM) (err error) {
_, err = ActivateSnapshot(vm) //创建devicemapper的snapshot存储,作为vm的启动设备
if err != nil {
return
}
defer util.DeferErr(&err, func() error { return DeactivateSnapshot(vm) })
mp, err := util.Mount(vm.SnapshotDev()) //挂载snapshot存储
if err != nil {
return
}
defer util.DeferErr(&err, mp.Umount)
// Copy the kernel files to the VM. TODO: Use snapshot overlaying instead.
//将/var/lib/firecracker/kernel/<UID>/kernel.tar解压到挂载路径下,与基础文件系统进行合并
if err = copyKernelToOverlay(vm, mp.Path); err != nil {
return
}
// do not mutate vm.Spec.CopyFiles
fileMappings := vm.Spec.CopyFiles
if vm.Spec.SSH != nil { //如果指定了ssh,则需要为vm创建ssh密钥对
pubKeyPath := vm.Spec.SSH.PublicKey
if vm.Spec.SSH.Generate {
// generate a key if PublicKey is empty
pubKeyPath, err = newSSHKeypair(vm)
if err != nil {
return
}
}
if len(pubKeyPath) > 0 {
fileMappings = append(fileMappings, api.FileMapping{
HostPath: pubKeyPath,
VMPath: vmAuthorizedKeys,
})
}
}
// TODO: File/directory permissions?
for _, mapping := range fileMappings { //使用拷贝方式处理vm和host的文件映射
vmFilePath := path.Join(mp.Path, mapping.VMPath)
if err = os.MkdirAll(path.Dir(vmFilePath), constants.DATA_DIR_PERM); err != nil {
return
}
if err = util.CopyFile(mapping.HostPath, vmFilePath); err != nil {
return
}
}
ip := net.IP{127, 0, 0, 1}
if len(vm.Status.Network.IPAddresses) > 0 {
ip = vm.Status.Network.IPAddresses[0]
}
// Write /etc/hosts for the VM //在/etc/hosts中设置本机主机名地址解析
if err = writeEtcHosts(mp.Path, vm.GetUID().String(), ip); err != nil {
return
}
// Write the UID to /etc/hostname for the VM // 在/etc/hostname中设置本机主机名
if err = writeEtcHostname(mp.Path, vm.GetUID().String()); err != nil {
return
}
// Populate /etc/fstab with the VM's volume mounts //在/etc/fstab中配置vm.Spec.Storage中定义的卷挂载
if err = populateFstab(vm, mp.Path); err != nil {
return
}
// Set overlay root permissions
err = os.Chmod(mp.Path, constants.DATA_DIR_PERM)
return
}
ActivateSnapshot
用来创建一个给vm使用的devicemapper块存储,主要步骤如下:
-
使用
losetup
将镜像文件image.ext4
attach到一个/dev/loop
设备上,此时可以将其虚拟成一个文件系统 -
使用
losetup
将镜像文件overlay.dm
attach到一个/dev/loop
设备上,可以使用losetup查看attach的设备:$ losetup NAME SIZELIMIT OFFSET AUTOCLEAR RO BACK-FILE DIO LOG-SEC /dev/loop1 0 0 1 1 /var/lib/firecracker/image/669a5721d130ef1d/image.ext4 0 512 /dev/loop2 0 0 1 0 /var/lib/firecracker/vm/ddf49307b5b27c34/overlay.dm 0 512
-
如果overlay loop设备的大小大于image loop设备,需要对image loop设备进行扩展(官方要求)。方法是创建一个linear类型的devicemapper设备,将image loop设备映射到该dm设备上,并使用zero类型的devicemapper扩展该dm设备。扩展方式如下:
linear类型的dm用于join多个存储,或将一个存储split成多个dm设备。
$ dmsetup create test-snapshot <<EOF "0 8388608 linear /dev/loop0 0" "8388608 12582912 zero" EOF
-
使用
dmsetup
命令创建一个snapshot类型的devicemapper设备,创建方式如下,其中image loop作为origin device,overlay loop作为COW device:$ dmsetup create ignite-ddf49307b5b27c34 --table '0 8388608 snapshot /dev/{loop0,mapper/ignite-<uid>-base} /dev/loop1 P 8'
使用如下命令可以查看创建的dm设备:
$ dmsetup status ignite-ddf49307b5b27c34: 0 8388608 snapshot 274328/8388608 1080 #创建出来的snapshot设备 ignite-ddf49307b5b27c34-base: 0 577536 linear #扩展image loop所创建的devicemapper,映射到image loop设备 ignite-ddf49307b5b27c34-base: 577536 8388608 zero #用于扩展ignite-ddf49307b5b27c34-base的设备
官方对snapshot的描述如下,即向snapshot中写入数据时,数据只会写到COW device,而读取时则会从COW device和origin device中读取。这里描述了COW device要小于origin的大小。
*) snapshot <origin> <COW device> <persistent?> <chunksize> A snapshot of the <origin> block device is created. Changed chunks of <chunksize> sectors will be stored on the <COW device>. Writes will only go to the <COW device>. Reads will come from the <COW device> or from <origin> for unchanged data. <COW device> will often be smaller than the origin and if it fills up the snapshot will become useless and be disabled, returning errors. So it is important to monitor the amount of free space and expand the <COW device> before it fills up. <persistent?> is P (Persistent) or N (Not persistent - will not survive after reboot). O (Overflow) can be added as a persistent store option to allow userspace to advertise its support for seeing "Overflow" in the snapshot status. So supported store types are "P", "PO" and "N".
创建snapshot类型的dm设备需要两部分,一个是origin device,它是只读的;另一个是COW device,可读可写。下面展示一下该类型在容器中的用法:
$ mkdir -p /tmp/mnt # 拷贝一个vm镜像文件,并attach到/dev/loop5 $ cp /var/lib/firecracker/image/669a5721d130ef1d/image.ext4 /home $ losetup /dev/loop5 image.img # 拷贝一个和vm镜像文件一样大小的overlay文件,并attach到/dev/loop6 $ dd if=/dev/zero of=overlay.dm bs=512 count=577536 $ losetup /dev/loop6 overlay.dm # 获取块的block数目,并创建snapshot类型的devicemapper设备,并将其挂载到/tmp/mnt $ blockdev --getsz /dev/loop5 577536 # 创建snapshot设备并挂载到/tmp/mnt目录 $ dmsetup create test-snapshot --table '0 577536 snapshot /dev/loop5 /dev/loop6 P 8' $ mount /dev/mapper/test-snapshot /tmp/mnt # 设置remove snapshot设备之后自动detach loop设备 $ losetup -d /dev/loop0 $ losetup -d /dev/loop1
查看挂载的目录,可以看到它就是一个完整的linux文件系统。如果对该文件系统进行修改,其修改内容并不会影响到vm镜像(可以在修改之后umount snapshot设备并单独挂载image的
/dev/loop5
,可以发现其并没有任何变更,重新创建snapshot之后可以复原变更):$ ll /tmp/mnt/ total 76 lrwxrwxrwx. 1 root root 7 Oct 7 2021 bin -> usr/bin drwxr-xr-x. 2 root root 4096 Apr 15 2020 boot drwxr-xr-x. 2 root root 4096 Oct 7 2021 dev drwxr-xr-x. 52 root root 4096 Jul 14 10:52 etc drwxr-xr-x. 2 root root 4096 Apr 15 2020 home lrwxrwxrwx. 1 root root 7 Oct 7 2021 lib -> usr/lib lrwxrwxrwx. 1 root root 9 Oct 7 2021 lib32 -> usr/lib32 lrwxrwxrwx. 1 root root 9 Oct 7 2021 lib64 -> usr/lib64 lrwxrwxrwx. 1 root root 10 Oct 7 2021 libx32 -> usr/libx32 drwx------. 2 root root 16384 Jul 14 10:52 lost+found drwxr-xr-x. 2 root root 4096 Oct 7 2021 media drwxr-xr-x. 2 root root 4096 Oct 7 2021 mnt drwxr-xr-x. 2 root root 4096 Oct 7 2021 opt drwxr-xr-x. 2 root root 4096 Apr 15 2020 proc drwx------. 2 root root 4096 Oct 7 2021 root drwxr-xr-x. 8 root root 4096 Nov 9 2021 run lrwxrwxrwx. 1 root root 8 Oct 7 2021 sbin -> usr/sbin drwxr-xr-x. 2 root root 4096 Oct 7 2021 srv drwxr-xr-x. 2 root root 4096 Apr 15 2020 sys drwxrwxrwt. 2 root root 4096 Oct 7 2021 tmp drwxr-xr-x. 13 root root 4096 Oct 7 2021 usr drwxr-xr-x. 11 root root 4096 Oct 7 2021 var
环境清理
$ umount /tmp/mnt $ dmsetup remove test-snapshot
-
使用e2fsck解决可能存在的文件系统错误
$ e2fsck -p -f /dev/mapper/<snapshot>
-
使用loseup Detach image和overlay的loop设备,这样在snapshot被移除之后,底层的loop设备也会被自动移除:
$ losetup -d /dev/loop0 $ losetup -d /dev/loop1
这样就完成了一个vm文件系统存储,后面只需将其进行挂载就可以为vm所用。
配置ssh
在创建vm文件系统的过程中需要配置(copyToOverlay
方法)ssh,代码段如下,主要就是将公钥(.pub
结尾的文件)拷贝到vm的/root/.ssh/authorized_keys
中。如果vm.Spec.SSH.Generate
为true
,则会通过openssl命令生成新的密钥对,路径为/var/lib/firecracker/vm/<UID>/id_<UID>/
,其中包含了公钥和私钥,公钥仍然会被拷贝到vm的/root/.ssh/authorized_keys
中,私钥则用于ssh client连接。
这样后续就可以通过ssh登陆vm机器:
if vm.Spec.SSH != nil {
pubKeyPath := vm.Spec.SSH.PublicKey
if vm.Spec.SSH.Generate {
// generate a key if PublicKey is empty
pubKeyPath, err = newSSHKeypair(vm)
if err != nil {
return
}
}
if len(pubKeyPath) > 0 {
fileMappings = append(fileMappings, api.FileMapping{
HostPath: pubKeyPath,
VMPath: vmAuthorizedKeys,
})
}
}
for _, mapping := range fileMappings {
vmFilePath := path.Join(mp.Path, mapping.VMPath)
if err = os.MkdirAll(path.Dir(vmFilePath), constants.DATA_DIR_PERM); err != nil {
return
}
if err = util.CopyFile(mapping.HostPath, vmFilePath); err != nil {
return
}
}
ssh密钥对生成方式如下:
// Generate a new SSH keypair for the vm
func newSSHKeypair(vm *api.VM) (string, error) {
privKeyPath := path.Join(vm.ObjectPath(), fmt.Sprintf(constants.VM_SSH_KEY_TEMPLATE, vm.GetUID()))
// TODO: In future versions, let the user specify what key algorithm to use through the API types
sshKeyAlgorithm := "ed25519"
if util.FIPSEnabled() {
// Use rsa on FIPS machines
sshKeyAlgorithm = "rsa"
}
_, err := util.ExecuteCommand("ssh-keygen", "-q", "-t", sshKeyAlgorithm, "-N", "", "-f", privKeyPath)
if err != nil {
return "", err
}
return fmt.Sprintf("%s.pub", privKeyPath), nil
}
Start vm
ignite的vm其实就是在容器中通过firecracker
命令创建出来的一个vm。因此要创建一个vm,首先要启动一个容器。容器也有自己的文件系统,在下图中,容器的文件系统由ignite镜像提供。另一个是vm所需要的文件系统,它就是上面我们创建出来的devicemapper设备,后续由firecracker
挂载为vm的root fs。
启动vm使用的命令是ignite vm start
。主要是启动由ignite vm create
创建出来的vm对象。
第一步通过vm名称从Storage中找到该vm对象,然后启动vm。入参so
中包含了vm对象及其参数。在vm启动之后,还需要处理ssh连接以及vm attach之类的操作:
func Start(so *StartOptions, fs *flag.FlagSet) error {
// Check if the given VM is already running
if so.vm.Running() {
return fmt.Errorf("VM %q is already running", so.vm.GetUID())
}
//下面主要是配置runtime和networkplugin的名称和client
// Stopped VMs don't contain the runtime and network information. Set the
// default runtime and network from the providers if empty.
if so.vm.Status.Runtime.Name == "" {
so.vm.Status.Runtime.Name = providers.RuntimeName
}
if so.vm.Status.Network.Plugin == "" {
so.vm.Status.Network.Plugin = providers.NetworkPluginName
}
// In case the runtime and network-plugin are specified explicitly at
// start, set the runtime and network-plugin on the VM. This overrides the
// global config and config on the VM object, if any.
if fs.Changed("runtime") {
so.vm.Status.Runtime.Name = providers.RuntimeName
}
if fs.Changed("network-plugin") {
so.vm.Status.Network.Plugin = providers.NetworkPluginName
}
// Set the runtime and network-plugin providers from the VM status.
if err := config.SetAndPopulateProviders(so.vm.Status.Runtime.Name, so.vm.Status.Network.Plugin); err != nil {
return err
}
//有效性校验,主要校验文件的存在性,如依赖的可执行文件,依赖的CNI文件以及内核/dev/kvm、/dev/net/tun、/dev/mapper/control等
ignoredPreflightErrors := sets.NewString(util.ToLower(so.StartFlags.IgnoredPreflightErrors)...)
if err := checkers.StartCmdChecks(so.vm, ignoredPreflightErrors); err != nil {
return err
}
//启动vm
if err := operations.StartVM(so.vm, so.Debug); err != nil {
return err
}
//等待ssh服务就绪
// When --ssh is enabled, wait until SSH service started on port 22 at most N seconds
if ssh := so.vm.Spec.SSH; ssh != nil && ssh.Generate && len(so.vm.Status.Network.IPAddresses) > 0 {
if err := waitForSSH(so.vm, constants.SSH_DEFAULT_TIMEOUT_SECONDS, constants.IGNITE_SPAWN_TIMEOUT); err != nil {
return err
}
}
// If starting interactively, attach after starting
if so.Interactive {
return Attach(so.AttachOptions)
}
return nil
}
StartVM
是启动vm的入口,vmChans.SpawnFinished
用于校验vm对象是否被成功保存到Storage中,超时时间为2min,超时返回启动失败的错误。
func StartVM(vm *api.VM, debug bool) error {
vmChans, err := StartVMNonBlocking(vm, debug)
if err != nil {
return err
}
if err := <-vmChans.SpawnFinished; err != nil {
return err
}
return nil
}
启动一个vm需要预先设置一些条件,如文件系统、网络、目录挂载等。下面是启动vm的方法,了解vm是如何启动的,基本就了解ignite是如何运作的。
- 首先查找是否已经存在vm所在的容器,如果存在,则移除该容器。这里需要注意的是,
RemoveContainer
调用的是containerdClient去删除容器,如果容器正在运行,则无法删除,此时会直接返回,中断后续流程(缺少kill?) - 调用
ActivateSnapshot
配置容器需要的snapshot设备 - 获取vm的目录(
/var/lib/firecracker/vm/<UID>
)和内核目录(/var/lib/firecracker/kernel/<UID>
),分别用于挂载vm的metadata.json
文件和内核的vmlinux
文件,后续firecracker
会使用这两个文件来启动vm - 添加环境变量,以及挂载的设备(如
/dev/mapper/control
,/dev/net/tun
)和自定义目录,这里包含vm的文件系统。可以使用ctr --namespace firecracker containers info <UID>
查看一个vm的挂载情况。 - 调用
providers.Runtime.RunContainer
启动vm的容器 - 配置容器的cni网络
- 设置vm对象的runtime字段,后续会通过该字段来判断vm使用的runtime
- 将vm元数据保存到Storage中。
- 通过
vmChans.SpawnFinished
等待vm创建成功。
func StartVMNonBlocking(vm *api.VM, debug bool) (*VMChannels, error) {
// Inspect the VM container and remove it if it exists
inspectResult, _ := providers.Runtime.InspectContainer(vm.PrefixedID())
RemoveVMContainer(inspectResult)
// Make sure we always initialize all channels
vmChans := &VMChannels{
SpawnFinished: make(chan error),
}
// Setup the snapshot overlay filesystem
snapshotDevPath, err := dmlegacy.ActivateSnapshot(vm)
if err != nil {
return vmChans, err
}
kernelUID, err := lookup.KernelUIDForVM(vm, providers.Client)
if err != nil {
return vmChans, err
}
//查找vm路径和kernel路径,用于挂载vm元数据和内核vmlinux文件
vmDir := filepath.Join(constants.VM_DIR, vm.GetUID().String())
kernelDir := filepath.Join(constants.KERNEL_DIR, kernelUID.String())
// Verify that the image containing ignite-spawn is pulled
// TODO: Integrate automatic pulling into pkg/runtime
//校验基础镜像和内核镜像是否存在,不存在则重新拉取
if err := verifyPulled(vm.Spec.Sandbox.OCI); err != nil {
return vmChans, err
}
//设置挂载的卷,主要是/var/lib/firecracker/vm/<UID>/目录和该目录下的vmlinux文件,以及/dev下的一些设备
config := &runtime.ContainerConfig{
Cmd: []string{
fmt.Sprintf("--log-level=%s", logs.Logger.Level.String()),
vm.GetUID().String(),
},
Labels: map[string]string{"ignite.name": vm.GetName()},
Binds: []*runtime.Bind{
{
HostPath: vmDir,
ContainerPath: vmDir,
},
{
// Mount the metadata.json file specifically into the container, to a well-known place for ignite-spawn to access
HostPath: path.Join(vmDir, constants.METADATA),
ContainerPath: constants.IGNITE_SPAWN_VM_FILE_PATH,
},
{
// Mount the vmlinux file specifically into the container, to a well-known place for ignite-spawn to access
HostPath: path.Join(kernelDir, constants.KERNEL_FILE),
ContainerPath: constants.IGNITE_SPAWN_VMLINUX_FILE_PATH,
},
},
CapAdds: []string{
"SYS_ADMIN", // Needed to run "dmsetup remove" inside the container
"NET_ADMIN", // Needed for removing the IP from the container's interface
},
Devices: []*runtime.Bind{
runtime.BindBoth("/dev/mapper/control"), // This enables containerized Ignite to remove its own dm snapshot
runtime.BindBoth("/dev/net/tun"), // Needed for creating TAP adapters
runtime.BindBoth("/dev/kvm"), // Pass through virtualization support
runtime.BindBoth(snapshotDevPath), // The block device to boot from
},
StopTimeout: constants.STOP_TIMEOUT + constants.IGNITE_TIMEOUT,
PortBindings: vm.Spec.Network.Ports, // Add the port mappings to Docker
}
// 配置环境变量
var envVars []string
for k, v := range vm.GetObjectMeta().Annotations {
if strings.HasPrefix(k, constants.IGNITE_SANDBOX_ENV_VAR) {
k := strings.TrimPrefix(k, constants.IGNITE_SANDBOX_ENV_VAR)
envVars = append(envVars, fmt.Sprintf("%s=%s", k, v))
}
}
config.EnvVars = envVars
// 添加自定义挂载
for _, volume := range vm.Spec.Storage.Volumes {
if volume.BlockDevice == nil {
continue // Skip all non block device volumes for now
}
config.Devices = append(config.Devices, &runtime.Bind{
HostPath: volume.BlockDevice.Path,
ContainerPath: path.Join(constants.IGNITE_SPAWN_VOLUME_DIR, volume.Name),
})
}
// Prepare the networking for the container, for the given network plugin
if err := providers.NetworkPlugin.PrepareContainerSpec(config); err != nil {
return vmChans, err
}
// If we're not debugging, remove the container post-run
if !debug {
config.AutoRemove = true
}
// Run the VM container in Docker
containerID, err := providers.Runtime.RunContainer(vm.Spec.Sandbox.OCI, config, vm.PrefixedID(), vm.GetUID().String())
if err != nil {
return vmChans, fmt.Errorf("failed to start container for VM %q: %v", vm.GetUID(), err)
}
// 配置CNI网络
result, err := providers.NetworkPlugin.SetupContainerNetwork(containerID, vm.Spec.Network.Ports...)
if err != nil {
return vmChans, err
}
if !logs.Quiet {
log.Infof("Networking is handled by %q", providers.NetworkPlugin.Name())
log.Infof("Started Firecracker VM %q in a container with ID %q", vm.GetUID(), containerID)
}
// Set the container ID for the VM
vm.Status.Runtime.ID = containerID
vm.Status.Runtime.Name = providers.RuntimeName
// Append non-loopback runtime IP addresses of the VM to its state
for _, addr := range result.Addresses {
if !addr.IP.IsLoopback() {
vm.Status.Network.IPAddresses = append(vm.Status.Network.IPAddresses, addr.IP)
}
}
vm.Status.Network.Plugin = providers.NetworkPluginName
// write the API object in a non-running state before we wait for spawn's network logic and firecracker
if err := providers.Client.VMs().Set(vm); err != nil {
return vmChans, err
}
// TODO: This is temporary until we have proper communication to the container
// It's best to perform any imperative changes to the VM object pointer before this go-routine starts
go waitForSpawn(vm, vmChans)
return vmChans, nil
}
在下面的RunContainer
方法中涉及到一个snapshotService
,该snapshot与devicemapper 的snapshot不同,此处的snapshotService
用于解压并挂载容器镜像,给容器提供启动所需的文件系统。
下面是containerd的主目录。containerd本身是插件化的,该目录下的目录都由不同的插件创建。使用
ctr plugin list
查看支持的插件:$ cd /var/lib/containerd/ $ ll drwxr-xr-x. 4 root root 33 Mar 15 14:14 io.containerd.content.v1.content drwx--x--x. 2 root root 21 Mar 15 11:05 io.containerd.metadata.v1.bolt drwx--x--x. 2 root root 6 Mar 15 11:05 io.containerd.runtime.v1.linux drwx--x--x. 4 root root 37 Jul 14 10:53 io.containerd.runtime.v2.task drwx------. 3 root root 23 Mar 15 11:05 io.containerd.snapshotter.v1.native drwx------. 3 root root 42 Jul 14 10:52 io.containerd.snapshotter.v1.overlayfs
io.containerd.content.v1.content
:存储 OCI 镜像,更多参见:oci image spec。io.containerd.metadata.v1.bolt
:存储 containerd 管理的镜像、容器、快照的元数据,存储的内容参见:源码。io.containerd.snapshotter.v1.<type>
:Snapshotter 快照目录,参见Snapshotters 文档
io.containerd.snapshotter.v1.btrfs
:使用 btrfs 文件系统创建容器快照的目录io.containerd.snapshotter.v1.overlayfs
:默认的 snapshotter。采用 overlayfs2 创建快照。上面是对containerd主目录的描述,镜像文件会被放到
io.containerd.content.v1.content
中,然后由snapshot解压并mount到io.containerd.snapshotter.v1.overlayfs
(此处使用的是overlayfs),供容器使用。下面是本机启动的一个ignit vm,可以看到snapshot提供了overlayfs所需的
lowerdir
、upperdir
和workdir
。$ mount|grep ignit overlay on /run/containerd/io.containerd.runtime.v2.task/firecracker/ignite-272a0eab-75be-4131-a022-1fde8012f9f6/rootfs type overlay (rw,relatime,seclabel,lowerdir=/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/10/fs,upperdir=/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/17/fs,workdir=/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/17/work)
containerd的snapshot有两种类型
Active
和Committed
,分别对应容器运行的container layer(lowerdir
、workdir
)和image layer(lowerdir
),对Active
snapshot的修改是不会保存的,如果需要保存可以通过snapshot commit
将其转变为Committed
状态。使用ctr --namespace firecracker snapshot ls
可以查看当前的snapshot状态。snapshot是有层级关系的,使用ctr --namespace firecracker snapshot tree
可以查看snapshot的层级关系。$ ctr --namespace firecracker snapshot ls KEY PARENT KIND ignite-272a0eab-75be-4131-a022-1fde8012f9f6 sha256:6d1a1092846de7c30d76df9c7aa787b50ad4dee32d32daebe0c7a87ffede14b9 Active sha256:0c949a3342f6400d49b4d378bf7b20b768cf09bef107cc6c5d58a1f3e50e06f3 sha256:38facc6304c0b9270805fab2c549a3fef82dce370cab3f24d922e0a3b46c2541 Committed sha256:38facc6304c0b9270805fab2c549a3fef82dce370cab3f24d922e0a3b46c2541 Committed sha256:6d1a1092846de7c30d76df9c7aa787b50ad4dee32d32daebe0c7a87ffede14b9 Committed sha256:9f54eef412758095c8079ac465d494a2872e02e90bf1fb5f12a1641c0d1bb78b Committed sha256:ac9030d17ea3c723f7ff631b7e9c16f0d914ecf43f37b3e0f7cb5cae8012b39d sha256:f0e76d36d3129de5a1ddb77efc4963b2dfec81f9c5ca21e117198a3c2ae9f397 Committed sha256:bc98849e95ef9484381c1a36ce97339d7cd8675f23a37766ed47b7fcc947bb91 sha256:9f54eef412758095c8079ac465d494a2872e02e90bf1fb5f12a1641c0d1bb78b Committed sha256:f0e76d36d3129de5a1ddb77efc4963b2dfec81f9c5ca21e117198a3c2ae9f397 sha256:bc98849e95ef9484381c1a36ce97339d7cd8675f23a37766ed47b7fcc947bb91 Committed sha256:f9e99b137a1976a6aaa287cb3cddea2f6e6545707ad1302c454fd4d06ffbb2ab sha256:ac9030d17ea3c723f7ff631b7e9c16f0d914ecf43f37b3e0f7cb5cae8012b39d Committed
下面用一个例子看下snapshot是如何工作的。
# 创建一个名为test的containerd 命名空间 $ ctr ns create test # 准备挂载点 $ mkdir /var/lib/containerd/custom_dir # 第一次提交 (根) $ ctr -n test snapshot prepare activeLayer0 # prepare 创建一个工作状态的层 # 生成并执行snapshot 文件系统挂载(此挂载类型overlayfs)命令 $ ctr -n test snapshot mount /var/lib/containerd/custom_dir activeLayer0 | xargs sudo $ echo "1" > /var/lib/containerd/custom_dir/add01 # 增加一次变更文件 $ umount /var/lib/containerd/custom_dir # umount $ ctr -n test snapshot commit commit_add01 activeLayer0 # 提交 committed,变更snapshot状态,保存Layer
上面
snapshot mount
生成的mount命令为:"mount -t bind /var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/21866/fs /var/lib/containerd/custom_dir -o rw,rbind"查看当前的snapshot,可以发现生成了一个committed状态的snapshot
$ ctr -n test snapshot ls KEY PARENT KIND commit_add01 Committed
如果查看
io.containerd.snapshotter.v1.overlayfs/snapshots
目录可以发现生成了一个新的文件夹21866
,它就是commit产生的文件系统:$ ll io.containerd.snapshotter.v1.overlayfs/snapshots/21841/fs/ -rw-r----- 1 root root 2 Jul 26 01:33 add01
下面再测试提交一个变更,首先创建一个active的snapshot,parent为
commit_add01
# 第二次提交,以第一次 layer 为 parent $ ctr -n test snapshot prepare activeLayer0 commit_add01
查看snapshot,发现active的snapshot的parent是上面commit的snapshot:
$ ctr -n test snapshot ls KEY PARENT KIND activeLayer0 commit_add01 Active commit_add01 Committed
提交二次变更,这一步中ctr生成的mount命令为:"mount -t overlay overlay /var/lib/containerd/custom_dir -o index=off,workdir=/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/21867/work,upperdir=/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/21867/fs,lowerdir=/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/21866/fs",可以看到它就是容器运行所需的overlay文件系统,
lowerdir
就是commit_add01
生成的文件系统。$ ctr -n test snapshot mount /var/lib/containerd/custom_dir activeLayer0 | xargs sudo $ echo "2" > /var/lib/containerd/custom_dir/add02 $ umount /var/lib/containerd/custom_dir $ ctr -n test snapshot commit commit_add02 activeLayer0
查看snapshot,可以发现新增了一个snapshot
commit_add02
,其parents为commit_add01
,即commit之后就产生了一个子snapshot。$ ctr -n test snapshot ls KEY PARENT KIND commit_add01 Committed commit_add02 commit_add01 Committed
如果继续以新的snapshot
commit_add02
为parents创建overlay,会不会合并commit_add01
的变更?$ ctr -n test snapshot prepare activeLayer0 commit_add02 $ ctr -n test snapshot mount /var/lib/containerd/custom_dir activeLayer0
下面是
snapshot mount
生成的mount命令,可以看到lowerdir
中包含了commit_add01
和commit_add02
的文件系统。$ mount -t overlay overlay /var/lib/containerd/custom_dir -o index=off,workdir=/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/21869/work,upperdir=/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/21869/fs,lowerdir=/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/21867/fs:/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/21866/fs
环境清理,清理时注意先清理子snapshot,否则会出现错误"cannot remove snapshot with child: failed precondition"
$ ctr -n test snapshot rm commit_add02 $ ctr -n test snapshot rm commit_add01
此外还可以使用
ctr snapshot view
创建只读系统,此时如果向挂载的目录中写数据,会返回"Read-only file system"的错误。总结下来就是,首先使用
prepare
或view
(只读)创建一个active snapshot,然后通过mount
命令挂载active snapshot,在对active snapshot修改之后就可以通过commit
命令将变更持久化。更多参见Snapshots。
containerd有两个概念:container和task。container可以看做是为容器运行准备的环境,如cgroup和挂载的卷,而task则是容器内运行的进程。如下,查看container可以看到的是容器使用的镜像和runtime,而task则是进程和进程状态:
$ ctr -n firecracker container ls CONTAINER IMAGE RUNTIME ignite-4a64e75d-c7fb-43ba-aaed-6e7923374ba5 docker.io/weaveworks/ignite:v0.10.0 io.containerd.runc.v2
$ ctr -n firecracker task ls TASK PID STATUS ignite-4a64e75d-c7fb-43ba-aaed-6e7923374ba5 10332 RUNNING
有了上述知识后,就不难理解RunContainer
的流程:
- 首先移除非running的容器
- 将主机的
/etc/resolv.conf
中的内容写入vm目录下的runtime.containerd.resolv.conf
文件中,并将其加入挂载配置,后续挂载为容器的/etc/resolv.conf
- 配置创建容器所需的cni选项,这里添加了配置的环境变量、hostname、挂载卷和/dev的挂载设备
- 创建一个containerd snapshot Service,用于给容器提供rootfs
- 配置创建容器的选项,这里用到了上面创建的cni选项和rootfs
- 创建containerd 容器和task并启动 task
以下都是标准的启动containerd容器的流程,感兴趣的话也可以在containerd源码的_test.go文件中查找使用例子:
func (cc *ctdClient) RunContainer(image meta.OCIImageRef, config *runtime.ContainerConfig, name, id string) (s string, err error) {
img, err := cc.client.GetImage(cc.ctx, image.Normalized())
if err != nil {
return
}
// Remove the container if it exists
if err = cc.RemoveContainer(name); err != nil {
return
}
// Load the default snapshotter
snapshotter := cc.client.SnapshotService(containerd.DefaultSnapshotter)
// Add the /etc/resolv.conf mount, this isn't done automatically by containerd
// Ensure a resolv.conf exists in the vmDir. Calculate path using the vm id
resolvConfPath := filepath.Join(constants.VM_DIR, id, resolvConfName)
//读取主机的/etc/resolv.conf并写入vm目录的runtime.containerd.resolv.conf中
err = resolvconf.EnsureResolvConf(resolvConfPath, constants.DATA_DIR_FILE_PERM)
if err != nil {
return
}
config.Binds = append(
config.Binds,
&runtime.Bind{
HostPath: resolvConfPath,
ContainerPath: "/etc/resolv.conf", //将runtime.containerd.resolv.conf挂载到容器中
},
)
// Add the stop timeout as a label, as containerd doesn't natively support it
config.Labels[stopTimeoutLabel] = strconv.FormatUint(uint64(config.StopTimeout), 10)
// Build the OCI specification
opts := []oci.SpecOpts{
oci.WithDefaultSpec(),
oci.WithDefaultUnixDevices,
oci.WithTTY,
oci.WithImageConfigArgs(img, config.Cmd),
oci.WithEnv(config.EnvVars),
withAddedCaps(config.CapAdds),
withHostname(config.Hostname),
withMounts(config.Binds), //挂载卷
withDevices(config.Devices), //挂载设备
}
// Known limitations, containerd doesn't support the following config fields:
// - StopTimeout
// - AutoRemove
// - NetworkMode (only CNI supported)
// - PortBindings
snapshotOpt := containerd.WithSnapshot(name)
if _, err = snapshotter.Stat(cc.ctx, name); errdefs.IsNotFound(err) {
// Even if "read only" is set, we don't use a KindView snapshot here (#1495).
// We pass the writable snapshot to the OCI runtime, and the runtime remounts
// it as read-only after creating some mount points on-demand.
snapshotOpt = containerd.WithNewSnapshot(name, img)
} else if err != nil {
return
}
cOpts := []containerd.NewContainerOpts{
containerd.WithImage(img),
snapshotOpt,
//containerd.WithImageStopSignal(img, "SIGTERM"),
containerd.WithNewSpec(opts...),
containerd.WithContainerLabels(config.Labels),
}
cont, err := cc.client.NewContainer(cc.ctx, name, cOpts...)
if err != nil {
return
}
// This is a dummy PTY to silence output
// when starting without attach breaking
con, _, err := console.NewPty()
if err != nil {
return
}
defer util.DeferErr(&err, con.Close)
// We need a temporary dummy stdin reader that
// actually works, can't use nullReader here
dummyReader, _, err := os.Pipe()
if err != nil {
return
}
defer util.DeferErr(&err, dummyReader.Close)
// Spawn the Creator with the dummy streams
ioCreator := cio.NewCreator(cio.WithTerminal, cio.WithStreams(dummyReader, con, con))
task, err := cont.NewTask(cc.ctx, ioCreator)
if err != nil {
return
}
if err = task.Start(cc.ctx); err != nil {
return
}
// TODO: Save task.Pid() somewhere for attaching?
s = task.ID()
return
}
至此已经完成容器的启动,在容器启动之后会通过ignite-spawn
命令调用firecracker
来启动vm。在Dockerfile中可以看到,容器启动命令为:
ENTRYPOINT ["/usr/local/bin/ignite-spawn"]
firecracker start vm
解析配置
第一步是将挂载的IGNITE_SPAWN_VM_FILE_PATH
转变为一个vm对象,该文件是在start vm时挂载的/var/lib/firecracker/vm/<UID>/metadata.json
文件,里面包含了创建vm的规则,如CPU、内存、磁盘、网络等。需要注意的是启动vm的操作是在容器中执行的。
func decodeVM(vmID string) (*api.VM, error) {
filePath := constants.IGNITE_SPAWN_VM_FILE_PATH
obj, err := scheme.Serializer.DecodeFile(filePath, true)
if err != nil {
return nil, err
}
vm, ok := obj.(*api.VM)
if !ok {
return nil, fmt.Errorf("object couldn't be converted to VM")
}
// Explicitly set the GVK on this object
vm.SetGroupVersionKind(api.SchemeGroupVersion.WithKind(api.KindVM.Title()))
return vm, nil
}
启动vm
启动vm需要完成如下三步:
- 配置容器网络:主要是检查接口地址是否正常,并为vm创建接口
- 配置DHCP
- 启动vm:这一步使用firecracker启动vm,用到了第一步中准备的接口、主机上的devicemapper设备等
func StartVM(vm *api.VM) (err error) {
// Setup networking inside of the container, return the available interfaces
fcIfaces, dhcpIfaces, err := container.SetupContainerNetworking(vm)//配置容器网络
if err != nil {
return fmt.Errorf("network setup failed: %v", err)
}
// Serve DHCP requests for those interfaces
// This function returns the available IP addresses that are being
// served over DHCP now
if err = container.StartDHCPServers(vm, dhcpIfaces); err != nil { //配置DHCP
return
}
// Serve metrics over an unix socket in the VM's own directory
metricsSocket := path.Join(vm.ObjectPath(), constants.PROMETHEUS_SOCKET)
serveMetrics(metricsSocket)
// Patches the VM object to set state to stopped, and clear IP addresses
defer util.DeferErr(&err, func() error { return patchStopped(vm) })
// Remove the snapshot overlay post-run, which also removes the detached backing loop devices
defer util.DeferErr(&err, func() error { return dmlegacy.DeactivateSnapshot(vm) })
// Remove the Prometheus socket post-run
defer util.DeferErr(&err, func() error { return os.Remove(metricsSocket) })
// Execute Firecracker
if err = container.ExecuteFirecracker(vm, fcIfaces); err != nil { //启动vm
return fmt.Errorf("runtime error for VM %q: %v", vm.GetUID(), err)
}
return
}
配置容器网络
firecracker是在容器中创建vm的,因此需要在容器中为vm准备网络环境。通过vm对象的annotation ignite.weave.works/interface
可以为vm添加额外的接口。
注:ignite支持两种网络模式,MODE_DHCP
和MODE_TC
,目前用的是MODE_DHCP
。
func SetupContainerNetworking(vm *api.VM) (firecracker.NetworkInterfaces, []DHCPInterface, error) {
var dhcpIntfs []DHCPInterface
var fcIntfs firecracker.NetworkInterfaces
//通过vm的metadata.json的annotation ignite.weave.works/interface可以添加额外的接口
vmIntfs := parseExtraIntfs(vm) //vmIntfs: map[<interface_name>][interface_mode]
// 如果没有eth0接口,则添加该接口,并设置为dhcp模式
if _, ok := vmIntfs[mainInterface]; !ok {
vmIntfs[mainInterface] = MODE_DHCP
}
interval := 1 * time.Second
//等待接口就绪,就绪则返回true
err := wait.PollImmediate(interval, constants.IGNITE_SPAWN_TIMEOUT, func() (bool, error) {
// 检查接口是否存在且配置正确
retry, err := collectInterfaces(vmIntfs)
if err == nil {
// We're done here
return true, nil
}
if retry {
// We got an error, but let's ignore it and try again
log.Warnf("Got an error while trying to set up networking, but retrying: %v", err)
return false, nil
}
// The error was fatal, return it
return false, err
})
if err != nil {
return nil, nil, err
}
//为vm准备接口等网络环境
if err := networkSetup(&fcIntfs, &dhcpIntfs, vmIntfs); err != nil {
return nil, nil, err
}
return fcIntfs, dhcpIntfs, nil
}
collectInterfaces
方法用于检查接口是否存在且配置正确,流程为:
- 获取当前的所有接口
- 验证是否存在预期的接口
vmIntfs
,以及接口是否配置了IP地址
func collectInterfaces(vmIntfs map[string]string) (bool, error) {
//获取所有接口
allIntfs, err := net.Interfaces()
if err != nil || allIntfs == nil || len(allIntfs) == 0 {
return false, fmt.Errorf("cannot get local network interfaces: %v", err)
}
// create a map of candidate interfaces
foundIntfs := make(map[string]net.Interface)
for _, intf := range allIntfs {
if _, ok := ignoreInterfaces[intf.Name]; ok {
continue
}
foundIntfs[intf.Name] = intf
// If the interface is explicitly defined, no changes are needed
if _, ok := vmIntfs[intf.Name]; ok { //如果已经定义接口,则无需再为接口配置mode
continue
}
// default fallback behaviour to always consider intfs with an address
addrs, _ := intf.Addrs()
if len(addrs) > 0 {
vmIntfs[intf.Name] = MODE_DHCP
}
}
// 校验是否已经创建期望的接口
for intfName, mode := range vmIntfs {
if _, ok := foundIntfs[intfName]; !ok {
return true, fmt.Errorf("interface %q (mode %q) is still not found", intfName, mode)
}
// for DHCP interface, we need to make sure IP and route exist
if mode == MODE_DHCP {
intf := foundIntfs[intfName]
_, _, _, noIPs, err := getAddress(&intf) //返回接口的IP/掩码、网关和物理接口(link),这里判断是否接口配置了IP
if err != nil {
return true, err
}
if noIPs {
return true, fmt.Errorf("IP is still not found on %q", intfName)
}
}
}
return false, nil
}
在对vmIntfs
接口进行校验之后,就可以为vm配置网络,主要流程为:
- 遍历容器中所有预期的接口,然后获取这些接口的第一个地址信息,并从接口上删除该地址,并返回地址信息,后续作为firecracker vm的地址,相当于vm借用了容器的地址信息。
- 针对每个预期的接口,创建一个tab接口和一个bridge接口,然后将预期接口和tab接口桥接到该bridge接口上。后续会将tab接口配置给firecracker 创建出来的vm,并配置上第一步返回的地址信息
func networkSetup(fcIntfs *firecracker.NetworkInterfaces, dhcpIntfs *[]DHCPInterface, vmIntfs map[string]string) error {
// The order in which interfaces are plugged in is intentionally deterministic
// All interfaces are sorted alphabetically and 'eth0' is always first
var keys []string
for k := range vmIntfs {
keys = append(keys, k)
}
sort.Strings(keys)
sort.Slice(keys, func(i, j int) bool {
return keys[i] == mainInterface
})
for _, intfName := range keys {
intf, err := net.InterfaceByName(intfName) //根据接口名称获取容器的接口实例
if err != nil {
return fmt.Errorf("cannot find interface %q: %s", intfName, err)
}
switch vmIntfs[intfName] {
case MODE_DHCP:
ipNet, gw, err := takeAddress(intf) //获取容器接口的第一个地址信息,并从容器接口上删除该地址信息,然后返回该信息,后续作为vm的接口地址
if err != nil {
return fmt.Errorf("error parsing interface %q: %s", intfName, err)
}
dhcpIface, err := bridge(intf) //创建tab和bridge接口,并配置桥接。返回给vm使用的接口dhcpIface
if err != nil {
return fmt.Errorf("bridging interface %q failed: %v", intfName, err)
}
dhcpIface.VMIPNet = ipNet
dhcpIface.GatewayIP = gw
*dhcpIntfs = append(*dhcpIntfs, *dhcpIface) //添加dhcp接口
*fcIntfs = append(*fcIntfs, firecracker.NetworkInterface{
StaticConfiguration: &firecracker.StaticNetworkConfiguration{
MacAddress: dhcpIface.MACFilter,
HostDevName: dhcpIface.VMTAP,
},
})
case MODE_TC:
tcInterface, err := addTcRedirect(intf)
if err != nil {
log.Errorf("Failed to setup tc redirect %v", err)
continue
}
*fcIntfs = append(*fcIntfs, *tcInterface)
}
}
return nil
}
下面是使用bridge CNI时创建的容器接口,可以看到eth0
接口上的IP被(takeAddress
)删除了,vm_eth0
和br_eth0
分别是为eth0
创建的tab接口和bridge接口:
$ ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
3: eth0@if14: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master br_eth0 state UP group default
link/ether 8e:c9:3a:f0:50:67 brd ff:ff:ff:ff:ff:ff link-netnsid 0
inet6 fe80::8cc9:3aff:fef0:5067/64 scope link
valid_lft forever preferred_lft foreve
4: vm_eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel master br_eth0 state UP group default qlen 1000
link/ether d6:20:6b:d4:3e:2a brd ff:ff:ff:ff:ff:ff
inet6 fe80::d420:6bff:fed4:3e2a/64 scope link
valid_lft forever preferred_lft forever
5: br_eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
link/ether 92:79:7d:39:28:b6 brd ff:ff:ff:ff:ff:ff
inet6 fe80::9079:7dff:fe39:28b6/64 scope link
valid_lft forever preferred_lft forever
$ ip link show master br_eth0
3: eth0@if14: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master br_eth0 state UP mode DEFAULT group default
link/ether 8e:c9:3a:f0:50:67 brd ff:ff:ff:ff:ff:ff link-netnsid 0
4: vm_eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel master br_eth0 state UP mode DEFAULT group default qlen 1000
link/ether d6:20:6b:d4:3e:2a brd ff:ff:ff:ff:ff:ff
vm的接口和路由如下,eth0就是从容器中的vm_eth0
$ ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: bond0: <BROADCAST,MULTICAST,MASTER> mtu 1500 qdisc noqueue state DOWN group default qlen 1000
link/ether 1e:7d:6c:90:99:18 brd ff:ff:ff:ff:ff:ff
3: dummy0: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN group default qlen 1000
link/ether e6:18:3d:88:a9:ff brd ff:ff:ff:ff:ff:ff
4: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
link/ether 76:6f:77:5d:6d:1c brd ff:ff:ff:ff:ff:ff
inet 10.61.0.41/16 brd 10.61.255.255 scope global eth0
valid_lft forever preferred_lft forever
inet6 fe80::746f:77ff:fe5d:6d1c/64 scope link
valid_lft forever preferred_lft forever
$ ip route
default via 10.61.0.1 dev eth0
呈现的接口如下:
配置DHCP
这一步在bridge接口上启动了dhcp服务:
// StartDHCPServers starts multiple DHCP servers for the VM, one per interface
// It returns the IP addresses that the API object may post in .status, and a potential error
func StartDHCPServers(vm *api.VM, dhcpIfaces []DHCPInterface) error {
// Fetch the DNS servers given to the container
clientConfig, err := dns.ClientConfigFromFile("/etc/resolv.conf")
if err != nil {
return fmt.Errorf("failed to get DNS configuration: %v", err)
}
for i := range dhcpIfaces {
dhcpIface := &dhcpIfaces[i]
// Set the VM hostname to the VM ID
dhcpIface.Hostname = vm.GetUID().String()
// Add the DNS servers from the container
dhcpIface.SetDNSServers(clientConfig.Servers)
go func() {
log.Infof("Starting DHCP server for interface %q (%s)\n", dhcpIface.Bridge, dhcpIface.VMIPNet.IP)
if err := dhcpIface.StartBlockingServer(); err != nil {
log.Errorf("%q DHCP server error: %v\n", dhcpIface.Bridge, err)
}
}()
}
return nil
}
启动vm
使用firecracker启动vm时需要配置如下基本参数:
- 从vm元数据中获取设置的CPU和内存资源
- 配置devicemapper设备、接口和挂载卷
- 初始并启动一个firecracker vm
func ExecuteFirecracker(vm *api.VM, fcIfaces firecracker.NetworkInterfaces) (err error) {
drivePath := vm.SnapshotDev() //获取vm的devicemapper设备,由于容器中挂载了host的/dev,因此可以直接查看使用
vCPUCount := int64(vm.Spec.CPUs) //获取CPU和内存资源
memSizeMib := int64(vm.Spec.Memory.MBytes())
cmdLine := vm.Spec.Kernel.CmdLine
if len(cmdLine) == 0 {
// if for some reason cmdline would be unpopulated, set it to the default
cmdLine = constants.VM_DEFAULT_KERNEL_ARGS
}
// Convert the logrus error level to a Firecracker compatible error level.
// Firecracker accepts "Error", "Warning", "Info", and "Debug", case-sensitive.
fcLogLevel := "Debug"
switch logs.Logger.Level {
case log.InfoLevel:
fcLogLevel = "Info"
case log.WarnLevel:
fcLogLevel = "Warning"
case log.ErrorLevel, log.FatalLevel, log.PanicLevel:
fcLogLevel = "Error"
}
firecrackerSocketPath := path.Join(vm.ObjectPath(), constants.FIRECRACKER_API_SOCKET)
logSocketPath := path.Join(vm.ObjectPath(), constants.LOG_FIFO)
metricsSocketPath := path.Join(vm.ObjectPath(), constants.METRICS_FIFO)
cfg := firecracker.Config{
SocketPath: firecrackerSocketPath,
KernelImagePath: constants.IGNITE_SPAWN_VMLINUX_FILE_PATH, //挂载到容器中的vmlinux文件路径
KernelArgs: cmdLine,
Drives: []models.Drive{{
DriveID: firecracker.String("1"),
IsReadOnly: firecracker.Bool(false),
IsRootDevice: firecracker.Bool(true),
PathOnHost: &drivePath, //设置devicemapper设备
}},
NetworkInterfaces: fcIfaces, //设置vm接口
MachineCfg: models.MachineConfiguration{
VcpuCount: &vCPUCount,
MemSizeMib: &memSizeMib,
HtEnabled: firecracker.Bool(true),
},
//JailerCfg: firecracker.JailerConfig{
// GID: firecracker.Int(0),
// UID: firecracker.Int(0),
// ID: vm.ID,
// NumaNode: firecracker.Int(0),
// ExecFile: "firecracker",
//},
LogLevel: fcLogLevel,
// TODO: We could use /dev/null, but firecracker-go-sdk issues Mkfifo which collides with the existing device
LogFifo: logSocketPath,
MetricsFifo: metricsSocketPath,
}
// Add the volumes to the VM
for i, volume := range vm.Spec.Storage.Volumes {
volumePath := path.Join(constants.IGNITE_SPAWN_VOLUME_DIR, volume.Name)
if !util.FileExists(volumePath) {
log.Warnf("Skipping nonexistent volume: %q", volume.Name)
continue // Skip all nonexistent volumes
}
cfg.Drives = append(cfg.Drives, models.Drive{
DriveID: firecracker.String(strconv.Itoa(i + 2)),
IsReadOnly: firecracker.Bool(false), // TODO: Support read-only volumes
IsRootDevice: firecracker.Bool(false),
PathOnHost: &volumePath, //设置挂载卷,这部分是从host-->container-->vm
})
}
// Remove these FIFOs for now
defer os.Remove(logSocketPath)
defer os.Remove(metricsSocketPath)
ctx, vmmCancel := context.WithCancel(context.Background())
defer vmmCancel()
cmd := firecracker.VMCommandBuilder{}.
WithBin("firecracker").
WithSocketPath(firecrackerSocketPath).
WithStdin(os.Stdin).
WithStdout(os.Stdout).
WithStderr(os.Stderr).
Build(ctx)
m, err := firecracker.NewMachine(ctx, cfg, firecracker.WithProcessRunner(cmd))
if err != nil {
return fmt.Errorf("failed to create machine: %s", err)
}
//defer os.Remove(cfg.SocketPath)
//if opts.validMetadata != nil {
// m.EnableMetadata(opts.validMetadata)
//}
if err = m.Start(ctx); err != nil { //启动vm
return fmt.Errorf("failed to start machine: %v", err)
}
defer util.DeferErr(&err, m.StopVMM)
installSignalHandlers(ctx, m)
// wait for the VMM to exit
if err = m.Wait(ctx); err != nil {
return fmt.Errorf("wait returned an error %s", err)
}
return
}
Run vm
下面是vm运行的入口,可以看到其内部只调用了vm create和vm start两种方法,即执行了上面的"Create vm"和"Start vm"两个步骤:
func Run(ro *RunOptions, fs *flag.FlagSet) error {
if err := Create(ro.CreateOptions); err != nil {
return err
}
// Copy the pointer over for Start
// TODO: This is pretty bad, fix this
ro.vm = ro.VM
return Start(ro.StartOptions, fs)
}
Kill VM
kill vm用于强制停止vm,但不会删除vm,vm的元数据和存储都还在/var/lib/firecracker/vm/<UID>
目录下。
kill vm主要用的就是remove vm
中调用的StopVM
方法,但执行的是providers.Runtime.KillContainer
,用于停止containerd task。
在删除containerd的task之前必须kill task
注意这里释放了网络资源,在执行ignite vm start
的时候会重新配容器的网络资源。
func StopVM(vm *api.VM, kill, silent bool) error {
var err error
container := vm.PrefixedID()
action := "stop"
if !vm.Running() && !logs.Quiet {
log.Warnf("VM %q is not running but trying to cleanup networking for stopped container\n", vm.GetUID())
}
// 释放网络资源
if err = removeNetworking(vm.Status.Runtime.ID, vm.Spec.Network.Ports...); err != nil {
log.Warnf("Failed to cleanup networking for stopped container %s %q: %v", vm.GetKind(), vm.GetUID(), err)
return err
}
if vm.Running() {
// Stop or kill the VM container
if kill {
action = "kill"
err = providers.Runtime.KillContainer(container, signalSIGQUIT) // TODO: common constant for SIGQUIT
} else {
err = providers.Runtime.StopContainer(container, nil)
}
if err != nil {
return fmt.Errorf("failed to %s container for %s %q: %v", action, vm.GetKind(), vm.GetUID(), err)
}
if silent {
return nil
}
if logs.Quiet {
fmt.Println(vm.GetUID())
} else {
log.Infof("Stopped %s with name %q and ID %q", vm.GetKind(), vm.GetName(), vm.GetUID())
}
}
return nil
}
KillContainer
的实现如下,即获取containerd容器进程并通过向该进程发送syscall.SIGQUIT
信号来强制停止该容器进程,此处使用cont.Task
来等待进程退出。
func (cc *ctdClient) KillContainer(container, signal string) (err error) {
cont, err := cc.client.LoadContainer(cc.ctx, container)
if err != nil {
// If the container is not found, return nil, no-op.
if errdefs.IsNotFound(err) {
log.Warn(err)
err = nil
}
return
}
task, err := cont.Task(cc.ctx, cio.Load)
if err != nil {
// If the task is not found, return nil, no-op.
if errdefs.IsNotFound(err) {
log.Warn(err)
err = nil
}
return
}
// Initiate a wait
waitC, err := task.Wait(cc.ctx)
if err != nil {
return
}
// Send a SIGQUIT signal to force stop
if err = task.Kill(cc.ctx, syscall.SIGQUIT); err != nil {
return
}
// Wait for the container to stop
<-waitC
// Delete the task
_, err = task.Delete(cc.ctx)
return
}
Stop VM
stop vm使用的也是StopVM
方法,但执行的是providers.Runtime.StopContainer
,相比kill vm增加了等待时间,更优雅一些。
- 首先向容器进程发送
syscall.SIGTERM
命令来优雅停机 - 如果在超时时间(30s)内进程没有退出,则向容器进程发送
syscall.SIGQUIT
信号来强制停机 - 最后调用
task.Delete
删除vm进程
核心代码如下:
waitC, err := task.Wait(cc.ctx)
if err != nil {
return
}
// Send a SIGTERM signal to request a clean shutdown
if err = task.Kill(cc.ctx, syscall.SIGTERM); err != nil {
return
}
// After sending the signal, start the timer to force-kill the task
timeoutC := make(chan error)
timer := time.AfterFunc(*timeout, func() {
timeoutC <- task.Kill(cc.ctx, syscall.SIGQUIT)
})
// Wait for the task to stop or the timer to fire
select {
case exitStatus := <-waitC:
timer.Stop() // Cancel the force-kill timer
err = exitStatus.Error() // TODO: Handle exit code
case err = <-timeoutC: // The kill timer has fired
}
// Delete the task
if _, e := task.Delete(cc.ctx); e != nil {
if err != nil {
err = fmt.Errorf("%v, task deletion failed: %v", err, e) // TODO: Multierror
} else {
err = e
}
}
Remove vm
下面是删除一个vm的入口:
func Rm(ro *RmOptions) error {
for _, vm := range ro.vms {
// 如果vm是运行状态,则需要指定强制删除才能继续删除,这与docker命令行删除一个运行的容器一样
if vm.Running() && !ro.Force {
return fmt.Errorf("%s is running", vm.GetUID())
}
// Runtime and network info are present only when the VM is running.
if vm.Running() {
// Set the runtime and network-plugin providers from the VM status.
if err := config.SetAndPopulateProviders(vm.Status.Runtime.Name, vm.Status.Network.Plugin); err != nil {
return err
}
}
// This will first kill the VM container, and then remove it
if err := operations.DeleteVM(providers.Client, vm); err != nil {
return err
}
}
return nil
}
一个运行的vm包含几种资源:containerd task、containerd container、cni网络、vm挂载的devicemapper snapshot设备、vm日志文件以及Storage中保存的vm对象。移除一个vm意味着需要清理这些资源。
func DeleteVM(c *client.Client, vm *api.VM) error {
if err := CleanupVM(vm); err != nil {
return err
}
//清除vm对象以及/var/lib/firecracker/vm/<UID>/目录
return c.VMs().Delete(vm.GetUID())
}
CleanupVM
是主要的清理方法。首先调用StopVM
(参见"kill vm" 和"stop vm"章节)停止并删除容器进程,移除容器网络,然后调用RemoveVMContainer
清理containerd相关资源,最后调用dmlegacy.DeactivateSnapshot
移除vm的文件系统(内部调用dmsetup remove
命令行)。步骤为:
-
如果vm正在运行,则调用StopVM移除网络、停止containerd 容器的task
在移除vm时也需要移除对应的容器,否则会导致资源泄露,参见:issue
-
删除vm所在的容器
-
移除vm挂载的devicemapper snapshot设备以及vm日志文件
func CleanupVM(vm *api.VM) error {
// Runtime information is available only when the VM is running.
if vm.Running() {
// Inspect the container before trying to stop it and it gets auto-removed
inspectResult, _ := providers.Runtime.InspectContainer(vm.PrefixedID())
// If the VM is running, try to kill it first so we don't leave dangling containers. Otherwise, try to cleanup VM networking.
if err := StopVM(vm, true, true); err != nil {
if vm.Running() {
return err
}
}
// Remove the VM container if it exists
// TODO should this function return a proper error?
RemoveVMContainer(inspectResult)
}
// After removing the VM container, if the Snapshot Device is still there, clean up
if _, err := os.Stat(vm.SnapshotDev()); err == nil {
// try remove it again with DeactivateSnapshot
if err := dmlegacy.DeactivateSnapshot(vm); err != nil {
return err
}
}
if logs.Quiet {
fmt.Println(vm.GetUID())
} else {
log.Infof("Removed %s with name %q and ID %q", vm.GetKind(), vm.GetName(), vm.GetUID())
}
return nil
}
RemoveContainer
的清理操作如下:
- 通过名称从containerd中加载vm所在的容器
- 获取并删除该容器的task
- 删除容器本身
- 移除vm日志文件
/tmp/<containerName>.log
func (cc *ctdClient) RemoveContainer(container string) error {
// Remove the container if it exists
cont, contLoadErr := cc.client.LoadContainer(cc.ctx, container)
if errdefs.IsNotFound(contLoadErr) {
log.Debug(contLoadErr)
return nil
} else if contLoadErr != nil {
return contLoadErr
}
// Load the container's task without attaching
task, taskLoadErr := cont.Task(cc.ctx, nil)
if errdefs.IsNotFound(taskLoadErr) {
log.Debug(taskLoadErr)
} else if taskLoadErr != nil {
return taskLoadErr
} else {
_, taskDeleteErr := task.Delete(cc.ctx)
if taskDeleteErr != nil {
log.Debug(taskDeleteErr)
}
}
// Delete the container
deleteContErr := cont.Delete(cc.ctx, containerd.WithSnapshotCleanup)
if errdefs.IsNotFound(contLoadErr) {
log.Debug(contLoadErr)
} else if deleteContErr != nil {
return deleteContErr
}
// Remove the log file if it exists
logFile := fmt.Sprintf(logPathTemplate, container)
if util.FileExists(logFile) {
logDeleteErr := os.RemoveAll(logFile)
if logDeleteErr != nil {
return logDeleteErr
}
}
return nil
}
辅助命令
vm logs
获取vm日志其实就是获取vm所在容器的task的打印信息,然后输出到/tmp/ignite-<UID>.log
文件中:
func (cc *ctdClient) ContainerLogs(container string) (r io.ReadCloser, err error) {
var (
cont containerd.Container
)
if cont, err = cc.client.LoadContainer(cc.ctx, container); err != nil {
return
}
var retriever *logRetriever
if retriever, err = newlogRetriever(fmt.Sprintf(logPathTemplate, container)); err != nil {
return
}
if _, err = cont.Task(cc.ctx, cio.NewAttach(retriever.Opt())); err != nil {
return
}
// Currently we have no way of detecting if the task's attach has filled the stdout and stderr
// buffers without asynchronous I/O (syscall.Conn and syscall.Splice). If the read reaches
// the end, the application hangs indefinitely waiting for new output from the container.
// TODO: Get rid of this, implement asynchronous I/O and read until the streams have been exhausted
time.Sleep(time.Second)
// Close the writer to signal EOF
if err = retriever.CloseWriter(); err != nil {
return
}
return retriever, nil
}
Attach vm
ignite连接终端的方式有两种:一种是attach,另一种是ssh。不同之处是,每次执行ssh会生成新的会话,而每次attach则操作的是系统的终端,因此通常使用ssh来获取终端会话。
ignite的这部分代码参考了containerd中attach的实现。
attach
操作首先获取当前的终端,然后处理输入输出。ignite启动时会使用oci.WithTTY
配置终端。
func (cc *ctdClient) AttachContainer(container string) (err error) {
var (
cont containerd.Container
spec *oci.Spec
)
if cont, err = cc.client.LoadContainer(cc.ctx, container); err != nil {
return
}
if spec, err = cont.Spec(cc.ctx); err != nil {
return
}
var (
con console.Console
tty = spec.Process.Terminal
)
if tty {
con = console.Current() //获取当前的终端
defer util.DeferErr(&err, con.Reset)
if err = con.SetRaw(); err != nil {
return
}
}
var (
task containerd.Task
statusC <-chan containerd.ExitStatus
igniteIO *igniteIO
)
if igniteIO, err = newIgniteIO(fmt.Sprintf(logPathTemplate, container)); err != nil {
return
}
defer util.DeferErr(&err, igniteIO.Close)
if task, err = cont.Task(cc.ctx, cio.NewAttach(igniteIO.Opt())); err != nil { //配置日志相关的输出
return
}
if statusC, err = task.Wait(cc.ctx); err != nil {
return
}
if tty {
if err := HandleConsoleResize(cc.ctx, task, con); err != nil {
log.Errorf("console resize failed: %v", err)
}
} else {
sigc := ForwardAllSignals(cc.ctx, task)
defer StopCatch(sigc)
}
var code uint32
select {
case ec := <-statusC:
code, _, err = ec.Result()
case <-igniteIO.Detach():
fmt.Println() // Use a new line for the log entry
log.Println("Detached")
}
if code != 0 && err == nil {
err = fmt.Errorf("attach exited with code %d", code)
}
return
}
Inspect vm
inspect可以查看image/kernel/vm三种资源,从Storage中加载对象,然后进行解码输出即可。
ssh vm
还可以通过在执行create
时指定--ssh
标志来启用ssh:
$ ignite ssh my-vm
在"create vm->配置ssh"中已经介绍了vm是如何配置ssh服务的。这里看下客户端是如何连接vm的ssh的。
此处使用了密钥对来进行ssh连接,大部分都是标准的ssh连接代码,参考demo。
// runSSH creates and runs ssh session based on the provided arguments.
// If the command list is empty, ssh shell is created, else the ssh command is
// executed.
func runSSH(vm *api.VM, privKeyFile string, command []string, tty bool, timeout uint32) (err error) {
// Check if the VM is running.
if !vm.Running() {
return fmt.Errorf("VM %q is not running", vm.GetUID())
}
// Get the IP address.
ipAddrs := vm.Status.Network.IPAddresses //获取ssh连接的ip地址
if len(ipAddrs) == 0 {
return fmt.Errorf("VM %q has no usable IP addresses", vm.GetUID())
}
// Get private key file path.
if len(privKeyFile) == 0 { //获取本地私钥
privKeyFile = path.Join(vm.ObjectPath(), fmt.Sprintf(constants.VM_SSH_KEY_TEMPLATE, vm.GetUID()))
if !util.FileExists(privKeyFile) {
return fmt.Errorf("no private key found for VM %q", vm.GetUID())
}
}
// Create a new ssh signer for the private key.
signer, err := newSignerForKey(privKeyFile)
if err != nil {
return fmt.Errorf("unable to create signer for private key: %v", err)
}
// Defer exit here and set the exit code based on any ssh error, so that
// this ssh command returns the correct ssh exit code. Since this function
// results in an os.Exit, any error returned by this function won't be
// received by the caller. Print the error to make the errror message
// visible and set the error code when an error is found.
exitCode := 0
defer func() {
os.Exit(exitCode)
}()
// printErrAndSetExitCode is used to print an error message, set exit code
// and return nil. This is needed because once the ssh connection is
// estabilish, to return the error code of the actual ssh session, instead
// of returning an error, the runSSH function defers os.Exit with the ssh
// exit code. For showing any error to the user, it needs to be printed.
printErrAndSetExitCode := func(errMsg error, exitCode *int, code int) error {
log.Errorf("%v\n", errMsg)
*exitCode = code
return nil
}
// Create an SSH client, and connect.
config := newSSHConfig(signer, timeout)
client, err := ssh.Dial(defaultSSHNetwork, net.JoinHostPort(ipAddrs[0].String(), defaultSSHPort), config)
if err != nil {
return printErrAndSetExitCode(fmt.Errorf("failed to dial: %v", err), &exitCode, 1)
}
defer util.DeferErr(&err, client.Close)
// Create a session.
session, err := client.NewSession()
if err != nil {
return printErrAndSetExitCode(fmt.Errorf("failed to create session: %v", err), &exitCode, 1)
}
defer util.DeferErr(&err, session.Close)
// Configure tty if requested.
if tty {
// Get stdin file descriptor reference.
fd := int(os.Stdin.Fd())
// Store the raw state of the terminal.
state, err := terminal.MakeRaw(fd)
if err != nil {
return printErrAndSetExitCode(fmt.Errorf("failed to make terminal raw: %v", err), &exitCode, 1)
}
defer util.DeferErr(&err, func() error { return terminal.Restore(fd, state) })
// Get the terminal dimensions.
w, h, err := terminal.GetSize(fd)
if err != nil {
return printErrAndSetExitCode(fmt.Errorf("failed to get terminal size: %v", err), &exitCode, 1)
}
// Set terminal modes.
modes := ssh.TerminalModes{
ssh.ECHO: 1,
}
// Read the TERM environment variable and use it to request the PTY.
term := os.Getenv("TERM")
if term == "" {
term = defaultTerm
}
if err = session.RequestPty(term, h, w, modes); err != nil {
return printErrAndSetExitCode(fmt.Errorf("request for pseudo terminal failed: %v", err), &exitCode, 1)
}
}
// Connect input / output.
// TODO: these should come from the cobra command instead of hardcoding
// os.Stderr etc.
session.Stderr = os.Stderr
session.Stdout = os.Stdout
session.Stdin = os.Stdin
if len(command) == 0 {
if err = session.Shell(); err != nil {
return printErrAndSetExitCode(fmt.Errorf("failed to start shell: %v", err), &exitCode, 1)
}
if err = session.Wait(); err != nil {
if e, ok := err.(*ssh.ExitError); ok {
return printErrAndSetExitCode(err, &exitCode, e.ExitStatus())
}
return printErrAndSetExitCode(fmt.Errorf("failed waiting for session to exit: %v", err), &exitCode, 1)
}
} else {
if err = session.Run(joinShellCommand(command)); err != nil {
if e, ok := err.(*ssh.ExitError); ok {
return printErrAndSetExitCode(err, &exitCode, e.ExitStatus())
}
return printErrAndSetExitCode(fmt.Errorf("failed to run shell command: %s", err), &exitCode, 1)
}
}
return
}
func newSSHConfig(publicKey ssh.Signer, timeout uint32) *ssh.ClientConfig {
return &ssh.ClientConfig{
User: "root",
Auth: []ssh.AuthMethod{
ssh.PublicKeys(publicKey),
},
HostKeyCallback: ssh.InsecureIgnoreHostKey(), // TODO: use ssh.FixedPublicKey instead
Timeout: time.Second * time.Duration(timeout),
}
}
exec vm
可以看到exec方式内部其实用的就是ssh方式,首先使用waitForSSH
等待ssh服务正常工作,然后使用runSSH
登录:
func Exec(eo *ExecOptions) error {
if err := waitForSSH(eo.vm, constants.SSH_DEFAULT_TIMEOUT_SECONDS, time.Duration(eo.Timeout)*time.Second); err != nil {
return err
}
return runSSH(eo.vm, eo.IdentityFile, eo.command, eo.Tty, eo.Timeout)
}
func waitForSSH(vm *ignite.VM, dialSeconds int, sshTimeout time.Duration) error {
if err := dialSuccess(vm, dialSeconds); err != nil { //验证ssh服务是否可达
return err
}
certCheck := &ssh.CertChecker{
IsHostAuthority: func(auth ssh.PublicKey, address string) bool {
return true
},
IsRevoked: func(cert *ssh.Certificate) bool {
return false
},
HostKeyFallback: func(hostname string, remote net.Addr, key ssh.PublicKey) error {
return nil
},
}
config := &ssh.ClientConfig{ //配置无认证方式登录
HostKeyCallback: certCheck.CheckHostKey,
Timeout: sshTimeout,
}
addr := vm.Status.Network.IPAddresses[0].String() + ":22"
sshConn, err := ssh.Dial("tcp", addr, config) //验证ssh服务是否能够返回无法认证的错误,以此判断ssh服务是否正常
if err != nil {
if strings.Contains(err.Error(), "unable to authenticate") {
// we connected to the ssh server and recieved the expected failure
return nil
}
return err
}
defer sshConn.Close()
return fmt.Errorf("waitForSSH: connected successfully with no authentication -- failure was expected")
}
rm image
- 根据镜像ID从Storage中找到镜像对象,同时获取所有的vm对象
- 移除镜像时如果指定了
--force
参数,则会同时删除掉使用该镜像的vm - 删除镜像所在的目录
/var/lib/firecracker/image/<UID>
如果指定了多个镜像,则需要遍历处理。
rm kernel
和rm image
处理逻辑相同
更多CLI操作参见官方文档
Ignited Daemon
Ignited daemon是ignite的守护进程,当用户在constants.MANIFEST_DIR
(默认为/etc/firecracker/manifests
)目录下创建vm的描述文件时,ignited会自动发现文件变动,并从constants.DATA_DIR
(默认为/var/lib/firecracker
)读取生成vm所需要的镜像和元数据。
ignited使用一个ManifestStorage来管理constants.MANIFEST_DIR
和constants.DATA_DIR
这两个目录,并将生成的manifestStorage
保存到providers.Storage
中:
func SetManifestStorage() (err error) {
log.Trace("Initializing the ManifestStorage provider...")
ManifestStorage, err = manifest.NewTwoWayManifestStorage(constants.MANIFEST_DIR, constants.DATA_DIR, scheme.Serializer)
if err != nil {
return
}
providers.Storage = cache.NewCache(ManifestStorage)
return
}
由于需要watch constants.MANIFEST_DIR
目录的变动,此处用到了一个GenericWatchStorage
存储,它内部使用rjeczalik/notify库来通知文件的变动(Create
/Modify
/Delete
)。
constants.DATA_DIR
存储的是镜像相关的文件,只需要在创建vm的时候读取即可,不需要对其watch,因此使用了GenericStorage
来对其进行管理。
func NewTwoWayManifestStorage(manifestDir, dataDir string, ser serializer.Serializer) (*ManifestStorage, error) {
ws, err := watch.NewGenericWatchStorage(storage.NewGenericStorage(storage.NewGenericMappedRawStorage(manifestDir), ser))
if err != nil {
return nil, err
}
ss := sync.NewSyncStorage(
storage.NewGenericStorage(
storage.NewGenericRawStorage(dataDir), ser),
ws)
return &ManifestStorage{
Storage: ss,
}, nil
}
syncStorage和ignited daemon主流程(ReconcileManifests
)的关系如下。一个syncStorage可以对接多个Storage,用于同时操作多个Storage资源,例如从多个Storage中获取/设置/删除某个资源对象。
watchStorage会通过一个名为eventStream
的chan将文件事件传递给syncStorage,而syncStorage则会通过一个名为updateStream
的chan将该事件传递给ignited daemon的主流程,在主流程中根据事件类型以及产生事件的对象来做出相应的动作(增/删/改等)。需要注意的是,主流程只关心vm对象的事件。
有了上述认知,ignited daemon主流程就比较简单了。根据vm的事件类型,对vm进行相应的操作即可。
func ReconcileManifests(s *manifest.ManifestStorage) {
startMetricsThread()
// Wrap the Manifest Storage with a cache for better performance, and create a client
c = client.NewClient(cache.NewCache(s))
// 监听syncStorage中传过来的事件
for upd := range s.GetUpdateStream() {
// 仅关心vm资源的事件
if upd.APIType.GetKind() != api.KindVM {
log.Tracef("GitOps: Ignoring kind %s", upd.APIType.GetKind())
kindIgnored.Inc()
continue
}
var vm *api.VM
var err error
//如果是删除事件,此时 manifeststorage.ManifestStorage 中vm的描述文件已经被删除,无法从ManifestStorage中获取vm
if upd.Event == update.ObjectEventDelete {
// As we know this VM was deleted, it wouldn't show up in a Get() call
// Construct a temporary VM object for passing to the delete function
vm = &api.VM{
TypeMeta: *upd.APIType.GetTypeMeta(),
ObjectMeta: *upd.APIType.GetObjectMeta(),
Status: api.VMStatus{
Running: true, // TODO: Fix this in StopVM
},
}
} else {
// Get the real API object
vm, err = c.VMs().Get(upd.APIType.GetUID())
if err != nil {
log.Errorf("Getting %s %q returned an error: %v", upd.APIType.GetKind(), upd.APIType.GetUID(), err)
continue
}
// If the object was existent in the storage; validate it
// Validate the VM object
// TODO: Validate name uniqueness
if err := validation.ValidateVM(vm).ToAggregate(); err != nil {
log.Warnf("Skipping %s of %s %q, not valid: %v.", upd.Event, upd.APIType.GetKind(), upd.APIType.GetUID(), err)
continue
}
}
// TODO: Parallelization
switch upd.Event {
case update.ObjectEventCreate, update.ObjectEventModify: //处理创建和修改事件
runHandle(func() error {
return handleChange(vm)
})
case update.ObjectEventDelete: //处理删除事件
runHandle(func() error {
// TODO: Temporary VM Object for removal
return handleDelete(vm)
})
default:
log.Infof("Unrecognized Git update type %s\n", upd.Event)
continue
}
}
}
下面是处理事件的具体内容:
func handleChange(vm *api.VM) (err error) {
// Only apply the new state if it
// differs from the current state
running := currentState(vm)
if vm.Status.Running && !running { // 如果vm元数据中状态是 running,而实际非running, 则启动vm
err = start(vm)
} else if !vm.Status.Running && running { // 如果vm元数据中状态非running,而实际是running,则停止vm
err = stop(vm)
}
return
}
func handleDelete(vm *api.VM) error {
return remove(vm)
}
func remove(vm *api.VM) error {
log.Infof("Removing VM %q with name %q...", vm.GetUID(), vm.GetName())
vmDeleted.Inc()
// Object deletion is performed by the SyncStorage, so we just
// need to clean up any remaining resources of the VM here
return operations.CleanupVM(vm)
}
CNI
ignite使用CNI来配置主机和容器网络。
默认CNI
默认的cni是bridge,源码位于plugins,其主要缺点是无法实现vm的跨节点通信,结构如下:
bridge的cni配置/etc/cni/net.d/10-ignite.conflist
如下:
{
"cniVersion": "0.4.0",
"name": "ignite-cni-bridge",
"plugins": [
{
"type": "bridge",
"bridge": "ignite0",
"isGateway": true,
"isDefaultGateway": true,
"promiscMode": true,
"ipMasq": true,
"ipam": {
"type": "host-local",
"subnet": "10.61.0.0/16"
}
},
{
"type": "portmap",
"capabilities": {
"portMappings": true
}
},
{
"type": "firewall"
}
]
}
ignite的使用go-cni来配置cni,实际调用的就是基本的go-cni用法,即如下接口:
New(config ...CNIOpt) (CNI, error) //初始化一个cni对象
// Setup setup the network for the namespace
Setup(ctx context.Context, id string, path string, opts ...NamespaceOpts) (*CNIResult, error)
// Remove tears down the network of the namespace.
Remove(ctx context.Context, id string, path string, opts ...NamespaceOpts) error
// Load loads the cni network config
Load(opts ...CNIOpt) error
此外还可以采用Flannel插件实现跨主机通信,但flannel需要etcd来维护网络。更多参见官方文档。