Docker run 命令实现
本文需要实现第一个命令 Mydocker run
,类似于 docker run -it [command]
命令。通过创建新的 Namespace 来对新进程进行视图隔离。
核心需要解决如下问题:
- 命令行参数解析的问题,具体实现时通过
github.com/urfave/cli
库来实现对用户输入命令行的解析,需要解析的命令包括run
、init
命令; - 不同容器内系统信息的隔离,以及如何获取系统信息(可以通过
mount /proc
实现);
具体实现
docker run执行流程
Mydocker
中需要解析用户输入的命令行参数列表,比如 Mydocker run -it /bin/sh
,首要的是识别并解析 run
参数。
在run
参数解析函数中需要创建并初始化容器进程,不同的容器进程需要基于 Namespace
隔离。具体实现通过调用 /proc/self/exe
可执行程序(/proc/self 代表当前进程),实现容器进程的新建操作;调用 /proc/self/exe init
以传递 init
命令行参数实现容器进程的初始化操作(由 fork 出的子进程执行初始化操作)。
在容器进程初始化完毕后,需要开始执行具体命令例如 /bin/sh
,父进程需要将命令行参数传递给子进程,这里采用的是匿名管道方式来实现。
子进程读取管道数据,通过 execve(fileName, argv, env)
系统调用替换当前进程的镜像、数据和堆栈等信息,在完全隔离的内存空间中执行具体命令。
主函数
主函数体中定义了容器相关的核心命令及其解析方式,根据 urfave/cli
库来实现。
// main.go
package main
import (
"os"
log "github.com/sirupsen/logrus"
"github.com/urfave/cli"
)
const usage = `mydocker is a simple container runtime implementation.
The purpose of this project is to learn how docker works and how to write a docker by ourselves
Enjoy it, just for fun.`
func main() {
app := cli.NewApp()
app.Name = "Mydocker"
app.Usage = usage
// init command params,including initCommand、runCommand
app.Commands = []cli.Command{
initCommand,
runCommand,
}
// init logrus configs
app.Before = func(ctx *cli.Context) error {
log.SetFormatter(&log.JSONFormatter{})
log.SetOutput(os.Stdout)
return nil
}
if err := app.Run(os.Args); err != nil {
log.Fatal(err)
}
}
主函数执行如下逻辑:
- 创建
urfave/cli
对象,定义命令参数解析逻辑; - 定义日志输出格式;
用户命令行参数解析
在使用 docker 时,首先通过命令行 docker run xxx
命令启动一个容器并执行相应命令,命令格式为 docker run [OPTIONS] IMAGE [COMMAND] [ARG...]
。
OPTIONS
:
-a stdin
: 指定标准输入输出内容类型,可选 STDIN/STDOUT/STDERR 三项;-d
: 后台运行容器,并返回容器ID;-i
: 以交互模式运行容器,通常与 -t 同时使用;-P
: 随机端口映射,容器内部端口随机映射到主机的端口-p
: 指定端口映射,格式为:主机(宿主)端口:容器端口-t
: 为容器重新分配一个伪输入终端,通常与 -i 同时使用;--expose=[]
: 开放一个端口或一组端口;--volume
, -v: 绑定一个卷
import (
"Mydockker/container"
"fmt"
log "github.com/sirupsen/logrus"
"github.com/urfave/cli"
)
/**
* start procedure:
* 1. user exec Mydocker run by hand;
* 2. urfave/cli parse user Commands;
* 3. call runCommand method to build cmds Object;
* 4. NewParentProcess method return cmds Object to runCommand method;
* 5. according to cmds paramters, /proc/self/exe init will execute mydocker command, which inilizates container's environment
* 6. all init procedures end;
*/
/**
* for Example: Mydocker run xxx -it /bin/bash
* container start command
*/
var runCommand = cli.Command{
Name: "run",
Usage: `Create a container with namespace and cgroups limit
mydocker run -it [command]`,
Flags: []cli.Flag{
cli.BoolFlag{
Name: "it",
Usage: "enable tty",
},
cli.BoolFlag{
Name: "d",
Usage: "detach container",
},
cli.StringFlag{
Name: "m",
Usage: "memory limit",
},
cli.StringFlag{
Name: "cpushare",
Usage: "cpushare limit",
},
cli.StringFlag{
Name: "cpuset",
Usage: "cpuset limit",
},
cli.StringFlag{
Name: "name",
Usage: "container name",
},
cli.StringFlag{
Name: "v",
Usage: "volume",
},
cli.StringSliceFlag{
Name: "e",
Usage: "set environment",
},
cli.StringFlag{
Name: "net",
Usage: "container network",
},
cli.StringSliceFlag{
Name: "p",
Usage: "port mapping",
},
},
/**
* parse commandline, tty represents allow bash windows
*/
Action: func(context *cli.Context) error {
if len(context.Args()) < 1 {
return fmt.Errorf("missing container command")
}
// collect params after it
var cmdArray []string
for _, arg := range context.Args() {
cmdArray = append(cmdArray, arg)
}
// i: use console to interact
// t: tty, allow bash login
tty := context.Bool("it")
// name: containerName
containerName := context.String("name")
// environments
envSlice := context.StringSlice("envSlice")
imageName := cmdArray[0]
log.Infof("exec run command, bashMode:%v, imageName:%v", tty, imageName)
/**
* start create container process
*/
Run(tty, cmdArray, containerName, imageName, envSlice)
return nil
},
}
/**
* container inilization command
*/
var initCommand = cli.Command{
Name: "init",
Usage: "Init container process run user's process in container. Do not call it outside",
/**
* init process resource after create container
*/
Action: func(context *cli.Context) error {
log.Infof("exec init command")
return container.ContainerResourceInit()
},
}
需要注意的是,在执行完 run
参数对应的解析函数后,会通过 /proc/self/exe init
执行新的可执行程序并输入 init
命令行参数,新的子进程会执行 init
参数对应的解析函数。
在 init
参数的解析函数中,需要执行包括进程资源的初始化、/proc
工作目录挂载、shell
命令执行。
进程创建及初始化
/**
* clone process which dividing by namespace, and use /proc/self/exe to init processResource
* attention:
* 1.only after childProcess has been inited that we can write message to writePipe by parentProcess
*/
func Run(tty bool, cmdArray []string, containerName, imageName string, envSlice []string) {
// init container process
cmdProcess, writePipe := container.NewParentProcess(tty, imageName, containerName, envSlice)
if cmdProcess == nil {
log.Errorf("run::Run create child process failed")
return
}
// create parentProcess —— containerProcess
if err := cmdProcess.Start(); err != nil {
log.Errorf("run::Run parent Start failed %v", err)
return
}
// send parameters to childProcess after childProcess has been inilizated
sendInitCommands(cmdArray, writePipe)
if tty {
cmdProcess.Wait()
}
}
/**
* start a new process, return executable commands
* 1.use /proc/self/exe to create child process which diving by namespace and other environment;
* 2.use init command param to init child process;
* 3.redirect input/output/errput;
*
* perf:
* 1.use pipe to transfer parameters between parentProcess and childProcess. Avoid out-of-buffer and console parameters too long
*/
func NewParentProcess(tty bool, containerName, imageName string, envSlice []string) (*exec.Cmd, *os.File) {
// create Pipe which transferring parameters between parentProcess and childProcess
readPipe, writePipe, err := os.Pipe()
if err != nil {
log.Errorf("container_process::NewParentProcess new pipe failed")
return nil, nil
}
// locate /proc/self/exe executable process
exePath, err := os.Readlink("/proc/self/exe")
if err != nil {
log.Errorf("container_process::NewParentProcess can't find /proc/self/exe link")
return nil, nil
}
processCmd := exec.Command(exePath, "init")
processCmd.SysProcAttr = &syscall.SysProcAttr{
Cloneflags: syscall.CLONE_NEWUTS | syscall.CLONE_NEWNET | syscall.CLONE_NEWPID | syscall.CLONE_NEWNS | syscall.CLONE_NEWIPC,
}
// redirect output/input
if tty {
processCmd.Stdin = os.Stdin
processCmd.Stdout = os.Stdout
processCmd.Stderr = os.Stderr
} else {
// if allow process exec backgroundly, redirect output/input fd
dirURL := fmt.Sprintf(InfoLogFormat, containerName)
if err := os.MkdirAll(dirURL, Perm0622); err != nil {
log.Errorf("container_process::NewParentProcess mkdir log directory failed %s", dirURL)
return nil, nil
}
logPath := dirURL + LogFileName
file, err := os.Create(logPath)
if err != nil {
log.Errorf("container_process::NewParentProcess create logFile failed %s", logPath)
return nil, nil
}
processCmd.Stdout = file
}
// transfer readPipe to childProcess by adding fourth fd to it
processCmd.ExtraFiles = []*os.File{readPipe}
return processCmd, writePipe
}
/**
* after create containerProcess, its the first process to init process's resource
* 1.mount current process proc config;
* 2.read commands from readPipe;
* 3.
*/
func ContainerResourceInit() error {
// read parameters from readPipe
cmdArrays := readUserCommands()
if len(cmdArrays) == 0 {
return errors.New("init::ContainerResourceInit userCommands is nil")
}
// proc mount
mountProc()
// execute commands
path, err := exec.LookPath(cmdArrays[0])
if err != nil {
log.Errorf("init::ContainerResourceInit exec lookPath failed, err=%v", err)
return err
}
log.Infof("init::ContainerResourceInit execuatble path=%v", path)
if err = syscall.Exec(path, cmdArrays[0:], os.Environ()); err != nil {
log.Errorf("init::ContainerResourceInit exec failed, err=%v", err)
}
return nil
}
/**
* mount proc fileSystem for current process
* mountFlags:
* 1.syscall.MS_NOEXEC:本文件系统中不允许运行其它程序;
* 2.syscall.MS_NOSUID:本系统运行程序时不允许 set-user-id、set-group-id;
* 3.syscall.MS_NODEV:mount默认都会携带;
* systemd 加入 linux后,mount namespace 更新为 shared by default,所以必须显式声明 mount namespace 独立于宿主机
*/
func mountProc() {
if err := syscall.Mount("", "/", "", syscall.MS_PRIVATE|syscall.MS_REC, ""); err != nil {
log.Errorf("mount default namespace failed, err = %v", err)
return
}
defaultMountFlags := syscall.MS_NOEXEC | syscall.MS_NOSUID | syscall.MS_NODEV
if err := syscall.Mount("proc", "/proc", "proc", uintptr(defaultMountFlags), ""); err != nil {
log.Errorf("mount proc failed, err = %v", err)
return
}
}
/proc
文件系统是一个虚拟的文件系统,它提供了对内核和运行中进程的信息的访问,包含了系统运行时的信息(比如系统内存、mount设备信息、硬件配置等),它存在于内存中不占用外存空间。通过挂载 /proc
目录,我们可以查看到系统内核信息。
在容器环境中,为了和宿主机的 /proc
环境隔离,docker init
操作时需要重新挂载 /proc
文件系统,转化为 bash 命令对应为 mount -t proc proc /proc
:
syscall.Mount("proc", "/proc", "proc", uintptr(mountFlags), "")
但是按照上述逻辑实际操作时会出现以下问题:
root@mydocker:~/mydocker# ./mydocker run -it /bin/ls
{"level":"info","msg":"init come on","time":"2024-01-03T15:07:27+08:00"}
{"level":"info","msg":"command: /bin/ls","time":"2024-01-03T15:07:27+08:00"}
{"level":"info","msg":"command:/bin/ls","time":"2024-01-03T15:07:27+08:00"}
LICENSE Makefile README.md container example go.mod go.sum main.go main_command.go mydocker run.go
root@mydocker:~/mydocker# ./mydocker run -it /bin/ls
{"level":"error","msg":"fork/exec /proc/self/exe: no such file or directory","time":"2024-01-03T15:07:28+08:00"}
重复启动 docker 容器出现 /proc/self/exe
无法找到的问题,这是因为引入了 systemd
之后的 linux 系统中,mount namespace
是默认宿主机和 namespace 隔离进程间共享的。因此我们需要先将 mount 事件显示指定为 private
来避免挂载事件外泄,这样就不会破坏主机 /proc
目录数据,具体实现如下:
func mountProc() {
// 配置 mount 操作为 private
if err := syscall.Mount("", "/", "", syscall.MS_PRIVATE|syscall.MS_REC, ""); err != nil {
log.Errorf("mount default namespace failed, err = %v", err)
return
}
// mount 进程 /proc 目录
defaultMountFlags := syscall.MS_NOEXEC | syscall.MS_NOSUID | syscall.MS_NODEV
if err := syscall.Mount("proc", "/proc", "proc", uintptr(defaultMountFlags), ""); err != nil {
log.Errorf("mount proc failed, err = %v", err)
return
}
}
测试
项目目录:
编译及运行:
[root@localhost Mydocker]# go build .
[root@localhost Mydocker]# ./Mydockker run -it /bin/sh
{"level":"info","msg":"exec run command, bashMode:true, imageName:/bin/sh","time":"2024-01-31T23:06:07+08:00"}
{"level":"info","msg":"run::sendInitCommands all commands:/bin/sh","time":"2024-01-31T23:06:07+08:00"}
{"level":"info","msg":"exec init command","time":"2024-01-31T23:06:07+08:00"}
{"level":"info","msg":"init::ContainerResourceInit execuatble path=/bin/sh","time":"2024-01-31T23:06:07+08:00"}
# 查看容器目录
sh-4.2# ls
container go.sum mainCommands.go Mydockker
go.mod log main.go run.go
# 容器内 ps -af 发现 /bin/sh 为容器内第一个进程,与预期一致
sh-4.2# ps -af
UID PID PPID C STIME TTY TIME CMD
root 1 0 0 23:06 pts/0 00:00:00 /bin/sh
root 7 1 0 23:06 pts/0 00:00:00 ps -af
标签:容器,run,nil,err,syscall,init,docker,proc,手写
From: https://www.cnblogs.com/istitches/p/18000339