ELF是linux中使用最广泛的一种应用程序格式,为了弄清楚Linux内核是如何讲ELF文件精确映射到指定内存空间,上周末把内核sys_execve部分好好看了一遍,小结如下:
1. ELF格式
ELF指定了进程中text段、bss段、data段等应该放置到进程虚拟内存空间的什么位置,以及记录了进程需要用到的各种动态链接库的位置。
2. sys_execve的大致执行流程
1) 打开ELF二进制文件,读入ELF头
2) 删除从父进程继承过来的mm相关内容
3) 根据ELF头将interpreter段、text段、data段等映射进内存(由此知linux不支持压缩了的二进制程序)
设置好堆栈等,更新mm内容。
4) "伪造"好本进程的内核栈,为进程返回用户态执行做好准备。内核栈中的ip指向了interpreter段入口。
5) sys_execve系统调用返回到用户态,开始interpreter的执行(interpreter一般为linux-ld.so.2 or similar)
进入到用户态后,interpreter做了些什么呢?
6) interpreter帮助用户进程装入动态链接库,做好全部重定位映射工作。
7) interpreter返回到main开始执行。
这里面有几个问题需要深究:
1> sys_execve被调用的时候内核栈长什么样?用户态参数是如何传入到内核的?
只有弄明白了这个问题,才知道如何从内核返回到interpreter入口开始执行
A: 关于这个问题请参考linux系统调用相关章节。linux系统调用采取了一个一致的方法来处理系统调用参数问题,非常值得借鉴,将另外撰文梳理其设计思路。
2> interpreter的参数从哪里来?interpreter如何返回到main?
A: 如果从传统的C语言函数调用的角度来理解,这个问题会很费解。但是如果能从汇编的角度,动态地、有目的地调整和"伪造"调用栈,就能够做到方便地再各个函数间切换和传参。
内核会构造好interpreter所需要的参数栈,interpreter会构造好main所需要的参数栈。用户栈是在setup_arg_pages函数中构建的。
3> 内核是如何保证将各个段映射到期望的位置?
mmap函数有一个参数取MAP_FIXED参数即可。
笔记附文:
/* 将当前(current)的mm结构替换成参数中的mm结构。本函数被
* int flush_old_exec(struct linux_binprm * bprm)调用。
* 旧mm被删除。
*/
static int exec_mmap(struct mm_struct *mm)
{
struct task_struct *tsk;
struct mm_struct * old_mm, *active_mm;
/* Notify parent that we're no longer interested in the old VM */
tsk = current;
old_mm = current->mm;
/* 释放当前进程的老mm结构(人老珠黄真可怕!)*/
mm_release(tsk, old_mm);
if (old_mm) { /* 如果老的mm正在被使用(coredump)则不能继续 */
/*
* Make sure that if there is a core dump in progress
* for the old mm, we get out and die instead of going
* through with the exec. We must hold mmap_sem around
* checking core_state and changing tsk->mm.
*/
down_read(&old_mm->mmap_sem);
if (unlikely(old_mm->core_state)) {
up_read(&old_mm->mmap_sem);
return -EINTR;
}
}
/* 老的mm已经销毁了,迎接新媳妇 */
task_lock(tsk);
/* 如果当前线程是个核心线程,则active_mm有效 */
active_mm = tsk->active_mm;
/* 新mm入洞房 */
tsk->mm = mm;
tsk->active_mm = mm;
/* 第二天起,新媳妇就正式管家啦! */
activate_mm(active_mm, mm);
task_unlock(tsk);
/* 设置了mm中几个函数指针, 何用? */
arch_pick_mmap_layout(mm);
if (old_mm) {
/* 事到如今如果old_mm还没有消失,
* 那是因为他们家妹妹active_mm在帮她撑腰
*/
up_read(&old_mm->mmap_sem);
BUG_ON(active_mm != old_mm);
/* 如果老mm外头有人,就做个顺水人情 送给外头那位吧 */
mm_update_next_owner(old_mm);
/* 从自己的通讯录里头把老mm删除 */
mmput(old_mm);
return 0;
}
/* 彻底干掉老的active_mm. 莫非是为多线程服务? */
mmdrop(active_mm);
return 0;
}
/* 将elf文件映射到当前进程的虚拟内存中
* 总体思路为:
*
*
*/
/* 预备知识
Complete Reference on ELF format:
http://www.muppetlabs.com/~breadbox/software/ELF.txt
1. 为了读懂下面的代码,最好了解ELF头的格式:
typedef struct elf32_hdr{
unsigned char e_ident[EI_NIDENT]; /* Magic Number */
Elf32_Half e_type; /* ET_EXEC或ET_DYN:可执行映像或共享库 */
Elf32_Half e_machine; /* 目标CPU类型 */
Elf32_Word e_version; /* */
Elf32_Addr e_entry; /* Entry point, 一般是_start()的起点 */
Elf32_Off e_phoff; /* 指向“程序头(Program Header)”数组的起点 */
Elf32_Off e_shoff; /* 向“区段头(Section Header)”数组的起点,
标定“程序段”“数据段”等等 */
Elf32_Word e_flags;
Elf32_Half e_ehsize; /* 映像头部本身的大小 */
Elf32_Half e_phentsize; /* “程序头(Program Header)”数组元素的大小 */
Elf32_Half e_phnum; /* “程序头(Program Header)”数组元素的个数 */
Elf32_Half e_shentsize; /* “区段头(Section Header)”数组元素的大小 */
Elf32_Half e_shnum; /* “区段头(Section Header)”数组元素的个数 */
Elf32_Half e_shstrndx;
} Elf32_Ehdr;
2. 每个程序头里面包含的是什么呢?
typedef struct elf32_phdr{
Elf32_Word p_type; /* 段的类型,特别地,PT_LOAD表示是可加载的段 */
Elf32_Off p_offset; /* 该段在文件中相对于文件第0个字节的偏移 */
Elf32_Addr p_vaddr; /* 该段加载后在进程空间中占用的内存起始地址 */
Elf32_Addr p_paddr; /* 在支持paging的OS中该字段被忽略 */
Elf32_Word p_filesz; /*该段在文件中占用的字节大小. 有些段可能在
文件中不存在但却占用一定的内存空间,此时这个字段为0 */
Elf32_Word p_memsz; /* 该段在内存中占用的字节大小。有些段可能仅存在于文件
中而不被加载到内存,此时这个字段为0。*/
Elf32_Word p_flags;
Elf32_Word p_align; /* 对齐值 */
} Elf32_Phdr;
3. 每个区段头里面包含的是什么呢?
区段表是从链接角度看待ELF文件的结果,所以从区段的角度ELF文件分成了许多的区,
每个区保存着用于不同目的的数据,这些数据可能被前面提到的程序头重复引用。
typedef struct elf64_shdr {
Elf64_Word sh_name; /* Section name, index in string tbl */
Elf64_Word sh_type; /* Type of section */
Elf64_Xword sh_flags; /* Miscellaneous section attributes */
Elf64_Addr sh_addr; /* Section virtual addr at execution */
Elf64_Off sh_offset; /* Section file offset */
Elf64_Xword sh_size; /* Size of section in bytes */
Elf64_Word sh_link; /* Index of another section */
Elf64_Word sh_info; /* Additional section information */
Elf64_Xword sh_addralign; /* Section alignment */
Elf64_Xword sh_entsize; /* Entry size if section holds table */
} Elf64_Shdr;
4. 程序头和区段头有什么区别?
链接器和加载器看待elf是完全不同的,
链接器看到的是由区段头部表描述的一系列逻辑区段的**(也就是说它忽略了程序头部表)。
而加载器则是看成是由程序头部表描述的一系列的段的**(忽略了区段头部表)。
区分图片: http://img.ddvip.com/2009_09_10/1252583354_ddvip_9407.jpeg
Segment是从映像装入角度考虑的划分,Section才是从连接/启动角度考虑的划分
以Wine为例子,
Section to Segment mapping:
Segment Sections...
00
01 .interp
02 .interp .note.ABI-tag .hash .dynsym .dynstr .gnu.version .gnu.version_r
.rel.dyn .rel.plt .init .plt .text .fini .rodata .eh_frame
03 .data .dynamic .ctors .dtors .jcr .got .bss
04 .dynamic
05 .note.ABI-tag
5. 如何保证各个区段map到期望的虚拟位置?
mmap函数flags参数有MAP_FIXED标志,当此标志被设置的时候,一旦映射失败,则返回错误!
6. 纵观全函数,load_elf_binary的作用是:
1) 将elf各个段的数据读入到内存并建立映射
2) 将interpreter载入到内存并建立映射(包括了动态重定位过程)
3) 设置好regs结构的ip,sp等,为启动进程做好了准备
待解决的问题:interpreter如何把控制权交给_main()?
我自己的一点分析:
在load_elf_binary中获得ld-linux.so.2的入口地址eax后,执行
push eax
ret
就进入了ld-linux.so.2领地,在这里ld-linux.so.2帮助装入各个链接库
Q1. 如何知道装入哪些链接库?参数从何而来?
Q2. 如何在装入完成后返回到main开始执行主程序?
A1. 通过堆栈操作!注意到上面两句汇编代码,起本质等价于一个jump,可以想象jump的目标地址
load_elf_binary函数内部,此时解释器的代码就和load_elf_binary函数共用参数堆栈了!
A2. 通过unwind interpreter的堆栈,然后返回到main开始执行
下面的代码取自GNU ELF interpreter,说明了ld.so是如何完成链接的。
Code in http://ftp.gnu.org >
gnu > glibc > glibc-2.5.tar.bz2 > glibc-2.5 > sysdeps > i386 > dl-machine.h
/* Initial entry point code for the dynamic linker.
The C function `_dl_start' is the real entry point;
its return value is the user program's entry point. */
#define RTLD_START asm ("/n/
.text/n/
.align 16/n/
0: movl (%esp), %ebx/n/
ret/n/
.align 16/n/
.globl _start/n/
.globl _dl_start_user/n/
_start:/n/
# Note that _dl_start gets the parameter in %eax./n/
movl %esp, %eax/n/
call _dl_start/n/
_dl_start_user:/n/
# Save the user entry point address in %edi./n/
movl %eax, %edi/n/
# Point %ebx at the GOT./n/
call 0b/n/
addl $_GLOBAL_OFFSET_TABLE_, %ebx/n/
# See if we were run as a command with the executable file/n/
# name as an extra leading argument./n/
movl _dl_skip_args@GOTOFF(%ebx), %eax/n/
# Pop the original argument count./n/
popl %edx/n/
# Adjust the stack pointer to skip _dl_skip_args words./n/
leal (%esp,%eax,4), %esp/n/
# Subtract _dl_skip_args from argc./n/
subl %eax, %edx/n/
# Push argc back on the stack./n/
push %edx/n/
# The special initializer gets called with the stack just/n/
# as the application's entry point will see it; it can/n/
# switch stacks if it moves these contents over./n/
" RTLD_START_SPECIAL_INIT "/n/
# Load the parameters again./n/
# (eax, edx, ecx, *--esp) = (_dl_loaded, argc, argv, envp)/n/
movl _rtld_local@GOTOFF(%ebx), %eax/n/
leal 8(%esp,%edx,4), %esi/n/
leal 4(%esp), %ecx/n/
movl %esp, %ebp/n/
# Make sure _dl_init is run with 16 byte aligned stack./n/
andl $-16, %esp/n/
pushl %eax/n/
pushl %eax/n/
pushl %ebp/n/
pushl %esi/n/
# Clear %ebp, so that even constructors have terminated backchain./n/
xorl %ebp, %ebp/n/
# Call the function to run the initializers./n/
call _dl_init_internal@PLT/n/
# Pass our finalizer function to the user in %edx, as per ELF ABI./n/
leal _dl_fini@GOTOFF(%ebx), %edx/n/
# Restore %esp _start expects./n/
movl (%esp), %esp/n/
# Jump to the user's entry point./n/
jmp *%edi/n/
.previous/n/
");
/* Call the OS-dependent function to set up life so we can do things like
file access. It will call `dl_main' (below) to do all the real work
of the dynamic linker, and then unwind our frame and run the user
entry point on the same stack we entered on. */
Code in rtld.c ....
*/
static int load_elf_binary(struct linux_binprm *bprm, struct pt_regs *regs)
{
struct file *interpreter = NULL; /* to shut gcc up */
unsigned long load_addr = 0, load_bias = 0;
int load_addr_set = 0;
char * elf_interpreter = NULL;
unsigned long error;
struct elf_phdr *elf_ppnt, *elf_phdata;
unsigned long elf_bss, elf_brk;
int elf_exec_fileno;
int retval, i;
unsigned int size;
unsigned long elf_entry;
unsigned long interp_load_addr = 0;
unsigned long start_code, end_code, start_data, end_data;
unsigned long reloc_func_desc = 0;
int executable_stack = EXSTACK_DEFAULT;
unsigned long def_flags = 0;
struct {
struct elfhdr elf_ex;
struct elfhdr interp_elf_ex;
} *loc;
loc = kmalloc(sizeof(*loc), GFP_KERNEL);
if (!loc) {
retval = -ENOMEM;
goto out_ret;
}
/* Get the exec-header */
loc->elf_ex = *((struct elfhdr *)bprm->buf);
retval = -ENOEXEC;
/* First of all, some simple consistency checks */
if (memcmp(loc->elf_ex.e_ident, ELFMAG, SELFMAG) != 0)
goto out;
if (loc->elf_ex.e_type != ET_EXEC && loc->elf_ex.e_type != ET_DYN)
goto out;
if (!elf_check_arch(&loc->elf_ex))
goto out;
/* EFL文件所在的文件系统必须支持mmap操作 */
if (!bprm->file->f_op||!bprm->file->f_op->mmap)
goto out;
/* Now read in all of the header information */
if (loc->elf_ex.e_phentsize != sizeof(struct elf_phdr))
goto out;
if (loc->elf_ex.e_phnum < 1 ||
loc->elf_ex.e_phnum > 65536U / sizeof(struct elf_phdr))
goto out;
/* Note: ELF装载器(区分链接器)只使用Program Header
* 下面为Program Header分配空间
* Program header里面指明了各个区段应该如何装载到内存中
*/
size = loc->elf_ex.e_phnum * sizeof(struct elf_phdr);
retval = -ENOMEM;
elf_phdata = kmalloc(size, GFP_KERNEL);
if (!elf_phdata)
goto out;
/* 将ELF文件中Program Header部分读入到缓存中 */
retval = kernel_read(bprm->file, loc->elf_ex.e_phoff,
(char *)elf_phdata, size);
if (retval != size) {
if (retval >= 0)
retval = -EIO;
goto out_free_ph;
}
/* 下面对ELF文件的操作应该需要一个fd (?) */
retval = get_unused_fd();
if (retval < 0)
goto out_free_ph;
get_file(bprm->file);
fd_install(elf_exec_fileno = retval, bprm->file);
elf_ppnt = elf_phdata;
elf_bss = 0;
elf_brk = 0;
start_code = ~0UL;
end_code = 0;
start_data = 0;
end_data = 0;
/* 下面的代码遍历 三次Program Header数组
* 第一次处理PT_INTERP类型的区段
* 第二次处理PT_GNU_STACK类型的区段
* 第三次才处理PT_LOAD类型的区段
* NOTE: PT_DYNAMIC这个字段并没有处理,留给interpreter来映射和重定位。
* 下面分区段注释
*/
/*
* 第一次处理PT_INTERP类型的区段
*/
for (i = 0; i < loc->elf_ex.e_phnum; i++) {
if (elf_ppnt->p_type == PT_INTERP) {
/* This is the program interpreter used for
* shared libraries - for now assume that this
* is an a.out format binary
*/
retval = -ENOEXEC;
if (elf_ppnt->p_filesz > PATH_MAX ||
elf_ppnt->p_filesz < 2)
goto out_free_file;
retval = -ENOMEM;
elf_interpreter = kmalloc(elf_ppnt->p_filesz,
GFP_KERNEL);
if (!elf_interpreter)
goto out_free_file;
/* 在PT_INTERP段中存放的是链接器的名称
* ELF规范强制要求OS最先处理该字段
* 该字段的内容类似于:
* /lib64/ld-linux-x86-64.so.2
*/
retval = kernel_read(bprm->file, elf_ppnt->p_offset,
elf_interpreter,
elf_ppnt->p_filesz);
if (retval != elf_ppnt->p_filesz) {
if (retval >= 0)
retval = -EIO;
goto out_free_interp;
}
/* make sure path is NULL terminated */
retval = -ENOEXEC;
if (elf_interpreter[elf_ppnt->p_filesz - 1] != '/0')
goto out_free_interp;
/*
* The early SET_PERSONALITY here is so that the lookup
* for the interpreter happens in the namespace of the
* to-be-execed image. SET_PERSONALITY can select an
* alternate root.
*
* However, SET_PERSONALITY is NOT allowed to switch
* this task into the new images's memory mapping
* policy - that is, TASK_SIZE must still evaluate to
* that which is appropriate to the execing application.
* This is because exit_mmap() needs to have TASK_SIZE
* evaluate to the size of the old image.
*
* So if (say) a 64-bit application is execing a 32-bit
* application it is the architecture's responsibility
* to defer changing the value of TASK_SIZE until the
* switch really is going to happen - do this in
* flush_thread(). - akpm
*/
SET_PERSONALITY(loc->elf_ex);
/* 打开链接器文件,返回文件句柄 */
interpreter = open_exec(elf_interpreter);
retval = PTR_ERR(interpreter);
if (IS_ERR(interpreter))
goto out_free_interp;
/*
* If the binary is not readable then enforce
* mm->dumpable = 0 regardless of the interpreter's
* permissions.
*/
if (file_permission(interpreter, MAY_READ) < 0)
bprm->interp_flags |= BINPRM_FLAGS_ENFORCE_NONDUMP;
/* 读入链接器的程序头 */
retval = kernel_read(interpreter, 0, bprm->buf,
BINPRM_BUF_SIZE);
if (retval != BINPRM_BUF_SIZE) {
if (retval >= 0)
retval = -EIO;
goto out_free_dentry;
}
/* Get the exec headers */
loc->interp_elf_ex = *((struct elfhdr *)bprm->buf);
break;
}
elf_ppnt++;
}
/*
* 第二次处理PT_GNU_STACK类型的区段
*/
elf_ppnt = elf_phdata;
for (i = 0; i < loc->elf_ex.e_phnum; i++, elf_ppnt++)
if (elf_ppnt->p_type == PT_GNU_STACK) {
/* 由代码可以看出,这个区段只是提供了一个标志
* 没有实际的段数据
*/
if (elf_ppnt->p_flags & PF_X)
executable_stack = EXSTACK_ENABLE_X;
else
executable_stack = EXSTACK_DISABLE_X;
break;
}
/* 检查链接器的ELF标志以及其目标平台是否合法 */
/* Some simple consistency checks for the interpreter */
if (elf_interpreter) {
retval = -ELIBBAD;
/* Not an ELF interpreter */
if (memcmp(loc->interp_elf_ex.e_ident, ELFMAG, SELFMAG) != 0)
goto out_free_dentry;
/* Verify the interpreter has a valid arch */
if (!elf_check_arch(&loc->interp_elf_ex))
goto out_free_dentry;
} else {
/* Executables without an interpreter also need a personality */
SET_PERSONALITY(loc->elf_ex);
}
/* Flush all traces of the currently running executable */
retval = flush_old_exec(bprm);
if (retval)
goto out_free_dentry;
/* OK, This is the point of no return */
current->flags &= ~PF_FORKNOEXEC;
current->mm->def_flags = def_flags;
/* Do this immediately, since STACK_TOP as used in setup_arg_pages
may depend on the personality. */
SET_PERSONALITY(loc->elf_ex);
if (elf_read_implies_exec(loc->elf_ex, executable_stack))
current->personality |= READ_IMPLIES_EXEC;
if (!(current->personality & ADDR_NO_RANDOMIZE) && randomize_va_space)
current->flags |= PF_RANDOMIZE;
arch_pick_mmap_layout(current->mm);
/* Do this so that we can load the interpreter, if need be. We will
change some of these later */
current->mm->free_area_cache = current->mm->mmap_base;
current->mm->cached_hole_size = 0;
retval = setup_arg_pages(bprm, randomize_stack_top(STACK_TOP),
executable_stack);
if (retval < 0) {
send_sig(SIGKILL, current, 0);
goto out_free_dentry;
}
current->mm->start_stack = bprm->p;
/*
* 第三次处理PT_LOAD类型的区段
*/
/* Now we do a little grungy work by mmaping the ELF image into
the correct location in memory. */
for(i = 0, elf_ppnt = elf_phdata;
i < loc->elf_ex.e_phnum; i++, elf_ppnt++) {
int elf_prot = 0, elf_flags;
unsigned long k, vaddr;
if (elf_ppnt->p_type != PT_LOAD)
continue;
if (unlikely (elf_brk > elf_bss)) {
unsigned long nbyte;
/* There was a PT_LOAD segment with p_memsz > p_filesz
before this one. Map anonymous pages, if needed,
and clear the area. */
retval = set_brk (elf_bss + load_bias,
elf_brk + load_bias);
if (retval) {
send_sig(SIGKILL, current, 0);
goto out_free_dentry;
}
nbyte = ELF_PAGEOFFSET(elf_bss);
if (nbyte) {
nbyte = ELF_MIN_ALIGN - nbyte;
if (nbyte > elf_brk - elf_bss)
nbyte = elf_brk - elf_bss;
if (clear_user((void __user *)elf_bss +
load_bias, nbyte)) {
/*
* This bss-zeroing can fail if the ELF
* file specifies odd protections. So
* we don't check the return value
*/
}
}
}
if (elf_ppnt->p_flags & PF_R)
elf_prot |= PROT_READ;
if (elf_ppnt->p_flags & PF_W)
elf_prot |= PROT_WRITE;
if (elf_ppnt->p_flags & PF_X)
elf_prot |= PROT_EXEC;
elf_flags = MAP_PRIVATE | MAP_DENYWRITE | MAP_EXECUTABLE;
vaddr = elf_ppnt->p_vaddr;
if (loc->elf_ex.e_type == ET_EXEC || load_addr_set) {
/* 非动态定位部分,必须映射到期望区间,
* 故而指定MAP_FIXED参数
*/
elf_flags |= MAP_FIXED;
} else if (loc->elf_ex.e_type == ET_DYN) {
/* Try and get dynamic programs out of the way of the
* default mmap base, as well as whatever program they
* might try to exec. This is because the brk will
* follow the loader, and is not movable. */
#ifdef CONFIG_X86
load_bias = 0;
#else
load_bias = ELF_PAGESTART(ELF_ET_DYN_BASE - vaddr);
#endif
}
/* 重点代码
* 将file中的对应区段内容map到vaddr中
*/
error = elf_map(bprm->file, load_bias + vaddr, elf_ppnt,
elf_prot, elf_flags, 0);
if (BAD_ADDR(error)) {
send_sig(SIGKILL, current, 0);
retval = IS_ERR((void *)error) ?
PTR_ERR((void*)error) : -EINVAL;
goto out_free_dentry;
}
/* 本代码只在第一次时执行 */
if (!load_addr_set) {
load_addr_set = 1;
load_addr = (elf_ppnt->p_vaddr - elf_ppnt->p_offset);
if (loc->elf_ex.e_type == ET_DYN) {
load_bias += error -
ELF_PAGESTART(load_bias + vaddr);
load_addr += load_bias;
reloc_func_desc = load_bias;
}
}
k = elf_ppnt->p_vaddr;
if (k < start_code)
start_code = k;
if (start_data < k)
start_data = k;
/*
* Check to see if the section's size will overflow the
* allowed task size. Note that p_filesz must always be
* <= p_memsz so it is only necessary to check p_memsz.
*/
if (BAD_ADDR(k) || elf_ppnt->p_filesz > elf_ppnt->p_memsz ||
elf_ppnt->p_memsz > TASK_SIZE ||
TASK_SIZE - elf_ppnt->p_memsz < k) {
/* set_brk can never work. Avoid overflows. */
send_sig(SIGKILL, current, 0);
retval = -EINVAL;
goto out_free_dentry;
}
k = elf_ppnt->p_vaddr + elf_ppnt->p_filesz;
if (k > elf_bss)
elf_bss = k;
if ((elf_ppnt->p_flags & PF_X) && end_code < k)
end_code = k;
if (end_data < k)
end_data = k;
k = elf_ppnt->p_vaddr + elf_ppnt->p_memsz;
if (k > elf_brk)
elf_brk = k;
} /* end of for PT_LOAD */
/* 对PT_LOAD的全部努力就得到如下数据,加上已经
* 映射好了的内存段
*/
loc->elf_ex.e_entry += load_bias;
elf_bss += load_bias;
elf_brk += load_bias;
start_code += load_bias;
end_code += load_bias;
start_data += load_bias;
end_data += load_bias;
/* Calling set_brk effectively mmaps the pages that we need
* for the bss and break sections. We must do this before
* mapping in the interpreter, to make sure it doesn't wind
* up getting placed where the bss needs to go.
*/
retval = set_brk(elf_bss, elf_brk);
if (retval) {
send_sig(SIGKILL, current, 0);
goto out_free_dentry;
}
if (likely(elf_bss != elf_brk) && unlikely(padzero(elf_bss))) {
send_sig(SIGSEGV, current, 0);
retval = -EFAULT; /* Nobody gets to see this, but.. */
goto out_free_dentry;
}
/* 读入链接器到内存中,记录入口地址 */
if (elf_interpreter) {
unsigned long uninitialized_var(interp_map_addr);
elf_entry = load_elf_interp(&loc->interp_elf_ex,
interpreter,
&interp_map_addr,
load_bias);
if (!IS_ERR((void *)elf_entry)) {
/*
* load_elf_interp() returns relocation
* adjustment
*/
interp_load_addr = elf_entry;
elf_entry += loc->interp_elf_ex.e_entry;
}
if (BAD_ADDR(elf_entry)) {
force_sig(SIGSEGV, current);
retval = IS_ERR((void *)elf_entry) ?
(int)elf_entry : -EINVAL;
goto out_free_dentry;
}
reloc_func_desc = interp_load_addr;
allow_write_access(interpreter);
fput(interpreter);
kfree(elf_interpreter);
} else {
elf_entry = loc->elf_ex.e_entry;
if (BAD_ADDR(elf_entry)) {
force_sig(SIGSEGV, current);
retval = -EINVAL;
goto out_free_dentry;
}
}
kfree(elf_phdata);
sys_close(elf_exec_fileno);
set_binfmt(&elf_format);
#ifdef ARCH_HAS_SETUP_ADDITIONAL_PAGES
retval = arch_setup_additional_pages(bprm, executable_stack);
if (retval < 0) {
send_sig(SIGKILL, current, 0);
goto out;
}
#endif /* ARCH_HAS_SETUP_ADDITIONAL_PAGES */
compute_creds(bprm);
current->flags &= ~PF_FORKNOEXEC;
/* 这个函数做了很多事,需要仔细分析!
* bprm->p在这里被修改了。
*/
retval = create_elf_tables(bprm, &loc->elf_ex,
load_addr, interp_load_addr);
if (retval < 0) {
send_sig(SIGKILL, current, 0);
goto out;
}
/* N.B. passed_fileno might not be initialized? */
current->mm->end_code = end_code;
current->mm->start_code = start_code;
current->mm->start_data = start_data;
current->mm->end_data = end_data;
current->mm->start_stack = bprm->p;
#ifdef arch_randomize_brk
if ((current->flags & PF_RANDOMIZE) && (randomize_va_space > 1))
current->mm->brk = current->mm->start_brk =
arch_randomize_brk(current->mm);
#endif
if (current->personality & MMAP_PAGE_ZERO) {
/* Why this, you ask??? Well SVr4 maps page 0 as read-only,
and some applications "depend" upon this behavior.
Since we do not have the power to recompile these, we
emulate the SVr4 behavior. Sigh. */
down_write(¤t->mm->mmap_sem);
error = do_mmap(NULL, 0, PAGE_SIZE, PROT_READ | PROT_EXEC,
MAP_FIXED | MAP_PRIVATE, 0);
up_write(¤t->mm->mmap_sem);
}
#ifdef ELF_PLAT_INIT
/*
* The ABI may specify that certain registers be set up in special
* ways (on i386 %edx is the address of a DT_FINI function, for
* example. In addition, it may also specify (eg, PowerPC64 ELF)
* that the e_entry field is the address of the function descriptor
* for the startup routine, rather than the address of the startup
* routine itself. This macro performs whatever initialization to
* the regs structure is required as well as any relocations to the
* function descriptor entries when executing dynamically links apps.
*/
ELF_PLAT_INIT(regs, reloc_func_desc);
#endif
/* start_thread名不副实,更应该叫做prepare_user_thread()
* 它把邋elf_entry、user_stack设置到regs里面去了
* 为后面的启动做好了准备。真正启动用户态程序的时机是
* sys_execve()返回到用户态的时候!
* http://lkml.indiana.edu/hypermail/linux/kernel/0105.2/0910.html
*/
start_thread(regs, elf_entry, bprm->p);
retval = 0;
out:
kfree(loc);
out_ret:
return retval;
/* error cleanup */
out_free_dentry:
allow_write_access(interpreter);
if (interpreter)
fput(interpreter);
out_free_interp:
kfree(elf_interpreter);
out_free_file:
sys_close(elf_exec_fileno);
out_free_ph:
kfree(elf_phdata);
goto out;
}
/*
ip,sp到底是如何转换的呢?这里面用到了诀窍!
sys_execve->do_execve->search_binary_handler->load_binary->load_elf_binary->(code above)
首先弄明白下面的问题:
1. 系统调用中,用户参数、用户栈是如何管理的?保存在哪里?
首先描述下陷入内核的时候堆栈长成了什么样: ( in the famous 8K space )
struct pt_regs {
unsigned long bx; /* 进入内核后SAVE_ALL压入 */ 低地址
unsigned long cx; /* 进入内核后SAVE_ALL压入 */
unsigned long dx; /* 进入内核后SAVE_ALL压入 */
unsigned long si; /* 进入内核后SAVE_ALL压入 */
unsigned long di; /* 进入内核后SAVE_ALL压入 */ ^
unsigned long bp; /* 进入内核后SAVE_ALL压入 */ ^
unsigned long ax; /* 进入内核后SAVE_ALL压入 */ ^
unsigned long ds; /* 进入内核后SAVE_ALL压入 */ ^
unsigned long es; /* 进入内核后SAVE_ALL压入 */ ^
unsigned long fs; /* 进入内核后SAVE_ALL压入 */ ^
/* int gs; */
unsigned long orig_ax;/* 进入内核后push eax压入 */
unsigned long ip; /* 陷入内核时系统自动压入 */
unsigned long cs; /* 陷入内核时系统自动压入 */
unsigned long flags; /* 陷入内核时系统自动压入 */
unsigned long sp; /* 陷入内核时系统自动压入 */
unsigned long ss; /* 陷入内核时系统自动压入 */ 高地址
};
NOTE: 越是下面的数据越早被压入堆栈.
下面是2.6内核中进入内核栈后的代码.
# system call handler stub
ENTRY(system_call)
RING0_INT_FRAME # can't unwind into user space anyway
pushl %eax # save orig_eax
CFI_ADJUST_CFA_OFFSET 4 # cld instruction
SAVE_ALL
GET_THREAD_INFO(%ebp) # 这个时候esp指向的是pt_regs栈顶(高地址)
syscall_call:
call *sys_call_table(,%eax,4) # call的目标地址为sys_call_table+eax*4, 应该就是eax表示调用号,
# 调用目标即为函数入口, 此时ip再次压栈, 参数esp指向ip
# 在服务函数内部, 就可以通过esp访问到pt_regs了
movl %eax,PT_EAX(%esp) # store the return value
syscall_exit:
LOCKDEP_SYS_EXIT
DISABLE_INTERRUPTS(CLBR_ANY) # make sure we don't miss an interrupt
# setting need_resched or sigpending
# between sampling and the iret
......
在64为计算机上,sys_execve反汇编结果如下:
(objdump -d /lib/modules/2.6.18.8/source/arch/x86_64/kernel/process.o)
0000000000000000 <sys_execve>:
0: 48 83 ec 28 sub $0x28,%rsp # 腾出本地局部变量栈
4: 48 89 5c 24 08 mov %rbx,0x8(%rsp) # 保存些寄存器到临时栈中
9: 48 89 6c 24 10 mov %rbp,0x10(%rsp)
e: 48 89 d5 mov %rdx,%rbp # Why rdx? no idea.
11: 4c 89 64 24 18 mov %r12,0x18(%rsp)
16: 4c 89 6c 24 20 mov %r13,0x20(%rsp)
不纠缠这个了。。。乱!反正就是知道一点,pt_regs中有你所需
关于Linux用户进程向系统中断调用过程传递参数方面,
Linux系统使用了通用寄存器传递方法,例如寄存器ebx、ecx和edx。
这种使用寄存器传递参数方法的一个明显优点就是:
当进入系统中断服务程序而保存寄存器值时,
这些传递参数的寄存器也被自动地放在了内核态堆栈上,
因此用不着再专门对传递参数的寄存器进行特殊处理。
2. 如何与execve合作?
在pt_regs 的帮助下,可以设置ip,esp, 对于execve之类的系统调用,就可以通过替换掉ip,esp
来实现移花接木的效果。
3. 用户态如何把参数传入核心栈呢?
举个例子用户态write被调用时候
write:
pushl %ebx
movl 8(%esp), %ebx ; linux的_syscall3使得这里做了如此的展开
movl 12(%esp), %ecx ; 使得寄存器传参得以实现
movl 16(%esp), %edx ; 显然,这个过程不依赖于编译器
movl $4, %eax
int $0x80
....
Read more from this perfect online book-store:
http://my.safaribooksonline.com/0-596-00002-2/ch08-10-fm2xml
Chapter 8. System Calls > Anticipating Linux 2.4 - Pg. 241
*/
/*
* sys_execve() executes a new program.
*/
asmlinkage int sys_execve(struct pt_regs regs)
{
int error;
char * filename;
filename = getname((char __user *) regs.bx);
error = PTR_ERR(filename);
if (IS_ERR(filename))
goto out;
error = do_execve(filename,
(char __user * __user *) regs.cx,
(char __user * __user *) regs.dx,
®s);
if (error == 0) {
/* Make sure we don't return using sysenter.. */
set_thread_flag(TIF_IRET);
}
putname(filename);
out:
return error;
}
标签:load,装入,ELF,elf,mm,Linux,interpreter,retval,out
From: https://blog.51cto.com/maray/6510890