一、简介
1. Android 从9.0版本开始全面支持eBPF,其主要用在流量统计上。此外,eBPF可以与内核的 kprobe/tracepoints/skfilter 等模块相结合,hook内核事件从而监控相应的系统状态。
二、bpf服务启动与程序加载
1. Android 为 eBPF 提供了许多封装的库,并设计了 eBPF 加载器 bpfloader,主要模块如下:
(1) bpfloader: [/system/bin/bpfloader] 系统启动时负责加载 /system/etc/bpf 目录下的 eBPF 目标文件。
(2) libbpf_android: [system/bpf/libbpf_android] 会生成 libbpf_android.so, 提供创建 bpf map 容器、加载 bpf 目标文件的接口。
(3) libbpf [external/libbpf] 会生成 libbpf_minimal.so 封装了bpf的系统调用, 提供如 attach/deattach 等bpf操作相关api
2. 大致流程为,开机 init 初始化 fs 流程完成后,启动 bpfloader 服务,扫描并解析 /system/etc/bpf/*.o 这些 elf 格式的 bpf 目标文件,读取并校验 critical、license、bpfloader* 相关 section 数据;
读取 progs、maps section 中的所有 bpf prog/map 数据,并将其按 文件+符号 名格式映射到 /sys/fs/bpf/ 下特定文件节点,完成bpf加载。
//system/core/rootdir/init.rc on late-init trigger load_bpf_programs //system/bpf/bpfloader/bpfloader.rc on load_bpf_programs write /proc/sys/kernel/unprivileged_bpf_disabled 0 write /proc/sys/net/core/bpf_jit_enable 1 //开启及时编译功能 write /proc/sys/net/core/bpf_jit_kallsyms 1 exec_start bpfloader //启动 bpfloader service bpfloader /system/bin/bpfloader capabilities CHOWN SYS_ADMIN NET_ADMIN rlimit memlock 1073741824 1073741824 oneshot reboot_on_failure reboot,bpfloader-failed updatable
bpfloader 的实现文件是 system/bpf/bpfloader/BpfLoader.cpp,其主要逻辑:
const Location locations[] = { ... { .dir = "/apex/com.android.tethering/etc/bpf/net_private/", //表示从这个路径下加载bpf .o 类型的elf文件 .prefix = "net_private/", //加载后形成的prog/map文件句柄存放在 /sys/fs/bpf/net_private 目录下 .allowedDomainBitmask = kTetheringApexDomainBitmask, }, // Core operating system { .dir = "/system/etc/bpf/", .prefix = "", .allowedDomainBitmask = domainToBitmask(domain::platform), }, ... }; int main(int argc, char** argv) { /* 在 /sys/fs/bpf 目录下创建需要创建的所有子目录 */ for (const auto& location : locations) { createSysFsBpfSubDir(location.prefix); } // Load all ELF objects, create programs and maps, and pin them /* * 加载所有的.o类型的bpf elf文件,会去读取 “critical” “license” section, * 并检查 bpfloader 的版本号,和看map和prog中的大小是否匹配。然后读取"progs" * section对应的bpf elf代码。然后调用 bpf() 系统调用加载到内核中。 * 最终elf程序也会绑定到 /sys/fs/bpf/<prefix/>prog_<filename>_<mapname> 句柄上。 */ for (const auto& location : locations) { loadAllElfObjects(location) != 0); } /* * 代码中定义的map结构也是通过 bpf() 系统调用加载到内核,然后内核中将这个map 绑 * 定到 /sys/fs/bpf/<prefix/>map_<filename>_<mapname> 句柄上。 */ int key = 1, value = 123; android::base::unique_fd map(android::bpf::createMap(BPF_MAP_TYPE_ARRAY, sizeof(key), sizeof(value), 2, 0)); android::bpf::writeToMapEntry(map, &key, &value, BPF_ANY)); android::base::SetProperty("bpf.progs_loaded", "1"); return 0; }
三、bpf程序
1. 可以参考 gpu_mem.o 这个elf文件对应的源码
frameworks/native/services/gpuservice/bpfprogs/gpu_mem.c 的实现:
#include <bpf_helpers.h> #define GPU_MEM_TOTAL_MAP_SIZE 1024 /* * map名字为 gpu_mem_total_map,是个hash类型的map, hash map的能存储的数据条数是 GPU_MEM_TOTAL_MAP_SIZE。 * 然后是key和val的类型,组指定为 GRAPHICS 组,即只有图形相关服务才能访问它。 * 之后通过与这个map绑定的 /sys/fs/bpf/map_gpu_mem_gpu_mem_total_map 文件通信来取得或修改内核中的数据。 */ DEFINE_BPF_MAP_GRO(gpu_mem_total_map, HASH, uint64_t, uint64_t, GPU_MEM_TOTAL_MAP_SIZE, AID_GRAPHICS); /* 这个数据结构要和 gpu_mem_total 这个tracepoint的要严格一致 */ struct gpu_mem_total_args { /* tracepoint中非用户定义的变量类型,通用的 */ uint64_t ignore; /* 用户定义的变量,从偏移位置8B处开始 */ uint32_t gpu_id; uint32_t pid; uint64_t size; }; /* * 这个宏指定了这个prog是个tracepoint类型和对应的tracepoint节点,前面加个点就是这段代码在elf文件中对应 * 的section。后面是属组和权限组。 * 最终这个宏会生成一个函数,并放到 <.name> section 里面。作用是在tracepoint被命中后往map中写数据。 * 内核中会将这段代码绑定到 /sys/fs/bpf/prog_gpu_mem_tracepoint_gpu_mem_gpu_mem_total 句柄上,之后通过 * 这个句柄就可以找到内核中的这段代码。 */ DEFINE_BPF_PROG("tracepoint/gpu_mem/gpu_mem_total", AID_ROOT, AID_GRAPHICS, tp_gpu_mem_total) (struct gpu_mem_total_args* args) { uint64_t key = 0; uint64_t cur_val = 0; uint64_t* prev_val = NULL; /* 先取到tracepoint传进来的类型,就是 gpu_mem_total_args 类型 */ /* The upper 32 bits are for gpu_id while the lower is the pid */ key = ((uint64_t)args->gpu_id << 32) | args->pid; /* tracepoint被命中后传的内存值,表示这个pid使用这个gpu的多少内存 */ cur_val = args->size; if (!cur_val) { bpf_gpu_mem_total_map_delete_elem(&key); //由 DEFINE_BPF_MAP_GRO 宏生成的函数 return 0; } /* 先根据key查一下是否已经有存储了,若有的话更新一下,若没有存的话创建一个条目存储 */ prev_val = bpf_gpu_mem_total_map_lookup_elem(&key); //由 DEFINE_BPF_MAP_GRO 宏生成的函数 if (prev_val) { *prev_val = cur_val; } else { bpf_gpu_mem_total_map_update_elem(&key, &cur_val, BPF_NOEXIST); //由 DEFINE_BPF_MAP_GRO 宏生成的函数 } return 0; } LICENSE("Apache 2.0");
编译配置文件 frameworks/native/services/gpuservice/bpfprogs/Android.bp:
package { default_applicable_licenses: ["frameworks_native_license"], } bpf { name: "gpu_mem.o", //编译生成的目标文件 srcs: ["gpu_mem.c"], //源文件 cflags: [ "-Wall", "-Werror", ], }
大致代码逻辑为:
(1) 通过 DEFINE_BPF_MAP 这个Android上层封装的宏定义BPF数据容器的类型以及访问接口。
(2) 通过 DEFINE_BPF_PROG 声明并定义hook方法。
(3) LICENSE指定程序使用的license。
2. 对bpf程序的使用
(1) 激活prog句柄对应的程序段代码
对此例bpf程序的的激活位置在 native/services/gpuservice/gpumem/GpuMem.cpp 中:
static constexpr char kGpuMemTotalProgPath[] = "/sys/fs/bpf/prog_gpu_mem_tracepoint_gpu_mem_gpu_mem_total"; static constexpr char kGpuMemTotalMapPath[] = "/sys/fs/bpf/map_gpu_mem_gpu_mem_total_map"; void GpuMem::initialize() { /* Make sure bpf programs are loaded */ bpf::waitForProgsLoaded(); int fd = bpf::retrieveProgram(kGpuMemTotalProgPath); /* 将程序附加到tracepoint,这里会自动启用tracepoint */ while (bpf_attach_tracepoint(fd, "gpu_mem", "gpu_mem_total") < 0) { if (++count > kGpuWaitTimeout) { return; } /* Retry until GPU driver loaded or timeout */ sleep(1); } /* 这里只做了一个只读的映射 */ auto map = bpf::BpfMapRO<uint64_t, uint64_t>(kGpuMemTotalMapPath); setGpuMemTotalMap(map); }
附加成功后,当 gpu_mem_total 这个 tracepoint 被命中时map句柄中就有数据了。
(2) 通过map句柄对应的文件进行使用
对此例bpf程序的使用,就是直接读取 /sys/fs/bpf/map_gpu_mem_gpu_mem_total_map 文件,使用位置如:
# cat /sys/fs/bpf/map_gpu_mem_gpu_mem_total_map 4205: 14106624 0: 425660416 10341: 16977920 ...
也通过下面这种方法查看:
root@localhost:# bpftool map list | grep gpu_mem //遍历所有map信息 查看每个map对应的id 17: hash name gpu_mem_total_m flags 0x0 root@localhost:# bpftool map dump id 17 //dump map 详细信息 看来是与它匹配的 [{ "key": 4205, "value": 14778368 },{ "key": 10341, "value": 16977920 },{ "key": 2992, "value": 2686976 }, ... ]
Android中的这个gpu服务读取的内存统计信息来自这个bpf程序:
# dumpsys gpu --gpumem Memory snapshot for GPU 0: Global total: 358850560 Proc 1655 total: 184938496 Proc 2174 total: 2658304 Proc 2992 total: 2686976 Proc 3956 total: 10371072 Proc 4205 total: 14778368 Proc 5729 total: 26066944 Proc 6110 total: 2654208 Proc 8168 total: 112107520 Proc 10341 total: 16977920
代码上的使用,例如 system/memory/libmeminfo/sysmeminfo.cpp 中对map文件的使用:
bool ReadPerProcessGpuMem([[maybe_unused]] std::unordered_map<uint32_t, uint64_t>* out) { static constexpr const char kBpfGpuMemTotalMap[] = "/sys/fs/bpf/map_gpu_mem_gpu_mem_total_map"; /* Use the read-only wrapper BpfMapRO to properly retrieve the read-only map. */ auto map = bpf::BpfMapRO<uint64_t, uint64_t>(kBpfGpuMemTotalMap); out->clear(); auto map_key = map.getFirstKey(); do { uint64_t key = map_key.value(); uint32_t pid = key; // BPF Key [32-bits GPU ID | 32-bits PID] auto gpu_mem = map.readValue(key); ... map_key = map.getNextKey(key); } while (map_key.ok()); return true; }
四、bpf elf文件格式解析
1. 可以使用 objdump 来查看 bpf elf 文件的字节码
bpf程序编译出来会生成多个section,所有定义的map结构都会存储在maps这个section里面。
root@localhost:/# llvm-objdump-11 -h -d /system/etc/bpf/gpu_mem.o /system/etc/bpf/gpu_mem.o: file format elf64-bpf Sections: Idx Name Size VMA Type 0 00000000 0000000000000000 1 .strtab 00000110 0000000000000000 2 .text 00000000 0000000000000000 TEXT 3 tracepoint/gpu_mem/gpu_mem_total 00000100 0000000000000000 TEXT //在elf文件中对应的section 4 .reltracepoint/gpu_mem/gpu_mem_total 00000030 0000000000000000 5 maps 00000074 0000000000000000 DATA 6 .maps.gpu_mem_total_map 00000010 0000000000000000 DATA //在elf文件中对应的map 7 progs 0000005c 0000000000000000 DATA 8 bpfloader_min_ver 00000004 0000000000000000 DATA 9 bpfloader_max_ver 00000004 0000000000000000 DATA 10 size_of_bpf_map_def 00000008 0000000000000000 DATA 11 size_of_bpf_prog_def 00000008 0000000000000000 DATA 12 license 0000000b 0000000000000000 DATA 13 .BTF 00000c1b 0000000000000000 14 .llvm_addrsig 00000009 0000000000000000 15 .symtab 00000108 0000000000000000 Disassembly of section tracepoint/gpu_mem/gpu_mem_total: 0000000000000000 <tp_gpu_mem_total>: 0: 61 12 08 00 00 00 00 00 r2 = *(u32 *)(r1 + 8) //r1指向参数gpu_mem_total_args* args,这里跳过公共部分,取出 gpu_id 1: 67 02 00 00 20 00 00 00 r2 <<= 32 2: 61 13 0c 00 00 00 00 00 r3 = *(u32 *)(r1 + 12) //取出 pid 3: 4f 32 00 00 00 00 00 00 r2 |= r3 //gpu_id|pid做成hash key 4: 7b 2a f8 ff 00 00 00 00 *(u64 *)(r10 - 8) = r2 5: 79 16 10 00 00 00 00 00 r6 = *(u64 *)(r1 + 16) //取出size 6: 7b 6a f0 ff 00 00 00 00 *(u64 *)(r10 - 16) = r6 7: 55 06 06 00 00 00 00 00 if r6 != 0 goto +6 <tp_gpu_mem_total+0x70> 8: bf a2 00 00 00 00 00 00 r2 = r10 9: 07 02 00 00 f8 ff ff ff r2 += -8 10: 18 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 r1 = 0 ll 12: 85 00 00 00 03 00 00 00 call 3 13: 05 00 10 00 00 00 00 00 goto +16 <tp_gpu_mem_total+0xf0> 14: bf a2 00 00 00 00 00 00 r2 = r10 15: 07 02 00 00 f8 ff ff ff r2 += -8 16: 18 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 r1 = 0 ll 18: 85 00 00 00 01 00 00 00 call 1 19: 15 00 02 00 00 00 00 00 if r0 == 0 goto +2 <tp_gpu_mem_total+0xb0> 20: 7b 60 00 00 00 00 00 00 *(u64 *)(r0 + 0) = r6 21: 05 00 08 00 00 00 00 00 goto +8 <tp_gpu_mem_total+0xf0> 22: bf a2 00 00 00 00 00 00 r2 = r10 23: 07 02 00 00 f8 ff ff ff r2 += -8 24: bf a3 00 00 00 00 00 00 r3 = r10 25: 07 03 00 00 f0 ff ff ff r3 += -16 26: 18 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 r1 = 0 ll 28: b7 04 00 00 01 00 00 00 r4 = 1 29: 85 00 00 00 02 00 00 00 call 2 30: b7 00 00 00 00 00 00 00 r0 = 0 31: 95 00 00 00 00 00 00 00 exit
也可以通过 bpftool 工具进行查看:
root@localhost:/# bpftool prog | grep gpu 15: tracepoint name tracepoint_gpu_ tag 37955a3ec8581e93 root@localhost:/# root@localhost:/# bpftool prog dump xlated id 15 0: (61) r2 = *(u32 *)(r1 +8) 1: (67) r2 <<= 32 2: (61) r3 = *(u32 *)(r1 +12) 3: (4f) r2 |= r3 4: (7b) *(u64 *)(r10 -8) = r2 5: (79) r6 = *(u64 *)(r1 +16) 6: (7b) *(u64 *)(r10 -16) = r6 7: (55) if r6 != 0x0 goto pc+6 8: (bf) r2 = r10 9: (07) r2 += -8 10: (18) r1 = map[id:17] 12: (85) call 0xffffffe0a837f6c8#89744 13: (05) goto pc+18 14: (bf) r2 = r10 15: (07) r2 += -8 16: (18) r1 = map[id:17] 18: (85) call 0xffffffe0a837f590#89432 19: (15) if r0 == 0x0 goto pc+1 20: (07) r0 += 56 21: (15) if r0 == 0x0 goto pc+2 22: (7b) *(u64 *)(r0 +0) = r6 23: (05) goto pc+8 24: (bf) r2 = r10 25: (07) r2 += -8 26: (bf) r3 = r10 27: (07) r3 += -16 28: (18) r1 = map[id:17] 30: (b7) r4 = 1 31: (85) call 0xffffffe0a837f648#89616 32: (b7) r0 = 0 33: (95) exit
可以看到 tp_gpu_mem_total 前面是按下面的format格式解析参数。
# cat /sys/kernel/tracing/events/gpu_mem/gpu_mem_total/format name: gpu_mem_total ID: 671 format: field:unsigned short common_type; offset:0; size:2; signed:0; //前面8字节是tracepoint通用的 field:unsigned char common_flags; offset:2; size:1; signed:0; field:unsigned char common_preempt_count; offset:3; size:1; signed:0; field:int common_pid; offset:4; size:4; signed:1; field:uint32_t gpu_id; offset:8; size:4; signed:0; //后面的这些才是用户定义的 field:uint32_t pid; offset:12; size:4; signed:0; field:uint64_t size; offset:16; size:8; signed:0; print fmt: "gpu_id=%u pid=%u size=%llu", REC->gpu_id, REC->pid, REC->size
也可以使用 readelf 来看各个 section 的信息,还可以看到偏移位置:
root@localhost:/system/etc/bpf# llvm-readelf-11 -s -S gpu_mem.o There are 16 section headers, starting at offset 0x10c0: Section Headers: [Nr] Name Type Address Off Size ES Flg Lk Inf Al [ 0] NULL 0000000000000000 000000 000000 00 0 0 0 [ 1] .strtab STRTAB 0000000000000000 000fa9 000110 00 0 0 1 [ 2] .text PROGBITS 0000000000000000 000040 000000 00 AX 0 0 4 [ 3] tracepoint/gpu_mem/gpu_mem_total PROGBITS 0000000000000000 000040 000100 00 AX 0 0 8 [ 4] .reltracepoint/gpu_mem/gpu_mem_total REL 0000000000000000 000f70 000030 10 I 15 3 8 [ 5] maps PROGBITS 0000000000000000 000140 000074 00 A 0 0 4 [ 6] .maps.gpu_mem_total_map PROGBITS 0000000000000000 0001b8 000010 00 WA 0 0 8 [ 7] progs PROGBITS 0000000000000000 0001c8 00005c 00 A 0 0 4 [ 8] bpfloader_min_ver PROGBITS 0000000000000000 000224 000004 00 WA 0 0 4 [ 9] bpfloader_max_ver PROGBITS 0000000000000000 000228 000004 00 WA 0 0 4 [10] size_of_bpf_map_def PROGBITS 0000000000000000 000230 000008 00 WA 0 0 8 [11] size_of_bpf_prog_def PROGBITS 0000000000000000 000238 000008 00 WA 0 0 8 [12] license PROGBITS 0000000000000000 000240 00000b 00 WA 0 0 1 [13] .BTF PROGBITS 0000000000000000 00024c 000c1b 00 0 0 4 [14] .llvm_addrsig LLVM_ADDRSIG 0000000000000000 000fa0 000009 00 E 0 0 1 [15] .symtab SYMTAB 0000000000000000 000e68 000108 18 1 2 8 Key to Flags: W (write), A (alloc), X (execute), M (merge), S (strings), I (info), L (link order), O (extra OS processing required), G (group), T (TLS), C (compressed), x (unknown), o (OS specific), E (exclude), p (processor specific) Symbol table '.symtab' contains 11 entries: Num: Value Size Type Bind Vis Ndx Name 0: 0000000000000000 0 NOTYPE LOCAL DEFAULT UND 1: 0000000000000000 0 SECTION LOCAL DEFAULT 3 tracepoint/gpu_mem/gpu_mem_total 2: 0000000000000000 256 FUNC GLOBAL DEFAULT 3 tp_gpu_mem_total 3: 0000000000000000 116 OBJECT GLOBAL DEFAULT 5 gpu_mem_total_map 4: 0000000000000000 16 OBJECT GLOBAL DEFAULT 6 ____btf_map_gpu_mem_total_map 5: 0000000000000000 92 OBJECT GLOBAL DEFAULT 7 tp_gpu_mem_total_def 6: 0000000000000000 4 OBJECT GLOBAL DEFAULT 8 _bpfloader_min_ver 7: 0000000000000000 4 OBJECT GLOBAL DEFAULT 9 _bpfloader_max_ver 8: 0000000000000000 8 OBJECT GLOBAL DEFAULT 10 _size_of_bpf_map_def 9: 0000000000000000 8 OBJECT GLOBAL DEFAULT 11 _size_of_bpf_prog_def 10: 0000000000000000 11 OBJECT GLOBAL DEFAULT 12 _license
五、总结
1. 当 bpfloader 服务起来时会加载 BpfLoader.cpp 中指定的所有路径下的所有的.o格式的bpf elf格式的文件,然后在 /sys/fs/bpf 目录下导出 prog句柄和 map句柄,其中 prog 句柄对应的是elf程序代码段,map句柄
对应的是数据读写访问接口。需要有对应的程序将 prog 段附加到对应的HOOK位置上,这样当HOOK被命中时才能采集到数据,采集到的数据会通过map句柄文件导出给其它进程读写。
六、相关资料
1. 有eBPF架构图
https://www.cnblogs.com/janeysj/p/16185097.html
https://cloudnative.to/blog/bpf-intro/
标签:map,00,bpf,mem,ebpf,eBPF,gpu,Android,total From: https://www.cnblogs.com/hellokitty2/p/16915814.html