从 unlink/rm 底层实现来看Linux文件系统管理

标签：文件 struct dentry file Linux rm unlink inode

文章目录

1. 前言
2. 文件系统结构
3. Unlink实现

文中涉及到的内核源代码版本是3.10.1。

1. 前言

工作中听到一个同事对unlink 系统调用的描述，unlink并不是将文件中的数据从磁盘上真删除，而是对该文件/目录的dentry以及 inode的解引用。探索了一下这个过程内核对文件都做了什么，才会让文件对用户不可见，本文做一个探索历程的总结。

我们平时执行rm命令或者调用unlink系统调用的时候（其实rm底层也是执行unlink系统调用的）会发现文件已经被“清理掉了”，确切得说是我们从操作系统中访问不到文件了。

如下unlink使用方式：

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <fcntl.h>
#include <string.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <dirent.h>

void check_op(int ret, char *op) {
  if (ret <= -1) {
    printf("%s file faild %d\n", op, ret);
    exit(1);
  } else {
    printf("%s file success !!!\n", op);
  }
}

int main(int argc, char *argv[])
{
  struct stat sb;
  if(argc <= 1) {
    printf("arg's num is not enough!\n");
    exit(1);
  }

  char* file_name = argv[1];
  printf("file name is %s\n", file_name);

  int fd = open(file_name, O_CREAT | O_RDWR);
  if(fd == -1) {
    printf("open file %s failed\n", file_name);
    exit(1);
  }

  char *buf = "i am a coding boy";
  check_op(write(fd, buf, 20), "write");

  check_op(lstat(file_name, &sb), "lstat");
  printf("mode : %u\n \
          size : %lld\n \
          user id : %ud\n \
          blocks : %lld\n \
          block size: %u\n",  sb.st_mode, sb.st_size, sb.st_uid, sb.st_blocks,sb.st_blksize);
  
  check_op(unlink(file_name), "unlink");

  check_op(lstat(file_name, &sb), "lstat");


  close(fd);
  return 0;
}

输出如下：

file name is 11111
write file success !!!
lstat file success !!!
mode : 32768
           size : 20
           user id : 501d
           blocks : 8
           block size: 4096
unlink file success !!!
lstat file faild -1

可以看到unlink之后我们无法访问到这个文件任何信息，这个时候我们会普遍认为文件相关的数据已经被立即从磁盘删除掉了，真的是立即删除这样吗？

如果想要完整了解文件的删除过程，还是需要对文件系统相关的知识有一个大体的了解。

2. 文件系统结构

首先先了解一下linux的磁盘文件系统，以ext4文件系统为例（其他文件系统大同小异）。

从 unlink/rm 底层实现来看Linux文件系统管理_文件系统

图片来自【极客时间 – 趣谈linux操作系统】，侵删

磁盘文件系统是挂载在磁盘之上的。

上图中是磁盘文件系统在磁盘上的存储形态，包括全局的超级块(supper block)、块组描述符以及属于每个块组的inode位图和块位图。

一个文件存储在ext4文件系统中，这个文件内容属于上图中的一个或者多个数据块，inode是管理数据块的元指针，一个文件有一个属于自己的inode。如果创建一个文件，则需要从文件系统中申请空间，则需要从inode位图块 中查找空闲的inode，申请到了inode之后想要向这个文件内写入数据，则需要从块位图块中申请存放数据的空闲块，并将申请到的数据块交给inode下的数据块管理。

ps：inode列表和数据块列表是一个树装形态，一个目录也是有一个inode，这个目录下可能有多个目录，多个目录下可能有多个文件，这里的树形态就像是文件树一样。
类似如下

大体的文件系统形态就是以上描述的样子。

我们linux 的文件存储基本架构如下：

从 unlink/rm 底层实现来看Linux文件系统管理_数据_03

图片来自【极客时间 – 趣谈linux操作系统】，侵删

可以看到用户进程访问文件的入口都是vfs, 通过指定的文件目录/文件名，内核能够找到这个文件的目录(dentry) 以及文件管理元数据(inode)，通过对应的磁盘文件系统操作完成针对dentry和inode的操作。
以/home/zhg/hello_world.txt 文件为例，其中/home/zhg 为该文件的denry，找到dentry之后通过文件名hello_world能取到该文件的inode。

操作文件内容之前需要打开文件，这个过程内核为了提高针对文件系统上的目录项的访问效率（性能所需，不能用户想要频繁访问一批文件的时候都需要从磁盘上读，代价太大了）, dcache/icache应运而生，全称也就是(dentry cache / inode cache)。以dcache为例，为了加速文件dentry的查找，内核在dcache中维护了两个数据结构： lru-list 和 hash-list。

hash-list 用来保存活跃的热点dentry，通过将denry数据结构中的d_hash指针和对应的denry hash表绑定
lru-list 用来保存dcache中不活跃（不经常访问）的dentry数据，dentry占用的内存页会被回收。

同样的inode 在内存的cache中也会有对应的i_hash和i_lru链表。

当我们访问一个文件的时候从dcache中找不到对应的dentry的时候会从磁盘中加载到内存中，这个dentry会被直接当作活跃dentry添加到hash-list中，添加之前会由slub分配器分配相关的的内存。slub分配器从hash-list和lru-list回收内存时会有两种情况：

从lru-list尾部回收（最不活跃的denry）
用户触发针对一个文件的删除，会从hash-list 中直接回收。（重要！！！ unlink主要就走到这一步）
d_count和i_count为0，这两个指标是对应的entry/inode的引用计数，如果我们实际的操作场景中针对一个文件或者目录有软硬链接，这个时候肯定同一个dentry/inode的引用计数大于0，删除的话会失败。

如下图：

从 unlink/rm 底层实现来看Linux文件系统管理_文件系统_04

其中vfs的 denry数据结构和inode数据结构基本类型如下：

struct dentry {
  /* RCU lookup touched fields */
  unsigned int d_flags;   /* protected by d_lock */
  seqcount_t d_seq;   /* per dentry seqlock */
  struct hlist_bl_node d_hash;  /* lookup hash list */ // hash 链表
  struct dentry *d_parent;  /* parent directory */
  struct qstr d_name;
  struct inode *d_inode;    /* Where the name belongs to - NULL is
           * negative */ 
  unsigned char d_iname[DNAME_INLINE_LEN];  /* small names */

  /* Ref lookup also touches following */
  unsigned int d_count;   /* protected by d_lock */ //  进程的引用计数
  spinlock_t d_lock;    /* per dentry lock */
  const struct dentry_operations *d_op;
  struct super_block *d_sb; /* The root of the dentry tree */
  unsigned long d_time;   /* used by d_revalidate */
  void *d_fsdata;     /* fs-specific data */

  struct list_head d_lru;   /* LRU list */ // LRU 链表
  ......
};

struct inode {
  umode_t     i_mode; /* File mode */
  unsigned short    i_opflags;
  kuid_t      i_uid; /* Low 16 bits of Owner Uid */
  kgid_t      i_gid;
  ......
  unsigned long   i_ino; /* inode number */
  ......
  struct timespec   i_atime; /* Access time */
  struct timespec   i_mtime; /* Inode Change time */
  struct timespec   i_ctime; /* Modification time */
  spinlock_t    i_lock; /* i_blocks, i_bytes, maybe i_size */
  ......

  /* Misc */
  unsigned long   i_state;
  struct mutex    i_mutex;

  unsigned long   dirtied_when; /* jiffies of first dirtying */

  struct hlist_node i_hash;
  struct list_head  i_wb_list;  /* backing dev IO list */
  struct list_head  i_lru;    /* inode LRU list */
  struct list_head  i_sb_list;
  ......
}

ps :以上是从虚拟文件系统中访问到的文件数据，并不是最底层的磁盘文件系统。

3. Unlink实现

通过以上针对文件系统的粗略描述，能够大概清楚文件系统的基本结构，打开文件的过程中 dentry/inode 会起什么样的作用，以及dcache和icache 对我们用户操作文件的影响。

在描述unlink的代码实现之前，根据上面对文件系统的了解，我们能够大概猜测一下unlink 删除文件的基本逻辑。

想要让一个文件从操作系统消失，无非就是破坏这个文件的元数据结构，并且在整个文件系统链路中都将元数据相关的缓存清理掉。

主要通过变更文件dentry以及 inode中的引用计数和链接计数来达到删除文件的目的。

从 unlink/rm 底层实现来看Linux文件系统管理_文件系统_05

也就是只有文件的被链接数为0，且文件被进程引用的计数为0 时才能够删除文件。

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <fcntl.h>
#include <string.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <dirent.h>

void check_op(int ret, char *op) {
  if (ret <= -1) {
    printf("%s file faild\n", op);
    exit(1);
  } else {
    printf("%s file success !!!\n", op);
  }
}

int main(int argc, char *argv[])
{
  struct stat sb;
  if(argc <= 1) {
    printf("arg's num is not enough!\n");
    exit(1);
  }

  char* file_name = argv[1];
  printf("file name is %s\n", file_name);

  int fd = open(file_name, O_CREAT | O_RDWR);
  if(fd == -1) {
    printf("open file %s failed\n", file_name);
    exit(1);
  }

  char *buf = "i am a coding boy";
  check_op(write(fd, buf, strlen(buf)), "write");

  check_op(lstat(file_name, &sb), "lstat");
  printf("mode : %u\n \
          size : %lld\n \
          user id : %ud\n \
          blocks : %lld\n \
          block size: %u\n",  sb.st_mode, sb.st_size, sb.st_uid, sb.st_blocks,sb.st_blksize);
  
  check_op(unlink(file_name), "unlink");

  check_op(write(fd, buf, strlen(buf)), "write");
  check_op(lstat(file_name, &sb), "lstat");

  close(fd);
  return 0;
}

输出如下：

file name is 11111
write file success !!!
lstat file success !!!
mode : 32768
           size : 17
           user id : 501d
           blocks : 8
           block size: 4096
unlink file success !!!
write file success !!!
lstat file faild

可以看到unlink之后再次向文件中写入数据依然能够写入/读取内容成功（此时文件其实其他进程已经无法从vfs层读到了），但是当关闭fd ，操作文件的进程退出之后文件就会被操作系统真删除。

接下来从源代码看一下unlink的链路实现：
unlink系统调用入口

SYSCALL_DEFINE1(unlink, const char __user *, pathname)
{
  return do_unlinkat(AT_FDCWD, pathname);
}

进入到do_unlinkat函数中, 这个函数就是主体的unlink操作函数，主要做如下几件事情：

获取文件名称
从dcache中获取文件名对应的dentry，找不到，则从磁盘文件系统中找
删除前的相关安全性检查
进入到vfs_unlink执行实际的unlink操作

static long do_unlinkat(int dfd, const char __user *pathname)
{
  ......
  mutex_lock_nested(&nd.path.dentry->d_inode->i_mutex, I_MUTEX_PARENT);
  // 从dcache中获取dentry,获取不到，从文件系统中获取
  dentry = lookup_hash(&nd);
  error = PTR_ERR(dentry);
  if (!IS_ERR(dentry)) {
    /* Why not before? Because we want correct error value */
    if (nd.last.name[nd.last.len])
      goto slashes;
    // 获取inode
    inode = dentry->d_inode;
    if (!inode)
      goto slashes;
    ihold(inode);
    // 安全性检查
    error = security_path_unlink(&nd.path, dentry);
    if (error)
      goto exit2;
    // unlink主体入口
    error = vfs_unlink(nd.path.dentry->d_inode, dentry);
exit2:
    dput(dentry);
  }
  mutex_unlock(&nd.path.dentry->d_inode->i_mutex);
  if (inode)
    iput(inode);  /* truncate the inode here */
  mnt_drop_write(nd.path.mnt);
  ......
}

获取从dcache中查找dentry的逻辑这里感兴趣的同学可以直接看看上面的lookup函数，这里我们直接进入正题，到vfs_unlink函数，先进行一些是否能够删除的判断：对于有挂载其他文件系统的、权限只读等情况进行判断，满足的话则无法删除，直接返回。

能够满足，继续后续的逻辑，减少inode的link数，并通过d_delete清理inode和减少dentry的count，从dcache中将当前dentry 从hash表移除。

int vfs_unlink(struct inode *dir, struct dentry *dentry)
{
  // 确认当前的目录是否能够删除
  // 如果是一个文件系统的挂载目录、权限是只读等都无法删除，直接返回
  int error = may_delete(dir, dentry, 0);

  if (error)
    return error;
  ......

  /* We don't d_delete() NFS sillyrenamed files--they still exist. */
  if (!error && !(dentry->d_flags & DCACHE_NFSFS_RENAMED)) {
    // 减少当前inode的link数目
    fsnotify_link_count(dentry->d_inode);
    // 清理dentry下的inode, 减少该dentry的count
    d_delete(dentry);
  }
  
  return error;
}

其中d_delete的逻辑如下：

void d_delete(struct dentry * dentry)
{
  struct inode *inode;
  int isdir = 0;
  /*
   * Are we the only user?
   */
again:
  spin_lock(&dentry->d_lock);
  inode = dentry->d_inode;
  isdir = S_ISDIR(inode->i_mode);
  // 如果当前dentry只被当前进程引用（d_count==1）
  // 则可以直接针对inode的相关引用进行操作，减少i_count和i_link，移除文件名
  // 否则，通过d_drop方式清理
  if (dentry->d_count == 1) {
    if (!spin_trylock(&inode->i_lock)) {
      spin_unlock(&dentry->d_lock);
      cpu_relax();
      goto again;
    }
    dentry->d_flags &= ~DCACHE_CANT_MOUNT;
    // 直接针对inode的相关引用进行操作，减少i_count和i_link
    // 从icache中的相关链表移除
    // 如果i_count变为0，则会通过evict(inode)将inode 相关资源释放。
    dentry_unlink_inode(dentry);
    // 移除文件名
    fsnotify_nameremove(dentry, isdir);
    return;
  }

  // 如果dcache的引用计数不为0
  // 断开当前dentry和上一个节点之间的指针，
  // 从dcache中清理dentry，防止从vfs层找到该文件名。
  // 主要是一些链表的指针操作
  if (!d_unhashed(dentry))
    __d_drop(dentry);

  spin_unlock(&dentry->d_lock);

  fsnotify_nameremove(dentry, isdir);
}

代码细节比较多，感兴趣的同学可以顺链看一看，对整个linux文件系统的元数据管理会有更进一步得理解和认识。

总结一下，unlink操作本质上并不会立即从磁盘上清理文件内容，而是将该文件在文件系统中的各个元数据引用计数清零或者减少（i_link, i_count, d_count），并且从相关的cache中清除（防止vfs能够找到），后续的磁盘文件系统数据内容清理则是由操作系统来做的。

标签：文件,struct,dentry,file,Linux,rm,unlink,inode
From： https://blog.51cto.com/u_13456560/5823174

从 unlink/rm 底层实现来看Linux文件系统管理

文章目录

1. 前言

2. 文件系统结构

3. Unlink实现

相关文章

赞助商

阅读排行