random随机数性能问题

标签：rte __ rand 性能 random uint64 state 随机数

random函数性能问题

测试dpvs新建conn的性能过程中，根据火焰图，发现random函数占用了大量的cpu资源，如下所示。

random随机数性能问题_随机数

random函数实现

man random 知道，random函数为glibc的库函数，查看源码如下所示：

long int
__random (void)
{
  int32_t retval;
  __libc_lock_lock (lock);
  (void) __random_r (&unsafe_state, &retval);
  __libc_lock_unlock (lock);
  return retval;
}

weak_alias (__random, random)
ps: 
glibc源码在线查看：https://elixir.bootlin.com/glibc/latest/source/stdlib/random.c#L300

分析：
如上所示，random添加了锁，所以是线程安全的函数，但是锁影响了性能。

lrand48实现

lrand48同样是库函数，实现如下所示：

long int
lrand48 (void)
{
  long int result;
  (void) __nrand48_r (__libc_drand48_data.__x, &__libc_drand48_data, &result);
  return result;
}

分析：
如上所示，lrand48
没有加锁，
不是线程安全的，多线程下会有cache miss的开销问题；
因为多个线程公用了 __libc_drand48_data 变量。在 __nrand48_r—> __drand48_iterate 中会对其进行更改。
一个线程对于共享变量进行更改，则另外一个线程读取就需要从内存中读取，而不可以从cache中读取，因为cache从内存中一次性读取一个cacheline的数据，一个线程对于数据进行更改，那么其他线程中cache中保存的该数据就不是最新的数据了。
所以：不是线程安全（函数中读以及写了static变量或者全局变量）的函数；多线程之间，就可能存在cache 一致性同步的问题。

lrand48_r实现

int lrand48_r (struct drand48_data *buffer, long int *result)
{
  /* Be generous for the arguments, detect some errors.  */
  if (buffer == NULL)
   return -1;
  return __nrand48_r (buffer->__x, buffer, result);
}

分析：
没有加锁
自定义生成随机数的种子，可以做到线程安全；
如上所示，lrand48_r 自定义生成随机数的种子数据，将生成的随机数保存到 result中。
每个线程都有自己的随机数种子，这样可以保证 lrand48_r是线程安全的，且可以做到不存在cache miss。

dpdk中的rte_rand

dpdk 18.11中的 rte_rand

static inline uint64_t rte_rand(void)
{
  uint64_t val;
  val = (uint64_t)lrand48();
  val <<= 32;
  val += (uint64_t)lrand48();
  return val;
}

分析
通过调用两次的 lrand48 生成一个 uint64的随机数。
但是 lrand48 由于共享变量，在多线程之间存在 cache 一致性同步的问题。

dpdk 20.11中的 rte_rand

struct rte_rand_state {
  uint64_t z1;
  uint64_t z2;
  uint64_t z3;
  uint64_t z4;
  uint64_t z5;
} __rte_cache_aligned;
static struct rte_rand_state rand_states[RTE_MAX_LCORE];

static __rte_always_inline
struct rte_rand_state *__rte_rand_get_state(void)
{
  unsigned int lcore_id;
  lcore_id = rte_lcore_id();
  if (unlikely(lcore_id == LCORE_ID_ANY))
    lcore_id = rte_get_main_lcore();
  return &rand_states[lcore_id];
}

uint64_t  rte_rand(void)
{
  struct rte_rand_state *state;
  state = __rte_rand_get_state();
  return __rte_rand_lfsr258(state);
}

static __rte_always_inline uint64_t
__rte_rand_lfsr258(struct rte_rand_state *state)
{
  state->z1 = __rte_rand_lfsr258_comp(state->z1, 1UL, 53UL,
              18446744073709551614UL, 10UL);
  state->z2 = __rte_rand_lfsr258_comp(state->z2, 24UL, 50UL,
              18446744073709551104UL, 5UL);
  state->z3 = __rte_rand_lfsr258_comp(state->z3, 3UL, 23UL,
              18446744073709547520UL, 29UL);
  state->z4 = __rte_rand_lfsr258_comp(state->z4, 5UL, 24UL,
              18446744073709420544UL, 23UL);
  state->z5 = __rte_rand_lfsr258_comp(state->z5, 3UL, 33UL,
              18446744073701163008UL, 8UL);
  return state->z1 ^ state->z2 ^ state->z3 ^ state->z4 ^ state->z5;
}

分析：
rand_states[RTE_MAX_LCORE]数组，每个core有一个自己独占的元素；
struct rte_rand_state 是cacheline对齐的，即每个core对应的元素是cacheline对齐的，各个元素之间的更改互不影响。
这个很重要，因为数组中每个元素如果不是cacheline对齐的，那么整个数组就会是在一个cacheline中。
一个线程对于自己对应的数组元素进行更改，则会造成另外一个线程，其cache中数据为老数据，需要从内存中读取。因为cpu每次从内存中读取一个cacheline的数据，同时在cache中保存一份备份。
一个线程对于自身对应元素的更改，虽然不会更改其他元素的值，会影响到整个cacheline的变化。那么其他的每个core对应的cache中的数据就都是老数据了。

自定义比较好的随机函数

dpdk 中  获取时间戳的方法：
Intel的X86中的RDTSC即Read Time Stamp Counter 读取时间计数器的指令。

static inline uint64_t rte_rdtsc(void)
{
  union {
    uint64_t tsc_64;
    RTE_STD_C11
    struct {
      uint32_t lo_32;
      uint32_t hi_32;
    };
  } tsc;

...

  asm volatile("rdtsc" :
         "=a" (tsc.lo_32),
         "=d" (tsc.hi_32));
  return tsc.tsc_64;
}

分析：
背景：
Linux提供的API——gettimeofday()可以获取微秒级的精度。但是，首先它不能提供纳秒级精度，其次，他是一个系统调用，自身就有一定的开销。
Tsc介绍
每个核都有自己独立的一套寄存器，TSC是一个64位的寄存器。
从Intel Pentium开始，在所有的x86平台上均会提供。
TSC寄存器中存放的是CPU从启动以来执行的指令周期数。
通过rdtsc指令，可以将TSC的数值存放在EDX:EAX中。如下所示：
uint64_t get_tsc() { uint64_t a, d; __asm__ volatile("rdtsc" : "=a"(a), "=d"(d)); return (d << 32) | a; }TSC曾经是一个==极高精度，极低开销==的取时间的方法。
TSC的坑：
结论：
TSC寄存器在每个core中都有一份，其值是放在寄存器中。
所以获取TSC（cpu启动以来的指令周期数）的值的开销极低。
可基于rdtsc，然后乘以一个放大的系数，再mode一个数字，得到一个随机数。

参考

dpdk中线程安全的高性能伪随机数生成器：
https://www.dpdk.org/wp-content/uploads/sites/35/2019/10/Threadsafe.pdf

TSC的坑
http://www.wangkaixuan.tech/?p=901

标签：rte,__,rand,性能,random,uint64,state,随机数
From： https://blog.51cto.com/u_15911260/5934627

目录