Hash魔数的奇妙之旅

2024-09-14 13:24:11 #Linux Kernel #Hash

一、起因

在看《深入理解linux内核》有关hash函数部分时候，发现了一个有趣的魔数0x9e370001UL。书中对于该数的解释如下：

也许你会想常量0x9e370001(=2654 404 609)究竟是怎么得出的。这种散列函数是基于表索引乘以一个适当的大数，于是结果溢出，就把留在32位变量中的值作为模数操作的结果。Knuth建议，要得到满意的结果，这个大柬数就应当是接近黄金比例的2”的一个素数（32位是80x86寄存器的大小）。这里，2654 404 609就是接近$2^{32}×(\sqrt{5}-1)/2$的一个素数，这个数可以方便地通过加运算和位移运算得到，因为它等于∶ $2^{31}+2^{29}-2^{25}+2^{22}-2^{19}-2^{16}+1$。

unsigned long hash_long(unsigned long val,unsigned int bits)
{
        unsigned long hash = val* 0×9e370001UL;
        return hash S >> (32-bits);
}

于是乎，我去翻找了对应的源代码文件/include/linux/hash.h，发现了有趣的地方，这个魔数在最新版本的kernel中并不存在，取而代之的是0x61C88647。

#define GOLDEN_RATIO_32 0x61C88647
#define GOLDEN_RATIO_64 0x61C8864680B583EBull

#ifdef CONFIG_HAVE_ARCH_HASH
/* This header may use the GOLDEN_RATIO_xx constants */
#include <asm/hash.h>
#endif

/*
 * The _generic versions exist only so lib/test_hash.c can compare
 * the arch-optimized versions with the generic.
 *
 * Note that if you change these, any <asm/hash.h> that aren't updated
 * to match need to have their HAVE_ARCH_* define values updated so the
 * self-test will not false-positive.
 */
#ifndef HAVE_ARCH__HASH_32
#define __hash_32 __hash_32_generic
#endif
static inline u32 __hash_32_generic(u32 val)
{
	return val * GOLDEN_RATIO_32;
}

static inline u32 hash_32(u32 val, unsigned int bits)
{
	/* High bits are more random, so use them. */
	return __hash_32(val) >> (32 - bits);
}

#ifndef HAVE_ARCH_HASH_64
#define hash_64 hash_64_generic
#endif
static __always_inline u32 hash_64_generic(u64 val, unsigned int bits)
{
#if BITS_PER_LONG == 64
	/* 64x64-bit multiply is efficient on all 64-bit processors */
	return val * GOLDEN_RATIO_64 >> (64 - bits);
#else
	/* Hash 64 bits using only 32x32-bit multiply. */
	return hash_32((u32)val ^ __hash_32(val >> 32), bits);
#endif
}

为什么会出现这个问题呢？我们找到了hash.h的修改日志，发现了2016-05-02的一条修改记录引人注目。

变量名	值
GOLDEN_RATIO_PRIME_32	0x9e370001
GOLDEN_RATIO_PRIME_64	0x9e37fffffffc0001B
GOLDEN_RATIO_32	0x61C88647
GOLDEN_RATIO_64	0x61C8864680B583EB

二、分析

这次patch往hash文件中增加了新的魔数0x61C88647，但是并没有删除原来的魔数（为了兼容性和效率）。其中注释说到的理由可以概括为两点：

上方的质数（0x9e370001与0x9e37fffffffc0001）过于稀疏（sparse），对64位hash函数影响显著。
是否是质数对于乘法hash并没有影响。

因此，原来的最接近黄金分割的素数0x9e370001被替换成最接近黄金分割的奇数0x61C88647。

diff --git a/include/linux/hash.h b/include/linux/hash.h
index 1afde47e1528c..79c52fa81cac9 100644
--- a/include/linux/hash.h
+++ b/include/linux/hash.h
@@ -32,12 +32,28 @@
 #error Wordsize not 32 or 64
 #endif
 
+/*
+ * The above primes are actively bad for hashing, since they are
+ * too sparse. The 32-bit one is mostly ok, the 64-bit one causes
+ * real problems. Besides, the "prime" part is pointless for the
+ * multiplicative hash.
+ *
+ * Although a random odd number will do, it turns out that the golden
+ * ratio phi = (sqrt(5)-1)/2, or its negative, has particularly nice
+ * properties.
+ *
+ * These are the negative, (1 - phi) = (phi^2) = (3 - sqrt(5))/2.
+ * (See Knuth vol 3, section 6.4, exercise 9.)
+ */
+#define GOLDEN_RATIO_32 0x61C88647
+#define GOLDEN_RATIO_64 0x61C8864680B583EBull
+
 static __always_inline u64 hash_64(u64 val, unsigned int bits)
 {
 	u64 hash = val;
 
-#if defined(CONFIG_ARCH_HAS_FAST_MULTIPLIER) && BITS_PER_LONG == 64
-	hash = hash * GOLDEN_RATIO_PRIME_64;
+#if BITS_PER_LONG == 64
+	hash = hash * GOLDEN_RATIO_64;
 #else
 	/*  Sigh, gcc can't optimise this alone like it does for 32 bits. */
 	u64 n = hash;

这两个原因可能让人不太理解，下面我们来详细讲一下两个原因：

2.1 稀疏

之前讲过，乘法hash是将key乘以一个适当的大数，于是结果溢出，保留一些高位的值得到一个相对的伪随机数。这里我们需要考虑两个因素：一个是随机性，我们希望能够得到一个相对随机性较大的结果；另外我们还需要考虑效率问题，因为大数乘法的效率比较低。早前选取了一个位稀疏（bit sparse）的大质数0x9e370001，这个数可以表示为 $2^{31}+2^{29}-2^{25}+2^{22}-2^{19}-2^{16}+1$。因此，编译器在优化乘法的时候会将其优化成位操作与加法，64位下也是同理。0x61C8864680B583EB可以表示为$ 2^{63} + 2^{61} - 2^{57} + 2^{54} - 2^{51} - 2^{18} + 1 $。可以看到位稀疏的大数在乘法的时候，可以通过位操作完成大数乘法，如下：

static __always_inline u64 hash_64(u64 val, unsigned int bits)
{
	...
	/*  Sigh, gcc can't optimise this alone like it does for 32 bits. */
	u64 n = hash;
	n <<= 18;
	hash -= n;
	n <<= 33;
	hash -= n;
	n <<= 3;
	hash += n;
	n <<= 3;
	hash -= n;
	n <<= 4;
	hash += n;
	n <<= 2;
	hash += n;
	...
	/* High bits are more random, so use them. */
	return hash >> (64 - bits);
}

可以看到，位越稀疏的大数，所消耗的开销会越低，但是这也会带来一些问题。在32位下，这些问题并不显著；而在64位下，这会带来一些灾难性的问题。我们将一个等待hash的数设为$n$，则$ n[64] = {n[63]n[62]…n[0]} $,其中$O$负责将数组补全到64位，则上述的步骤就会变成以下步骤：

$ out = n[64] $
$ out -= {n[45]n[41]….n[0]O}$ // 左移18位后
$$out -= {n[12]n[11]….n[0]O}$ // 左移33位后
$out += {n[9]n[8]….n[0]O}$ // 左移3位后
$ out += {n[6]n[5]….n[0]O}$ // 左移3位后
$ out -={n[2]n[1]n[0]O}$ // 左移4位后
$out += {n[0]O}$ // 左移2位后

这个稀疏的数存在的问题便是，高位的一些数对于out结果的决定性较小，而低位的一些数对于out结果的决定性较大，这会带来什么问题呢？我们知道，计算机中往往存在一些对齐的地址（比如页对齐等）。例如一个值0xf10000，这个数的低12位都为0，这会导致上方步骤中的第3至第7步的右侧结果都为0，而高位0xf1承载了更多的信息的比重反而十分低了。这会让hash函数的碰撞率升高。我引入以下测试代码：

#include <iostream>

#define GOLDEN_RATIO_PRIME 0x9e37fffffffc0001UL

using namespace std;
static inline uint64_t hash_64(uint64_t  val, unsigned int bits)
{
    uint64_t hash = val;
    hash = hash * GOLDEN_RATIO_PRIME;
    return hash >> (64 - bits);
}

#define BIT 32
int main() {
    uint64_t arrays[] = {0xf10000UL,0xf20000UL,0xf30000UL,0xf40000UL,0xfe0000UL, 0xff0000UL};
    for (int i = 0; i < sizeof(arrays)/sizeof(uint64_t); ++i) {
            uint64_t hash = hash_64(arrays[i], BIT);
            cout << hex << "0x" << hash << endl;

    }
}

结果：

0xfffffc3c
0xfffffc38
0xfffffc34
0xfffffc30
0xfffffc08
0xfffffc04

可以看到，当hash长度设置为32位时，这一串基址的hash结果非常接近，而当我们把BIT设置更短时，例如28，就会产生碰撞：

0xfffffc3
0xfffffc3
0xfffffc3
0xfffffc3
0xfffffc0
0xfffffc0

而将GOLDEN_RATIO_PRIME设为0x61C8864680B583EB则不存在这个问题，同样的输入，同样BIT为28，结果会更加离散：

0x685f2ae
0xeea5ab9
0x74ec2c4
0xfb32ad0
0x39f3b41
0xc03a34c

2.2 质数

那么为什么把质数替换成非质数呢？我从《Vol,3 SortingAndSearching-Donald Knuth （计算机程序设计艺术（第三卷））》中找到了一些答案。该书的6.4节中提到，有两大散列函数十分友好。一是以除法为基础，而另一是以乘法为基础。除法如下：

$$h(k)=K\mod M$$

此时$M$应当选择为一个大质数。而乘法并没有要求数为质数，早期的开发者可能混淆了这个概念。具体情况可以去翻一下这本书。

总结

一个数的变化Linux都会斟酌很久，反复改进，应该学习这种精神，也算是学到不少知识。

2024-09-14 13:24:11 #Linux Kernel #Hash