哈希函数的原理、实现与应用场景

1. 哈希函数原理
2. 常用哈希算法
3. std::hash 的实现
4. 应用场景

1. 哈希函数原理

1.1. 基本概念

哈希函数是一种将任意长度的输入（键）映射到固定长度输出（哈希值）的函数：

1	h: Key → Hash Value

1.2. 哈希函数特性

确定性：相同的输入总是产生相同的输出，这是哈希函数的基本要求
高效性：计算速度快，时间复杂度通常为O(1)或O(n)。
均匀分布：好的哈希函数应该使输出在值域内均匀分布，减少哈希冲突的概率。

1.3. 哈希冲突

当两个不同的输入产生相同的哈希值时，称为哈希冲突。

解决冲突的方法：

开放定址法：线性探测、二次探测
链地址法：每个桶使用链表存储冲突元素
再哈希法：使用多个哈希函数

2. 常用哈希算法

2.1. 非加密哈希函数

2.1.1. 多项式滚动哈希

这个算法其实没有特定的名字，只是其哈希算法的原理是多项式滚动。

虽然没有特定的名称，但它是JDK源码中String类的哈希计算方法，在Java中被广泛使用。这个算法的实现非常简单，所以在很多开源的项目中也被采用。

C/C++实现

uint32_t polynomial_rolling(const std::string& str)
{
    int h = 0;
    for (int i = 0; i < str.length(); i++)
    {
        h = 31 * h + str[i];
    }
    return h;
}

JDK源码实现

在Java 8及以后版本中，String.hashCode()的实现如下：

public final class String {
    /** Cache the hash code for the string */
    private int hash; // Default to 0
    
    public int hashCode() {
        int h = hash;
        if (h == 0 && value.length > 0) {
            char val[] = value;
            
            for (int i = 0; i < value.length; i++) {
                h = 31 * h + val[i];
            }
            hash = h;
        }
        return h;
    }
}

算法讲解

JDK中String类的哈希计算采用的是多项式滚动哈希：

1	h(s) = s[0] × 31^(n-1) + s[1] × 31^(n-2) + ... + s[n-1] × 31^0

数学推导过程如下：

初始: h = 0
第1次: h = 31 × 0 + s[0] = s[0]
第2次: h = 31 × (31 × 0 + s[0]) + s[1] = 31 × s[0] + s[1]
第3次: h = 31 × (31 × s[0] + s[1]) + s[2] = 31² × s[0] + 31 × s[1] + s[2]
...
最终: h = 31^(n-1) × s[0] + 31^(n-2) × s[1] + ... + 31^0 × s[n-1]

为什么选择31?

质数性质： 31是质数，减少哈希冲突。质数乘法有更好的分布特性。
性能优化： 31 * i 可以被JVM优化为 (i << 5) - i。移位操作比乘法快很多。
历史原因： 经过大量测试，31在各种场景下表现良好。在ASCII字符范围内分布均匀。

2.1.2. DJB2

算法实现

// DJB2哈希算法
uint32_t DJB2(const std::string& str)
{
    uint32_t DJB2 = 5381;
    for (char c : str)
    {
        DJB2 = ((DJB2 << 5) + DJB2) + c; // DJB2 * 33 + c
    }
    return DJB2;
}

算法讲解

初始值： 5381（经过大量测试选择的质数）
核心操作： hash = hash * 33 + char
乘法优化： hash * 33 可以优化为 (hash << 5) + hash
特点： 简单高效，分布均匀

2.1.3. FNV-1/FNV-1a

Fowler-Noll-Vo哈希算法，有两种变体。

算法实现

static const uint32_t FNV_OFFSET_BASIS = 2166136261u;
static const uint32_t FNV_PRIME = 16777619u;

// FNV-1 32位版本
static uint32_t fnv1_32(const std::string& str)
{
    uint32_t hash = FNV_OFFSET_BASIS;
    for (char c : str)
    {
        hash = (hash * FNV_PRIME) ^ c;
    }
    return hash;
}

// FNV-1a 32位版本（推荐使用）
static uint32_t fnv1a_32(const std::string& str)
{
    uint32_t hash = FNV_OFFSET_BASIS;
    for (char c : str)
    {
        hash = (hash ^ c) * FNV_PRIME;
    }
    return hash;
}

static const uint64_t FNV_OFFSET_BASIS_64 = 14695981039346656037u;
static const uint64_t FNV_PRIME_64 = 1099511628211u;

// FNV-1a 64位版本
static uint64_t fnv1a_64(const std::string& str)
{
    uint64_t hash = FNV_OFFSET_BASIS_64;
    for (char c : str)
    {
        hash = (hash ^ static_cast<uint8_t>(c)) * FNV_PRIME_64;
    }
    return hash;
}

算法讲解

FNV-1： 先乘后异或 hash = (hash * prime) ^ byte。
FNV-1a： 先异或后乘 hash = (hash ^ byte) * prime。
FNV-1a更好： 对短字符串和零字节处理更好。
质数选择： 精心选择的质数保证良好分布。

2.1.4. Jenkins

Bob Jenkins开发了一系列高质量的哈希函数，以出色的分布性和低冲突率著称。

主要算法成员：

one-at-a-time：简单但高质量的通用哈希
lookup2：经典的32位哈希
lookup3：改进的64位友好版本
SpookyHash：更现代的快速哈希
CityHash： Google基于Jenkins工作的扩展

Bob Jenkins是哈希算法领域的传奇人物，其设计的Lookup3等哈希函数以卓越性能和低碰撞率成为行业基石，深刻影响了后世算法发展，并被数据库、编译器等多种核心软件广泛采用。

算法实现

// Jenkins one_at_a_time 哈希（简单但高质量）
uint32_t one_at_a_time(const std::string& str)
{
    uint32_t hash = 0;

    for (char c : str)
    {
        hash += static_cast<uint8_t>(c);
        hash += (hash << 10);
        hash ^= (hash >> 6);
    }

    hash += (hash << 3);
    hash ^= (hash >> 11);
    hash += (hash << 15);

    return hash;
}

// Jenkins lookup3 哈希（更复杂的Jenkins哈希）
uint32_t lookup3(const void* key, size_t length, uint32_t initval = 0)
{
    uint32_t a, b, c;
    const uint8_t* k = static_cast<const uint8_t*>(key);

    // 设置初始值
    a = b = c = 0xdeadbeef + static_cast<uint32_t>(length) + initval;

    // 主要哈希循环
    while (length > 12)
    {
        a += k[0] + (static_cast<uint32_t>(k[1]) << 8) + (static_cast<uint32_t>(k[2]) << 16) +
             (static_cast<uint32_t>(k[3]) << 24);
        b += k[4] + (static_cast<uint32_t>(k[5]) << 8) + (static_cast<uint32_t>(k[6]) << 16) +
             (static_cast<uint32_t>(k[7]) << 24);
        c += k[8] + (static_cast<uint32_t>(k[9]) << 8) + (static_cast<uint32_t>(k[10]) << 16) +
             (static_cast<uint32_t>(k[11]) << 24);

        // 混合函数
        a -= c;
        a ^= ((c << 4) | (c >> 28));
        c += b;
        b -= a;
        b ^= ((a << 6) | (a >> 26));
        a += c;
        c -= b;
        c ^= ((b << 8) | (b >> 24));
        b += a;
        a -= c;
        a ^= ((c << 16) | (c >> 16));
        c += b;
        b -= a;
        b ^= ((a << 19) | (a >> 13));
        a += c;
        c -= b;
        c ^= ((b << 4) | (b >> 28));
        b += a;

        k += 12;
        length -= 12;
    }

    // 处理最后1-12个字节
    switch (length)
    {
        case 12:
            c += (static_cast<uint32_t>(k[11]) << 24);
            [[fallthrough]];
        case 11:
            c += (static_cast<uint32_t>(k[10]) << 16);
            [[fallthrough]];
        case 10:
            c += (static_cast<uint32_t>(k[9]) << 8);
            [[fallthrough]];
        case 9:
            c += k[8];
            [[fallthrough]];
        case 8:
            b += (static_cast<uint32_t>(k[7]) << 24);
            [[fallthrough]];
        case 7:
            b += (static_cast<uint32_t>(k[6]) << 16);
            [[fallthrough]];
        case 6:
            b += (static_cast<uint32_t>(k[5]) << 8);
            [[fallthrough]];
        case 5:
            b += k[4];
            [[fallthrough]];
        case 4:
            a += (static_cast<uint32_t>(k[3]) << 24);
            [[fallthrough]];
        case 3:
            a += (static_cast<uint32_t>(k[2]) << 16);
            [[fallthrough]];
        case 2:
            a += (static_cast<uint32_t>(k[1]) << 8);
            [[fallthrough]];
        case 1:
            a += k[0];
            break;
        case 0:
            return c;
    }

    // 最终混合
    c ^= b;
    c -= ((b << 14) | (b >> 18));
    a ^= c;
    a -= ((c << 11) | (c >> 21));
    b ^= a;
    b -= ((a << 25) | (a >> 7));
    c ^= b;
    c -= ((b << 16) | (b >> 16));
    a ^= c;
    a -= ((c << 4) | (c >> 28));
    b ^= a;
    b -= ((a << 14) | (a >> 18));
    c ^= b;
    c -= ((b << 24) | (b >> 8));

    return c;
}

算法讲解

one_at_a_time:

这个算法的核心思想是充分混合：

加法：引入算术相关性
移位：扩散比特位影响
异或：引入非线性变换

每个操作都帮助打破输入数据的模式，确保微小的输入变化导致哈希值的巨大变化。

jenkins算法比较：

特性	one-at-a-time	lookup2	lookup3	SpookyHash	CityHash
推出时间	1990s后期	1997	2006	2012	2011
位宽	32位	32位	32/64位	32/64/128位	64/128位
速度	中等	中等	快	非常快	极快
质量	好	优秀	卓越	卓越	优秀
复杂度	简单	中等	中等	中等	复杂
目标平台	通用	32位系统	32/64位	64位优化	64位优化

推荐选择:

平衡选择: lookup3 - 质量与性能的最佳平衡
性能优先: SpookyHash - 现代系统的优秀选择
简单优先: one-at-a-time - 资源受限环境
极致性能: CityHash - x86-64服务器环境

2.1.5. MurmurHash

算法实现

// MurmurHash3 32位版本
uint32_t murmur3_32(const void* key, size_t len, uint32_t seed = 0)
{
    const uint8_t* data = static_cast<const uint8_t*>(key);
    const int nblocks = len / 4;

    uint32_t h1 = seed;

    const uint32_t c1 = 0xcc9e2d51;
    const uint32_t c2 = 0x1b873593;

    // 主体处理
    const uint32_t* blocks = reinterpret_cast<const uint32_t*>(data + nblocks * 4);
    for (int i = -nblocks; i; i++)
    {
        uint32_t k1 = blocks[i];

        k1 *= c1;
        k1 = (k1 << 15) | (k1 >> 17);
        k1 *= c2;

        h1 ^= k1;
        h1 = (h1 << 13) | (h1 >> 19);
        h1 = h1 * 5 + 0xe6546b64;
    }

    // 尾部处理
    const uint8_t* tail = data + nblocks * 4;
    uint32_t k1 = 0;

    switch (len & 3)
    {
        case 3:
            k1 ^= tail[2] << 16;
            [[fallthrough]];
        case 2:
            k1 ^= tail[1] << 8;
            [[fallthrough]];
        case 1:
            k1 ^= tail[0];
            k1 *= c1;
            k1 = (k1 << 15) | (k1 >> 17);
            k1 *= c2;
            h1 ^= k1;
    }

    // 最终混合
    h1 ^= len;
    h1 ^= h1 >> 16;
    h1 *= 0x85ebca6b;
    h1 ^= h1 >> 13;
    h1 *= 0xc2b2ae35;
    h1 ^= h1 >> 16;

    return h1;
}

算法讲解

名称来源: Multiply and Rotate - 乘法和旋转操作
开发者: Austin Appleby (2008)
设计目标: 高性能、优良分布、简单实现
应用场景: 哈希表、布隆过滤器、分布式系统

2.2. 加密哈希函数

2.2.1. MD5

MD5（Message-Digest Algorithm 5）是一种广泛使用的密码散列函数，由Ron Rivest在1991年设计，是MD4算法的改进版。

基本特性:

输出长度: 128位（16字节）
块大小: 512位（64字节）
设计目标: 数据完整性验证、数字签名
现状: 已不安全，不推荐用于安全敏感场景

源码实现：

通常会使用OpenSSL库来实现MD5哈希函数。OpenSSL是一个开源的、功能全面的密码学工具库，提供了 SSL/TLS 协议实现以及各种强大的加密、解密和证书管理功能。

#include <iomanip>
#include <openssl/evp.h>
#include <openssl/md5.h>
#include <sstream>
#include <string>

std::string md5_hash(const std::string& input)
{
    EVP_MD_CTX* context = EVP_MD_CTX_new();
    unsigned char digest[EVP_MAX_MD_SIZE];
    unsigned int digest_len;

    // 使用EVP接口
    EVP_DigestInit_ex(context, EVP_md5(), nullptr);
    EVP_DigestUpdate(context, input.c_str(), input.length());
    EVP_DigestFinal_ex(context, digest, &digest_len);
    EVP_MD_CTX_free(context);

    // 转换为十六进制字符串
    std::stringstream ss;
    for (unsigned int i = 0; i < digest_len; i++)
    {
        ss << std::hex << std::setw(2) << std::setfill('0') << static_cast<int>(digest[i]);
    }
    return ss.str();
}

2.2.2. SHA

SHA系列算法概述

SHA(Secure Hash Algorithm)是由美国国家安全局设计、美国国家标准与技术研究院发布的一系列密码散列函数。

SHA系列包含多个版本，按发布时间和安全性演进如下：

SHA-0：1993年发布，因存在安全缺陷，很快被SHA-1取代。
SHA-1：1995年发布，输出160位（20字节）散列值，曾广泛用于SSL/TLS、Git、数字证书等，但已被证明不安全。
SHA-2：2001年发布，包含SHA-224、SHA-256、SHA-384、SHA-512等，其中SHA-256是应用最广泛的版本。
SHA-3：2015年发布，基于全新的海绵结构设计，作为SHA-2的替代方案，包含SHA3-256、SHA3-512等。

SHA-1 基本特性

输出长度: 160位（20字节）
块大小: 512位
设计年代: 1995年
现状: 已不安全，被证实存在碰撞攻击

SHA-256 基本特性

输出长度: 256位（32字节）
块大小: 512位
所属标准: SHA-2家族
现状: 目前安全，广泛使用

源码实现

#include <iomanip>
#include <memory>
#include <openssl/evp.h>
#include <sstream>
#include <string>

class SHAHash
{
public:
    // SHA-256 使用 EVP 接口
    static std::string sha256(const std::string& input) { return hash_evp(EVP_sha256(), input); }

    // SHA-1 使用 EVP 接口
    static std::string sha1(const std::string& input) { return hash_evp(EVP_sha1(), input); }

    // SHA-512 使用 EVP 接口
    static std::string sha512(const std::string& input) { return hash_evp(EVP_sha512(), input); }

    // SHA-384 使用 EVP 接口
    static std::string sha384(const std::string& input) { return hash_evp(EVP_sha384(), input); }

    // 通用的 SHA-224
    static std::string sha224(const std::string& input) { return hash_evp(EVP_sha224(), input); }

private:
    // 通用的 EVP 哈希函数
    static std::string hash_evp(const EVP_MD* md_type, const std::string& input)
    {
        std::unique_ptr<EVP_MD_CTX, decltype(&EVP_MD_CTX_free)> context(EVP_MD_CTX_new(),
                                                                        EVP_MD_CTX_free);

        if (!context)
        {
            throw std::runtime_error("Failed to create EVP_MD_CTX");
        }

        unsigned char digest[EVP_MAX_MD_SIZE];
        unsigned int digest_len;

        // 初始化哈希上下文
        if (EVP_DigestInit_ex(context.get(), md_type, nullptr) != 1)
        {
            throw std::runtime_error("EVP_DigestInit_ex failed");
        }

        // 更新哈希数据
        if (EVP_DigestUpdate(context.get(), input.c_str(), input.length()) != 1)
        {
            throw std::runtime_error("EVP_DigestUpdate failed");
        }

        // 完成哈希计算
        if (EVP_DigestFinal_ex(context.get(), digest, &digest_len) != 1)
        {
            throw std::runtime_error("EVP_DigestFinal_ex failed");
        }

        // 转换为十六进制字符串
        return bytes_to_hex(digest, digest_len);
    }

    // 将字节数组转换为十六进制字符串
    static std::string bytes_to_hex(const unsigned char* bytes, size_t length)
    {
        std::stringstream ss;
        for (size_t i = 0; i < length; i++)
        {
            ss << std::hex << std::setw(2) << std::setfill('0') << static_cast<int>(bytes[i]);
        }
        return ss.str();
    }
};

2.3. 专用哈希函数

2.3.1. 整数哈希函数

整数哈希函数的主要目的是改善原始整数的分布特性，而不是压缩数据。

应用场景：

哈希表优化
数据库分片
随机数生成器种子优化

源码实现：

// Thomas Wang的整数哈希函数
uint32_t thomas_wang(uint32_t key)
{
    key = ~key + (key << 15);
    key = key ^ (key >> 12);
    key = key + (key << 2);
    key = key ^ (key >> 4);
    key = key * 2057;
    key = key ^ (key >> 16);
    return key;
}

// 乘法哈希（适用于哈希表）
uint32_t multiplicative_hash(uint32_t key, uint32_t table_size)
{
    const double A = 0.6180339887; // 黄金比例的分数部分
    double product = key * A;
    double fractional = product - static_cast<uint32_t>(product);
    return static_cast<uint32_t>(table_size * fractional);
}

// 除法哈希
uint32_t division_hash(uint32_t key, uint32_t table_size)
{
    return key % table_size;
}

使用示例：

template<typename K>
struct SimpleHasher {
    size_t operator()(const K& key) const {
        return key;  // 直接使用原始值
    }
};

template<typename K>  
struct BetterHasher {
    size_t operator()(const K& key) const {
        return thomas_wang(key);  // 使用哈希函数改善分布
    }
};

// 使用示例
std::unordered_map<uint32_t, std::string, SimpleHasher<uint32_t>> map1;
std::unordered_map<uint32_t, std::string, BetterHasher<uint32_t>> map2;

2.3.2. 布隆过滤器哈希

参见《布隆过滤器》

3. std::hash 的实现

3.1. 算法概述

在C++里，C++标准没有规定std::hash的具体算法，只定义了接口。各个平台和编译器可以自行实现。

主流编译器的实现：

GCC 4.8+: 通常使用MurmurHash2变体
Clang: 类似GCC，使用MurmurHash系列
MSVC: 使用FNV-1a或类似算法

3.2. 使用方法

示例代码：

void TestStdHash()
{
    std::string str1("Hello World!");
    std::string str2("我爱中国！");

    std::cout << "std::hash():" << std::endl;
    std::hash<std::string> hasher;
    std::cout << str1 << " --> " << hasher(str1.c_str()) << std::endl;
    std::cout << str2 << " --> " << hasher(str2.c_str()) << std::endl;
}

在各个平台下的实现如下：

平台	编译器	`Hello World!`	`我爱中国！`
Ubuntu	GCC 13.3.0	`10571665718977150164`	`16981257090658025398`
Windows	MSVC 19.44.35214	`10092224619179044402`	`7108835978030019050`
macOS	clang 16.0.0	`18476351241006313`	`16995802077979883537`