HyperLogLog算法(HLL算法)

1. 什么是HyperLogLog？
2. 算法原理详解
- 2.1. 直观理解(抛硬币实验)
- 2.2. 算法步骤
3. 数学推导
4. 完整实现
5. 性能分析
6. 应用场景
7. 总结

1. 什么是HyperLogLog？

1.1. 一句话描述

HyperLogLog是一种 概率统计算法，用于在极小的内存空间下快速估算海量数据的 不重复元素个数 (基数统计)。

1.2. 核心问题

在海量数据的背景下，如何高效、省空间地计算一个集合中不同元素的个数，即“基数估计”问题。(如大型热门网站的UV统计)

1.3. 传统方法的瓶颈

set/unorderded_set： 准确，但需要存储所有元素本身。如果元素是几十亿个用户ID，内存消耗将是巨大的（几十GB甚至更多），不现实。
Bitmap： 将每个元素通过哈希映射到一个位，如果该位是0，则置1并计数+1。比 HashMap 省空间，但当基数(独立用户数)很大时，比如有10亿用户，就需要10亿 / 8 ≈ 125MB的内存，对于多个需要统计的维度来说，内存消耗依然很高。

1.4. HyperLogLog方案

HyperLogLog 是一种概率算法，用于基数估计。它的核心思想是：通过观察数据的随机性来估算基数。

均匀分布假设：好的哈希函数能将输入均匀分布
极值原理：观察哈希值的极端模式（如前导零）
概率估算：通过极端值的概率分布估算基数

它的特点是：

速度极快
占用空间极小（通常只需要十几KB到几十KB的内存）
可容忍一定的误差（标准误差约为 0.81% / √m，其中 m 是桶数）

它牺牲了精确度，换来了极致的空间和速度效率，对于许多大数据场景来说，这个 trade-off 是完全值得的。

1.5. 核心思想

2. 算法原理详解

2.1. 直观理解(抛硬币实验)

抛掷一枚硬币，用1表示正面，用0表示反面。那么抛一次硬币，出现正面(1)和反面(0)的概率都是1/2。

抛硬币说明	概率	抛硬币的次数	硬币正反面的顺序
抛掷1次硬币就出现正面的概率	1/2	1	1
抛掷2次硬币才出现正面的概率	1/4	2	`01`
抛掷3次硬币才出现正面的概率	1/8	3	`001`
抛掷k次硬币才出现正面的概率	1/2^k	k	`0...01`(前面有`k-1`个`0`)

如果我们观察到最长连续反面次数为 k，可以估算抛硬币次数 ≈ 2^k。
同意的，如果我们观察到最长连续反面次数为k，第k+1次出现正面，可以估算抛硬币次数 ≈ 2^(k+1)。

这里基数指的是集合中不重复元素的数量。对应到统计学中，就是采用的样本容量(去重后的)。

2.2. 算法步骤

2.2.1. 步骤 1：哈希处理

1
2
3

// 对每个元素进行哈希
hash = hashFunction(element)
// 得到 32位 整数：01011010010110100101101001011010

2.2.2. 步骤 2：分桶统计

将哈希值分成两部分：

前 p 位：决定桶索引 (0 到 m-1，其中 m = 2^p)
剩余 32-p 位：统计前导零数量

// 示例：p = 4，m = 16 个桶
hash = 0b 01011010 01011010 01011010 01011010
bucket_index = hash >>> (32 - 4)    // 32位hash值的 前 4位(0101 = 5)
remaining_bits = hash & 0x0FFFFFFF  // 32位hash值的 后28位
// 统计前导零(统计后 28位中 从左往右第一次出现1时，前面出现0的位数)
leading_zeros = countLeadingZeros(remaining_bits)  // 0

2.2.3. 步骤 3：更新寄存器

每个桶记录观察到的最大前导零数量：

1
2
3

if (leading_zeros > registers[bucket_index]) {
    registers[bucket_index] = leading_zeros;
}

2.2.4. 步骤 4：基数估算

使用调和平均数估算：

function estimateCardinality(registers, m) {
    let sum = 0;
    for (let i = 0; i < m; i++) {
        sum += Math.pow(2, -registers[i]);
    }
    
    const alpha = getAlpha(m);  // 修正系数
    const estimate = alpha * m * m / sum;
    
    return applyCorrections(estimate, m);
}

3. 数学推导

3.1. 概率模型

在单个桶中，观察到前导零数量 ≥ k 的概率：

1	P(leading_zeros ≥ k) = 1/2^k

在 n 个元素中，某个桶的最大前导零为 k 的概率：

1	P(max_leading_zeros = k) ≈ (1 - 1/2^(k+1))^n - (1 - 1/2^k)^n

3.2. 估算公式推导

令 Z = 调和平均数：

1	Z = (∑ 2^{-register[i]})^{-1}

基数估算：

1	E = α × m × Z

其中 α 是修正系数：

1	α(m) = (m × ∫_0^∞ (log_2((2+u)/(1+u)))^m du)^{-1}

3.3. 修正项

小范围修正（线性计数）：

if (estimate <= 2.5 * m) {
    const zeros = countZeroRegisters(registers);
    if (zeros > 0) {
        estimate = m * Math.log(m / zeros);
    }
}

大范围修正：

1
2
3

if (estimate > 2^32 / 30) {
    estimate = -2^32 * Math.log(1 - estimate / 2^32);
}

4. 完整实现

class HyperLogLog {
    constructor(precision = 14) {
        // 精度 p，寄存器数量 m = 2^p
        this.p = precision;
        this.m = 1 << precision; // 2^p
        this.registers = new Array(this.m).fill(0);
        this.alpha = this._computeAlpha();
    }
    
    _computeAlpha() {
        // 修正系数 α 的预计算值
        switch (this.m) {
            case 16: return 0.673;
            case 32: return 0.697;
            case 64: return 0.709;
            default: return 0.7213 / (1 + 1.079 / this.m);
        }
    }
    
    _hash(value) {
        // 32位哈希函数
        let hash = 0;
        for (let i = 0; i < value.length; i++) {
            hash = ((hash << 5) - hash) + value.charCodeAt(i);
            hash |= 0; // 转换为32位整数
        }
        return hash >>> 0; // 确保无符号
    }
    
    _countLeadingZeros(bits) {
        // 统计后 (32-p) 位的前导零数量
        if (bits === 0) return 32 - this.p;
        
        let count = 1; // 从1开始计数
        while ((bits & (1 << (31 - this.p - count))) === 0) {
            count++;
            if (count > (32 - this.p)) break;
        }
        return count;
    }
    
    add(value) {
        const hash = this._hash(value);
        
        // 前 p 位作为桶索引
        const index = hash >>> (32 - this.p);
        
        // 后 (32-p) 位统计前导零
        const bits = hash & ((1 << (32 - this.p)) - 1);
        const leadingZeros = this._countLeadingZeros(bits);
        
        // 更新寄存器
        if (leadingZeros > this.registers[index]) {
            this.registers[index] = leadingZeros;
        }
    }
    
    count() {
        // 计算调和平均数
        let sum = 0;
        let zeroCount = 0;
        
        for (let i = 0; i < this.m; i++) {
            if (this.registers[i] === 0) {
                zeroCount++;
            }
            sum += Math.pow(2, -this.registers[i]);
        }
        
        // 基础估算
        let estimate = this.alpha * this.m * this.m / sum;
        
        // 小范围修正
        if (estimate <= 2.5 * this.m) {
            if (zeroCount > 0) {
                estimate = this.m * Math.log(this.m / zeroCount);
            }
        }
        // 大范围修正
        else if (estimate > (1 << 30)) { // 2^30
            estimate = -Math.pow(2, 32) * Math.log(1 - estimate / Math.pow(2, 32));
        }
        
        return Math.round(estimate);
    }
    
    merge(other) {
        // 合并两个 HyperLogLog
        if (this.m !== other.m) {
            throw new Error('Precision mismatch');
        }
        
        for (let i = 0; i < this.m; i++) {
            if (other.registers[i] > this.registers[i]) {
                this.registers[i] = other.registers[i];
            }
        }
    }
    
    // 获取内存使用量（字节）
    getMemoryUsage() {
        return this.m; // 每个寄存器1字节，共 m 字节
    }
    
    // 获取相对误差
    getRelativeError() {
        return 1.04 / Math.sqrt(this.m);
    }
}

5. 性能分析

5.1. 内存消耗

精度 p	寄存器数 m	内存占用	标准误差
10	1,024	1 KB	3.2%
12	4,096	4 KB	1.6%
14	16,384	16 KB	0.8%
16	65,536	64 KB	0.4%

5.2. 精度选择

中小网站 (UV < 10万)：p = 12 (4KB, 1.6%误差)
大型网站 (UV 10万-1000万)：p = 14 (16KB, 0.8%误差)
超大型 (UV > 1000万)：p = 16 (64KB, 0.4%误差)

5.3. 精度测试

function testAccuracy() {
    const hll = new HyperLogLog(14);
    const actualCount = 100000;
    
    // 添加10万个唯一元素
    for (let i = 0; i < actualCount; i++) {
        hll.add(`user-${i}-${Date.now()}`);
    }
    
    const estimatedCount = hll.count();
    const error = Math.abs(estimatedCount - actualCount) / actualCount;
    
    console.log(`实际值: ${actualCount}`);
    console.log(`估算值: ${estimatedCount}`);
    console.log(`相对误差: ${(error * 100).toFixed(2)}%`);
    console.log(`理论误差: ${(hll.getRelativeError() * 100).toFixed(2)}%`);
    console.log(`内存使用: ${hll.getMemoryUsage()} 字节`);
}

6. 应用场景

6.1. 网站统计分析

class WebsiteAnalytics {
    constructor() {
        this.dailyStats = new Map(); // date -> HyperLogLog
    }
    
    trackVisit(userId, date = new Date()) {
        const dateStr = date.toISOString().split('T')[0];
        
        if (!this.dailyStats.has(dateStr)) {
            this.dailyStats.set(dateStr, new HyperLogLog(14));
        }
        
        this.dailyStats.get(dateStr).add(userId);
    }
    
    getUV(date) {
        const dateStr = typeof date === 'string' ? date : date.toISOString().split('T')[0];
        const hll = this.dailyStats.get(dateStr);
        return hll ? hll.count() : 0;
    }
    
    // 统计多日总UV（去重）
    getUVRange(startDate, endDate) {
        const merged = new HyperLogLog(14);
        
        let current = new Date(startDate);
        const end = new Date(endDate);
        
        while (current <= end) {
            const dateStr = current.toISOString().split('T')[0];
            const dailyHLL = this.dailyStats.get(dateStr);
            
            if (dailyHLL) {
                merged.merge(dailyHLL);
            }
            
            current.setDate(current.getDate() + 1);
        }
        
        return merged.count();
    }
}

6.2. 分布式统计

class DistributedCounter {
    constructor() {
        this.shards = new Map(); // shardId -> HyperLogLog
    }
    
    // 每个节点维护自己的 HLL
    addToShard(shardId, value) {
        if (!this.shards.has(shardId)) {
            this.shards.set(shardId, new HyperLogLog(14));
        }
        this.shards.get(shardId).add(value);
    }
    
    // 合并所有分片
    getGlobalCount() {
        const globalHLL = new HyperLogLog(14);
        
        for (const shardHLL of this.shards.values()) {
            globalHLL.merge(shardHLL);
        }
        
        return globalHLL.count();
    }
}

6.3. 实时数据流

class StreamProcessor {
    constructor(windowSize = 5 * 60 * 1000) { // 5分钟窗口
        this.windowSize = windowSize;
        this.currentWindow = new HyperLogLog(14);
        this.windowStart = Date.now();
        this.pastWindows = [];
    }
    
    processEvent(userId) {
        const now = Date.now();
        
        // 检查是否需要开新窗口
        if (now - this.windowStart >= this.windowSize) {
            this.pastWindows.push({
                hll: this.currentWindow,
                start: this.windowStart,
                end: now
            });
            
            // 只保留最近1小时的数据
            this.pastWindows = this.pastWindows.filter(
                w => now - w.end <= 3600000
            );
            
            this.currentWindow = new HyperLogLog(14);
            this.windowStart = now;
        }
        
        this.currentWindow.add(userId);
    }
    
    // 获取滑动窗口UV
    getSlidingWindowUV(windowMs = 300000) { // 5分钟
        const merged = new HyperLogLog(14);
        const cutoff = Date.now() - windowMs;
        
        // 合并当前窗口
        merged.merge(this.currentWindow);
        
        // 合并历史窗口
        for (const window of this.pastWindows) {
            if (window.end >= cutoff) {
                merged.merge(window.hll);
            }
        }
        
        return merged.count();
    }
}