Redis HyperLogLog基数统计
Redis 的五种数据结构各有特色,用对了才能发挥它的优势。很多人只用到了 String 和 Hash,却不知道 List、Set、ZSet 在特定场景下更合适。本文从应用场景出发,讲什么时候用什么类型。
一、HyperLogLog原理
#
1.1 问题背景
场景:统计网站UV(独立访客数)
传统方案:使用Set存储每个用户ID - 100万UV → 约100MB内存 - 1亿UV → 约10GB内存
HyperLogLog方案: - 100万UV → 12KB内存 - 1亿UV → 12KB内存 - 100亿UV → 12KB内存
|
#
1.2 核心思想
HyperLogLog基于概率统计:
1. 对每个元素计算哈希值 2. 观察哈希值二进制表示中前导0的个数 3. 前导0越多,说明需要更多不同的元素才能出现
示例: 哈希值 h(x) 的二进制表示: - h(A) = 00101101... → 前导0个数 = 2 - h(B) = 00001011... → 前导0个数 = 4 - h(C) = 00000001... → 前导0个数 = 6
如果发现最大前导0个数为k, 则估算元素数量约为 2^k
上例中最大k=6,估算数量 ≈ 2^6 = 64
|
#
1.3 分桶平均
为减少偶然性,HyperLogLog使用多个桶:
1. 将哈希值分成两部分: - 前b位:决定放入哪个桶(2^b个桶) - 剩余位:计算前导0个数
2. 每个桶记录看到的最大前导0个数
3. 最终估算使用调和平均数: E = α_m * m^2 / Σ(2^(-M[j])) 其中: m = 桶数量(通常为2^14 = 16384) M[j] = 第j个桶的最大前导0数 α_m = 偏差修正系数
|
#
1.4 内存占用
Redis HyperLogLog使用16384个桶(2^14) 每个桶需要6bit存储最大前导0数(最大64个0)
总内存 = 16384 × 6bit = 98304bit = 12288字节 = 12KB
无论统计多少元素,内存固定12KB!
|
二、Redis HyperLogLog命令
#
2.1 基本命令
PFADD page:uv:20240101 user1001 user1002 user1003
PFADD page:uv:20240101 user1004 user1005
PFCOUNT page:uv:20240101
PFMERGE page:uv:week1 page:uv:20240101 page:uv:20240102 page:uv:20240103 PFCOUNT page:uv:week1
|
#
2.2 实际示例
for i in {1..10000}; do redis-cli PFADD hll_test "user_$i" done
redis-cli DEBUG OBJECT hll_test
redis-cli PFCOUNT hll_test
for i in {10001..1000000}; do redis-cli PFADD hll_test "user_$i" done
redis-cli DEBUG OBJECT hll_test
redis-cli PFCOUNT hll_test
|
三、Java中使用HyperLogLog
#
3.1 Spring Data Redis
@Service public class HyperLogLogService { @Autowired private StringRedisTemplate redis;
public Long addVisit(String key, String... userIds) { return redis.opsForHyperLogLog().add(key, userIds); }
public Long getUV(String key) { return redis.opsForHyperLogLog().size(key); }
public Long mergeUV(String destinationKey, String... sourceKeys) { return redis.opsForHyperLogLog().union(destinationKey, sourceKeys); }
public void deleteUV(String key) { redis.delete(key); } }
|
#
3.2 UV统计应用
@Service public class UVStatisticsService { @Autowired private HyperLogLogService hyperLogLogService;
public void recordPageView(Long userId, String page) { String today = LocalDate.now().format(DateTimeFormatter.BASIC_ISO_DATE); String key = "uv:page:" + page + ":" + today; hyperLogLogService.addVisit(key, String.valueOf(userId)); }
public Long getTodayPageUV(String page) { String today = LocalDate.now().format(DateTimeFormatter.BASIC_ISO_DATE); String key = "uv:page:" + page + ":" + today; return hyperLogLogService.getUV(key); }
public Long getWeekPageUV(String page) { LocalDate today = LocalDate.now(); String[] dayKeys = new String[7]; for (int i = 0; i < 7; i++) { String day = today.minusDays(i).format(DateTimeFormatter.BASIC_ISO_DATE); dayKeys[i] = "uv:page:" + page + ":" + day; } String weekKey = "uv:page:" + page + ":week:" + today; return hyperLogLogService.mergeUV(weekKey, dayKeys); }
public Long getMonthPageUV(String page) { LocalDate today = LocalDate.now(); String month = today.format(DateTimeFormatter.ofPattern("yyyyMM")); List<String> dayKeys = new ArrayList<>(); for (int i = 1; i <= today.getDayOfMonth(); i++) { String day = String.format("%s%02d", month, i); dayKeys.add("uv:page:" + page + ":" + day); } String monthKey = "uv:page:" + page + ":month:" + month; return hyperLogLogService.mergeUV(monthKey, dayKeys.toArray(new String[0])); } }
|
#
3.3 搜索去重统计
@Service public class SearchStatisticsService { @Autowired private HyperLogLogService hyperLogLogService;
public void recordSearch(String keyword, Long userId) { hyperLogLogService.addVisit("search:uv:total", String.valueOf(userId)); hyperLogLogService.addVisit("search:uv:" + keyword, String.valueOf(userId)); String today = LocalDate.now().format(DateTimeFormatter.BASIC_ISO_DATE); hyperLogLogService.addVisit("search:uv:daily:" + today, String.valueOf(userId)); }
public Long getKeywordUV(String keyword) { return hyperLogLogService.getUV("search:uv:" + keyword); }
public List<KeywordUV> getTopKeywords(List<String> keywords) { List<KeywordUV> result = new ArrayList<>(); for (String keyword : keywords) { Long uv = hyperLogLogService.getUV("search:uv:" + keyword); result.add(new KeywordUV(keyword, uv)); } result.sort((a, b) -> Long.compare(b.getUv(), a.getUv())); return result; } }
|
#
3.4 广告曝光去重
@Service public class AdStatisticsService { @Autowired private HyperLogLogService hyperLogLogService;
public void recordAdExposure(String adId, Long userId, String deviceId) { String today = LocalDate.now().format(DateTimeFormatter.BASIC_ISO_DATE); hyperLogLogService.addVisit("ad:uv:user:" + adId + ":" + today, String.valueOf(userId)); hyperLogLogService.addVisit("ad:uv:device:" + adId + ":" + today, deviceId); }
public Long getAdDailyUV(String adId, String date) { return hyperLogLogService.getUV("ad:uv:user:" + adId + ":" + date); }
public Long getAdTotalUV(String adId) { Set<String> keys = redis.keys("ad:uv:user:" + adId + ":*"); if (keys == null || keys.isEmpty()) { return 0L; } return hyperLogLogService.mergeUV("ad:uv:user:" + adId + ":total", keys.toArray(new String[0])); } }
|
四、误差分析
#
4.1 误差特性
HyperLogLog的标准误差:约0.81%
这意味着: - 真实值100万,估算值可能在99.19万~100.81万之间 - 真实值1亿,估算值可能在9919万~1.0081亿之间
误差特性: - 小基数时(< 2.5 * m),误差较大,Redis会使用Linear Counting修正 - 中等基数时,标准误差约0.81% - 大基数时,误差稳定
|
#
4.2 误差测试
@Component public class HyperLogLogAccuracyTest { @Autowired private StringRedisTemplate redis; public void testAccuracy() { String key = "hll:accuracy:test"; int[] testSizes = {1000, 10000, 100000, 1000000}; for (int size : testSizes) { redis.delete(key); for (int i = 0; i < size; i++) { redis.opsForHyperLogLog().add(key, "user_" + i); } Long estimate = redis.opsForHyperLogLog().size(key); double error = Math.abs(estimate - size) * 100.0 / size; System.out.printf("真实值: %d, 估算值: %d, 误差: %.2f%%%n", size, estimate, error); } } }
|
五、与Set对比
| 特性 |
HyperLogLog |
Set |
| 内存占用 |
12KB(固定) |
与元素数成正比 |
| 100万元素 |
12KB |
~100MB |
| 1亿元素 |
12KB |
~10GB |
| 精确度 |
概率估算(0.81%误差) |
精确 |
| 获取元素 |
不支持 |
SMEMBERS |
| 判断存在 |
不支持 |
SISMEMBER |
| 合并 |
支持(内存不变) |
SUNION(内存增加) |
六、最佳实践
#
6.1 适用场景
适合使用HyperLogLog: - UV统计(独立访客) - 搜索去重统计 - 广告曝光去重 - IP去重统计 - 大数据量去重计数
不适合使用HyperLogLog: - 需要精确计数的场景(如订单数) - 需要获取具体元素的场景 - 基数很小的场景(< 1000)
|
#
6.2 优化建议
@Service public class HyperLogLogBestPractice {
public void addWithExpire(String key, String element, int days) { redis.opsForHyperLogLog().add(key, element); redis.expire(key, days, TimeUnit.DAYS); }
public void mergeHistory(String pattern, String destination) { Set<String> keys = redis.keys(pattern); if (keys != null && !keys.isEmpty()) { redis.opsForHyperLogLog().union(destination, keys.toArray(new String[0])); redis.delete(keys); } }
public void adaptiveCount(String key, String element) { String setKey = key + ":set"; String hllKey = key + ":hll"; Long setSize = redis.opsForSet().size(setKey); if (setSize != null && setSize < 10000) { redis.opsForSet().add(setKey, element); } else { if (setSize != null && setSize > 0) { Set<String> members = redis.opsForSet().members(setKey); if (members != null) { redis.opsForHyperLogLog().add(hllKey, members.toArray(new String[0])); } redis.delete(setKey); } redis.opsForHyperLogLog().add(hllKey, element); } } }
|
七、总结
| 指标 |
数值 |
| 内存占用 |
12KB(固定) |
| 标准误差 |
~0.81% |
| 最大统计基数 |
约 2^64 |
| 单个桶数 |
16384 |
HyperLogLog的核心价值:
- 极致的空间效率:12KB统计百亿级数据
- 常数查询时间:O(1)获取估算值
- 可合并性:多个HLL可合并统计
使用建议:
- 大数据量去重计数首选
- 可接受约1%误差的场景
- 不需要获取具体元素的场景
- 配合过期时间自动清理
核心要点
String:简单的键值对,适合缓存、计数器
Hash:存储对象属性,适合用户信息、配置
List:有序列表,适合消息队列、最新列表
Set:无序去重,适合共同好友、抽奖
ZSet:有序集合,适合排行榜、积分系统
总结
选择合适的数据结构是使用 Redis 的关键。在实际项目中,根据业务需求选择合适的类型,可以提升性能和开发效率。