Top K 问题解决方案详解：堆排序、快排、Bitmap 与 Hash 分治 | 极客日志

Javajava算法

Top K 问题解决方案详解：堆排序、快排、Bitmap 与 Hash 分治

Top K 问题是面试中常见的海量数据处理场景。针对不同的数据规模和内存限制，主要有四种解决方案。小顶堆法适用于流式数据，维护 K 个候选元素，时间复杂度 O(N log K)。快速选择算法适合全量数据在内存的情况，平均时间复杂度 O(N)。位图法适用于数据范围已知且不重复的场景，极大节省空间。Hash 分治结合堆统计频次，适合超大海量重复数据，通过分片避免内存溢出。根据具体业务场景选择合适的算法组合，可有效解决性能瓶颈。

dehua dong发布于 2026/3/28更新于 2026/7/2136 浏览

Top K 问题

在面试中，"Top K 问题"常常以各种面貌出现。先来看看这几个熟悉的场景：

给定 100 个整数，找出最大的 10 个。
给定 10 亿个无序整数，找出最大的 10 个。
给定 10 亿个不重复的整数，找出最大的 10 个。
有 10 个 10GB 大小的文件，里面全是 IP 地址，找出访问频次最高的前 10 个。

这些问题看似换汤不换药，但在不同的数据量级、内存限制和查询频率下，解决思路千差万别。稍有不慎，一次暴力的全量排序就会让系统的内存直接 OOM（Out Of Memory），成为性能瓶颈。

下面总结几种常见的解决思路，遇到问题的时候，把这些基础思路融会贯通并且杂糅组合，即可做到见招拆招。

解决思路	时间复杂度	空间复杂度	适用场景与优缺点
小顶堆法	`O(N log K)`	`< O(K)`	最通用。适合处理海量流式数据，内存占用极小。求最大 K 个用小顶堆，求最小 K 个用大顶堆。
快排	平均 `< O(N)` ，最差 `< O(N^2)`	`< O(log N)`	适合数据能全部加载到内存的情况。速度极快，但找出的 Top K 内部是无序的。
Bitmap (位图)	`< O(N + MaxValue)`	取决于数据范围	适合数据范围已知且不重复的海量数据。极致节省内存，但不适合数据稀疏的场景。
Hash 分治 + 堆	`< O(N)`	取决于分片大小	适合超大海量重复数据（如词频统计）。工业界最常用（MapReduce 思想的单机版）。

堆排序法

这里说的是堆排序法，而不是快排或者希尔排序。虽然理论时间复杂度都是 O(nlogn)，但是堆排在做 topK 的时候有一个优势，就是可以维护一个仅包含 k 个数字的小顶堆（想清楚，为啥是小顶堆哦），当新加入的数字大于堆顶数字的时候，将堆顶元素剔除，并加入新的数字。

文章配图

代码设计如下：

import java.util.PriorityQueue;

public class TopKExample {
    public static void main(String[] args) {
        
            ;
        
        [] arr = {, , , , , , , , , };
        
        PriorityQueue<Integer> pq =  <>();
        
        
         ( num : arr) {
            pq.offer(num);
            
             (pq.size() > topK) {
                pq.poll();
            }
        }
        
        
         (!pq.isEmpty()) {
            System.out.println(pq.poll());
        }
    }
}

相关免费在线工具

Keycode 信息
查找任何按下的键的javascript键代码、代码、位置和修饰符。在线工具，Keycode 信息在线工具，online
Escape 与 Native 编解码
JavaScript 字符串转义/反转义；Java 风格 \uXXXX（Native2Ascii）编码与解码。在线工具，Escape 与 Native 编解码在线工具，online
JavaScript / HTML 格式化
使用 Prettier 在浏览器内格式化 JavaScript 或 HTML 片段。在线工具，JavaScript / HTML 格式化在线工具，online
JavaScript 压缩与混淆
Terser 压缩、变量名混淆，或 javascript-obfuscator 高强度混淆（体积会增大）。在线工具，JavaScript 压缩与混淆在线工具，online
加密/解密文本
使用加密算法（如AES、TripleDES、Rabbit或RC4）加密和解密文本明文。在线工具，加密/解密文本在线工具，online
Gemini 图片去水印
基于开源反向 Alpha 混合算法去除 Gemini/Nano Banana 图片水印，支持批量处理与下载。在线工具，Gemini 图片去水印在线工具，online

import java.util.Arrays;
import java.util.Random;

class Solution {
    private final static Random random = new Random(System.currentTimeMillis());

    public static void main(String[] args) {
        // 定义要找的前 K 个最大元素的 K 值
        final int topK = 3;
        // 原始数组
        int[] nums = {4, 1, 5, 8, 7, 2, 3, 0, 6, 9};
        findTopKLargest(nums, topK);
    }

    // 新增：核心方法，返回最大的前 K 个元素
    public static int[] findTopKLargest(int[] nums, int k) {
        // 1. 合法性校验
        if (k <= 0 || k > nums.length) {
            throw new IllegalArgumentException("K 值不合法，必须满足 0 < K ≤ 数组长度");
        }
        // 2. 拷贝原数组，避免修改输入的原始数组
        int[] copyNums = Arrays.copyOf(nums, nums.length);
        // 3. 找到第 K 大元素的位置（和原逻辑一致）
        int target = copyNums.length - k;
        findKthLargest(copyNums, k); // 调用原方法，完成分区定位
        // 4. 截取从 target 到末尾的元素（就是最大的前 K 个）
        int[] topK = Arrays.copyOfRange(copyNums, target, copyNums.length);
        // 可选：如果需要返回有序的 TopK（从小到大/从大到小），添加排序
        Arrays.sort(topK); // 从小到大排序（如需从大到小，可反转）
        return topK;
    }

    // 保留原有的'找第 K 大元素'方法（核心逻辑不变）
    public static int findKthLargest(int[] nums, int k) {
        int left = 0, right = nums.length - 1;
        int target = nums.length - k;
        while (true) {
            int pivotIndex = partition(nums, left, right);
            if (pivotIndex > target) {
                right = pivotIndex - 1;
            } else if (pivotIndex < target) {
                left = pivotIndex + 1;
            } else {
                return nums[pivotIndex];
            }
        }
    }

    // 保留原有的分区方法（随机基准、双指针分区，逻辑不变）
    public static int partition(int[] nums, int left, int right) {
        int randomIndex = left + random.nextInt(right - left + 1);
        swap(nums, left, randomIndex);
        int pivot = nums[left];
        int le = left + 1;
        int ge = right;
        while (true) {
            while (le <= ge && nums[le] < pivot) {
                le++;
            }
            while (le <= ge && nums[ge] > pivot) {
                ge--;
            }
            if (le > ge) break;
            swap(nums, le, ge);
            le++;
            ge--;
        }
        swap(nums, left, ge);
        return ge;
    }

    // 保留原有的交换方法
    public static void swap(int[] nums, int index1, int index2) {
        int tmp = nums[index1];
        nums[index1] = nums[index2];
        nums[index2] = tmp;
    }

    // 测试示例
    public static void main(String[] args) {
        Solution solution = new Solution();
        int[] nums = {4, 1, 5, 8, 7, 2, 3, 0, 6, 9};
        int k = 3;
        int[] topK = solution.findTopKLargest(nums, k);
        System.out.println("最大的前" + k + "个元素（从小到大）：" + Arrays.toString(topK));
        // 如需从大到小输出，可反转数组
        reverse(topK);
        System.out.println("最大的前" + k + "个元素（从大到小）：" + Arrays.toString(topK));
    }

    // 辅助方法：反转数组（可选）
    private static void reverse(int[] arr) {
        int left = 0, right = arr.length - 1;
        while (left < right) {
            int tmp = arr[left];
            arr[left] = arr[right];
            arr[right] = tmp;
            left++;
            right--;
        }
    }
}

import java.util.BitSet;
import java.util.ArrayList;
import java.util.List;

public class BitmapTopK {
    public static List<Integer> getTopK(int[] data, int k) {
        // 假设数据范围在 0 到 1 亿 之间
        BitSet bitmap = new BitSet(100000000);
        // 1. 填充位图 (O(n))
        for (int num : data) {
            bitmap.set(num);
        }
        // 2. 倒序遍历获取 TopK (O(数据范围))
        List<Integer> result = new ArrayList<>();
        int count = 0;
        for (int i = 100000000 - 1; i >= 0 && count < k; i--) {
            if (bitmap.get(i)) {
                result.add(i);
                count++;
            }
        }
        return result;
    }
}

import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import java.util.*;

public class TopKWithHashAndHeap {
    public static void main(String[] args) throws IOException {
        // 模拟超大文件：big_data.txt 包含重复整数，比如 1,3,5,3,9,5,5,7,9,9,9...
        String filePath = "big_data.txt";
        int topK = 3;
        // 步骤 1：Hash 分批统计频次
        Map<Integer, Integer> freqMap = countFrequency(filePath);
        // 步骤 2：小顶堆找频次最高的 TopK
        List<Integer> topKResult = findTopKByFrequency(freqMap, topK);
        System.out.println("出现次数最多的前" + topK + "个整数：" + topKResult);
        // 示例输出：[9,5,3]（假设 9 出现 4 次，5 出现 3 次，3 出现 2 次）
    }

    // 分批读取文件，统计数值出现次数（核心：Hash 统计频次）
    private static Map<Integer, Integer> countFrequency(String filePath) throws IOException {
        Map<Integer, Integer> freqMap = new HashMap<>();
        // 分批读取，每次仅加载一行，避免内存溢出
        try (BufferedReader br = new BufferedReader(new FileReader(filePath))) {
            String line;
            while ((line = br.readLine()) != null) {
                if (line.trim().isEmpty()) continue;
                int num = Integer.parseInt(line.trim());
                // Hash 统计：存在则 +1，不存在则初始化为 1
                freqMap.put(num, freqMap.getOrDefault(num, 0) + 1);
            }
        }
        return freqMap;
    }

    // 基于频次找 TopK（小顶堆实现）
    private static List<Integer> findTopKByFrequency(Map<Integer, Integer> freqMap, int k) {
        if (k <= 0 || freqMap.isEmpty()) {
            return Collections.emptyList();
        }
        // 小顶堆：按频次升序，堆顶是'当前频次最小的候选'
        // 堆中存储数组：[频次，数值]
        PriorityQueue<int[]> minHeap = new PriorityQueue<>(Comparator.comparingInt(a -> a[0]));
        for (Map.Entry<Integer, Integer> entry : freqMap.entrySet()) {
            int num = entry.getKey();
            int freq = entry.getValue();
            minHeap.offer(new int[]{freq, num});
            // 超出 K 则弹出堆顶（频次最小的）
            if (minHeap.size() > k) {
                minHeap.poll();
            }
        }
        // 提取结果（堆中是频次从大到小？不，小顶堆弹出是从小到大，需反转）
        List<Integer> topK = new ArrayList<>();
        while (!minHeap.isEmpty()) {
            topK.add(minHeap.poll()[1]);
        }
        Collections.reverse(topK); // 转为'频次从高到低'
        return topK;
    }
}

import java.io.*;
import java.util.*;

/**
 * 分治法（Hash Partitioning）解决海量数据 TopK 问题（按频次）
 * 核心流程：哈希分片 → 局部计数 → 局部 TopK → 全局归并
 */
public class DistributedTopK {
    // 分片数量（可根据机器性能/文件大小调整，比如 10 个分片）
    private static final int PARTITION_NUM = 10;
    // 临时分片文件的前缀
    private static final String TEMP_FILE_PREFIX = "temp_partition_";
    // 临时分片文件的后缀
    private static final String TEMP_FILE_SUFFIX = ".txt";

    // 测试主方法
    public static void main(String[] args) {
        // 配置参数
        String sourceFilePath = "big_data.txt"; // 海量数据文件（每行一个整数，含重复）
        int topK = 3; // 要找的 TopK 值
        try {
            // 步骤 1：哈希分片
            System.out.println("开始哈希分片...");
            hashPartition(sourceFilePath);
            // 步骤 2+3：处理所有分片，获取局部 TopK
            System.out.println("开始处理各分片，计算局部 TopK...");
            List<List<int[]>> allLocalTopK = new ArrayList<>();
            for (int i = 0; i < PARTITION_NUM; i++) {
                String partitionPath = TEMP_FILE_PREFIX + i + TEMP_FILE_SUFFIX;
                List<int[]> localTopK = processSinglePartition(partitionPath, topK);
                allLocalTopK.add(localTopK);
            }
            // 步骤 4：全局归并，获取全局 TopK
            System.out.println("开始全局归并，计算最终 TopK...");
            List<Integer> globalTopK = mergeGlobalTopK(allLocalTopK, topK);
            // 输出结果
            System.out.println("全局出现次数最多的前" + topK + "个整数：" + globalTopK);
            // 清理临时文件
            cleanTempFiles();
            System.out.println("临时文件已清理");
        } catch (IOException e) {
            e.printStackTrace();
        }
    }

    /**
     * 步骤 1：哈希分片 - 将海量数据分散到 N 个小文件中（相同数据进入同一个文件）
     * @param sourceFilePath 源海量数据文件路径
     * @throws IOException 读写异常
     */
    public static void hashPartition(String sourceFilePath) throws IOException {
        // 初始化 N 个文件写入流
        List<BufferedWriter> writers = new ArrayList<>();
        for (int i = 0; i < PARTITION_NUM; i++) {
            String filePath = TEMP_FILE_PREFIX + i + TEMP_FILE_SUFFIX;
            writers.add(new BufferedWriter(new FileWriter(filePath)));
        }
        // 分批读取源文件，按 hash(x)%N 分片写入
        try (BufferedReader br = new BufferedReader(new FileReader(sourceFilePath))) {
            String line;
            while ((line = br.readLine()) != null) {
                line = line.trim();
                if (line.isEmpty()) continue;
                int num = Integer.parseInt(line);
                // 核心：哈希分片，确保相同 num 进入同一个文件
                int partitionIndex = Math.abs(num % PARTITION_NUM);
                writers.get(partitionIndex).write(num + "\n");
            }
        }
        // 关闭所有写入流
        for (BufferedWriter writer : writers) {
            writer.close();
        }
    }

    /**
     * 步骤 2+3：局部处理 - 对单个分片文件统计频次 + 找局部 TopK
     * @param partitionFilePath 分片文件路径
     * @param k 局部 TopK 的 K 值
     * @return 该分片的 TopK（<频次，数值>）
     */
    public static List<int[]> processSinglePartition(String partitionFilePath, int k) throws IOException {
        // 2.1 局部计数：HashMap 统计当前分片的数值频次
        Map<Integer, Integer> freqMap = new HashMap<>();
        try (BufferedReader br = new BufferedReader(new FileReader(partitionFilePath))) {
            String line;
            while ((line = br.readLine()) != null) {
                int num = Integer.parseInt(line.trim());
                freqMap.put(num, freqMap.getOrDefault(num, 0) + 1);
            }
        }
        // 2.2 局部 TopK：小顶堆筛选当前分片的前 K 高频数值
        PriorityQueue<int[]> minHeap = new PriorityQueue<>(Comparator.comparingInt(a -> a[0]));
        for (Map.Entry<Integer, Integer> entry : freqMap.entrySet()) {
            int num = entry.getKey();
            int freq = entry.getValue();
            minHeap.offer(new int[]{freq, num});
            if (minHeap.size() > k) {
                minHeap.poll();
            }
        }
        // 转换为 List 返回（局部 TopK）
        List<int[]> localTopK = new ArrayList<>();
        while (!minHeap.isEmpty()) {
            localTopK.add(minHeap.poll());
        }
        return localTopK;
    }

    /**
     * 步骤 4：全局归并 - 汇总所有分片的局部 TopK，找全局 TopK
     * @param allLocalTopK 所有分片的局部 TopK 列表
     * @param k 全局 TopK 的 K 值
     * @return 全局 TopK 数值（按频次从高到低）
     */
    public static List<Integer> mergeGlobalTopK(List<List<int[]>> allLocalTopK, int k) {
        // 小顶堆：筛选全局 TopK（按频次升序）
        PriorityQueue<int[]> globalHeap = new PriorityQueue<>(Comparator.comparingInt(a -> a[0]));
        // 遍历所有分片的局部 TopK，加入全局堆
        for (List<int[]> localTopK : allLocalTopK) {
            for (int[] item : localTopK) {
                int freq = item[0];
                int num = item[1];
                globalHeap.offer(new int[]{freq, num});
                if (globalHeap.size() > k) {
                    globalHeap.poll();
                }
            }
        }
        // 转换为最终结果（堆中是频次从小到大，反转后从高到低）
        List<Integer> globalTopK = new ArrayList<>();
        while (!globalHeap.isEmpty()) {
            globalTopK.add(globalHeap.poll()[1]);
        }
        Collections.reverse(globalTopK);
        return globalTopK;
    }

    /**
     * 辅助方法：清理临时分片文件
     */
    public static void cleanTempFiles() {
        for (int i = 0; i < PARTITION_NUM; i++) {
            String filePath = TEMP_FILE_PREFIX + i + TEMP_FILE_SUFFIX;
            File file = new File(filePath);
            if (file.exists()) {
                file.delete();
            }
        }
    }
}

Top K 问题解决方案详解：堆排序、快排、Bitmap 与 Hash 分治

Top K 问题

堆排序法

更多推荐文章

相关免费在线工具

类似快排法

使用 Bitmap

用 Hash 统计重复次数，再基于次数找 TopK（出现次数最多的 k 个数）

解法一：

解法二：

把 100GB 拆成小文件，加起来不还是 100GB 吗？难道不会 OOM 吗？

总结

更多推荐文章

相关免费在线工具

Top K 问题解决方案详解：堆排序、快排、Bitmap 与 Hash 分治

Top K 问题

堆排序法

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具

类似快排法

使用 Bitmap

用 Hash 统计重复次数，再基于次数找 TopK（出现次数最多的 k 个数）

解法一：

解法二：

把 100GB 拆成小文件，加起来不还是 100GB 吗？难道不会 OOM 吗？

总结

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具