Prometheus + Grafana 实现 Java 应用数据监控系统 | 极客日志

JavaAIjava

Prometheus + Grafana 实现 Java 应用数据监控系统

综述由AI生成如何使用 Prometheus 和 Grafana 搭建 Java 应用的数据监控系统。首先简述了可观测性概念，随后详细讲解了 Prometheus 和 Grafana 的安装与整合步骤。核心部分展示了如何在 Spring Boot 项目中集成 Actuator 和 Micrometer，通过自定义监控上下文、指标收集器和监听器来采集 AI 模型调用相关的业务指标（如请求次数、Token 消耗、响应时间）。最后提供了 Prometheus 配置及 Grafana 看板导入方法，实现了从数据采集到可视化的完整闭环。

城市逃兵发布于 2026/3/30更新于 2026/5/2225 浏览

1.什么是可观测

**可观测（Observability）**作为现代运维理念，强调系统在运行时应具备全面的、深入的、可理解的状态获取能力。通过收集和分析系统的各种可观测数据，构建一个全方位、立体化的监控与分析体系，运维团队能够在复杂、动态的 IT 环境中实时了解系统内部的健康状况、性能表现以及故障原因，并基于这些信息做出准确的决策，实现快速问题定位、预防性维护以及持续优化。

2.使用 Prometheus + Grafana 实现监控

Prometheus

Prometheus 是一个开源的监控系统，专门为时序数据的收集、存储和查询而设计。

Prometheus 的核心理念是将所有监控数据以 时间序列 的形式存储。根据它的数据模型，每个时间序列都由指标名称和一组标签唯一标识。比如 http_requests_total{method="POST", handler="/api/xxx"} 就表示一个记录 POST 请求接口总数的时间序列。

这样一来，Prometheus 能够高效地处理监控场景中的时间范围查询，比如过去一小时内各个接口的平均响应时间、CPU 使用率超过 80% 的服务器列表等。

访问 Prometheus 下载页面，选择对应的操作系统和架构。

进入对应文件夹，双击 prometheus.exe 启动。

查看默认配置文件 prometheus.yml：

# 全局配置
global:
  scrape_interval: 15s # 全局抓取间隔，默认每 15 秒抓取一次
  evaluation_interval: 15s # 规则评估间隔
# 告警管理器配置（可选）
alerting:
  alertmanagers:
    - static_configs:
        - targets:
          # - alertmanager:9093
# 规则文件配置
rule_files:
  # - "first_rules.yml"
  # - "second_rules.yml"
# 抓取配置
scrape_configs:
  # Prometheus 自身监控
  - job_name: "prometheus"
    static_configs:
        []

相关免费在线工具

Keycode 信息
查找任何按下的键的javascript键代码、代码、位置和修饰符。在线工具，Keycode 信息在线工具，online
Escape 与 Native 编解码
JavaScript 字符串转义/反转义；Java 风格 \uXXXX（Native2Ascii）编码与解码。在线工具，Escape 与 Native 编解码在线工具，online
JavaScript / HTML 格式化
使用 Prettier 在浏览器内格式化 JavaScript 或 HTML 片段。在线工具，JavaScript / HTML 格式化在线工具，online
JavaScript 压缩与混淆
Terser 压缩、变量名混淆，或 javascript-obfuscator 高强度混淆（体积会增大）。在线工具，JavaScript 压缩与混淆在线工具，online
RSA密钥对生成器
生成新的随机RSA私钥和公钥pem证书。在线工具，RSA密钥对生成器在线工具，online
Mermaid 预览与可视化编辑
基于 Mermaid.js 实时预览流程图、时序图等图表，支持源码编辑与即时渲染。在线工具，Mermaid 预览与可视化编辑在线工具，online

<!-- 应用监控端点，提供健康检查、指标收集等功能 -->
<dependency>
    <groupId>org.springframework.boot</groupId>
    <artifactId>spring-boot-starter-actuator</artifactId>
</dependency>
<!-- Prometheus 监控支持，将应用指标暴露为 Prometheus 格式 -->
<dependency>
    <groupId>io.micrometer</groupId>
    <artifactId>micrometer-registry-prometheus</artifactId>
</dependency>

# 应用监控端点配置
management:
  endpoints:
    web:
      exposure:
        include: health,info,prometheus # 暴露健康检查、应用信息、Prometheus 指标端点
  endpoint:
    health:
      show-details: always # 显示健康检查详情

/**
 * 监控上下文类
 * 用于在应用监控中传递上下文信息，如用户 ID、应用 ID 等
 */
@Data
@Builder
@NoArgsConstructor
@AllArgsConstructor
public class MonitorContext implements Serializable {
    /**
     * 用户 ID
     */
    private String userId;
    /**
     * 应用 ID
     */
    private String appId;
    @Serial
    private static final long serialVersionUID = 1L;
}

/**
 * 监控上下文持有者类
 * 用于在应用监控中传递上下文信息，如用户 ID、应用 ID 等
 */
@Slf4j
public class MonitorContextHolder {
    /**
     * 线程本地变量，用于存储当前线程的监控上下文
     */
    private static final ThreadLocal<MonitorContext> CONTEXT_HOLDER = new ThreadLocal<>();

    /**
     * 设置监控上下文
     */
    public static void setContext(MonitorContext context) {
        CONTEXT_HOLDER.set(context);
    }

    /**
     * 获取当前监控上下文
     */
    public static MonitorContext getContext() {
        return CONTEXT_HOLDER.get();
    }

    /**
     * 清除监控上下文
     */
    public static void clearContext() {
        CONTEXT_HOLDER.remove();
    }
}

@Override
public Flux<String> chatToGen(Long appId, String message, User loginUser) {
    // 1. 参数校验
    ThrowUtils.throwIf(appId == null || appId <= 0, ErrorCode.PARAMS_ERROR, "应用 ID 不能为空");
    ThrowUtils.throwIf(StrUtil.isBlank(message), ErrorCode.PARAMS_ERROR, "用户消息不能为空");
    // 2. 查询应用信息
    App app = this.getById(appId);
    ThrowUtils.throwIf(app == null, ErrorCode.NOT_FOUND_ERROR, "应用不存在");
    // 3. 验证用户是否有权限访问该应用，仅本人可以生成对话
    if (!app.getUserId().equals(loginUser.getId())) {
        throw new BusinessException(ErrorCode.NO_AUTH_ERROR, "无权限访问该应用");
    }
    // 4. 获取应用的对话生成类型
    String chatGenType = app.getChatGenType();
    ChatTypeEnum chatTypeEnum = ChatTypeEnum.getEnumByValue(chatGenType);
    if (chatTypeEnum == null) {
        throw new BusinessException(ErrorCode.SYSTEM_ERROR, "不支持的对话类型");
    }
    // 5. 通过校验后，添加用户消息到对话历史
    chatHistoryService.addChatMessage(appId, message, ChatHistoryMessageTypeEnum.USER.getValue(), loginUser.getId());
    // 6. 设置监控上下文
    MonitorContextHolder.setContext(
        MonitorContext.builder()
            .userId(loginUser.getId().toString())
            .appId(appId.toString())
            .build()
    );
    // 7. 调用 AI 生成回复
    Flux<String> contentFlux = aiChatFacade.generateAndSaveStreamFacade(message, chatTypeEnum, appId);
    // 8. 收集 AI 响应内容并在完成后记录到对话历史
    StringBuilder aiResponseBuilder = new StringBuilder();
    return contentFlux
        .map(chunk -> {
            // 收集 AI 响应内容
            aiResponseBuilder.append(chunk);
            return chunk;
        })
        .doOnComplete(() -> {
            // 流式响应完成后，添加 AI 消息到对话历史
            String aiResponse = aiResponseBuilder.toString();
            if (StrUtil.isNotBlank(aiResponse)) {
                chatHistoryService.addChatMessage(appId, aiResponse, ChatHistoryMessageTypeEnum.AI.getValue(), loginUser.getId());
            }
        })
        .doOnError(error -> {
            // 如果 AI 回复失败，也要记录错误消息
            String errorMessage = "AI 回复失败：" + error.getMessage();
            chatHistoryService.addChatMessage(appId, errorMessage, ChatHistoryMessageTypeEnum.AI.getValue(), loginUser.getId());
        })
        .doFinally(signalType -> {
            // 流结束时清理（无论成功/失败/取消）
            MonitorContextHolder.clearContext();
        });
}

/**
 * AI 模型指标收集器
 * 负责收集业务数据并转换为 Prometheus 指标
 */
@Component
@Slf4j
public class AiModelMetricsCollector {
    @Resource
    private MeterRegistry meterRegistry;
    
    // 缓存已创建的指标，避免重复创建（按指标类型分离缓存）
    private final ConcurrentMap<String, Counter> requestCountersCache = new ConcurrentHashMap<>();
    private final ConcurrentMap<String, Counter> errorCountersCache = new ConcurrentHashMap<>();
    private final ConcurrentMap<String, Counter> tokenCountersCache = new ConcurrentHashMap<>();
    private final ConcurrentMap<String, Timer> responseTimersCache = new ConcurrentHashMap<>();

    /**
     * 记录请求次数
     */
    public void recordRequest(String userId, String appId, String modelName, String status) {
        String key = String.format("%s_%s_%s_%s", userId, appId, modelName, status);
        Counter counter = requestCountersCache.computeIfAbsent(key, k -> 
            Counter.builder("ai_model_requests_total")
                .description("AI 模型总请求次数")
                .tag("user_id", userId)
                .tag("app_id", appId)
                .tag("model_name", modelName)
                .tag("status", status)
                .register(meterRegistry)
        );
        counter.increment();
    }

    /**
     * 记录错误
     */
    public void recordError(String userId, String appId, String modelName, String errorMessage) {
        String key = String.format("%s_%s_%s_%s", userId, appId, modelName, errorMessage);
        Counter counter = errorCountersCache.computeIfAbsent(key, k -> 
            Counter.builder("ai_model_errors_total")
                .description("AI 模型错误次数")
                .tag("user_id", userId)
                .tag("app_id", appId)
                .tag("model_name", modelName)
                .tag("error_message", errorMessage)
                .register(meterRegistry)
        );
        counter.increment();
    }

    /**
     * 记录 Token 消耗
     */
    public void recordTokenUsage(String userId, String appId, String modelName, String tokenType, long tokenCount) {
        String key = String.format("%s_%s_%s_%s", userId, appId, modelName, tokenType);
        Counter counter = tokenCountersCache.computeIfAbsent(key, k -> 
            Counter.builder("ai_model_tokens_total")
                .description("AI 模型 Token 消耗总数")
                .tag("user_id", userId)
                .tag("app_id", appId)
                .tag("model_name", modelName)
                .tag("token_type", tokenType)
                .register(meterRegistry)
        );
        counter.increment(tokenCount);
    }

    /**
     * 记录响应时间
     */
    public void recordResponseTime(String userId, String appId, String modelName, Duration duration) {
        String key = String.format("%s_%s_%s", userId, appId, modelName);
        Timer timer = responseTimersCache.computeIfAbsent(key, k -> 
            Timer.builder("ai_model_response_duration_seconds")
                .description("AI 模型响应时间")
                .tag("user_id", userId)
                .tag("app_id", appId)
                .tag("model_name", modelName)
                .register(meterRegistry)
        );
        timer.record(duration);
    }
}

/**
 * AI 模型监控监听器
 * 负责监听 AI 模型的请求和响应事件，收集指标并记录到 Prometheus
 */
@Component
@Slf4j
public class AiModelMonitorListener implements ChatModelListener {
    // 用于存储请求开始时间的键
    private static final String REQUEST_START_TIME_KEY = "request_start_time";
    // 用于监控上下文传递（因为请求和响应事件的触发不是同一个线程）
    private static final String MONITOR_CONTEXT_KEY = "monitor_context";

    @Resource
    private AiModelMetricsCollector aiModelMetricsCollector;

    /**
     * 处理模型请求事件
     */
    @Override
    public void onRequest(ChatModelRequestContext requestContext) {
        requestContext.attributes().put(REQUEST_START_TIME_KEY, Instant.now());
        MonitorContext context = MonitorContextHolder.getContext();
        if (context == null) {
            context = MonitorContext.builder()
                .userId("unknown")
                .appId("unknown")
                .build();
        }
        requestContext.attributes().put(MONITOR_CONTEXT_KEY, context);
        String userId = context.getUserId();
        String appId = context.getAppId();
        String modelName = requestContext.chatRequest().modelName();
        aiModelMetricsCollector.recordRequest(userId, appId, modelName, "started");
    }

    /**
     * 处理模型响应事件
     */
    @Override
    public void onResponse(ChatModelResponseContext responseContext) {
        // 从属性中获取监控信息（由 onRequest 方法存储）
        Map<Object, Object> attributes = responseContext.attributes();
        // 从监控上下文中获取信息
        MonitorContext context = (MonitorContext) attributes.get(MONITOR_CONTEXT_KEY);
        String userId = context.getUserId();
        String appId = context.getAppId();
        // 获取模型名称
        String modelName = responseContext.chatResponse().modelName();
        // 记录成功请求
        aiModelMetricsCollector.recordRequest(userId, appId, modelName, "success");
        // 记录响应时间
        recordResponseTime(attributes, userId, appId, modelName);
        // 记录 Token 使用情况
        recordTokenUsage(responseContext, userId, appId, modelName);
    }

    /**
     * 处理模型错误事件
     */
    @Override
    public void onError(ChatModelErrorContext errorContext) {
        // 从属性中获取监控信息（由 onRequest 方法存储）
        Map<Object, Object> attributes = errorContext.attributes();
        MonitorContext context = (MonitorContext) attributes.get(MONITOR_CONTEXT_KEY);
        if (context == null) {
            log.warn("监控上下文为空，跳过错误监控");
            return;
        }
        String userId = context.getUserId();
        String appId = context.getAppId();
        String modelName = errorContext.chatRequest().modelName();
        String errorMessage = errorContext.error().getMessage();
        aiModelMetricsCollector.recordRequest(userId, appId, modelName, "error");
        aiModelMetricsCollector.recordError(userId, appId, modelName, errorMessage);
        recordResponseTime(attributes, userId, appId, modelName);
    }

    /**
     * 记录响应时间
     */
    private void recordResponseTime(Map<Object, Object> attributes, String userId, String appId, String modelName) {
        Instant startTime = (Instant) attributes.get(REQUEST_START_TIME_KEY);
        Duration responseTime = Duration.between(startTime, Instant.now());
        aiModelMetricsCollector.recordResponseTime(userId, appId, modelName, responseTime);
    }

    /**
     * 记录 Token 使用情况
     */
    private void recordTokenUsage(ChatModelResponseContext responseContext, String userId, String appId, String modelName) {
        TokenUsage tokenUsage = responseContext.chatResponse().metadata().tokenUsage();
        if (tokenUsage != null) {
            aiModelMetricsCollector.recordTokenUsage(userId, appId, modelName, "input", tokenUsage.inputTokenCount());
            aiModelMetricsCollector.recordTokenUsage(userId, appId, modelName, "output", tokenUsage.outputTokenCount());
            aiModelMetricsCollector.recordTokenUsage(userId, appId, modelName, "total", tokenUsage.totalTokenCount());
        }
    }
}

/**
 * AI 模型配置类
 * 配置模型的 API 密钥、模型名称、最大令牌数、监听器（用于监控模型调用）等参数
 */
@Configuration
public class AiModelConfig {
    @Resource
    private AiModelMonitorListener aiModelMonitorListener;

    @Value("${langchain4j.community.dashscope.streaming-chat-model.api-key}")
    private String apiKey;

    @Value("${langchain4j.community.dashscope.streaming-chat-model.model-name}")
    private String modelName;

    @Value("${langchain4j.community.dashscope.streaming-chat-model.max-tokens:8129}")
    private Integer maxTokens;

    @Bean
    @Primary
    public StreamingChatModel streamingChatModel() {
        return QwenStreamingChatModel.builder()
            .apiKey(apiKey)
            .modelName(modelName)
            .maxTokens(maxTokens)
            .listeners(List.of(aiModelMonitorListener))
            .build();
    }
}

# Prometheus 配置文件
global:
  scrape_interval: 15s # 全局抓取间隔
  evaluation_interval: 15s # 规则评估间隔
# 告警管理器配置 (可选)
alerting:
  alertmanagers:
    - static_configs:
        - targets:
          # - alertmanager:9093
# 规则文件配置
rule_files:
  # - "alert_rules.yml"
  # 可以添加告警规则
# 抓取配置
scrape_configs:
  # Prometheus 自身监控
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']
  # Spring Boot 应用监控
  - job_name: 'AlzAssistant'
    metrics_path: '/api/actuator/prometheus' # Spring Boot Actuator 端点
    static_configs:
      - targets: ['localhost:8126'] # 应用服务器地址
    scrape_interval: 10s # 每 10 秒抓取一次
    scrape_timeout: 10s # 抓取超时时间

帮我根据分析需求、数据上报相关代码、示例从 Prometheus 收集到的数据，来生成 Grafana 看板的 JSON 导入代码，全部汇总到一个看板中。
相关的规范参考：https://grafana.com/docs/grafana/latest/dashboards/build-dashboards/view-dashboard-json-model/
## 需求
// ... 补充分析需求
## 数据上报相关代码
// ... 补充 AiModelMetricsCollector.java 的代码
## 示例收集到的数据
HELP ai_model_requests_total AI 模型总请求次数
TYPE ai_model_requests_total counter
ai_model_requests_total{app_id="313129227198590976",model_name="deepseek-chat",status="started",user_id="302588523967918080"} 2.0
ai_model_requests_total{app_id="313129227198590976",model_name="deepseek-chat",status="success",user_id="302588523967918080"} 2.0
HELP ai_model_response_duration_seconds AI 模型响应时间
TYPE ai_model_response_duration_seconds summary
ai_model_response_duration_seconds_count{app_id="313129227198590976",model_name="deepseek-chat",user_id="302588523967918080"} 2
ai_model_response_duration_seconds_sum{app_id="313129227198590976",model_name="deepseek-chat",user_id="302588523967918080"} 91.285863
HELP ai_model_tokens_total AI 模型 Token 消耗总数
TYPE ai_model_tokens_total counter
ai_model_tokens_total{app_id="313129227198590976",model_name="deepseek-chat",token_type="input",user_id="302588523967918080"} 1321.0
ai_model_tokens_total{app_id="313129227198590976",model_name="deepseek-chat",token_type="output",user_id="302588523967918080"} 519.0
ai_model_tokens_total{app_id="313129227198590976",model_name="deepseek-chat",token_type="total",user_id="302588523967918080"} 1840.0

Prometheus + Grafana 实现 Java 应用数据监控系统

1.什么是可观测

2.使用 Prometheus + Grafana 实现监控

Prometheus

更多推荐文章

相关免费在线工具

Grafana

Grafana 整合 Prometheus

开发实现

1.引入依赖

2.编写配置

3.监控上下文

4.指标收集器

5.AI 调用监听器

6.注册监听器到 AI 模型中

7.测试验证

8.Prometheus 配置

9.Grafana 可视化监控配置

1）手动配置看板

2）通过 json 配置来导入看板

附：json 配置示例

更多推荐文章

相关免费在线工具

Prometheus + Grafana 实现 Java 应用数据监控系统

1.什么是可观测

2.使用 Prometheus + Grafana 实现监控

Prometheus

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具

Grafana

Grafana 整合 Prometheus

开发实现

1.引入依赖

2.编写配置

3.监控上下文

4.指标收集器

5.AI 调用监听器

6.注册监听器到 AI 模型中

7.测试验证

8.Prometheus 配置

9.Grafana 可视化监控配置

1）手动配置看板

2）通过 json 配置来导入看板

附：json 配置示例

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具