SenseVoice-Small 语音识别模型 Gradio 前端定制与 API 扩展 | 极客日志

Python算法

SenseVoice-Small 语音识别模型 Gradio 前端定制与 API 扩展

介绍基于 SenseVoice-Small 模型的语音识别系统开发流程。内容包括环境准备、ModelScope 模型加载测试、Gradio 前端界面深度定制（布局优化、自定义样式）、功能扩展（批量处理、RESTful API、实时音频流处理）以及性能优化与容器化部署建议。旨在帮助开发者构建功能丰富且用户体验良好的语音识别应用。

机器人发布于 2026/4/6更新于 2026/5/2122 浏览

SenseVoice-Small 语音识别模型 Gradio 前端定制化开发：UI 修改与 API 扩展教程

1. 环境准备与快速部署

在开始定制化开发之前，我们需要先准备好基础环境。SenseVoice-Small 语音识别模型基于 ONNX 格式并带有量化处理，这使得模型在保持高精度的同时具有更快的推理速度。

首先确保你的系统满足以下要求：

Python 3.8 或更高版本
至少 4GB 可用内存
支持 ONNX Runtime 的硬件环境

安装必要的依赖包：

pip install modelscope gradio onnxruntime numpy librosa soundfile

如果你需要录音功能，还需要安装额外的音频处理库：

pip install pydub webrtcvad

完成环境配置后，我们可以开始模型的加载和测试。ModelScope 提供了便捷的模型管理方式，让我们能够快速获取和部署预训练模型。

2. 基础模型加载与测试

2.1 使用 ModelScope 加载模型

ModelScope 简化了模型的加载过程，下面是基本的模型加载代码：

from modelscope.pipelines import pipeline
from modelscope.utils.constant import Tasks

# 创建语音识别管道
asr_pipeline = pipeline(
    task=Tasks.auto_speech_recognition,
    model='damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch'
)
# 或者直接指定 SenseVoice-Small 模型
# 根据实际模型名称进行调整

2.2 基础推理测试

加载模型后，我们可以进行简单的推理测试：

import gradio as gr
import numpy as np

def transcribe_audio(audio_path):
    """ 语音识别函数 """
    if audio_path is None:
        return "请先上传或录制音频"
    # 使用 ModelScope 管道进行识别
    result = asr_pipeline(audio_path)
    return result['text']

# 创建简单的 Gradio 界面
demo = gr.Interface(
    fn=transcribe_audio,
    inputs=gr.Audio(=),
    outputs=,
    title=
)
demo.launch(server_name=, server_port=)

更多推荐文章

查看全部

import gradio as gr

def create_advanced_interface():
    with gr.Blocks(title="SenseVoice 语音识别系统", theme=gr.themes.Soft()) as demo:
        gr.Markdown("# 🎯 SenseVoice-Small 语音识别系统")
        gr.Markdown("支持多语言语音识别、情感分析和音频事件检测")
        with gr.Row():
            with gr.Column(scale=1):
                gr.Markdown("## 音频输入")
                audio_input = gr.Audio(
                    sources=["upload", "microphone"],
                    type="filepath",
                    label="上传或录制音频"
                )
                duration_display = gr.Textbox(label="音频时长", interactive=False)
            with gr.Column(scale=2):
                gr.Markdown("## 识别结果")
                output_text = gr.Textbox(
                    label="转写文本",
                    lines=5,
                    max_lines=10,
                    interactive=False
                )
            with gr.Row():
                clear_btn = gr.Button("清空结果")
                export_btn = gr.Button("导出文本")
        # 添加高级选项
        with gr.Accordion("高级选项", open=False):
            with gr.Row():
                language_select = gr.Dropdown(
                    choices=["自动检测", "中文", "英文", "日语", "韩语"],
                    value="自动检测",
                    label="语言选择"
                )
                emotion_detection = gr.Checkbox(label="启用情感分析", value=True)
                event_detection = gr.Checkbox(label="启用事件检测", value=True)
        # 连接组件事件
        audio_input.change(fn=update_duration, inputs=audio_input, outputs=duration_display)
        submit_btn = gr.Button("开始识别", variant="primary")
        submit_btn.click(
            fn=process_audio,
            inputs=[audio_input, language_select, emotion_detection, event_detection],
            outputs=output_text
        )
        clear_btn.click(fn=lambda: ("", ""), inputs=[], outputs=[audio_input, output_text])
    return demo

css = """
.gradio-container { max-width: 900px !important; }
.audio-input { border: 2px dashed #ccc; padding: 20px; border-radius: 10px; }
.result-box { background-color: #f8f9fa; padding: 15px; border-radius: 8px; border-left: 4px solid #007bff; }
"""

def create_custom_interface():
    with gr.Blocks(css=css, title="定制化语音识别界面") as demo:
        # 界面组件代码...
        pass

import os
from pathlib import Path

def batch_process_audio(audio_dir, output_format="txt"):
    """ 批量处理目录中的音频文件 """
    results = []
    audio_dir = Path(audio_dir)
    if not audio_dir.exists():
        return "目录不存在"
    audio_files = list(audio_dir.glob("*.wav")) + list(audio_dir.glob("*.mp3"))
    for audio_file in audio_files:
        try:
            result = asr_pipeline(str(audio_file))
            results.append({
                "file": audio_file.name,
                "text": result['text'],
                "emotion": result.get('emotion', ''),
                "events": result.get('events', [])
            })
        except Exception as e:
            results.append({
                "file": audio_file.name,
                "error": str(e)
            })
    # 保存结果
    if output_format == "txt":
        save_txt_results(results, audio_dir / "results.txt")
    elif output_format == "json":
        save_json_results(results, audio_dir / "results.json")
    return f"处理完成，共处理 {len(audio_files)} 个文件"

def save_txt_results(results, output_path):
    with open(output_path, 'w', encoding='utf-8') as f:
        for result in results:
            f.write(f"文件：{result['file']}\n")
            if 'error' in result:
                f.write(f"错误：{result['error']}\n")
            else:
                f.write(f"文本：{result['text']}\n")
            if result['emotion']:
                f.write(f"情感：{result['emotion']}\n")
            if result['events']:
                f.write(f"事件：{', '.join(result['events'])}\n")
            f.write("-" * 50 + "\n")

from fastapi import FastAPI, UploadFile, File, HTTPException
from fastapi.responses import JSONResponse
import tempfile
import os

app = FastAPI(title="SenseVoice 语音识别 API")

@app.post("/api/recognize")
async def recognize_speech(
    audio_file: UploadFile = File(...),
    language: str = "auto",
    enable_emotion: bool = True,
    enable_events: bool = True
):
    """ 语音识别 API 端点 """
    if not audio_file.content_type.startswith('audio/'):
        raise HTTPException(status_code=400, detail="请上传音频文件")
    # 保存临时文件
    with tempfile.NamedTemporaryFile(delete=False, suffix=".wav") as tmp_file:
        content = await audio_file.read()
        tmp_file.write(content)
        tmp_path = tmp_file.name
    try:
        # 调用识别功能
        result = asr_pipeline(tmp_path)
        response_data = {
            "text": result.get('text', ''),
            "language": result.get('language', ''),
            "success": True
        }
        if enable_emotion:
            response_data["emotion"] = result.get('emotion', {})
        if enable_events:
            response_data["events"] = result.get('events', [])
        return JSONResponse(content=response_data)
    except Exception as e:
        raise HTTPException(status_code=500, detail=f"处理失败：{str(e)}")
    finally:
        # 清理临时文件
        if os.path.exists(tmp_path):
            os.unlink(tmp_path)

@app.get("/api/health")
async def health_check():
    """健康检查端点"""
    return {"status": "healthy", "model_loaded": asr_pipeline is not None}

import threading
import queue
import time

class AudioStreamProcessor:
    def __init__(self, asr_pipeline):
        self.asr_pipeline = asr_pipeline
        self.audio_queue = queue.Queue()
        self.is_processing = False
        self.results = []

    def add_audio_chunk(self, audio_data, sample_rate=16000):
        """添加音频数据块"""
        self.audio_queue.put((audio_data, sample_rate))

    def start_processing(self):
        """开始处理音频流"""
        self.is_processing = True
        self.process_thread = threading.Thread(target=self._process_stream)
        self.process_thread.start()

    def stop_processing(self):
        """停止处理"""
        self.is_processing = False
        if hasattr(self, 'process_thread'):
            self.process_thread.join()

    def _process_stream(self):
        """处理音频流的线程函数"""
        audio_buffer = []
        buffer_duration = 2.0 # 2 秒缓冲
        while self.is_processing:
            try:
                audio_data, sample_rate = self.audio_queue.get(timeout=1.0)
                audio_buffer.append(audio_data)
                # 当缓冲达到指定时长时进行处理
                if len(audio_buffer) * len(audio_data[0]) / sample_rate >= buffer_duration:
                    # 合并音频数据并处理
                    combined_audio = np.concatenate(audio_buffer, axis=0)
                    result = self.asr_pipeline(combined_audio)
                    self.results.append(result)
                    audio_buffer = [] # 清空缓冲
            except queue.Empty:
                continue
            except Exception as e:
                print(f"处理错误：{e}")

def format_recognition_result(result, format_type="rich"):
    """ 格式化识别结果 """
    if format_type == "rich":
        # 富文本格式，包含情感和事件信息
        formatted_text = result.get('text', '')
        if 'emotion' in result and result['emotion']:
            emotion_info = f"\n\n情感分析：{result['emotion']}"
            formatted_text += emotion_info
        if 'events' in result and result['events']:
            events_info = f"\n检测到事件：{', '.join(result['events'])}"
            formatted_text += events_info
        return formatted_text
    elif format_type == "json":
        # JSON 格式，便于程序处理
        return {
            "text": result.get('text', ''),
            "emotion": result.get('emotion', {}),
            "events": result.get('events', []),
            "language": result.get('language', ''),
            "confidence": result.get('confidence', 0.0)
        }
    elif format_type == "plain":
        # 纯文本格式
        return result.get('text', '')
    else:
        return str(result)

def add_timestamp_to_result(result, audio_duration):
    """ 为识别结果添加时间戳信息 """
    # 这里简化处理，实际应用中需要更精确的时间戳计算
    if 'text' in result:
        words = result['text'].split()
        timestamped_result = []
        for i, word in enumerate(words):
            # 假设词语均匀分布在整个音频时长中
            start_time = i * audio_duration / len(words)
            end_time = (i + 1) * audio_duration / len(words)
            timestamped_result.append({
                "word": word,
                "start": round(start_time, 2),
                "end": round(end_time, 2),
                "confidence": 0.9 # 假设置信度
            })
        result['timestamped_words'] = timestamped_result
    return result

def optimize_model_performance():
    """ 模型性能优化配置 """
    import onnxruntime as ort
    # ONNX Runtime 优化配置
    so = ort.SessionOptions()
    so.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL
    so.intra_op_num_threads = 4 # 根据 CPU 核心数调整
    # 对于 GPU 环境
    providers = [
        ('CUDAExecutionProvider', {
            'device_id': 0,
            'arena_extend_strategy': 'kNextPowerOfTwo',
            'gpu_mem_limit': 2 * 1024 * 1024 * 1024, # 2GB
            'cudnn_conv_algo_search': 'EXHAUSTIVE',
            'do_copy_in_default_stream': True,
        }),
        'CPUExecutionProvider',
    ]
    # 或者仅使用 CPU
    # providers = ['CPUExecutionProvider']
    return so, providers

# 使用优化配置加载模型
def create_optimized_pipeline():
    so, providers = optimize_model_performance()
    from modelscope.pipelines import pipeline
    from modelscope.utils.constant import Tasks
    return pipeline(
        task=Tasks.auto_speech_recognition,
        model='damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch',
        model_revision='v1.0.0',
        device='gpu' if 'CUDAExecutionProvider' in providers else 'cpu',
        **{'session_options': so, 'providers': providers}
    )

FROM python:3.9-slim
WORKDIR /app
# 安装系统依赖
RUN apt-get update && apt-get install -y \
    ffmpeg \
    libsndfile1 \
    && rm -rf /var/lib/apt/lists/*
# 复制依赖文件
COPY requirements.txt .
# 安装 Python 依赖
RUN pip install --no-cache-dir -r requirements.txt
# 复制应用代码
COPY . .
# 暴露端口
EXPOSE 7860
# 启动应用
CMD ["python", "app.py", "--host", "0.0.0.0", "--port", "7860"]

version: '3.8'
services:
  sensevoice-asr:
    build: .
    ports:
      - "7860:7860"
    environment:
      - MODEL_CACHE_DIR=/app/models
      - ENABLE_GPU=true
    volumes:
      - ./models:/app/models
      - ./data:/app/data
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

SenseVoice-Small 语音识别模型 Gradio 前端定制与 API 扩展

SenseVoice-Small 语音识别模型 Gradio 前端定制化开发：UI 修改与 API 扩展教程

1. 环境准备与快速部署

2. 基础模型加载与测试

2.1 使用 ModelScope 加载模型

2.2 基础推理测试

更多推荐文章

3. Gradio 前端界面深度定制

3.1 界面布局优化

3.2 自定义样式与主题

4. API 功能扩展与集成

4.1 批量处理功能

4.2 RESTful API 集成

5. 高级功能实现

5.1 实时音频流处理

5.2 结果后处理与格式化

6. 部署与优化建议

6.1 性能优化技巧

6.2 容器化部署

7. 总结

更多推荐文章

相关免费在线工具

SenseVoice-Small 语音识别模型 Gradio 前端定制与 API 扩展

SenseVoice-Small 语音识别模型 Gradio 前端定制化开发：UI 修改与 API 扩展教程

1. 环境准备与快速部署

2. 基础模型加载与测试

2.1 使用 ModelScope 加载模型

2.2 基础推理测试

微信扫一扫，关注极客日志

更多推荐文章

3. Gradio 前端界面深度定制

3.1 界面布局优化

3.2 自定义样式与主题

4. API 功能扩展与集成

4.1 批量处理功能

4.2 RESTful API 集成

5. 高级功能实现

5.1 实时音频流处理

5.2 结果后处理与格式化

6. 部署与优化建议

6.1 性能优化技巧

6.2 容器化部署

7. 总结

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具