Qwen3-TTS 与 Whisper ASR 构建双向语音对话系统 | 极客日志

PythonAI算法

Qwen3-TTS 与 Whisper ASR 构建双向语音对话系统

介绍基于 Qwen3-TTS 和 Whisper ASR 构建双向语音对话系统的完整流程。涵盖环境配置、模型加载、语音合成与识别实现、系统集成及大语言模型接入。同时提供性能优化策略与常见问题排查指南，助力开发者快速搭建智能语音交互应用。

雪落无声发布于 2026/4/6更新于 2026/5/2331 浏览

Qwen3-TTS 与 Whisper ASR 构建双向语音对话系统

本文介绍如何使用 Qwen3-TTS 和 Whisper ASR 构建双向语音对话系统。从基础部署开始，到最终实现一个能听会说的双向对话系统。

1. 准备工作与环境搭建

在开始之前，我们先来了解一下今天要用到的两个核心工具。

Qwen3-TTS 是一个强大的文本转语音模型。它最吸引人的地方在于，它支持 10 种主要语言，包括中文、英文、日文等，还能生成多种方言和语音风格。更厉害的是，它能理解你文本里的情感和意图，自动调整说话的语调、语速，让生成的声音听起来特别自然。

Whisper ASR 则是 OpenAI 开源的语音识别模型，它的识别准确率非常高，支持多种语言，而且对带口音、有噪声的语音也有很好的处理能力。

把这两个模型组合起来，一个负责'听'（Whisper），一个负责'说'（Qwen3-TTS），一个完整的语音对话闭环就形成了。

1.1 基础环境要求

为了顺利运行，你的电脑需要满足以下条件：

操作系统：推荐使用 Linux（如 Ubuntu 20.04+）或 macOS。Windows 系统也可以，但可能需要额外配置。
Python 版本：Python 3.8 或更高版本。
内存：至少 16GB RAM。如果要用到更大的模型或进行批量处理，建议 32GB 以上。
存储空间：预留 10-20GB 空间用于存放模型和依赖库。
GPU（可选但推荐）：虽然 CPU 也能运行，但有 NVIDIA GPU（显存 8GB 以上）会快很多。

1.2 快速安装依赖

打开你的终端，我们一步步来安装必要的软件包。

首先，创建一个独立的 Python 环境是个好习惯，这样可以避免不同项目之间的依赖冲突：

# 创建新的虚拟环境
python -m venv voice_chat_env
# 激活环境（Linux/macOS）
source voice_chat_env/bin/activate
# 激活环境（Windows）
voice_chat_env\Scripts\activate

环境激活后，你会看到命令行前面多了个环境名。接下来安装核心依赖：

# 升级 pip 到最新版本
pip install --upgrade pip
# 安装 PyTorch（根据你的 CUDA 版本选择，如果没有 GPU 就用 CPU 版本）
# 有 CUDA 11.8 的情况
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
# 没有 GPU 的情况
pip install torch torchvision torchaudio
# 安装 Transformers 库（Hugging Face 的核心库）
pip install transformers
# 安装音频处理相关库
pip install soundfile librosa pydub
# 安装 Web 界面相关（可选，用于可视化操作）
pip install gradio streamlit

安装过程可能需要几分钟，取决于你的网速。如果遇到网络问题，可以尝试使用国内的镜像源，比如清华源：

pip install transformers -i https://pypi.tuna.tsinghua.edu.cn/simple

2. Qwen3-TTS 模型部署与初体验

环境准备好后，我们先来部署和测试 Qwen3-TTS 模型。这是整个系统的'嘴巴'，负责把文字变成声音。

相关免费在线工具

加密/解密文本
使用加密算法（如AES、TripleDES、Rabbit或RC4）加密和解密文本明文。在线工具，加密/解密文本在线工具，online
RSA密钥对生成器
生成新的随机RSA私钥和公钥pem证书。在线工具，RSA密钥对生成器在线工具，online
Mermaid 预览与可视化编辑
基于 Mermaid.js 实时预览流程图、时序图等图表，支持源码编辑与即时渲染。在线工具，Mermaid 预览与可视化编辑在线工具，online
随机西班牙地址生成器
随机生成西班牙地址（支持马德里、加泰罗尼亚、安达卢西亚、瓦伦西亚筛选），支持数量快捷选择、显示全部与下载。在线工具，随机西班牙地址生成器在线工具，online
Gemini 图片去水印
基于开源反向 Alpha 混合算法去除 Gemini/Nano Banana 图片水印，支持批量处理与下载。在线工具，Gemini 图片去水印在线工具，online
curl 转代码
解析常见 curl 参数并生成 fetch、axios、PHP curl 或 Python requests 示例代码。在线工具，curl 转代码在线工具，online

from transformers import AutoModelForTextToSpeech, AutoTokenizer
import torch

# 指定模型名称
model_name = "Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign"
print("正在加载 Qwen3-TTS 模型，这可能需要几分钟...")

# 加载 tokenizer（文本处理器）
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

# 加载模型
model = AutoModelForTextToSpeech.from_pretrained(
    model_name,
    torch_dtype=torch.float16, # 使用半精度减少内存占用
    device_map="auto", # 自动分配设备（GPU/CPU）
    trust_remote_code=True
)
print("模型加载完成！")

def text_to_speech(text, language="中文", voice_description="亲切的女声"):
    """
    将文本转换为语音
    参数：
        text: 要转换的文本
        language: 语言类型，支持中文、英文、日文等
        voice_description: 音色描述，如'亲切的女声'、'沉稳的男声'
    """
    # 准备输入文本
    input_text = f"{language}|{voice_description}|{text}"
    
    # 使用 tokenizer 处理文本
    inputs = tokenizer(input_text, return_tensors="pt")
    
    # 将输入数据移到与模型相同的设备
    if torch.cuda.is_available():
        inputs = {k: v.cuda() for k, v in inputs.items()}
    
    # 生成语音
    with torch.no_grad(): # 不计算梯度，加快推理速度
        speech = model.generate(**inputs)
    
    # 将语音数据转换为 numpy 数组（便于保存）
    speech = speech.cpu().numpy()
    return speech

# 测试一下
test_text = "你好，我是 Qwen3-TTS，很高兴为你服务。"
audio_data = text_to_speech(test_text, language="中文", voice_description="亲切的女声")
print(f"语音生成完成！音频数据形状：{audio_data.shape}")

import soundfile as sf
import numpy as np

def save_audio(audio_data, filename="output.wav", sample_rate=24000):
    """
    保存音频数据为文件
    参数：
        audio_data: 音频数据（numpy 数组）
        filename: 保存的文件名
        sample_rate: 采样率，Qwen3-TTS 默认是 24000Hz
    """
    # 确保音频数据是单声道（如果是立体声，取第一个声道）
    if len(audio_data.shape) > 1:
        audio_data = audio_data[0]
    
    # 归一化到 [-1, 1] 范围（如果不在这个范围）
    if audio_data.max() > 1.0 or audio_data.min() < -1.0:
        audio_data = audio_data / np.max(np.abs(audio_data))
    
    # 保存为 WAV 文件
    sf.write(filename, audio_data, sample_rate)
    print(f"音频已保存为：{filename}")

# 保存刚才生成的语音
save_audio(audio_data, "first_speech.wav")

# 尝试不同的语言和音色
test_cases = [
    {"text": "Hello, welcome to our AI voice system.", "language": "英文", "voice": "专业的女声"},
    {"text": "こんにちは、AI 音声システムへようこそ。", "language": "日文", "voice": "可爱的女声"},
    {"text": "今天天气真好，我们出去散步吧。", "language": "中文", "voice": "欢快的女声"},
    {"text": "请注意，系统将在 5 分钟后进行维护。", "language": "中文", "voice": "严肃的男声"},
]

for i, case in enumerate(test_cases):
    print(f"生成中：{case['text']}")
    audio = text_to_speech(case["text"], case["language"], case["voice"])
    save_audio(audio, f"test_{i+1}.wav")
    print(f"已保存：test_{i+1}.wav")

from transformers import WhisperProcessor, WhisperForConditionalGeneration
import librosa

# 选择模型大小（tiny, base, small, medium, large）
model_size = "base" # 平衡准确率和速度
print(f"正在加载 Whisper-{model_size} 模型...")

# 加载处理器和模型
processor = WhisperProcessor.from_pretrained(f"openai/whisper-{model_size}")
asr_model = WhisperForConditionalGeneration.from_pretrained(f"openai/whisper-{model_size}")

# 如果有 GPU，把模型移到 GPU 上
if torch.cuda.is_available():
    asr_model = asr_model.cuda()
print("Whisper 模型加载完成！")

def speech_to_text(audio_path, language="zh"):
    """
    将音频文件转换为文本
    参数：
        audio_path: 音频文件路径
        language: 语言代码，zh=中文，en=英文，ja=日文等
    """
    # 加载音频文件
    audio, sr = librosa.load(audio_path, sr=16000) # Whisper 需要 16000Hz 采样率
    
    # 使用处理器提取特征
    input_features = processor(
        audio, sampling_rate=sr, return_tensors="pt"
    ).input_features
    
    # 将特征移到正确的设备
    if torch.cuda.is_available():
        input_features = input_features.cuda()
    
    # 生成文本
    predicted_ids = asr_model.generate(input_features)
    transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
    return transcription

# 测试一下（需要先有一个音频文件）
# 你可以用手机录一段话，保存为 test_voice.wav
try:
    text = speech_to_text("test_voice.wav", language="zh")
    print(f"识别结果：{text}")
except FileNotFoundError:
    print("找不到测试音频文件，请先录制一段语音保存为 test_voice.wav")

import pyaudio
import numpy as np
import wave
from queue import Queue
import threading

class RealTimeASR:
    def __init__(self, model_size="base"):
        """初始化实时语音识别"""
        self.processor = WhisperProcessor.from_pretrained(f"openai/whisper-{model_size}")
        self.model = WhisperForConditionalGeneration.from_pretrained(f"openai/whisper-{model_size}")
        if torch.cuda.is_available():
            self.model = self.model.cuda()
        
        # 音频参数
        self.CHUNK = 1024 # 每次读取的音频块大小
        self.FORMAT = pyaudio.paInt16 # 音频格式
        self.CHANNELS = 1 # 单声道
        self.RATE = 16000 # 采样率
        self.audio_queue = Queue()
        self.is_recording = False

    def record_audio(self):
        """录制音频"""
        p = pyaudio.PyAudio()
        stream = p.open(
            format=self.FORMAT,
            channels=self.CHANNELS,
            rate=self.RATE,
            input=True,
            frames_per_buffer=self.CHUNK
        )
        print("开始录音...（按 Ctrl+C 停止）")
        self.is_recording = True
        try:
            while self.is_recording:
                data = stream.read(self.CHUNK)
                self.audio_queue.put(data)
        except KeyboardInterrupt:
            print("\n录音停止")
        finally:
            stream.stop_stream()
            stream.close()
            p.terminate()
            self.is_recording = False

    def transcribe_audio(self, audio_data):
        """转录音频数据"""
        # 将字节数据转换为 numpy 数组
        audio_array = np.frombuffer(audio_data, dtype=np.int16).astype(np.float32) / 32768.0
        
        # 提取特征
        input_features = self.processor(
            audio_array, sampling_rate=self.RATE, return_tensors="pt"
        ).input_features
        if torch.cuda.is_available():
            input_features = input_features.cuda()
        
        # 生成文本
        predicted_ids = self.model.generate(input_features)
        transcription = self.processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
        return transcription

# 使用示例
# asr = RealTimeASR("base")
# 在另一个线程中录音
# record_thread = threading.Thread(target=asr.record_audio)
# record_thread.start()

用户说话 → Whisper 识别为文字 → 处理文字（可以接入大语言模型） → Qwen3-TTS 转换为语音 → 播放给用户

import time
from pydub import AudioSegment
from pydub.playback import play
import tempfile

class SimpleVoiceChat:
    def __init__(self, tts_model_name="Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign", asr_model_size="base"):
        """初始化简单语音聊天系统"""
        print("初始化语音对话系统...")
        
        # 初始化 TTS
        self.tts_tokenizer = AutoTokenizer.from_pretrained(
            tts_model_name, trust_remote_code=True
        )
        self.tts_model = AutoModelForTextToSpeech.from_pretrained(
            tts_model_name,
            torch_dtype=torch.float16,
            device_map="auto",
            trust_remote_code=True
        )
        
        # 初始化 ASR
        self.asr_processor = WhisperProcessor.from_pretrained(
            f"openai/whisper-{asr_model_size}"
        )
        self.asr_model = WhisperForConditionalGeneration.from_pretrained(
            f"openai/whisper-{asr_model_size}"
        )
        if torch.cuda.is_available():
            self.asr_model = self.asr_model.cuda()
        print("系统初始化完成！")

    def listen(self, audio_path):
        """听：语音转文字"""
        try:
            # 加载音频
            audio, sr = librosa.load(audio_path, sr=16000)
            
            # 提取特征
            input_features = self.asr_processor(
                audio, sampling_rate=sr, return_tensors="pt"
            ).input_features
            if torch.cuda.is_available():
                input_features = input_features.cuda()
            
            # 生成文字
            predicted_ids = self.asr_model.generate(input_features)
            text = self.asr_processor.batch_decode(
                predicted_ids, skip_special_tokens=True
            )[0]
            return text
        except Exception as e:
            print(f"语音识别失败：{e}")
            return ""

    def think(self, user_text):
        """思考：处理用户输入（这里简单重复）"""
        # 这是一个简单的示例，实际可以接入大语言模型
        response = f"你说的是：{user_text}"
        return response

    def speak(self, text, language="中文", voice="友好的助手"):
        """说：文字转语音并播放"""
        try:
            # 准备输入
            input_text = f"{language}|{voice}|{text}"
            inputs = self.tts_tokenizer(input_text, return_tensors="pt")
            if torch.cuda.is_available():
                inputs = {k: v.cuda() for k, v in inputs.items()}
            
            # 生成语音
            with torch.no_grad():
                speech = self.tts_model.generate(**inputs)
            
            # 转换为 numpy 数组
            speech = speech.cpu().numpy()
            
            # 保存临时文件并播放
            with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as tmp:
                temp_path = tmp.name
                
                # 确保是单声道
                if len(speech.shape) > 1:
                    speech = speech[0]
                sf.write(temp_path, speech, 24000)
                
                # 播放音频
                audio = AudioSegment.from_wav(temp_path)
                play(audio)
            return True
        except Exception as e:
            print(f"语音合成失败：{e}")
            return False

    def chat(self, user_audio_path):
        """完整的对话流程"""
        print("正在聆听...")
        
        # 1. 听：语音转文字
        user_text = self.listen(user_audio_path)
        if not user_text:
            print("没有识别到有效语音")
            return
        print(f"你说：{user_text}")
        
        # 2. 思考：生成回复
        response_text = self.think(user_text)
        print(f"我回答：{response_text}")
        
        # 3. 说：文字转语音
        print("正在生成语音回复...")
        self.speak(response_text)

# 使用示例
# chat_system = SimpleVoiceChat()
# chat_system.chat("你的音频文件.wav")

import openai

class SmartVoiceChat(SimpleVoiceChat):
    def __init__(self, openai_api_key, **kwargs):
        """初始化智能语音聊天系统"""
        super().__init__(**kwargs)
        # 设置 OpenAI API 密钥
        openai.api_key = openai_api_key
        # 对话历史
        self.conversation_history = []

    def think(self, user_text):
        """思考：使用 GPT 生成回复"""
        # 将用户输入加入历史
        self.conversation_history.append({"role": "user", "content": user_text})
        try:
            # 调用 GPT API
            response = openai.ChatCompletion.create(
                model="gpt-3.5-turbo", # 或 "gpt-4"
                messages=[
                    {"role": "system", "content": "你是一个友好的语音助手，回答要简洁自然，适合用语音表达。"}
                ] + self.conversation_history[-6:], # 只保留最近 6 条记录
                max_tokens=150,
                temperature=0.7
            )
            
            # 获取回复
            assistant_reply = response.choices[0].message.content
            
            # 将助手回复加入历史
            self.conversation_history.append({"role": "assistant", "content": assistant_reply})
            return assistant_reply
        except Exception as e:
            print(f"GPT 调用失败：{e}")
            return "抱歉，我现在无法处理你的请求。"

    def reset_conversation(self):
        """重置对话历史"""
        self.conversation_history = []
        print("对话历史已重置")

import gradio as gr

def create_web_interface():
    """创建 Web 界面"""
    # 初始化聊天系统（这里用简单版本，实际可以用智能版本）
    chat_system = SimpleVoiceChat()

    def process_audio(audio_file):
        """处理上传的音频文件"""
        if audio_file is None:
            return "请上传音频文件", None
        
        # 进行对话
        user_text = chat_system.listen(audio_file)
        if not user_text:
            return "没有识别到语音", None
        
        # 生成回复（这里简单处理，实际可以调用 think 方法）
        response = f"我听到你说：{user_text}"
        
        # 生成语音回复
        with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as tmp:
            temp_path = tmp.name
            
            # 合成语音
            input_text = f"中文 | 友好的助手|{response}"
            inputs = chat_system.tts_tokenizer(input_text, return_tensors="pt")
            if torch.cuda.is_available():
                inputs = {k: v.cuda() for k, v in inputs.items()}
            with torch.no_grad():
                speech = chat_system.tts_model.generate(**inputs)
            speech = speech.cpu().numpy()
            if len(speech.shape) > 1:
                speech = speech[0]
            sf.write(temp_path, speech, 24000)
        
        return response, temp_path

    # 创建界面
    interface = gr.Interface(
        fn=process_audio,
        inputs=gr.Audio(type="filepath", label="上传你的语音"),
        outputs=[
            gr.Textbox(label="识别结果"),
            gr.Audio(label="语音回复", type="filepath")
        ],
        title="语音对话系统",
        description="上传语音文件，系统会识别并回复"
    )
    return interface

# 启动 Web 界面
# interface = create_web_interface()
# interface.launch(share=True)
# share=True 会生成一个公共链接

# 使用 8 位量化加载模型
from transformers import BitsAndBytesConfig
quantization_config = BitsAndBytesConfig(
    load_in_8bit=True, # 8 位量化
    llm_int8_threshold=6.0
)
model = AutoModelForTextToSpeech.from_pretrained(
    "Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign",
    quantization_config=quantization_config,
    device_map="auto",
    trust_remote_code=True
)

# Whisper 模型大小选择
model_sizes = {
    "最快": "tiny",
    "平衡": "base",
    "准确": "small",
    "高质量": "medium",
    "最佳质量": "large"
}
# 根据需求选择
selected_size = model_sizes["平衡"]

def batch_speech_to_text(audio_paths):
    """批量语音识别"""
    audios = []
    for path in audio_paths:
        audio, sr = librosa.load(path, sr=16000)
        audios.append(audio)
    
    # 批处理
    input_features = processor(
        audios, sampling_rate=16000, return_tensors="pt", padding=True
    ).input_features
    
    # 批量生成
    predicted_ids = model.generate(input_features)
    transcriptions = processor.batch_decode(predicted_ids, skip_special_tokens=True)
    return transcriptions

Qwen3-TTS 与 Whisper ASR 构建双向语音对话系统

Qwen3-TTS 与 Whisper ASR 构建双向语音对话系统

1. 准备工作与环境搭建

1.1 基础环境要求

1.2 快速安装依赖

2. Qwen3-TTS 模型部署与初体验

更多推荐文章

相关免费在线工具

2.1 下载与加载 Qwen3-TTS 模型

2.2 你的第一次语音合成

2.3 保存和播放生成的语音

2.4 探索不同的语音风格

3. Whisper ASR 模型部署与测试

3.1 加载 Whisper 模型

3.2 实现语音识别功能

3.3 实时语音识别

4. 构建双向语音对话系统

4.1 系统架构设计

4.2 基础对话系统实现

4.3 接入大语言模型增强对话能力

4.4 创建 Web 界面（可选）

5. 优化技巧与常见问题

5.1 性能优化建议

5.2 常见问题与解决

5.3 实际应用建议

6. 总结与下一步

更多推荐文章

相关免费在线工具

Qwen3-TTS 与 Whisper ASR 构建双向语音对话系统

Qwen3-TTS 与 Whisper ASR 构建双向语音对话系统

1. 准备工作与环境搭建

1.1 基础环境要求

1.2 快速安装依赖

2. Qwen3-TTS 模型部署与初体验

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具

2.1 下载与加载 Qwen3-TTS 模型

2.2 你的第一次语音合成

2.3 保存和播放生成的语音

2.4 探索不同的语音风格

3. Whisper ASR 模型部署与测试

3.1 加载 Whisper 模型

3.2 实现语音识别功能

3.3 实时语音识别

4. 构建双向语音对话系统

4.1 系统架构设计

4.2 基础对话系统实现

4.3 接入大语言模型增强对话能力

4.4 创建 Web 界面（可选）

5. 优化技巧与常见问题

5.1 性能优化建议

5.2 常见问题与解决

5.3 实际应用建议

6. 总结与下一步

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具