基于 Qwen3-TTS 和 Whisper ASR 的双向语音对话系统搭建 | 极客日志

PythonAI算法

基于 Qwen3-TTS 和 Whisper ASR 的双向语音对话系统搭建

综述由AI生成如何使用 Qwen3-TTS 和 Whisper ASR 构建双向语音对话系统。首先配置 Python 环境并安装依赖，接着分别部署文本转语音（TTS）和语音识别（ASR）模型。通过代码实现音频文件的加载、转换及播放功能，并将两者结合形成完整的听 - 说闭环。此外，还展示了如何接入大语言模型增强对话智能性，以及创建 Web 界面和优化性能的方法。文章提供了详细的代码示例和常见问题解决方案，适合希望快速上手语音交互开发的开发者参考。

RustyLab发布于 2026/4/5更新于 2026/5/2539 浏览

基于 Qwen3-TTS 和 Whisper ASR 的双向语音对话系统搭建

本文介绍如何通过 Qwen3-TTS 和 Whisper ASR 两个开源模型构建双向语音对话系统。从基础环境部署开始，到实现听（ASR）说（TTS）的完整闭环，适合希望为应用增加语音交互功能的开发者。

1. 准备工作与环境搭建

1.1 基础环境要求

操作系统：推荐使用 Linux（如 Ubuntu 20.04+）或 macOS。Windows 也可行但需额外配置。
Python 版本：3.8 或更高。
内存：至少 16GB RAM，建议 32GB 以上。
存储空间：预留 10-20GB。
GPU：推荐 NVIDIA GPU（显存 8GB 以上）以加速推理。

1.2 快速安装依赖

创建独立虚拟环境以避免依赖冲突：

# 创建新的虚拟环境
python -m venv voice_chat_env
# 激活环境（Linux/macOS）
source voice_chat_env/bin/activate
# 激活环境（Windows）
voice_chat_env\Scripts\activate

激活后安装核心依赖：

pip install --upgrade pip
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
pip install transformers soundfile librosa pydub gradio streamlit

如遇网络问题可使用镜像源：

pip install transformers -i https://pypi.tuna.tsinghua.edu.cn/simple

2. Qwen3-TTS 模型部署与初体验

2.1 下载与加载 Qwen3-TTS 模型

from transformers import AutoModelForTextToSpeech, AutoTokenizer
import torch

model_name = "Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign"
print("正在加载 Qwen3-TTS 模型...")
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForTextToSpeech.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map="auto",
    trust_remote_code=True
)
print("模型加载完成！")

2.2 语音合成测试

def ():
    input_text = 
    inputs = tokenizer(input_text, return_tensors=)
     torch.cuda.is_available():
        inputs = {k: v.cuda()  k, v  inputs.items()}
     torch.no_grad():
        speech = model.generate(**inputs)
        speech = speech.cpu().numpy()
     speech

test_text = 
audio_data = text_to_speech(test_text)
()

相关免费在线工具

加密/解密文本
使用加密算法（如AES、TripleDES、Rabbit或RC4）加密和解密文本明文。在线工具，加密/解密文本在线工具，online
RSA密钥对生成器
生成新的随机RSA私钥和公钥pem证书。在线工具，RSA密钥对生成器在线工具，online
Mermaid 预览与可视化编辑
基于 Mermaid.js 实时预览流程图、时序图等图表，支持源码编辑与即时渲染。在线工具，Mermaid 预览与可视化编辑在线工具，online
随机西班牙地址生成器
随机生成西班牙地址（支持马德里、加泰罗尼亚、安达卢西亚、瓦伦西亚筛选），支持数量快捷选择、显示全部与下载。在线工具，随机西班牙地址生成器在线工具，online
Gemini 图片去水印
基于开源反向 Alpha 混合算法去除 Gemini/Nano Banana 图片水印，支持批量处理与下载。在线工具，Gemini 图片去水印在线工具，online
curl 转代码
解析常见 curl 参数并生成 fetch、axios、PHP curl 或 Python requests 示例代码。在线工具，curl 转代码在线工具，online

import soundfile as sf
import numpy as np

def save_audio(audio_data, filename="output.wav", sample_rate=24000):
    if len(audio_data.shape) > 1:
        audio_data = audio_data[0]
    if audio_data.max() > 1.0 or audio_data.min() < -1.0:
        audio_data = audio_data / np.max(np.abs(audio_data))
    sf.write(filename, audio_data, sample_rate)

save_audio(audio_data, "first_speech.wav")

from transformers import WhisperProcessor, WhisperForConditionalGeneration
import librosa

model_size = "base"
processor = WhisperProcessor.from_pretrained(f"openai/whisper-{model_size}")
asr_model = WhisperForConditionalGeneration.from_pretrained(f"openai/whisper-{model_size}")
if torch.cuda.is_available():
    asr_model = asr_model.cuda()

def speech_to_text(audio_path, language="zh"):
    audio, sr = librosa.load(audio_path, sr=16000)
    input_features = processor(audio, sampling_rate=sr, return_tensors="pt").input_features
    if torch.cuda.is_available():
        input_features = input_features.cuda()
    predicted_ids = asr_model.generate(input_features)
    transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
    return transcription

import time
from pydub import AudioSegment
from pydub.playback import play
import tempfile

class SimpleVoiceChat:
    def __init__(self, tts_model_name="Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign", asr_model_size="base"):
        self.tts_tokenizer = AutoTokenizer.from_pretrained(tts_model_name, trust_remote_code=True)
        self.tts_model = AutoModelForTextToSpeech.from_pretrained(tts_model_name, torch_dtype=torch.float16, device_map="auto", trust_remote_code=True)
        self.asr_processor = WhisperProcessor.from_pretrained(f"openai/whisper-{asr_model_size}")
        self.asr_model = WhisperForConditionalGeneration.from_pretrained(f"openai/whisper-{asr_model_size}")
        if torch.cuda.is_available():
            self.asr_model = self.asr_model.cuda()

    def listen(self, audio_path):
        try:
            audio, sr = librosa.load(audio_path, sr=16000)
            input_features = self.asr_processor(audio, sampling_rate=sr, return_tensors="pt").input_features
            if torch.cuda.is_available():
                input_features = input_features.cuda()
            predicted_ids = self.asr_model.generate(input_features)
            text = self.asr_processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
            return text
        except Exception as e:
            print(f"语音识别失败：{e}")
            return ""

    def think(self, user_text):
        response = f"你说的是：{user_text}"
        return response

    def speak(self, text, language="中文", voice="友好的助手"):
        try:
            input_text = f"{language}|{voice}|{text}"
            inputs = self.tts_tokenizer(input_text, return_tensors="pt")
            if torch.cuda.is_available():
                inputs = {k: v.cuda() for k, v in inputs.items()}
            with torch.no_grad():
                speech = self.tts_model.generate(**inputs)
                speech = speech.cpu().numpy()
            with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as tmp:
                temp_path = tmp.name
                if len(speech.shape) > 1:
                    speech = speech[0]
                sf.write(temp_path, speech, 24000)
                audio = AudioSegment.from_wav(temp_path)
                play(audio)
            return True
        except Exception as e:
            print(f"语音合成失败：{e}")
            return False

import openai

class SmartVoiceChat(SimpleVoiceChat):
    def __init__(self, openai_api_key, **kwargs):
        super().__init__(**kwargs)
        openai.api_key = openai_api_key
        self.conversation_history = []

    def think(self, user_text):
        self.conversation_history.append({"role": "user", "content": user_text})
        try:
            response = openai.ChatCompletion.create(
                model="gpt-3.5-turbo",
                messages=[{"role": "system", "content": "你是一个友好的语音助手。"}] + self.conversation_history[-6:],
                max_tokens=150
            )
            assistant_reply = response.choices[0].message.content
            self.conversation_history.append({"role": "assistant", "content": assistant_reply})
            return assistant_reply
        except Exception as e:
            return "抱歉，我现在无法处理你的请求。"

from transformers import BitsAndBytesConfig
quantization_config = BitsAndBytesConfig(load_in_8bit=True)

基于 Qwen3-TTS 和 Whisper ASR 的双向语音对话系统搭建

基于 Qwen3-TTS 和 Whisper ASR 的双向语音对话系统搭建

1. 准备工作与环境搭建

1.1 基础环境要求

1.2 快速安装依赖

2. Qwen3-TTS 模型部署与初体验

2.1 下载与加载 Qwen3-TTS 模型

2.2 语音合成测试

更多推荐文章

相关免费在线工具

2.3 保存和播放生成的语音

3. Whisper ASR 模型部署与测试

3.1 加载 Whisper 模型

3.2 实现语音识别功能

4. 构建双向语音对话系统

4.1 系统架构设计

4.2 基础对话系统实现

4.3 接入大语言模型增强对话能力

5. 优化技巧与常见问题

5.1 性能优化建议

5.2 常见问题与解决

6. 总结

更多推荐文章

相关免费在线工具

基于 Qwen3-TTS 和 Whisper ASR 的双向语音对话系统搭建

基于 Qwen3-TTS 和 Whisper ASR 的双向语音对话系统搭建

1. 准备工作与环境搭建

1.1 基础环境要求

1.2 快速安装依赖

2. Qwen3-TTS 模型部署与初体验

2.1 下载与加载 Qwen3-TTS 模型

2.2 语音合成测试

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具

2.3 保存和播放生成的语音

3. Whisper ASR 模型部署与测试

3.1 加载 Whisper 模型

3.2 实现语音识别功能

4. 构建双向语音对话系统

4.1 系统架构设计

4.2 基础对话系统实现

4.3 接入大语言模型增强对话能力

5. 优化技巧与常见问题

5.1 性能优化建议

5.2 常见问题与解决

6. 总结

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具