多模态模型开发实战：文本、图像与语音的融合应用 | 极客日志

PythonAI算法

多模态模型开发实战：文本、图像与语音的融合应用

多模态模型融合文本、图像与语音数据，实现更全面的理解与生成。开发全流程，涵盖数据预处理与对齐技术，展示跨模态问答、文生图及语音助手三大实战场景，介绍 LLaVA、Stable Diffusion 与 Whisper 等主流模型选型与部署方案。结合 QLoRA 微调技巧解决行业专属数据适配问题，并提供显存优化与合规建议，助力开发者构建高效的多模态 AI 应用。

性能调优发布于 2026/4/10更新于 2026/7/1334 浏览

多模态模型开发实战：文本、图像与语音的融合应用

多模态架构示意图

导读

随着人工智能技术的发展，单一模态模型已难以满足复杂场景需求。多模态模型通过融合文本、图像、语音等多种数据，实现更全面的理解与生成，成为当前 AI 领域的核心方向。本文将带你掌握多模态模型的核心概念、主流框架及开发流程，涵盖数据预处理、模型选型、训练微调到部署落地的全链路实践。

重点关注多模态数据的对齐与预处理、显存优化、生成内容的一致性，以及不同部署场景下的性能适配。

一、多模态模型基础：概念、技术与生态

1.1 核心概念与关键术语

**模态（Modality）**是数据的存在形式，常见类型包括：

文本模态：自然语言文本（新闻、对话、文档）；
视觉模态：图像、视频；
语音模态：语音信号、环境音；
其他模态：触觉、传感器、表格数据等。

多模态任务分类主要分为跨模态理解和跨模态生成：

任务类型	核心目标	典型场景
跨模态理解	联合分析多种模态，输出结构化结果	图文检索、跨模态问答、图像描述生成
跨模态生成	根据一种或多种模态输入，生成另一种模态输出	文生图、TTS、多模态对话

关键技术术语：

模态对齐：将不同模态映射到统一特征空间，确保语义一致。
特征融合：组合不同模态特征，生成更具表达力的联合特征。
跨模态注意力：让一种模态关注另一种模态的关键信息。
自监督预训练：在大规模无标注数据上学习通用特征表示。

1.2 主流多模态模型架构

工业界与学术界主要基于 Transformer 架构演变，核心包括三类：

统一编码器架构（Unified Encoder）
- 原理：所有模态转换为统一维度特征序列，共享 Transformer 编码。
- 代表：CLIP、ALBEF、FLAVA。
- 适用：跨模态理解任务（如图文检索），生成能力较弱。
编码器 - 解码器架构（Encoder-Decoder）
- 原理：编码器处理输入，解码器生成目标，通过跨模态注意力传递信息。
- 代表：DALL·E、Stable Diffusion、Whisper。
- 适用：生成类任务，结构复杂，资源消耗较高。
混合架构（Hybrid Architecture）
- 原理：结合两者优势，支持理解与生成。

相关免费在线工具

加密/解密文本
使用加密算法（如AES、TripleDES、Rabbit或RC4）加密和解密文本明文。在线工具，加密/解密文本在线工具，online
RSA密钥对生成器
生成新的随机RSA私钥和公钥pem证书。在线工具，RSA密钥对生成器在线工具，online
Mermaid 预览与可视化编辑
基于 Mermaid.js 实时预览流程图、时序图等图表，支持源码编辑与即时渲染。在线工具，Mermaid 预览与可视化编辑在线工具，online
随机西班牙地址生成器
随机生成西班牙地址（支持马德里、加泰罗尼亚、安达卢西亚、瓦伦西亚筛选），支持数量快捷选择、显示全部与下载。在线工具，随机西班牙地址生成器在线工具，online
Gemini 图片去水印
基于开源反向 Alpha 混合算法去除 Gemini/Nano Banana 图片水印，支持批量处理与下载。在线工具，Gemini 图片去水印在线工具，online
curl 转代码
解析常见 curl 参数并生成 fetch、axios、PHP curl 或 Python requests 示例代码。在线工具，curl 转代码在线工具，online

from transformers import CLIPTokenizer

tokenizer = CLIPTokenizer.from_pretrained("openai/clip-vit-base-patch32")

def preprocess_text(texts, max_seq_len=77):
    """
    文本预处理：Tokenization + 截断/填充
    :param texts: 文本列表
    :param max_seq_len: 最大序列长度
    :return: 预处理后的张量
    """
    inputs = tokenizer(
        texts,
        padding="max_length",
        truncation=True,
        max_length=max_seq_len,
        return_tensors="pt"
    )
    return {"input_ids": inputs["input_ids"], "attention_mask": inputs["attention_mask"]}

test_texts = ["一只坐在草地上的橘猫", "A red sports car on the road"]
text_features = preprocess_text(test_texts)
print(f"文本 Token ID 形状：{text_features['input_ids'].shape}")

from transformers import CLIPImageProcessor
from PIL import Image
import os

image_processor = CLIPImageProcessor.from_pretrained("openai/clip-vit-base-patch32")

def preprocess_image(image_paths, target_size=(224, 224)):
    """
    图像预处理：加载 + 缩放 + 归一化 + 维度转换
    :param image_paths: 图像路径列表
    :param target_size: 目标尺寸
    :return: 预处理后的图像张量
    """
    images = []
    for path in image_paths:
        img = Image.open(path).convert("RGB")
        images.append(img)
    
    inputs = image_processor(
        images,
        resize_size=target_size,
        crop_size=target_size,
        normalize={"mean": [0.48145466, 0.4578275, 0.40821073], "std": [0.26862954, 0.26130258, 0.27577711]},
        return_tensors="pt"
    )
    return inputs["pixel_values"]

test_image_paths = ["./images/cat.jpg", "./images/car.jpg"]
image_features = preprocess_image(test_image_paths)
print(f"预处理后图像张量形状：{image_features.shape}")

import json
from datasets import Dataset

def load_image_text_dataset(data_path, image_dir):
    """
    加载并过滤文本 - 图像配对数据集
    """
    samples = []
    with open(data_path, "r", encoding="utf-8") as f:
        for line in f:
            sample = json.loads(line)
            image_path = os.path.join(image_dir, sample["image_filename"])
            if not os.path.exists(image_path):
                continue
            if len(sample["text"].strip()) < 5:
                continue
            sample["image_path"] = image_path
            samples.append(sample)
    
    dataset = Dataset.from_list(samples)
    dataset = dataset.train_test_split(test_size=0.1, seed=42)
    return dataset["train"], dataset["test"]

train_dataset, val_dataset = load_image_text_dataset("image_text_pairs.jsonl", "./images")
print(f"训练集样本数：{len(train_dataset)}，验证集样本数：{len(val_dataset)}")

import librosa
import numpy as np
import torch

def preprocess_audio(audio_paths, sample_rate=16000, n_mels=80, max_length=3000):
    """
    语音预处理：加载 + 重采样 + 降噪 + 梅尔频谱提取
    """
    features = []
    for path in audio_paths:
        y, sr = librosa.load(path, sr=sample_rate)
        y, _ = librosa.effects.trim(y, top_db=20)
        mel_spec = librosa.feature.melspectrogram(y=y, sr=sr, n_mels=n_mels, fmax=8000)
        log_mel_spec = librosa.power_to_db(mel_spec, ref=np.max)
        
        seq_len = log_mel_spec.shape[1]
        if seq_len > max_length:
            log_mel_spec = log_mel_spec[:, :max_length]
        else:
            pad_len = max_length - seq_len
            log_mel_spec = np.pad(log_mel_spec, ((0, 0), (0, pad_len)), mode="constant")
        
        features.append(torch.tensor(log_mel_spec, dtype=torch.float32))
    return torch.stack(features)

test_audio_paths = ["./audios/speech1.wav", "./audios/speech2.mp3"]
audio_features = preprocess_audio(test_audio_paths)
print(f"梅尔频谱特征形状：{audio_features.shape}")

from transformers import Wav2Vec2ForCTC, Wav2Vec2Tokenizer

def align_text_audio(text, audio_path, sample_rate=16000):
    """
    文本 - 语音时序对齐：获取文本每个 Token 对应的语音时间戳
    """
    align_model_name = "facebook/wav2vec2-base-960h"
    align_tokenizer = Wav2Vec2Tokenizer.from_pretrained(align_model_name)
    align_model = Wav2Vec2ForCTC.from_pretrained(align_model_name).to("cuda")
    
    y, sr = librosa.load(audio_path, sr=sample_rate)
    text = text.lower().replace(",", "").replace(".", "").replace("?", "")
    
    inputs = align_tokenizer(y, sampling_rate=sr, return_tensors="pt", padding=True).to("cuda")
    
    with torch.no_grad():
        outputs = align_model(**inputs, output_hidden_states=True, return_dict=True)
        alignment_paths = align_model.wav2vec2.ctc_decoder.align(
            outputs.logits, align_tokenizer(text, return_tensors="pt")["input_ids"].to("cuda")
        )
    
    alignments = alignment_paths[0].alignments
    token_times = []
    frame_duration = 1 / sr
    downsample_rate = align_model.config.conv_stride[-1] * align_model.config.conv_kernel[-1]
    
    for token_idx, (frame_start, frame_end) in enumerate(alignments):
        orig_start_frame = frame_start * downsample_rate
        orig_end_frame = frame_end * downsample_rate
        start_time = orig_start_frame * frame_duration
        end_time = orig_end_frame * frame_duration
        token = align_tokenizer.convert_ids_to_tokens([token_idx])[0]
        
        if token != "<pad>" and token != "<s>" and token != "</s>":
            token_times.append({"token": token, "start_time": round(start_time, 3), "end_time": round(end_time, 3)})
    return token_times

test_text = "Hello, this is a speech recognition test."
test_audio = "./audios/english_speech.wav"
alignment_result = align_text_audio(test_text, test_audio)
print("文本 - 语音对齐结果：")
for item in alignment_result:
    print(f"Token: {item['token']}, 时间戳：{item['start_time']:.3f} - {item['end_time']:.3f}s")

from transformers import LlavaProcessor, LlavaForConditionalGeneration
import torch

model_name = "liuhaotian/LLaVA-7B-v1.5"
processor = LlavaProcessor.from_pretrained(model_name)
model = LlavaForConditionalGeneration.from_pretrained(
    model_name, torch_dtype=torch.float16, load_in_8bit=True, device_map="auto", trust_remote_code=True
)
print("模型加载完成，显存占用：", torch.cuda.memory_allocated() / 1024 / 1024 / 1024, "GB")

from PIL import Image

def multimodal_qa(image_path, question, max_new_tokens=200, temperature=0.3):
    """
    跨模态问答：输入图像和问题，生成回答
    """
    image = Image.open(image_path).convert("RGB")
    inputs = processor(
        text=question, images=image, return_tensors="pt", padding=True, truncation=True
    ).to(model.device)
    
    with torch.no_grad():
        outputs = model.generate(
            **inputs, max_new_tokens=max_new_tokens, temperature=temperature,
            top_p=0.9, do_sample=True, pad_token_id=processor.tokenizer.eos_token_id
        )
    answer = processor.decode(outputs[0], skip_special_tokens=True)
    answer = answer.split("ASSISTANT:")[-1].strip()
    return answer

test_image_path = "./images/phone_screenshot.jpg"
test_question = "这张截图显示的是什么手机型号？系统版本是多少？"
answer = multimodal_qa(test_image_path, test_question)
print(f"问题：{test_question}")
print(f"回答：{answer}")

from fastapi import FastAPI, UploadFile, File, Query
from fastapi.responses import JSONResponse, HTMLResponse
from fastapi.staticfiles import StaticFiles
import uvicorn
import os
from datetime import datetime

app = FastAPI(title="跨模态问答系统", version="1.0")
UPLOAD_DIR = "./uploads"
os.makedirs(UPLOAD_DIR, exist_ok=True)
app.mount("/uploads", StaticFiles(directory=UPLOAD_DIR), name="uploads")

@app.get("/", response_class=HTMLResponse)
async def index():
    html_content = """
    <!DOCTYPE html>
    <html>
    <head><title>跨模态问答系统</title></head>
    <body>
        <h1>跨模态问答系统（图像 + 文本）</h1>
        <div>
            <input type="file" accept="image/*"><br>
            <input type="text" placeholder="请输入你的问题"><br><br>
            <button onclick="submitQuery()">提交查询</button>
            <div id="result"></div>
        </div>
        <script>
            async function submitQuery() {
                const fileInput = document.getElementById("imageUpload");
                const questionInput = document.getElementById("question");
                const resultDiv = document.getElementById("result");
                if (!fileInput.files[0] || !questionInput.value) {
                    resultDiv.innerHTML = "请上传图像并输入问题！";
                    return;
                }
                const formData = new FormData();
                formData.append("image", fileInput.files[0]);
                formData.append("question", questionInput.value);
                try {
                    const response = await fetch("/qa", { method: "POST", body: formData });
                    const data = await response.json();
                    if (data.status === "success") {
                        resultDiv.innerHTML = `<strong>问题：</strong>${data.question}<br><strong>回答：</strong>${data.answer}`;
                    } else {
                        resultDiv.innerHTML = `<strong>错误：</strong>${data.message}`;
                    }
                } catch (error) {
                    resultDiv.innerHTML = "处理失败，请重试！";
                }
            }
        </script>
    </body>
    </html>
    """
    return HTMLResponse(content=html_content)

@app.post("/qa", summary="跨模态问答")
async def qa_endpoint(image: UploadFile = File(...), question: str = Query(...)):
    try:
        timestamp = datetime.now().strftime("%Y%m%d%H%M%S")
        image_filename = f"{timestamp}_{image.filename}"
        image_path = os.path.join(UPLOAD_DIR, image_filename)
        with open(image_path, "wb") as f:
            f.write(await image.read())
        answer = multimodal_qa(image_path, question)
        return JSONResponse(content={"status": "success", "question": question, "answer": answer, "image_url": f"/uploads/{image_filename}"})
    except Exception as e:
        return JSONResponse(content={"status": "error", "message": str(e)}, status_code=500)

if __name__ == "__main__":
    uvicorn.run(app, host="0.0.0.0", port=8000, workers=1)

from transformers import StableDiffusionPipeline
import torch

model_name = "runwayml/stable-diffusion-v1-5"
pipe = StableDiffusionPipeline.from_pretrained(
    model_name, torch_dtype=torch.float16, use_safetensors=True, device_map="auto"
)
pipe.safety_checker = None
pipe.requires_safety_checker = False
pipe.enable_attention_slicing()
pipe.enable_xformers_memory_efficient_attention()
print("Stable Diffusion 模型加载完成")

from PIL import Image
import os
from datetime import datetime

def text_to_image(prompt, negative_prompt="low quality, blurry, ugly, deformed, watermark", image_size=(512, 512), num_inference_steps=50, guidance_scale=7.5, num_images=1, output_dir="./generated_images"):
    os.makedirs(output_dir, exist_ok=True)
    with torch.no_grad():
        images = pipe(
            prompt=prompt, negative_prompt=negative_prompt, height=image_size[0], width=image_size[1],
            num_inference_steps=num_inference_steps, guidance_scale=guidance_scale, num_images_per_prompt=num_images
        ).images
    save_paths = []
    timestamp = datetime.now().strftime("%Y%m%d%H%M%S")
    for i, img in enumerate(images):
        img_filename = f"gen_{timestamp}_{i+1}.png"
        img_path = os.path.join(output_dir, img_filename)
        img.save(img_path)
        save_paths.append(img_path)
    return images, save_paths

test_prompt = "一片开满向日葵的田野，背景是蓝天白云，油画风格，高分辨率，细节丰富"
test_negative_prompt = "低质量，模糊，变形，水印，文字，暗沉"
generated_images, save_paths = text_to_image(
    prompt=test_prompt, negative_prompt=test_negative_prompt, image_size=(768, 512), num_inference_steps=75, guidance_scale=8.0, num_images=2
)
print(f"图像生成完成，保存路径：{save_paths}")

def optimize_prompt(raw_prompt, style="photorealistic", quality="high resolution"):
    style_templates = {
        "photorealistic": "photorealistic, ultra detailed, 8k, sharp focus, realistic lighting, cinematic",
        "油画": "oil painting style, thick brush strokes, vibrant colors, artistic, painterly",
        "卡通": "cartoon style, flat colors, clean lines, anime influence, cute",
        "水彩": "watercolor painting, soft colors, translucent, gentle brush strokes"
    }
    quality_desc = "high resolution, ultra detailed, sharp, no blur, no noise"
    optimized = f"{raw_prompt}, {style_templates.get(style, style)}, {quality_desc}, {quality}"
    return optimized

test_raw_prompt = "一只猫坐在窗边"
optimized_prompt = optimize_prompt(test_raw_prompt, style="油画", quality="8k")
print(f"原始提示词：{test_raw_prompt}")
print(f"优化后提示词：{optimized_prompt}")

import gradio as gr

def generate_image_interface(prompt, style, image_size, num_images):
    optimized_prompt = optimize_prompt(prompt, style=style)
    negative_prompt = "low quality, blurry, ugly, deformed, watermark, text, noise"
    images, _ = text_to_image(
        prompt=optimized_prompt, negative_prompt=negative_prompt, image_size=image_size,
        num_images=num_images, num_inference_steps=60, guidance_scale=7.5
    )
    return images

with gr.Blocks(title="文生图生成系统") as demo:
    gr.Markdown("# 文本生成图像系统（Stable Diffusion）")
    with gr.Row():
        with gr.Column(scale=1):
            prompt = gr.Textbox(label="文本描述", placeholder="请输入图像描述...", lines=3)
            style = gr.Dropdown(label="图像风格", choices=["photorealistic", "油画", "卡通", "水彩", "素描"], value="photorealistic")
            image_size = gr.Dropdown(label="图像尺寸", choices=[(512, 512), (768, 512), (1024, 768)], value=(512, 512))
            num_images = gr.Slider(label="生成数量", minimum=1, maximum=4, value=1, step=1)
            generate_btn = gr.Button("生成图像")
        with gr.Column(scale=2):
            output_images = gr.Gallery(label="生成结果", columns=2, height="auto")
    generate_btn.click(fn=generate_image_interface, inputs=[prompt, style, image_size, num_images], outputs=output_images)

demo.launch(server_name="0.0.0.0", server_port=7860, share=False)

from transformers import WhisperProcessor, WhisperForConditionalGeneration

asr_model_name = "openai/whisper-small"
asr_processor = WhisperProcessor.from_pretrained(asr_model_name)
asr_model = WhisperForConditionalGeneration.from_pretrained(asr_model_name, device_map="auto", torch_dtype=torch.float16)
asr_model.config.forced_decoder_ids = asr_processor.get_decoder_prompt_ids(language="zh", task="transcribe")

def speech_to_text(audio_path, sample_rate=16000):
    audio, sr = librosa.load(audio_path, sr=sample_rate)
    inputs = asr_processor(audio, sampling_rate=sr, return_tensors="pt", padding=True).to(asr_model.device)
    with torch.no_grad():
        outputs = asr_model.generate(**inputs, max_new_tokens=200)
    text = asr_processor.decode(outputs[0], skip_special_tokens=True)
    return text

test_audio_path = "./audios/chinese_speech.wav"
text = speech_to_text(test_audio_path)
print(f"语音转文字结果：{text}")

from transformers import AutoTokenizer, AutoModelForCausalLM

llm_model_name = "meta-llama/Meta-Llama-3-8B-Instruct"
llm_tokenizer = AutoTokenizer.from_pretrained(llm_model_name)
llm_model = AutoModelForCausalLM.from_pretrained(llm_model_name, device_map="auto", load_in_8bit=True, trust_remote_code=True)
llm_tokenizer.pad_token = llm_tokenizer.eos_token

def generate_answer(text):
    prompt = f"""<<<|begin_of_solution|> 用户问题：{text} 回答要求：简洁明了，口语化，适合语音播报 回答：<<<|end_of_solution|>"""
    inputs = llm_tokenizer(prompt, return_tensors="pt").to(llm_model.device)
    with torch.no_grad():
        outputs = llm_model.generate(**inputs, max_new_tokens=150, temperature=0.4, top_p=0.9, pad_token_id=llm_tokenizer.eos_token_id)
    answer = llm_tokenizer.decode(outputs[0], skip_special_tokens=True)
    answer = answer.split("回答：")[-1].strip()
    return answer

test_text = "今天天气怎么样？推荐一个户外活动"
answer = generate_answer(test_text)
print(f"生成回答：{answer}")

from TTS.api import TTS

tts_model_name = "tts_models/zh-CN/baker/tacotron2-DDC_ph"
tts = TTS(tts_model_name, gpu=True)

def text_to_speech(text, output_path="./output_audio.wav"):
    tts.tts_to_file(text=text, file_path=output_path)
    return output_path

test_answer = "今天天气晴朗，气温 25-30℃，适合去公园散步、野餐或者骑行，注意做好防晒哦~"
audio_path = text_to_speech(test_answer)
print(f"语音生成完成：{audio_path}")

def multimodal_voice_assistant(input_type="speech", input_path=None, text_input=None):
    if input_type == "speech":
        if not input_path:
            raise ValueError("语音输入需提供文件路径")
        user_text = speech_to_text(input_path)
        print(f"用户语音转文字：{user_text}")
    elif input_type == "text":
        if not text_input:
            raise ValueError("文本输入需提供内容")
        user_text = text_input
    else:
        raise ValueError("输入类型仅支持 speech 或 text")
    
    answer_text = generate_answer(user_text)
    print(f"助手回答文本：{answer_text}")
    
    timestamp = datetime.now().strftime("%Y%m%d%H%M%S")
    audio_output_path = f"./assistant_audio/{timestamp}_output.wav"
    os.makedirs("./assistant_audio", exist_ok=True)
    text_to_speech(answer_text, output_path=audio_output_path)
    return answer_text, audio_output_path

test_speech_path = "./audios/chinese_speech.wav"
answer_text, audio_path = multimodal_voice_assistant(input_type="speech", input_path=test_speech_path)
print(f"最终结果：文本={answer_text}，语音路径={audio_path}")

import gradio as gr

def voice_assistant_interface(input_type, audio_file, text_input):
    try:
        if input_type == "语音输入":
            if audio_file is None:
                return "", None, "请录制或上传语音！"
            audio_path = "./temp_audio.wav"
            with open(audio_path, "wb") as f:
                f.write(audio_file)
            answer_text, audio_output = multimodal_voice_assistant(input_type="speech", input_path=audio_path)
        else:
            if not text_input:
                return "", None, "请输入文本！"
            answer_text, audio_output = multimodal_voice_assistant(input_type="text", text_input=text_input)
        return answer_text, audio_output, "处理成功！"
    except Exception as e:
        return "", None, f"处理失败：{str(e)}"

with gr.Blocks(title="多模态语音助手") as demo:
    gr.Markdown("# 多模态语音助手（支持语音/文本交互）")
    with gr.Row():
        with gr.Column(scale=1):
            input_type = gr.Radio(label="输入类型", choices=["语音输入", "文本输入"], value="语音输入")
            audio_file = gr.Audio(label="录制/上传语音", sources=["microphone", "upload"], type="filepath")
            text_input = gr.Textbox(label="文本输入", placeholder="请输入你的问题...", lines=3)
            submit_btn = gr.Button("提交请求")
            status = gr.Textbox(label="状态", interactive=False)
        with gr.Column(scale=1):
            answer_text = gr.Textbox(label="助手回答（文本）", interactive=False, lines=3)
            answer_audio = gr.Audio(label="助手回答（语音）", type="filepath")
    
    def toggle_input_visibility(input_type):
        if input_type == "语音输入":
            return gr.update(visible=True), gr.update(visible=False)
        else:
            return gr.update(visible=False), gr.update(visible=True)
    
    input_type.change(fn=toggle_input_visibility, inputs=input_type, outputs=[audio_file, text_input])
    submit_btn.click(fn=voice_assistant_interface, inputs=[input_type, audio_file, text_input], outputs=[answer_text, answer_audio, status])

demo.launch(server_name="0.0.0.0", server_port=7861, share=False)

{"image_path":"./medical_images/lung1.jpg","question":"这张肺部 CT 影像是否存在结节？","answer":"这张肺部 CT 影像存在结节，位于右肺上叶。"}
{"image_path":"./medical_images/heart1.jpg","question":"左心室射血分数是否正常？","answer":"左心室射血分数为 62%，处于正常范围。"}

from peft import LoraConfig, prepare_model_for_kbit_training, get_peft_model
from transformers import TrainingArguments
from trl import SFTTrainer
from bitsandbytes.config import BitsAndBytesConfig

model = LlavaForConditionalGeneration.from_pretrained(
    "liuhaotian/LLaVA-7B-v1.5", torch_dtype=torch.float16, load_in_4bit=True, device_map="auto",
    quantization_config=BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_use_double_quant=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.float16),
    trust_remote_code=True
)
model = prepare_model_for_kbit_training(model)

lora_config = LoraConfig(r=8, lora_alpha=32, target_modules=["q_proj", "v_proj", "k_proj", "o_proj"], lora_dropout=0.05, bias="none", task_type="CAUSAL_LM")
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()

training_args = TrainingArguments(
    output_dir="./llava-medical-qa-lora", per_device_train_batch_size=4, per_device_eval_batch_size=4,
    gradient_accumulation_steps=4, learning_rate=2e-4, num_train_epochs=5, logging_steps=10,
    evaluation_strategy="epoch", save_strategy="epoch", fp16=True, optim="paged_adamw_8bit",
    lr_scheduler_type="cosine", warmup_ratio=0.05, report_to="none"
)

trainer = SFTTrainer(
    model=model, args=training_args, train_dataset=processed_dataset["train"], eval_dataset=processed_dataset["validation"],
    peft_config=lora_config, tokenizer=processor.tokenizer, max_seq_length=512
)
trainer.train()
trainer.save_model("./llava-medical-qa-lora-final")
print("医疗影像问答模型微调完成")

from peft import PeftModel, PeftConfig

peft_config = PeftConfig.from_pretrained("./llava-medical-qa-lora-final")
base_model = LlavaForConditionalGeneration.from_pretrained(
    peft_config.base_model_name_or_path, torch_dtype=torch.float16, load_in_4bit=True, device_map="auto", trust_remote_code=True
)
fine_tuned_model = PeftModel.from_pretrained(base_model, "./llava-medical-qa-lora-final")

def medical_qa_infer(image_path, question):
    image = Image.open(image_path).convert("RGB")
    prompt = f"USER: {question} ASSISTANT:"
    inputs = processor(text=prompt, images=image, return_tensors="pt", padding=True, truncation=True).to(fine_tuned_model.device)
    with torch.no_grad():
        outputs = fine_tuned_model.generate(**inputs, max_new_tokens=200, temperature=0.3, top_p=0.9, pad_token_id=processor.tokenizer.eos_token_id)
    answer = processor.decode(outputs[0], skip_special_tokens=True)
    answer = answer.split("ASSISTANT:")[-1].strip()
    return answer

test_image_path = "./medical_images/lung2.jpg"
test_question = "这张肺部 CT 影像的结节大小和边界情况如何？"
fine_tuned_answer = medical_qa_infer(test_image_path, test_question)
print(f"微调后回答：{fine_tuned_answer}")

多模态模型开发实战：文本、图像与语音的融合应用

多模态模型开发实战：文本、图像与语音的融合应用

导读

一、多模态模型基础：概念、技术与生态

1.1 核心概念与关键术语

1.2 主流多模态模型架构

更多推荐文章

相关免费在线工具

1.3 多模态开发生态与工具链

二、多模态数据预处理：对齐与标准化

2.1 文本 - 图像数据预处理

1. 文本预处理

2. 图像预处理

3. 模态对齐

2.2 文本 - 语音数据预处理

1. 语音预处理与特征提取

2. 文本 - 语音时序对齐

三、多模态模型开发实战：三大典型场景落地

3.1 场景一：跨模态问答系统（文本 + 图像）

1. 模型选型与加载

2. 跨模态问答核心函数

3. Web 部署（FastAPI + 前端交互）

3.2 场景二：文生图生成系统（文本→图像）

1. 模型加载与配置

2. 文生图核心函数

3. 提示词优化工具

4. Web 部署（Gradio）

3.3 场景三：多模态语音助手（文本 + 语音）

1. ASR 语音转文字模块（基于 Whisper）

2. 文本理解与回答生成模块（基于 LLaMA 3）

3. TTS 文字转语音模块（基于 Coqui TTS）

4. 多模态语音助手整合

5. Web 部署（支持语音录制与播放）

四、多模态模型训练微调与优化

4.1 微调数据准备

4.2 LLaVA 模型微调（QLoRA 低资源微调）

4.3 微调后模型推理与效果验证

五、总结与建议

更多推荐文章

相关免费在线工具

多模态模型开发实战：文本、图像与语音的融合应用

多模态模型开发实战：文本、图像与语音的融合应用

导读

一、多模态模型基础：概念、技术与生态

1.1 核心概念与关键术语

1.2 主流多模态模型架构

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具

1.3 多模态开发生态与工具链

二、多模态数据预处理：对齐与标准化

2.1 文本 - 图像数据预处理

1. 文本预处理

2. 图像预处理

3. 模态对齐

2.2 文本 - 语音数据预处理

1. 语音预处理与特征提取

2. 文本 - 语音时序对齐

三、多模态模型开发实战：三大典型场景落地

3.1 场景一：跨模态问答系统（文本 + 图像）

1. 模型选型与加载

2. 跨模态问答核心函数

3. Web 部署（FastAPI + 前端交互）

3.2 场景二：文生图生成系统（文本→图像）

1. 模型加载与配置

2. 文生图核心函数

3. 提示词优化工具

4. Web 部署（Gradio）

3.3 场景三：多模态语音助手（文本 + 语音）

1. ASR 语音转文字模块（基于 Whisper）

2. 文本理解与回答生成模块（基于 LLaMA 3）

3. TTS 文字转语音模块（基于 Coqui TTS）

4. 多模态语音助手整合

5. Web 部署（支持语音录制与播放）

四、多模态模型训练微调与优化

4.1 微调数据准备

4.2 LLaVA 模型微调（QLoRA 低资源微调）

4.3 微调后模型推理与效果验证

五、总结与建议

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具