多模态模型开发实战：文本、图像与语音的融合应用 | 极客日志

PythonAI算法

多模态模型开发实战：文本、图像与语音的融合应用

综述由AI生成多模态模型开发涉及文本、图像与语音数据的融合处理。了从数据预处理、模型选型（如 LLaVA、Stable Diffusion、Whisper）到微调优化（QLoRA）及部署落地的全流程。涵盖跨模态问答、文生图、语音助手三大场景，提供基于 PyTorch 和 Hugging Face 的实战代码，重点解决显存优化、模态对齐及提示词工程等问题，帮助开发者构建高效的多模态 AI 应用。

鲜活发布于 2026/3/14更新于 2026/6/1119 浏览

多模态模型开发实战：文本、图像与语音的融合应用

核心目标与关键点

掌握多模态模型的核心概念与技术原理，理解文本、图像、语音等不同模态数据的融合逻辑；熟练运用主流多模态框架（Hugging Face Transformers、MMEngine、LangChain Multimodal），实现跨模态理解与生成任务；精通多模态模型的开发流程，包括数据预处理、模型选型、训练微调、部署落地等关键环节。

重点关注：多模态数据的对齐与预处理、模型训练的显存优化、生成内容的一致性与准确性、以及不同部署场景下的性能适配。

基础概念与术语

随着人工智能技术的发展，单一模态模型已难以满足复杂场景需求。多模态模型通过融合文本、图像、语音、视频等多种模态数据，实现更全面的理解与更灵活的生成，成为当前 AI 领域的核心研究方向。

模态与任务分类

模态是数据的存在形式与来源，常见类型包括文本、视觉（图像/视频）、语音及其他传感器数据。多模态模型的核心任务可分为跨模态理解和跨模态生成两大类：

任务类型	核心目标	典型场景
跨模态理解	对多种模态数据进行联合分析，输出结构化结果	图文检索、跨模态问答、图像描述生成、语音转文字
跨模态生成	根据一种或多种模态输入，生成另一种模态输出	文生图、TTS、图像生成文本、多模态对话

关键技术术语包括模态对齐（将不同模态映射到统一特征空间）、特征融合（组合不同模态特征）、跨模态注意力（让一种模态关注另一种的关键信息）以及自监督预训练。

主流架构选型

当前工业界与学术界的多模态模型主要基于 Transformer 架构演变而来，核心架构包括以下三类：

统一编码器架构：如 CLIP、ALBEF。将所有模态转换为统一维度特征序列，通过共享 Transformer 联合建模。优势是特征融合充分，适合理解类任务，但生成能力较弱。
编码器 - 解码器架构：如 Stable Diffusion、Whisper。编码器处理输入，解码器生成目标模态。生成能力强，但资源消耗较高。
混合架构：如 GPT-4o、LLaVA。结合两者优势，支持理解与生成，适合复杂对话场景，但部署门槛较高。

选型建议：理解类任务优先选择 CLIP 类模型；生成类任务优先选择 Stable Diffusion 等；复杂多模态对话场景优先选择混合架构模型。

数据预处理：对齐与标准化

多模态数据的异构性导致预处理难度远高于单一模态。核心目标是数据标准化和模态对齐。

文本 - 图像数据预处理

这是最常见的组合，适用于图文检索、跨模态问答等任务。流程包括文本预处理、图像预处理、模态对齐。

文本预处理

目标是将自然语言转换为模型可识别的张量。核心步骤包括清洗、Tokenization、长度截断与填充、特征增强。

from transformers import CLIPTokenizer
import torch

tokenizer = CLIPTokenizer.from_pretrained("openai/clip-vit-base-patch32")

def preprocess_text(texts, max_seq_len=77):
    """
    文本预处理：Tokenization + 截断/填充
    :param texts: 文本列表
    :param max_seq_len: 最大序列长度
    :return: 预处理后的张量
    """
    inputs = tokenizer(
        texts,
        padding=,
        truncation=,
        max_length=max_seq_len,
        return_tensors=
    )
     {: inputs[], : inputs[]}

test_texts = [, ]
text_features = preprocess_text(test_texts)
()

相关免费在线工具

加密/解密文本
使用加密算法（如AES、TripleDES、Rabbit或RC4）加密和解密文本明文。在线工具，加密/解密文本在线工具，online
RSA密钥对生成器
生成新的随机RSA私钥和公钥pem证书。在线工具，RSA密钥对生成器在线工具，online
Mermaid 预览与可视化编辑
基于 Mermaid.js 实时预览流程图、时序图等图表，支持源码编辑与即时渲染。在线工具，Mermaid 预览与可视化编辑在线工具，online
随机西班牙地址生成器
随机生成西班牙地址（支持马德里、加泰罗尼亚、安达卢西亚、瓦伦西亚筛选），支持数量快捷选择、显示全部与下载。在线工具，随机西班牙地址生成器在线工具，online
Gemini 图片去水印
基于开源反向 Alpha 混合算法去除 Gemini/Nano Banana 图片水印，支持批量处理与下载。在线工具，Gemini 图片去水印在线工具，online
curl 转代码
解析常见 curl 参数并生成 fetch、axios、PHP curl 或 Python requests 示例代码。在线工具，curl 转代码在线工具，online

from transformers import CLIPImageProcessor
from PIL import Image
import os

image_processor = CLIPImageProcessor.from_pretrained("openai/clip-vit-base-patch32")

def preprocess_image(image_paths, target_size=(224, 224)):
    """
    图像预处理：加载 + 缩放 + 归一化 + 维度转换
    :param image_paths: 图像路径列表
    :param target_size: 目标尺寸
    :return: 预处理后的图像张量
    """
    images = []
    for path in image_paths:
        img = Image.open(path).convert("RGB")
        images.append(img)
    
    inputs = image_processor(
        images,
        resize_size=target_size,
        crop_size=target_size,
        normalize={"mean": [0.48145466, 0.4578275, 0.40821073], "std": [0.26862954, 0.26130258, 0.27577711]},
        return_tensors="pt"
    )
    return inputs["pixel_values"]

test_image_paths = ["./images/cat.jpg", "./images/car.jpg"]
image_features = preprocess_image(test_image_paths)
print(f"预处理后图像张量形状：{image_features.shape}")

import json
from datasets import Dataset
import os

def load_image_text_dataset(data_path, image_dir):
    """
    加载并过滤文本 - 图像配对数据集
    :param data_path: 数据集 JSONL 文件路径
    :param image_dir: 图像文件夹路径
    :return: Hugging Face Dataset
    """
    samples = []
    with open(data_path, "r", encoding="utf-8") as f:
        for line in f:
            sample = json.loads(line)
            image_path = os.path.join(image_dir, sample["image_filename"])
            if not os.path.exists(image_path):
                continue
            if len(sample["text"].strip()) < 5:
                continue
            sample["image_path"] = image_path
            samples.append(sample)
    
    dataset = Dataset.from_list(samples)
    dataset = dataset.train_test_split(test_size=0.1, seed=42)
    return dataset["train"], dataset["test"]

train_dataset, val_dataset = load_image_text_dataset("image_text_pairs.jsonl", "./images")
print(f"训练集样本数：{len(train_dataset)}，验证集样本数：{len(val_dataset)}")

import librosa
import numpy as np
import torch

def preprocess_audio(audio_paths, sample_rate=16000, n_mels=80, max_length=3000):
    """
    语音预处理：加载 + 重采样 + 降噪 + 梅尔频谱提取
    :param audio_paths: 语音文件路径列表
    :param sample_rate: 目标采样率
    :param n_mels: 梅尔频谱特征维度
    :param max_length: 最大序列长度
    :return: 梅尔频谱特征张量
    """
    features = []
    for path in audio_paths:
        y, sr = librosa.load(path, sr=sample_rate)
        y, _ = librosa.effects.trim(y, top_db=20)
        mel_spec = librosa.feature.melspectrogram(y=y, sr=sr, n_mels=n_mels, fmax=8000)
        log_mel_spec = librosa.power_to_db(mel_spec, ref=np.max)
        
        seq_len = log_mel_spec.shape[1]
        if seq_len > max_length:
            log_mel_spec = log_mel_spec[:, :max_length]
        else:
            pad_len = max_length - seq_len
            log_mel_spec = np.pad(log_mel_spec, ((0, 0), (0, pad_len)), mode="constant")
        
        features.append(torch.tensor(log_mel_spec, dtype=torch.float32))
    
    return torch.stack(features)

test_audio_paths = ["./audios/speech1.wav", "./audios/speech2.mp3"]
audio_features = preprocess_audio(test_audio_paths)
print(f"梅尔频谱特征形状：{audio_features.shape}")

from transformers import Wav2Vec2ForCTC, Wav2Vec2Tokenizer

def align_text_audio(text, audio_path, sample_rate=16000):
    """
    文本 - 语音时序对齐：获取文本每个 Token 对应的语音时间戳
    :param text: 文本内容
    :param audio_path: 语音文件路径
    :return: 对齐结果
    """
    align_model_name = "facebook/wav2vec2-base-960h"
    align_tokenizer = Wav2Vec2Tokenizer.from_pretrained(align_model_name)
    align_model = Wav2Vec2ForCTC.from_pretrained(align_model_name).to("cuda")
    
    y, sr = librosa.load(audio_path, sr=sample_rate)
    text = text.lower().replace(",", "").replace(".", "").replace("?", "")
    
    inputs = align_tokenizer(y, sampling_rate=sr, return_tensors="pt", padding=True).to("cuda")
    
    with torch.no_grad():
        outputs = align_model(**inputs, output_hidden_states=True, return_dict=True)
        alignment_paths = align_model.wav2vec2.ctc_decoder.align(
            outputs.logits, align_tokenizer(text, return_tensors="pt")["input_ids"].to("cuda")
        )
    
    alignments = alignment_paths[0].alignments
    token_times = []
    frame_duration = 1 / sr
    downsample_rate = align_model.config.conv_stride[-1] * align_model.config.conv_kernel[-1]
    
    for token_idx, (frame_start, frame_end) in enumerate(alignments):
        orig_start_frame = frame_start * downsample_rate
        orig_end_frame = frame_end * downsample_rate
        start_time = orig_start_frame * frame_duration
        end_time = orig_end_frame * frame_duration
        token = align_tokenizer.convert_ids_to_tokens([token_idx])[0]
        if token not in ["<pad>", "<s>", "</s>"]:
            token_times.append({"token": token, "start_time": round(start_time, 3), "end_time": round(end_time, 3)})
    
    return token_times

test_text = "Hello, this is a speech recognition test."
test_audio = "./audios/english_speech.wav"
alignment_result = align_text_audio(test_text, test_audio)
print("文本 - 语音对齐结果：")
for item in alignment_result:
    print(f"Token: {item['token']}, 时间戳：{item['start_time']:.3f} - {item['end_time']:.3f}s")

from transformers import LlavaProcessor, LlavaForConditionalGeneration
import torch

model_name = "liuhaotian/LLaVA-7B-v1.5"
processor = LlavaProcessor.from_pretrained(model_name)
model = LlavaForConditionalGeneration.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    load_in_8bit=True,
    device_map="auto",
    trust_remote_code=True
)
print("模型加载完成，显存占用：", torch.cuda.memory_allocated() / 1024 / 1024 / 1024, "GB")

def multimodal_qa(image_path, question, max_new_tokens=200, temperature=0.3):
    """
    跨模态问答：输入图像和问题，生成回答
    """
    from PIL import Image
    image = Image.open(image_path).convert("RGB")
    inputs = processor(
        text=question,
        images=image,
        return_tensors="pt",
        padding=True,
        truncation=True
    ).to(model.device)
    
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            temperature=temperature,
            top_p=0.9,
            do_sample=True,
            pad_token_id=processor.tokenizer.eos_token_id
        )
    
    answer = processor.decode(outputs[0], skip_special_tokens=True)
    answer = answer.split("ASSISTANT:")[-1].strip()
    return answer

test_image_path = "./images/phone_screenshot.jpg"
test_question = "这张截图显示的是什么手机型号？系统版本是多少？"
answer = multimodal_qa(test_image_path, test_question)
print(f"问题：{test_question}")
print(f"回答：{answer}")

from fastapi import FastAPI, UploadFile, File, Query
from fastapi.responses import JSONResponse, HTMLResponse
from fastapi.staticfiles import StaticFiles
import uvicorn
import os
from datetime import datetime

app = FastAPI(title="跨模态问答系统", version="1.0")
UPLOAD_DIR = "./uploads"
os.makedirs(UPLOAD_DIR, exist_ok=True)
app.mount("/uploads", StaticFiles(directory=UPLOAD_DIR), name="uploads")

@app.get("/", response_class=HTMLResponse)
async def index():
    html_content = """
    <!DOCTYPE html>
    <html>
    <head><title>跨模态问答系统</title></head>
    <body>
        <h1>跨模态问答系统（图像 + 文本）</h1>
        <div>
            <input type="file" accept="image/*"><br>
            <input type="text" placeholder="请输入你的问题"><br><br>
            <button onclick="submitQuery()">提交查询</button>
            <div id="result"></div>
        </div>
        <script>
            async function submitQuery() {
                const fileInput = document.getElementById("imageUpload");
                const questionInput = document.getElementById("question");
                const resultDiv = document.getElementById("result");
                if (!fileInput.files[0] || !questionInput.value) {
                    resultDiv.innerHTML = "请上传图像并输入问题！";
                    return;
                }
                const formData = new FormData();
                formData.append("image", fileInput.files[0]);
                formData.append("question", questionInput.value);
                try {
                    const response = await fetch("/qa", { method: "POST", body: formData });
                    const data = await response.json();
                    if (data.status === "success") {
                        resultDiv.innerHTML = `<strong>问题：</strong>${data.question}<br><strong>回答：</strong>${data.answer}`;
                    } else {
                        resultDiv.innerHTML = `<strong>错误：</strong>${data.message}`;
                    }
                } catch (error) {
                    resultDiv.innerHTML = "处理失败，请重试！";
                }
            }
        </script>
    </body>
    </html>
    """
    return HTMLResponse(content=html_content)

@app.post("/qa", summary="跨模态问答")
async def qa_endpoint(image: UploadFile = File(...), question: str = Query(...)):
    try:
        timestamp = datetime.now().strftime("%Y%m%d%H%M%S")
        image_filename = f"{timestamp}_{image.filename}"
        image_path = os.path.join(UPLOAD_DIR, image_filename)
        with open(image_path, "wb") as f:
            f.write(await image.read())
        answer = multimodal_qa(image_path, question)
        return JSONResponse(content={"status": "success", "question": question, "answer": answer, "image_url": f"/uploads/{image_filename}"})
    except Exception as e:
        return JSONResponse(content={"status": "error", "message": str(e)}, status_code=500)

if __name__ == "__main__":
    uvicorn.run(app, host="0.0.0.0", port=8000, workers=1)

from transformers import StableDiffusionPipeline
import torch

model_name = "runwayml/stable-diffusion-v1-5"
pipe = StableDiffusionPipeline.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    use_safetensors=True,
    device_map="auto"
)
pipe.safety_checker = None
pipe.requires_safety_checker = False
pipe.enable_attention_slicing()
pipe.enable_xformers_memory_efficient_attention()
print("Stable Diffusion 模型加载完成")

def text_to_image(prompt, negative_prompt="low quality, blurry, ugly, deformed, watermark", image_size=(512, 512), num_inference_steps=50, guidance_scale=7.5, num_images=1, output_dir="./generated_images"):
    import os
    os.makedirs(output_dir, exist_ok=True)
    with torch.no_grad():
        images = pipe(
            prompt=prompt,
            negative_prompt=negative_prompt,
            height=image_size[0],
            width=image_size[1],
            num_inference_steps=num_inference_steps,
            guidance_scale=guidance_scale,
            num_images_per_prompt=num_images
        ).images
    
    save_paths = []
    timestamp = datetime.now().strftime("%Y%m%d%H%M%S")
    for i, img in enumerate(images):
        img_filename = f"gen_{timestamp}_{i+1}.png"
        img_path = os.path.join(output_dir, img_filename)
        img.save(img_path)
        save_paths.append(img_path)
    return images, save_paths

test_prompt = "一片开满向日葵的田野，背景是蓝天白云，油画风格，高分辨率，细节丰富"
test_negative_prompt = "低质量，模糊，变形，水印，文字，暗沉"
generated_images, save_paths = text_to_image(
    prompt=test_prompt,
    negative_prompt=test_negative_prompt,
    image_size=(768, 512),
    num_inference_steps=75,
    guidance_scale=8.0,
    num_images=2
)
print(f"图像生成完成，保存路径：{save_paths}")

import gradio as gr

def optimize_prompt(raw_prompt, style="photorealistic", quality="high resolution"):
    style_templates = {
        "photorealistic": "photorealistic, ultra detailed, 8k, sharp focus, realistic lighting, cinematic",
        "油画": "oil painting style, thick brush strokes, vibrant colors, artistic, painterly",
        "卡通": "cartoon style, flat colors, clean lines, anime influence, cute",
        "水彩": "watercolor painting, soft colors, translucent, gentle brush strokes"
    }
    quality_desc = "high resolution, ultra detailed, sharp, no blur, no noise"
    optimized = f"{raw_prompt}, {style_templates.get(style, style)}, {quality_desc}, {quality}"
    return optimized

def generate_image_interface(prompt, style, image_size, num_images):
    optimized_prompt = optimize_prompt(prompt, style=style)
    negative_prompt = "low quality, blurry, ugly, deformed, watermark, text, noise"
    images, _ = text_to_image(
        prompt=optimized_prompt,
        negative_prompt=negative_prompt,
        image_size=image_size,
        num_images=num_images,
        num_inference_steps=60,
        guidance_scale=7.5
    )
    return images

with gr.Blocks(title="文生图生成系统") as demo:
    gr.Markdown("# 文本生成图像系统（Stable Diffusion）")
    with gr.Row():
        with gr.Column(scale=1):
            prompt = gr.Textbox(label="文本描述", placeholder="请输入图像描述...", lines=3)
            style = gr.Dropdown(label="图像风格", choices=["photorealistic", "油画", "卡通", "水彩", "素描"], value="photorealistic")
            image_size = gr.Dropdown(label="图像尺寸", choices=[(512, 512), (768, 512), (1024, 768)], value=(512, 512))
            num_images = gr.Slider(label="生成数量", minimum=1, maximum=4, value=1, step=1)
            generate_btn = gr.Button("生成图像")
        with gr.Column(scale=2):
            output_images = gr.Gallery(label="生成结果", columns=2, height="auto")
    
    generate_btn.click(
        fn=generate_image_interface,
        inputs=[prompt, style, image_size, num_images],
        outputs=output_images
    )

demo.launch(server_name="0.0.0.0", server_port=7860, share=False)

from transformers import WhisperProcessor, WhisperForConditionalGeneration
import librosa

easr_model_name = "openai/whisper-small"
easr_processor = WhisperProcessor.from_pretrained(asr_model_name)
easr_model = WhisperForConditionalGeneration.from_pretrained(
    asr_model_name, device_map="auto", torch_dtype=torch.float16
)
easr_model.config.forced_decoder_ids = asr_processor.get_decoder_prompt_ids(language="zh", task="transcribe")

def speech_to_text(audio_path, sample_rate=16000):
    audio, sr = librosa.load(audio_path, sr=sample_rate)
    inputs = asr_processor(audio, sampling_rate=sr, return_tensors="pt", padding=True).to(asr_model.device)
    with torch.no_grad():
        outputs = asr_model.generate(**inputs, max_new_tokens=200)
    text = asr_processor.decode(outputs[0], skip_special_tokens=True)
    return text

test_audio_path = "./audios/chinese_speech.wav"
text = speech_to_text(test_audio_path)
print(f"语音转文字结果：{text}")

from TTS.api import TTS
import gradio as gr

tts_model_name = "tts_models/zh-CN/baker/tacotron2-DDC_ph"
tts = TTS(tts_model_name, gpu=True)

def text_to_speech(text, output_path="./output_audio.wav"):
    tts.tts_to_file(text=text, file_path=output_path)
    return output_path

def voice_assistant_interface(input_type, audio_file, text_input):
    try:
        if input_type == "语音输入":
            if audio_file is None:
                return "", None, "请录制或上传语音！"
            audio_path = "./temp_audio.wav"
            with open(audio_path, "wb") as f:
                f.write(audio_file)
            user_text = speech_to_text(audio_path)
            # 此处假设 generate_answer 已定义
            answer_text = "模拟回答：" + user_text 
            audio_output = text_to_speech(answer_text)
        else:
            if not text_input:
                return "", None, "请输入文本！"
            user_text = text_input
            answer_text = "模拟回答：" + user_text
            audio_output = text_to_speech(answer_text)
        return answer_text, audio_output, "处理成功！"
    except Exception as e:
        return "", None, f"处理失败：{str(e)}"

with gr.Blocks(title="多模态语音助手") as demo:
    gr.Markdown("# 多模态语音助手（支持语音/文本交互）")
    with gr.Row():
        with gr.Column(scale=1):
            input_type = gr.Radio(label="输入类型", choices=["语音输入", "文本输入"], value="语音输入")
            audio_file = gr.Audio(label="录制/上传语音", sources=["microphone", "upload"], type="filepath")
            text_input = gr.Textbox(label="文本输入", placeholder="请输入你的问题...", lines=3)
            submit_btn = gr.Button("提交请求")
            status = gr.Textbox(label="状态", interactive=False)
        with gr.Column(scale=1):
            answer_text = gr.Textbox(label="助手回答（文本）", interactive=False, lines=3)
            answer_audio = gr.Audio(label="助手回答（语音）", type="filepath")
    
    submit_btn.click(
        fn=voice_assistant_interface,
        inputs=[input_type, audio_file, text_input],
        outputs=[answer_text, answer_audio, status]
    )

demo.launch(server_name="0.0.0.0", server_port=7861, share=False)

from datasets import Dataset, DatasetDict
import json
import os
from PIL import Image

def load_medical_qa_dataset(data_path):
    samples = []
    with open(data_path, "r", encoding="utf-8") as f:
        for line in f:
            sample = json.loads(line)
            if not os.path.exists(sample["image_path"]):
                continue
            samples.append(sample)
    dataset = Dataset.from_list(samples)
    dataset = dataset.train_test_split(test_size=0.1, seed=42)
    return dataset

def preprocess_medical_qa(examples, processor):
    images = [Image.open(path).convert("RGB") for path in examples["image_path"]]
    texts = [f"USER: {q} ASSISTANT: {a}" for q, a in zip(examples["question"], examples["answer"])]
    inputs = processor(
        text=texts,
        images=images,
        return_tensors="pt",
        padding="max_length",
        truncation=True,
        max_length=512
    )
    inputs["labels"] = inputs["input_ids"].clone()
    for i, text in enumerate(texts):
        assistant_token_idx = text.find("ASSISTANT:") + len("ASSISTANT:")
        tokenized_text = processor.tokenizer(text, return_tensors="pt")
        assistant_token_pos = len(processor.tokenizer(text[:assistant_token_idx]).input_ids) - 1
        inputs["labels"][i, :assistant_token_pos] = -100
    return inputs

dataset = load_medical_qa_dataset("medical_qa_pairs.jsonl")
processor = LlavaProcessor.from_pretrained("liuhaotian/LLaVA-7B-v1.5")
processed_dataset = dataset.map(lambda x: preprocess_medical_qa(x, processor), batched=True, batch_size=8, remove_columns=dataset["train"].column_names)
print("数据预处理完成")

from peft import LoraConfig, prepare_model_for_kbit_training, get_peft_model
from transformers import TrainingArguments, BitsAndBytesConfig
from trl import SFTTrainer

model = LlavaForConditionalGeneration.from_pretrained(
    "liuhaotian/LLaVA-7B-v1.5",
    torch_dtype=torch.float16,
    load_in_4bit=True,
    device_map="auto",
    quantization_config=BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_use_double_quant=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.float16
    ),
    trust_remote_code=True
)
model = prepare_model_for_kbit_training(model)

lora_config = LoraConfig(
    r=8,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()

training_args = TrainingArguments(
    output_dir="./llava-medical-qa-lora",
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,
    num_train_epochs=5,
    logging_steps=10,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    fp16=True,
    optim="paged_adamw_8bit",
    lr_scheduler_type="cosine",
    warmup_ratio=0.05,
    report_to="none"
)

trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=processed_dataset["train"],
    eval_dataset=processed_dataset["validation"],
    peft_config=lora_config,
    tokenizer=processor.tokenizer,
    max_seq_length=512
)

trainer.train()
trainer.save_model("./llava-medical-qa-lora-final")
print("医疗影像问答模型微调完成")

from peft import PeftModel, PeftConfig

def medical_qa_infer(image_path, question):
    image = Image.open(image_path).convert("RGB")
    prompt = f"USER: {question} ASSISTANT:"
    inputs = processor(
        text=prompt,
        images=image,
        return_tensors="pt",
        padding=True,
        truncation=True
    ).to(fine_tuned_model.device)
    with torch.no_grad():
        outputs = fine_tuned_model.generate(
            **inputs,
            max_new_tokens=200,
            temperature=0.3,
            top_p=0.9,
            pad_token_id=processor.tokenizer.eos_token_id
        )
    answer = processor.decode(outputs[0], skip_special_tokens=True)
    answer = answer.split("ASSISTANT:")[-1].strip()
    return answer

test_image_path = "./medical_images/lung2.jpg"
test_question = "这张肺部 CT 影像的结节大小和边界情况如何？"
fine_tuned_answer = medical_qa_infer(test_image_path, test_question)
print(f"微调后回答：{fine_tuned_answer}")

多模态模型开发实战：文本、图像与语音的融合应用

多模态模型开发实战：文本、图像与语音的融合应用

核心目标与关键点

基础概念与术语

模态与任务分类

主流架构选型

数据预处理：对齐与标准化

文本 - 图像数据预处理

文本预处理

更多推荐文章

相关免费在线工具

图像预处理

模态对齐

文本 - 语音数据预处理

语音特征提取

时序对齐

典型场景落地实战

场景一：跨模态问答系统

模型加载与推理

Web 部署

场景二：文生图生成系统

模型配置与生成

提示词优化与 Gradio 部署

场景三：多模态语音助手

ASR 与 TTS 模块

整合与部署

模型微调与优化

微调数据准备

QLoRA 微调

效果验证

总结与建议

更多推荐文章

相关免费在线工具

多模态模型开发实战：文本、图像与语音的融合应用

多模态模型开发实战：文本、图像与语音的融合应用

核心目标与关键点

基础概念与术语

模态与任务分类

主流架构选型

数据预处理：对齐与标准化

文本 - 图像数据预处理

文本预处理

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具

图像预处理

模态对齐

文本 - 语音数据预处理

语音特征提取

时序对齐

典型场景落地实战

场景一：跨模态问答系统

模型加载与推理

Web 部署

场景二：文生图生成系统

模型配置与生成

提示词优化与 Gradio 部署

场景三：多模态语音助手

ASR 与 TTS 模块

整合与部署

模型微调与优化

微调数据准备

QLoRA 微调

效果验证

总结与建议

微信扫一扫，关注极客日志

更多推荐文章

相关免费在线工具