Stable Diffusion 底模 VAE 推荐及配置优化解析
引言:VAE 在 Stable Diffusion 生态系统中的核心作用
变分自编码器(VAE)是 Stable Diffusion 生成架构中不可或缺的组件,负责将潜在空间表示与像素空间相互转换。尽管常常被忽视,VAE 的质量直接影响图像生成的细节表现、色彩准确性和整体视觉效果。本文将深入解析不同 Stable Diffusion 底模对应的最优 VAE 配置,从技术原理到实践应用全面剖析 VAE 的选择策略。
VAE 在 Stable Diffusion 中的核心功能包括:
- 编码过程:将输入图像压缩到潜在空间表示(latent representation)
- 解码过程:将潜在表示重构为高质量图像
- 正则化作用:确保潜在空间遵循高斯分布,便于扩散过程采样
一、VAE 技术原理深度解析
1.1 变分自编码器的数学基础
变分自编码器的目标是学习数据的潜在表示,其数学基础建立在变分推断之上。给定输入数据 x,VAE 试图最大化证据下界 (ELBO):
$$\log p(x) \geq \mathbb{E}{q(z|x)}[\log p(x|z)] - D{KL}(q(z|x) || p(z))$$
其中 $q(z|x)$ 是近似后验分布(编码器),$p(x|z)$ 是生成分布(解码器),$p(z)$ 是先验分布(通常为标准正态分布)。
在 Stable Diffusion 中,VAE 的潜在空间维度通常为原始图像的 1/8,即 512×512 图像对应 64×64×4 的潜在表示,大幅降低了计算复杂度。
1.2 VAE 架构设计特点
Stable Diffusion 使用的 VAE 基于改进的 VQ-GAN 架构,关键创新包括:
import torch
import torch.nn as nn
import torch.nn.functional as F
class ResidualBlock(nn.Module):
def __init__(self, in_channels, out_channels):
super(ResidualBlock, self).__init__()
self.conv1 = nn.Conv2d(in_channels, out_channels, 3, padding=1)
self.conv2 = nn.Conv2d(out_channels, out_channels, 3, padding=1)
self.activation = nn.SiLU()
if in_channels != out_channels:
self.skip = nn.Conv2d(in_channels, out_channels, 1)
else:
.skip = nn.Identity()
():
skip = .skip(x)
x = .activation(.conv1(x))
x = .conv2(x)
.activation(x + skip)
(nn.Module):
():
(VAEEncoder, ).__init__()
.initial_conv = nn.Conv2d(in_channels, channels[], , padding=)
.down_blocks = nn.ModuleList()
.down_samples = nn.ModuleList()
i ((channels)-):
.down_blocks.append(ResidualBlock(channels[i], channels[i]))
.down_samples.append(nn.Conv2d(channels[i], channels[i+], , stride=, padding=))
.mid_block = ResidualBlock(channels[-], channels[-])
.final_conv = nn.Conv2d(channels[-], latent_channels * , , padding=)
():
x = .initial_conv(x)
block, sample (.down_blocks, .down_samples):
x = block(x)
x = sample(x)
x = .mid_block(x)
x = .final_conv(x)
mean, log_var = torch.chunk(x, , dim=)
mean, log_var
(nn.Module):
():
(VAEDecoder, ).__init__()
.initial_conv = nn.Conv2d(latent_channels, channels[], , padding=)
.mid_block = ResidualBlock(channels[], channels[])
.up_blocks = nn.ModuleList()
.up_samples = nn.ModuleList()
i ((channels)-):
.up_blocks.append(ResidualBlock(channels[i], channels[i]))
.up_samples.append(nn.ConvTranspose2d(channels[i], channels[i+], , stride=, padding=))
.final_block = ResidualBlock(channels[-], channels[-])
.final_conv = nn.Conv2d(channels[-], out_channels, , padding=)
():
x = .initial_conv(z)
x = .mid_block(x)
block, sample (.up_blocks, .up_samples):
x = block(x)
x = sample(x)
x = .final_block(x)
x = .final_conv(x)
torch.sigmoid(x)

