选好VAE,比换底模更见效
在Stable Diffusion的生成管道里,VAE(变分自编码器)把潜在空间与像素空间串了起来。它的质量直接决定图像的细节锐度、色彩准确度和整体观感。很多人整天折腾底模,却从来没换过VAE——其实有时候换个VAE,画面立刻就不一样了。
VAE到底干了什么
简单说,VAE负责两件事:编码(把图像压缩到潜在表示)和解码(从潜在表示重建图像)。训练时它还让潜在空间逼近标准正态分布,方便扩散模型采样。
数学上,它最大化证据下界(ELBO):
log p(x) ≥ E_q(z|x)[log p(x|z)] - D_KL(q(z|x)||p(z))
其中的KL散度项保证了潜在空间的正则化。
在Stable Diffusion里,VAE通常把输入图像压缩到1/8大小,比如512×512变成64×64×4的潜在张量,计算量一下就降下来了。
架构上,它基于改进的VQ-GAN,关键组件是残差块和下/上采样。下面是一个简化版的编码器与解码器,感受一下结构:
import torch
import torch.nn as nn
import torch.nn.functional as F
class ResidualBlock(nn.Module):
def __init__(self, in_channels, out_channels):
super(ResidualBlock, self).__init__()
self.conv1 = nn.Conv2d(in_channels, out_channels, 3, padding=1)
self.conv2 = nn.Conv2d(out_channels, out_channels, 3, padding=1)
self.activation = nn.SiLU()
if in_channels != out_channels:
self.skip = nn.Conv2d(in_channels, out_channels, 1)
else:
self.skip = nn.Identity()
def forward(self, x):
skip = self.skip(x)
x = self.activation(.conv1(x))
x = .conv2(x)
.activation(x + skip)
(nn.Module):
():
(VAEEncoder, ).__init__()
.initial_conv = nn.Conv2d(in_channels, channels[], , padding=)
.down_blocks = nn.ModuleList()
.down_samples = nn.ModuleList()
i ((channels)-):
.down_blocks.append(ResidualBlock(channels[i], channels[i]))
.down_samples.append(nn.Conv2d(channels[i], channels[i+], , stride=, padding=))
.mid_block = ResidualBlock(channels[-], channels[-])
.final_conv = nn.Conv2d(channels[-], latent_channels * , , padding=)
():
x = .initial_conv(x)
block, sample (.down_blocks, .down_samples):
x = block(x)
x = sample(x)
x = .mid_block(x)
x = .final_conv(x)
mean, log_var = torch.chunk(x, , dim=)
mean, log_var
(nn.Module):
():
(VAEDecoder, ).__init__()
.initial_conv = nn.Conv2d(latent_channels, channels[], , padding=)
.mid_block = ResidualBlock(channels[], channels[])
.up_blocks = nn.ModuleList()
.up_samples = nn.ModuleList()
i ((channels)-):
.up_blocks.append(ResidualBlock(channels[i], channels[i]))
.up_samples.append(nn.ConvTranspose2d(channels[i], channels[i+], , stride=, padding=))
.final_block = ResidualBlock(channels[-], channels[-])
.final_conv = nn.Conv2d(channels[-], out_channels, , padding=)
():
x = .initial_conv(z)
x = .mid_block(x)
block, sample (.up_blocks, .up_samples):
x = block(x)
x = sample(x)
x = .final_block(x)
x = .final_conv(x)
torch.sigmoid(x)

