ํฌ์ŠคํŠธ

[๋…ผ๋ฌธ๋ฆฌ๋ทฐ] Diffusion Models already have a Semantic Latent Space

๐Ÿ’ก ํ•ต์‹ฌ ์š”์•ฝ

  1. ๊ธฐ์กด ๋””ํ“จ์ „ ๋ชจ๋ธ์—์„œ ์ƒ์„ฑ ๊ณผ์ •์„ ์ œ์–ดํ•  ๋•Œ ๋ฐœ์ƒํ–ˆ์—ˆ๋˜ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๋Š” ๋ฐฉ๋ฒ• ์ œ์•ˆ
    • Asyrp์„ ์ œ์•ˆํ•˜์—ฌ ์ค‘๊ฐ„ ๋ณ€ํ™”๊ฐ€ ์ƒ์‡„๋˜๋Š” ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐ
  2. ๋””ํ“จ์ „ ๋ชจ๋ธ์—์„œ ์ด๋ฏธ์ง€ ์ƒ์„ฑ ๊ณผ์ •์„ ์ œ์–ดํ•  ์ˆ˜ ์žˆ๋Š” ์˜๋ฏธ์  ์ž ์žฌ ๊ณต๊ฐ„(semantic latent space)์ธ h-space์˜ ๋ฐœ๊ฒฌ
    • GAN์—์„œ ์ž ์žฌ ๊ณต๊ฐ„(latent space)์„ ํŽธ์ง‘ํ•ด์„œ ์ด๋ฏธ์ง€๋ฅผ ์ œ์–ดํ•˜๋Š” ๊ฒƒ๊ณผ ๊ฐ™์ด ๋””ํ“จ์ „ ๋ชจ๋ธ์—์„œ๋„ ๊ฐ™์€ ๋ฐฉ์‹์œผ๋กœ ์ด๋ฏธ์ง€๋ฅผ ์ œ์–ดํ•  ์ˆ˜ ์žˆ์Œ
  3. ์ƒ์„ฑ ๊ณผ์ •์˜ ํ”„๋กœ์„ธ์Šค๋ฅผ Asyrp์„ ์ด์šฉํ•œ editing, ๊ธฐ์กด denoising, quality boosting์˜ 3๋‹จ๊ณ„๋กœ ๋‚˜๋ˆ ์„œ ๋‹ค์–‘ํ•œ ์ด๋ฏธ์ง€๋ฅผ ์ƒ์„ฑํ•˜๊ณ  ์ข‹์€ ํ’ˆ์งˆ์˜ ์ด๋ฏธ์ง€๋ฅผ ์ƒ์„ฑํ•˜๋„๋ก ํ•จ

Introduction

๊ธฐ์กด diffusion ๋ชจ๋ธ์—์„œ ์ด๋ฏธ์ง€ ์ƒ์„ฑ์„ ์ œ์–ดํ•˜๋Š” ๊ณผ์ •์— ๋Œ€ํ•œ ์„ค๋ช…์ด๋‹ค.

Untitled

(a) ์ด๋ฏธ์ง€ ๊ฐ€์ด๋˜์Šค(image guidance)๋Š” ๋ฌด์กฐ๊ฑด์ (unconditional)์ธ ์ž ์žฌ ๋ณ€์ˆ˜(latent variable)์— ๊ฐ€์ด๋“œ ์ด๋ฏธ์ง€์˜ ์ž ์žฌ ๋ณ€์ˆ˜๋ฅผ ๊ฒฐํ•ฉํ•˜๋Š” ๋ฐฉ์‹์„ ์‚ฌ์šฉํ•œ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜, ๊ฐ€์ด๋“œ์™€ ๋ฌด์กฐ๊ฑด์ ์ธ ๊ฒฐ๊ณผ ์ค‘์—์„œ ์–ด๋–ค ํŠน์„ฑ์„ ๋ฐ˜์˜ํ• ์ง€ ๋ช…ํ™•ํžˆ ์ง€์ •ํ•˜๋Š” ๊ฒƒ์ด ๋ชจํ˜ธํ•˜๋ฉฐ, ๋ณ€ํ™”์˜ ํฌ๊ธฐ๋ฅผ ์ง๊ด€์ ์œผ๋กœ ์ œ์–ดํ•˜๊ธฐ ์–ด๋ ต๋‹ค.

(b) ๋ถ„๋ฅ˜๊ธฐ ๊ฐ€์ด๋˜์Šค(classifier guidance)๋Š” ๋””ํ“จ์ „ ๋ชจ๋ธ์— ๋ถ„๋ฅ˜๊ธฐ๋ฅผ ์ถ”๊ฐ€ํ•˜์—ฌ ๋ชฉํ‘œ ํด๋ž˜์Šค์™€ ์ผ์น˜ํ•˜๋„๋ก ์—ญ ๊ณผ์ •์—์„œ ์ž ์žฌ ๋ณ€์ˆ˜์— ๋ถ„๋ฅ˜๊ธฐ์˜ ๊ธฐ์šธ๊ธฐ๋ฅผ ์ ์šฉํ•˜์—ฌ ์ด๋ฏธ์ง€๋ฅผ ์กฐ์ž‘ํ•œ๋‹ค. ์ด ๋ฐฉ๋ฒ•์€ ์ถ”๊ฐ€์ ์œผ๋กœ ๋ถ„๋ฅ˜๊ธฐ๋ฅผ ํ›ˆ๋ จํ•ด์•ผ ํ•˜๊ณ , ์ƒ˜ํ”Œ๋ง ์ค‘์— ๋ถ„๋ฅ˜๊ธฐ๋ฅผ ํ†ตํ•ด ๊ธฐ์šธ๊ธฐ๋ฅผ ๊ณ„์‚ฐํ•˜๋Š” ๋ฐ ๋น„์šฉ์ด ๋งŽ์ด ๋“ ๋‹ค.

๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” frozen ๋””ํ“จ์ „ ๋ชจ๋ธ์˜ ์˜๋ฏธ์  ์ž ์žฌ ๊ณต๊ฐ„(semantic latent space)๋ฅผ ๋ฐœ๊ฒฌํ•˜๋Š” ๋น„๋Œ€์นญ ์—ญ๋ฐฉํ–ฅ ํ”„๋กœ์„ธ์Šค(Asyrp)๋ฅผ ์ œ์•ˆํ•œ๋‹ค. ๊ทธ๋ ‡๊ฒŒ ํ•ด์„œ ๋ฐœ๊ฒฌํ•œ ์˜๋ฏธ์  ์ž ์žฌ ๊ณต๊ฐ„์„ h-space๋ผ๊ณ  ์นญํ•œ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์‚ฌ์ „ ํ›ˆ๋ จ๋œ frozen ๋””ํ“จ์ „ ๋ชจ๋ธ์—์„œ ์˜๋ฏธ์  ์ž ์žฌ ๊ณต๊ฐ„์„ ์ตœ์ดˆ๋กœ ๋ฐœ๊ฒฌํ•˜์˜€๋‹ค.

2. Background

์˜๋ฏธ์  ์ž ์žฌ ๊ณต๊ฐ„์— ๋Œ€ํ•ด ์ด์•ผ๊ธฐ ํ•˜๊ธฐ ์ „์— DDIM์˜ reverse process ์‹์„ ์‚ดํŽด๋ณด๋Š” ๊ฒƒ์œผ๋กœ ์‹œ์ž‘ํ•œ๋‹ค. DDIM์—์„œ๋Š” non-Markovian ํ”„๋กœ์„ธ์Šค๋ฅผ ์ด์šฉํ•ด์„œ, DDPM์˜ forward process์‹์„ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์žฌ์ •์˜ํ•œ๋‹ค.

DDPM, DDIM

  • DDPM์˜ forward process

Untitled

  • DDIM์˜ forward process
\[q_{\sigma}(x_{t-1}|x_t,x_0) = \mathcal{N}(\sqrt{\alpha_{t-1}}x_0 + \sqrt{1-\alpha_{t-1}-\sigma_t^2} \cdot \cfrac{x_t - \sqrt{\alpha_t}x_0}{\sqrt{1-\alpha_t}}, \sigma_t^2I)\]
  • DDIM์˜ reverse process

Untitled

์—ฌ๊ธฐ์„œ $\sigma_t = \eta\sqrt{(1-\alpha_{t-1}) / (1-\alpha_t)} \sqrt{1-\alpha_t/\alpha_{t-1}}$์ด๋‹ค. $\eta$=1์ธ ๊ฒฝ์šฐ DDPM์ด ๋˜๊ณ  stochasticํ•ด์ง€๋ฉฐ, $\eta$=0์ธ ๊ฒฝ์šฐ DDIM์ด ๋œ๋‹ค.

๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” DDIM์˜ reverse process ์‹์„ ์•„๋ž˜์˜ ์‹์œผ๋กœ ๊ฐ„๋žตํ•˜๊ฒŒ ์“ด๋‹ค. โ€œpredicted $x_0$โ€์„ $\mathrm{P}_t(\epsilon_t^{\theta}(x_t))$ ๋ผ๊ณ  ํ‘œํ˜„ํ•˜๊ณ , โ€œdirection pointing to $x_t$โ€๋ถ€๋ถ„์„ $\mathrm{D}_t(\epsilon_t^{\theta}(x_t))$๋ผ๊ณ  ํ‘œํ˜„ํ•œ๋‹ค.

\[x_{t-1} = \sqrt{\alpha_{t-1}}\mathrm{P}_t(\epsilon_t^{\theta}(x_t)) + \mathrm{D}_t(\epsilon_t^{\theta}(x_t)) + \sigma_t\mathcal{z_t}\]

๋˜ํ•œ, ๊ฐ„๊ฒฐ์„ฑ์„ ์œ„ํ•ด $\mathrm{P}_t(\epsilon_t^{\theta}(x_t))$ ๋Š” $P_t$๋กœ $\mathrm{D}_t(\epsilon_t^{\theta}(x_t))$๋Š” $D_t$๋กœ ํ‘œํ˜„ํ•˜๊ณ , $\eta\ne0$์ผ ๋•Œ๋ฅผ ์ œ์™ธํ•˜๊ณ ๋Š” $\sigma_t\mathcal{z_t}$๋ฅผ ์ƒ๋žตํ•œ๋‹ค.

Image Manipulation with CLIP

CLIP์€ ์ด๋ฏธ์ง€ ์ธ์ฝ”๋” $E_I$์™€ ํ…์ŠคํŠธ ์ธ์ฝ”๋” $E_T$๋กœ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ์ž„๋ฒ ๋”ฉ์„ ํ•™์Šตํ•˜๋ฉฐ, ์œ ์‚ฌ์„ฑ์€ ์ด๋ฏธ์ง€์™€ ํ…์ŠคํŠธ ๊ฐ„์˜ ์œ ์‚ฌ์„ฑ์„ ๋‚˜ํƒ€๋‚ธ๋‹ค. Editied image์™€ target distription ์‚ฌ์ด์˜ ์ฝ”์‚ฌ์ธ ๊ฑฐ๋ฆฌ๋ฅผ ์ด์šฉํ•œ directional loss๋ฅผ ์ด์šฉํ•˜์—ฌ mode collapse์—†์ด ๊ท ์ผํ•œ editing์„ ํ•˜์˜€๋‹ค.

\[\mathcal{L}_{direction} (x^{edit}, y^{target};x^{source},y^{source}) := 1 - \cfrac{\Delta I \cdot \Delta T}{\parallel\Delta I\parallel \parallel\Delta T\parallel}\]

$\Delta T = \mathrm{E}_T(y^{target}) - \mathrm{E}_T(y^{source})$

$\Delta I = \mathrm{E}_I(x^{edit}) - \mathrm{E}_I(x^{source})$

  • $x^{edit}$: edited image
  • $y^{target}$: target description
  • $x^{source}$: original image
  • $y^{source}$: source description

3. Discovering Semantic Latent Space In Diffusion Models

ํ•ด๋‹น ํŒŒํŠธ์—์„œ๋Š” ์™œ ๊ธฐ์กด์— ๋ฐฉ๋ฒ•๋“ค์ด reverse process๋ฅผ ์ œ์–ด๋ฅผ ์ž˜ ํ•˜์ง€ ๋ชปํ–ˆ๋Š”์ง€์— ๋Œ€ํ•ด ์„ค๋ช…ํ•˜๊ณ , ์ƒ์„ฑ ๊ณผ์ •์„ ์ œ์–ดํ•  ์ˆ˜ ์žˆ๋Š” ๊ธฐ์ˆ ์— ๋Œ€ํ•ด ์†Œ๊ฐœํ•œ๋‹ค.

3.1. Problem

Semantic latent manipulation์„ ํ•˜๋Š” ์ฒซ ๋ฒˆ์งธ ๋ฐฉ๋ฒ•์€ 2์—์„œ ์„ค๋ช…ํ•œ ๊ฒƒ์ฒ˜๋Ÿผ ํ…์ŠคํŠธ ํ”„๋กฌํ”„ํŠธ๊ฐ€ ์ฃผ์–ด์กŒ์„ ๋•Œ CLIP loss๋ฅผ ์ตœ์ ํ™” ํ•˜๋„๋ก x_T๋ฅผ ์—…๋ฐ์ดํŠธ ํ•˜์—ฌ x_0๋ฅผ ์กฐ์ •ํ•˜๋Š” ๊ฒƒ์ด๋‹ค. ํ•˜์ง€๋งŒ ์ด ๋ฐฉ๋ฒ•์€ ์ด๋ฏธ์ง€๊ฐ€ ์™œ๊ณก๋˜๊ฑฐ๋‚˜ ์ž˜๋ชป๋œ ์กฐ์ž‘์œผ๋กœ ์ด์–ด์ง€๊ฒŒ ๋œ๋‹ค.

๋‹ค๋ฅธ ์ ‘๊ทผ ๋ฐฉ์‹์œผ๋กœ๋Š” ๊ฐ ์ƒ˜ํ”Œ๋ง ๋‹จ๊ณ„์—์„œ ๋„คํŠธ์›Œํฌ๊ฐ€ ์˜ˆ์ธกํ•œ ๋…ธ์ด์ฆˆ $\epsilon_t^{\theta}$๋ฅผ ์›ํ•˜๋Š” ๋ฐฉํ–ฅ์œผ๋กœ ์ด๋™์‹œํ‚ค๋Š” ๊ฒƒ์ด๋‹ค. ํ•˜์ง€๋งŒ ์ด ๋ฐฉ๋ฒ•์€ $P_t$์™€ $D_t$์˜ ์ค‘๊ฐ„ ๋ณ€ํ™”๊ฐ€ ์ƒ์‡„๋˜์–ด ๊ธฐ์กด latent variable๊ณผ ๋‹ค๋ฅด์ง€ ์•Š๊ฒŒ ๋œ๋‹ค.

์ด์— ๋Œ€ํ•œ ์ฆ๋ช…์€ ๋ณธ ๋…ผ๋ฌธ์˜ Appendix C์— ์ˆ˜๋ก๋˜์–ด ์žˆ๋‹ค.

  • ์ฆ๋ช… (Appendix C)

    Untitled

    Untitled

3.2 Asymmetric Reverse Process(Asyrp)

์œ„์—์„œ ์„ค๋ช…ํ•œ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” Asyrp๋ฅผ ์ œ์•ˆํ•œ๋‹ค. ๊ธฐ์กด ๋ฐฉ์‹์ด $P_t$์™€ $D_t$์˜ ์ค‘๊ฐ„ ๋ณ€ํ™”๊ฐ€ ์ƒ์‡„๋˜์–ด ์›ํ•˜๋Š” ํšจ๊ณผ๋ฅผ ๋‚ด์ง€ ๋ชปํ–ˆ๋Š”๋ฐ, ์ด๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด $P_t$์™€ $D_t$๋ฅผ ๋น„๋Œ€์นญ์ ์œผ๋กœ ๋™์ž‘ํ•˜๊ฒŒ ํ•˜๋Š” ๊ฒƒ์ด๋‹ค. $x_0$๋ฅผ ์˜ˆ์ธกํ•˜๋Š” $\mathrm{P}_t$์—์„œ๋Š” shifted epsilon $\tilde{\epsilon}_t^{\theta}(x_t)$์„ ์‚ฌ์šฉํ•˜๊ณ , latent variable๋กœ ๋Œ์•„๊ฐ€๋Š” $\mathrm{D}_t$์—์„œ๋Š” non-shifted epsilon $\epsilon_t^{\theta}$์„ ์‚ฌ์šฉํ•œ๋‹ค. Asyrp๋ฅผ ์‹์œผ๋กœ ํ‘œํ˜„ํ•˜๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

\[x_{t-1} = \sqrt{\alpha_{t-1}}\mathrm{P}_t(\tilde{\epsilon}_t^{\theta}(x_t)) + \mathrm{D}_t(\epsilon_t^{\theta}(x_t))\]

Loss๋Š” 2์—์„œ ์†Œ๊ฐœํ•œ $\mathcal{L}_{direction}$์„ ์‚ฌ์šฉํ•˜์—ฌ ์žฌ๊ตฌ์„ฑํ•˜์˜€๋‹ค. Edit์„ ํ•˜์ง€ ์•Š์€ $\mathrm{P}_t^{source}$์™€ editํ•œ $\mathrm{P}_t^{edit}$์„ ์‚ฌ์šฉํ•œ๋‹ค. Loss์‹์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

\[\mathcal{L}^{(t)} = \lambda_{CLIP}(\mathrm{P}_t^{edit}, y^{ref};\mathrm{P}_t^{source},y^{source}) + \lambda_{recon}|\mathrm{P}_t^{edit} - \mathrm{P}_t^{source}|\]

์ „์ฒด์ ์ธ reverse process๋ฅผ ๊ทธ๋ฆผ์œผ๋กœ ๋‚˜ํƒ€๋‚ด๋ฉด ์•„๋ž˜์™€ ๊ฐ™๋‹ค.

Untitled

$x_t$๋กœ directionํ•  ๋•Œ๋Š” ์›๋ž˜ DDIM์˜ ๋…ธ์ด์ฆˆ๋ฅผ ์‚ฌ์šฉํ•˜๊ณ , $x_0$์„ predictํ•  ๋•Œ๋Š” shifted epsilon์„ ์‚ฌ์šฉํ•œ๋‹ค.

3.3 h-space

Untitled U-Net structure and h-space

U-net ๊ตฌ์กฐ์—์„œ ์ธ์ฝ”๋”์˜ ๊ฐ€์žฅ ๊นŠ์€ feature map์ธ $h_t$ (๋…ธ๋ž€์ƒ‰ ๋ฐ•์Šค)๋ฅผ ์„ ํƒํ•˜์—ฌ $\epsilon_t^{\theta}$๋ฅผ ์ œ์–ดํ•œ๋‹ค. $h_t$๋Š” spatial resolution์ด ์ž‘๊ณ  ๋†’์€ ์ˆ˜์ค€์˜ semantics๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ๋‹ค.

$h_t$๋ฅผ ์ด์šฉํ•œ ์ƒ˜ํ”Œ๋ง ๋ฐฉ์ •์‹์€ ์•„๋ž˜์™€ ๊ฐ™์ด ๋œ๋‹ค.

Untitled

์œ„ ์‹์—์„œ \(\epsilon_t^{\theta}(x_t|\Delta{h_t})\)์€ original featuremap $h_t$์— $\Delta{h_t}$๋ฅผ ์ถ”๊ฐ€ํ•œ๋‹ค.

h-space๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์†์„ฑ์„ ๊ฐ€์ง€๊ณ  ์žˆ๋‹ค.

  • ๋™์ผํ•œ $\Delta{h_t}$๋Š” ๋‹ค๋ฅธ ์ƒ˜ํ”Œ์— ๋™์ผํ•œ ํšจ๊ณผ๋ฅผ ๊ฐ€์ ธ์˜จ๋‹ค.
  • ์„ ํ˜• ์Šค์ผ€์ผ๋ง $\Delta{h_t}$๋Š” ์Œ์ˆ˜ ์Šค์ผ€์ผ์—์„œ๋„ ์†์„ฑ ๋ณ€ํ™”์˜ ํฌ๊ธฐ๋ฅผ ์ œ์–ดํ•œ๋‹ค.
  • ์—ฌ๋Ÿฌ ๊ฐœ์˜ $\Delta{h_t}$๋ฅผ ์ถ”๊ฐ€ํ•˜๋ฉด ํ•ด๋‹น๋˜๋Š” ์—ฌ๋Ÿฌ ์†์„ฑ์„ ๋™์‹œ์— ์กฐ์ž‘ํ•œ๋‹ค.
  • $\Delta{h_t}$๋Š” ํ™”์งˆ ์ €ํ•˜ ์—†์ด ๊ฒฐ๊ณผ ์ด๋ฏธ์ง€์˜ ํ’ˆ์งˆ์„ ๋ณด์กดํ•œ๋‹ค.
  • $\Delta{h_t}$๋Š” ๋‹ค๋ฅธ ์‹œ๊ฐ„ ๊ฐ€๊ฒฉ t์— ๊ฑธ์ณ ๋Œ€๊ฒŒ ์ผ๊ด€์„ฑ์ด ์žˆ๋‹ค.

3.4 Implicit Neural Directions

์—ฌ๋Ÿฌ ์‹œ๊ฐ„ ๊ฐ๊ฒฉ์— ๋Œ€ํ•ด $\Delta{h_t}$๋ฅผ ์ง์ ‘ ์ตœ์ ํ™” ํ•˜๋ ค๋ฉด ํ•™์Šต์— ๋งŽ์€ iteration์ด ํ•„์š”ํ•˜๊ณ , learning rate์™€ scheduling์„ ์„ ํƒํ•ด์•ผ ํ•˜๋Š” ๋ฌธ์ œ์ ์ด ์žˆ๋‹ค. ์ด๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด $h_t$์™€ $t$๊ฐ€ ์ฃผ์–ด์กŒ์„ ๋•Œ $\Delta{h_t}$๋ฅผ ๋งŒ๋“ค์–ด๋‚ด๋Š” implicit function $f_t(h_t)$๋ฅผ ์ •์˜ํ•œ๋‹ค. ์ด๊ฒƒ์€ timestep t๋กœ ์—ฐ๊ฒฐ ๋œ 2๊ฐœ์˜ 1x1 convolution์œผ๋กœ ๊ตฌํ˜„ํ•˜์˜€๋‹ค.

แ„‰แ…ณแ„แ…ณแ„…แ…ตแ†ซแ„‰แ…ฃแ†บ 2023-12-01 แ„‹แ…ฉแ„Œแ…ฅแ†ซ 1.49.46.png

4. Generative Process Design

์ด ํŒŒํŠธ์—์„œ๋Š” ์ „์ฒด์ ์ธ editing process์— ๋Œ€ํ•ด ์„ค๋ช…ํ•œ๋‹ค. ์ „์ฒด์ ์ธ process๋Š” ์„ธ ๋‹จ๊ณ„๋กœ ์ด๋ฃจ์–ด์ง„๋‹ค.

แ„‰แ…ณแ„แ…ณแ„…แ…ตแ†ซแ„‰แ…ฃแ†บ 2023-12-01 แ„‹แ…ฉแ„Œแ…ฅแ†ซ 1.54.08.png

  1. Asyrp์„ ์ด์šฉํ•œ editing
  2. ๊ธฐ์กด denoising
  3. Quality boosting

๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ๊ฐ ๋‹จ๊ณ„์˜ ๊ธธ์ด๋ฅผ ์ •๋Ÿ‰ํ™” ํ•  ์ˆ˜ ์žˆ๋Š” ๊ณต์‹์„ ์„ค๊ณ„ํ–ˆ๋‹ค.

4.1. Editing process with Asyrp

์ƒ์„ฑ ๊ณผ์ •์„ ์ˆ˜์ •ํ•˜์—ฌ semantic์„ ๋ฐ”๊พธ๋Š” ์ดˆ๊ธฐ ๋‹จ๊ณ„์ด๋‹ค. ์•„๋ž˜์˜ ์‹์œผ๋กœ ๊ตฌ๊ฐ„ [T,t]์—์„œ์˜ editing strength๋ฅผ ์ •์˜ํ•œ๋‹ค.

แ„‰แ…ณแ„แ…ณแ„…แ…ตแ†ซแ„‰แ…ฃแ†บ 2023-12-01 แ„‹แ…ฉแ„Œแ…ฅแ†ซ 1.59.50.png

ํŽธ์ง‘ ๊ฐ„๊ฒฉ์ด ์งง์„์ˆ˜๋ก $\xi_t$๊ฐ€ ๋‚ฎ์•„์ง€๊ณ , ํŽธ์ง‘ ๊ฐ„๊ฒฉ์ด ๊ธธ์ˆ˜๋ก ๊ฒฐ๊ณผ ์ด๋ฏธ์ง€์— ๋” ๋งŽ์€ ๋ณ€ํ™”๊ฐ€ ์ƒ๊ธด๋‹ค. ์ถฉ๋ถ„ํ•œ ๋ณ€ํ™”๋ฅผ ์ค„ ์ˆ˜ ์žˆ๋Š” ํ•œ์—์„œ ๊ฐ€์žฅ ์ตœ์†Œ์˜ Editing interval์„ ์ฐพ๋Š” ๊ฒƒ์ด $t_{edit}$์„ ๊ฒฐ์ •ํ•˜๋Š” ์ตœ๊ณ ์˜ ๋ฐฉ๋ฒ•์ด๋‹ค. ์ €์ž๋“ค์€ ์‹คํ—˜์ ์ธ ๊ฒฐ๊ณผ๋ฅผ ํ†ตํ•ด \(\mathrm{LPIPS}(x, \mathrm{P}_t)\) = 0.33์ธ t์‹œ์ ์„ $t_{edit}$์œผ๋กœ ๊ฒฐ์ •ํ•˜์˜€๋‹ค. ์ด ์ง€์ ์ด ์ถฉ๋ถ„ํ•œ ๋ณ€ํ™”๋ฅผ ์ค„ ์ˆ˜ ์žˆ์œผ๋ฉด์„œ ๊ฐ€์žฅ ์ตœ์†Œ์˜ editing interval์ด์—ˆ๋‹ค.

์•„๋ž˜์˜ ๊ทธ๋ฆผ์€ ๋‹ค์–‘ํ•œ $\mathrm{LPIPS}(x, \mathrm{P}{t_{edit}})$์— ๋”ฐ๋ฅธ ์ƒ์„ฑ ๊ฒฐ๊ณผ๋ฅผ ๋‚˜ํƒ€๋‚ธ ๊ทธ๋ฆผ์ด๋‹ค.

๋‹ค์–‘ํ•œ $\mathrm{LPIPS}(x, \mathrm{P}{t_{edit}})$์— ๋”ฐ๋ฅธ ์ƒ์„ฑ ๊ฒฐ๊ณผ

4.2. Quality Boosting With Stochastic Noise Injection

DDIM์€ stochasticity๋ฅผ ์ œ๊ฑฐํ•˜์—ฌ ๊ฑฐ์˜ ์™„๋ฒฝํ•œ inversion์„ ๋‹ฌ์„ฑํ•˜์ง€๋งŒ, stochasticity์ด ์ด๋ฏธ์ง€ ํ’ˆ์งˆ์„ ํ–ฅ์ƒ ์‹œํ‚จ๋‹ค๋Š” ๊ฒฐ๊ณผ๊ฐ€ ์žˆ๋‹ค. ๋”ฐ๋ผ์„œ, ๋ณธ ๋…ผ๋ฌธ์—์„œ boosting interval์—์„œ๋Š” ์ด๋ฏธ์ง€ ํ’ˆ์งˆ์„ ํ–ฅ์ƒ ์‹œํ‚ค๊ธฐ ์œ„ํ•ด์„œ ์ด ๊ฐ„๊ฒฉ์—์„œ๋Š” stochastic noise๋ฅผ ์ฃผ์ž…ํ•œ๋‹ค.

๋ถ€์ŠคํŒ… ๊ฐ„๊ฒฉ์ด ๊ธธ์ˆ˜๋ก ํ’ˆ์งˆ์ด ๋†’์•„์ง€์ง€๋งŒ, ์ง€๋‚˜์น˜๊ฒŒ ๊ธด ๊ฐ„๊ฒฉ์ผ ๋•Œ์—๋Š” ์ฝ˜ํ…์ธ ๊ฐ€ ๋ณ€ํ˜• ๋  ์ˆ˜ ์žˆ๋‹ค. ๋”ฐ๋ผ์„œ ์ €์ž๋“ค์€ ์ฝ˜ํ…์ธ ์˜ ๋ณ€ํ™”๋ฅผ ์ตœ์†Œํ™” ํ•˜๋ฉด์„œ ์ถฉ๋ถ„ํ•œ ํ™”์งˆ ๋ถ€์ŠคํŒ…์„ ์ œ๊ณตํ•˜๋Š” ์ตœ๋‹จ ๊ฐ„๊ฒฉ์„ ์ฐพ๊ณ ์ž ํ•˜์˜€๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์ด๋ฏธ์ง€ ๋…ธ์ด์ฆˆ๋ฅผ quality boosting์˜ capacity๋กœ ๊ฐ„์ฃผํ•˜๊ณ , ์›๋ณธ ์ด๋ฏธ์ง€์™€ ๋น„๊ตํ•˜์—ฌ $x_t$์˜ ๋…ธ์ด์ฆˆ ์–‘์„ ๋‚˜ํƒ€๋‚ด๋Š” quality deficiency๋ฅผ ์•„๋ž˜์™€ ๊ฐ™์ด ์ •์˜ํ–ˆ๋‹ค.

\[\gamma_t = \mathrm{LPIPS}(x, x_t)\]

์ €์ž๋“ค์€ ์‹คํ—˜์„ ํ†ตํ•ด $\gamma_t$ = 1.2์ธ t์‹œ์ ์„ $t_{boost}$๋กœ ์„ค์ •ํ•˜์˜€๋‹ค.

์•„๋ž˜์˜ ๊ทธ๋ฆผ์€ quality boosting์„ ์ ์šฉํ•  ๋•Œ์™€ ์ ์šฉํ•˜์ง€ ์•Š์•˜์„ ๋•Œ์˜ ๊ฒฐ๊ณผ ์ฐจ์ด๋ฅผ ๋‚˜ํƒ€๋‚ธ ๊ทธ๋ฆผ์ด๋‹ค.

quality boosting์„ ์ ์šฉํ•  ๋•Œ์™€ ์ ์šฉํ•˜์ง€ ์•Š์•˜์„ ๋•Œ์˜ ๊ฒฐ๊ณผ ์ฐจ์ด

4.3 Overall Process of Image Editing

$t_{edit}$๊ณผ $t_{boost}$๋ฅผ ์ด์šฉํ•œ ์ „์ฒด์ ์ธ generative process๋Š” ์•„๋ž˜์™€ ๊ฐ™์ด ์ •๋ฆฌ๋œ๋‹ค.

แ„‰แ…ณแ„แ…ณแ„…แ…ตแ†ซแ„‰แ…ฃแ†บ 2023-12-01 แ„‹แ…ฉแ„Œแ…ฅแ†ซ 2.16.53.png

5. Experiments

  • ๋ฐ์ดํ„ฐ์…‹๊ณผ ๋ชจ๋ธ
    • CelebA-HQ, SUN-bedroom/-church ๋ฐ์ดํ„ฐ์…‹์„ ์ด์šฉํ•˜์—ฌ DDPM++๋ฅผ ํ•™์Šต
    • FHQ-dog ๋ฐ์ดํ„ฐ์…‹์„ ์ด์šฉํ•˜์—ฌ iDDPM์„ ํ•™์Šต
    • METFACES ๋ฐ์ดํ„ฐ์…‹์—์„œ ADM with P2-weighting๋ฅผ ์‚ฌ์šฉํ•ด ํ•™์Šต

    โ†’ ๋ชจ๋“  ๋ชจ๋ธ๋“ค์€ pretrained checkpoint๋ฅผ ํ™œ์šฉํ–ˆ์œผ๋ฉฐ frozen์ƒํƒœ๋ฅผ ์œ ์ง€์‹œ์ผฐ๋‹ค.

5.1 Versatility of h-space with Asyrp

๋‹ค์–‘ํ•œ ๋ฐ์ดํ„ฐ์…‹์— ๋Œ€ํ•œ Asyrp์— editing result์ด๋‹ค.

Untitled

์œ„์˜ ๊ทธ๋ฆผ์„ ๋ณผ ์ˆ˜ ์žˆ๋“ฏ, ๋‹ค์–‘ํ•œ attribute๋“ค์˜ ํŠน์„ฑ์„ ์ž˜ ๋ฐ˜์˜ํ•ด์„œ ์ด๋ฏธ์ง€๋ฅผ ์กฐ์ •ํ•œ ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค. ์‹ฌ์ง€์–ด ํ›ˆ๋ จ์— ํฌํ•จ๋˜์ง€ ์•Š์€ ์†์„ฑ์ธ {department, factory, temple}์— ๋Œ€ํ•ด์„œ๋„ ํ•ฉ์„ฑํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค. ๋ฌด์—‡๋ณด๋‹ค ๋ชจ๋ธ์„ fine tuningํ•˜์ง€ ์•Š๊ณ  inference ์ค‘์—์„œ Asyrph๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ h-space์˜ bottle neck feature maps๋งŒ ์ด๋™ ์‹œํ‚จ ๊ฒฐ๊ณผ๋ผ๋Š” ๊ฒƒ์ด ๋†€๋ž๋‹ค.

5.2 Quantitive Comparison

Fine-tuning์—†์ด ๋‹ค์–‘ํ•œ diffusion ๋ชจ๋ธ๊ณผ ๊ฒฐํ•ฉ ํ•  ์ˆ˜ ์žˆ๋Š” ์ ์„ ๊ณ ๋ คํ•  ๋•Œ, ๋น„์Šทํ•œ ๊ฒฝ์Ÿ์ž๋ฅผ ์ฐพ์„ ์ˆ˜ ์—†์—ˆ๋‹ค. ๋”ฐ๋ผ์„œ ์ „์ฒด ๋ชจ๋ธ์„ fine-tuningํ•˜์—ฌ ์ด๋ฏธ์ง€๋ฅผ ํŽธ์ง‘ํ•˜๋Š” DiffusionCLIP๊ณผ ๋น„๊ต๋ฅผ ํ•˜์˜€๋‹ค. 80๋ช…์˜ ์ฐธ๊ฐ€์ž์—๊ฒŒ ์ด 40๊ฐœ์˜ ์›๋ณธ ์ด๋ฏธ์ง€ ์„ธํŠธ์™€ Asyrp์˜ ๊ฒฐ๊ณผ์™€ DiffusionCLIP์˜ ๊ฒฐ๊ณผ๋ฅผ ๋น„๊ตํ•˜๋„๋ก ํ•˜์˜€๋‹ค. (ํ’ˆ์งˆ, ์ž์—ฐ์Šค๋Ÿฌ์›€, ์ „๋ฐ˜์ ์ธ ์„ ํ˜ธ๋„ ๊ณ ๋ ค) ๊ทธ ๊ฒฐ๊ณผ๋Š” ์•„๋ž˜์™€ ๊ฐ™๋‹ค.

แ„‰แ…ณแ„แ…ณแ„…แ…ตแ†ซแ„‰แ…ฃแ†บ 2023-12-01 แ„‹แ…ฉแ„Œแ…ฅแ†ซ 2.29.32.png

๋ชจ๋“  ๊ด€์ ์—์„œ Asyrp๊ฐ€ DiffusionCLIP์„ ๋Šฅ๊ฐ€ํ•˜๋Š” ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ์—ˆ๋‹ค.

Untitled

5.3 Analysis on h-space

Homogeneity

์•„๋ž˜ ๊ทธ๋ฆผ์€ $\epsilon$-space์™€ ๋น„๊ตํ•œ h-space์˜ homogeneity๋ฅผ ๋ณด์—ฌ์ค€๋‹ค. ํ•˜๋‚˜์˜ ์ด๋ฏธ์ง€์— ๋Œ€ํ•ด $\Delta h_t$๋ฅผ ์ตœ์ ํ™”ํ•˜๋ฉด ๋‹ค๋ฅธ ์ž…๋ ฅ ์ด๋ฏธ์ง€์—๋„ ๋™์ผํ•œ ์†์„ฑ ๋ณ€๊ฒฝ์ด ๋ฐœ์ƒํ•œ๋‹ค. ๋ฐ˜๋ฉด์— ํ•˜๋‚˜์˜ ์ด๋ฏธ์ง€์— ์ตœ์ ํ™” ๋œ $\Delta \epsilon_t$๋Š” ๋‹ค๋ฅธ ์ž…๋ ฅ ์ด๋ฏธ์ง€๋ฅผ ์™œ๊ณกํ•œ๋‹ค.

Untitled

Linearity

์•„๋ž˜ ๊ทธ๋ฆผ์„ ํ†ตํ•ด $\Delta h_t$์˜ ์„ ํ˜• ์Šค์ผ€์ผ๋ง์ด ์‹œ๊ฐ์  ์†์„ฑ์˜ ๋ณ€ํ™”๋Ÿ‰์„ ๋ฐ˜์˜ํ•˜๋Š” ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค. ๋†€๋ผ๋ฒก๋„ ํ›ˆ๋ จ ์ค‘์—๋Š” ๋ณผ ์ˆ˜ ์—†๋Š” ์Œ์˜ ์Šค์ผ€์ผ๋กœ๋„ ์ผ๋ฐ˜ํ™”๊ฐ€ ๋œ๋‹ค.

Untitled

๋˜ํ•œ, ์•„๋ž˜ ๊ทธ๋ฆผ์ฒ˜๋Ÿผ ์„œ๋กœ ๋‹ค๋ฅธ $\Delta h_t$์˜ ์กฐํ•ฉ์ด ๊ฒฐ๊ณผ ์ด๋ฏธ์ง€์—์„œ ๊ฒฐํ•ฉ๋œ semantic change๋ฅผ ๋ณด์—ฌ์ฃผ๋Š” ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค.

Untitled

Robustness

์•„๋ž˜ ๊ทธ๋ฆผ์€ h-space์™€ $\epsilon$-space์—์„œ random noise๋ฅผ ์ฃผ์ž…ํ–ˆ์„ ๋•Œ์˜ ๊ฒฐ๊ณผ๋ฅผ ๋น„๊ตํ•œ ๊ฒƒ์ด๋‹ค. h-space๋Š” random noise๊ฐ€ ์ถ”๊ฐ€๋˜์—ˆ์–ด๋„ ์ด๋ฏธ์ง€์— ํฐ ๋ณ€ํ™”๊ฐ€ ์—†์œผ๋ฉฐ ๋งŽ์€ noise๊ฐ€ ์ถ”๊ฐ€๋˜์—ˆ์„ ๊ฒฝ์šฐ์—๋„ ์ด๋ฏธ์ง€ ์™œ๊ณก์€ ๊ฑฐ์˜ ์—†๊ณ  semantic change๋งŒ ๋ฐœ์ƒํ•œ๋‹ค. ๋ฐ˜๋ฉด์— $\epsilon$-space์˜ ๊ฒฝ์šฐ์—๋Š” random noise๊ฐ€ ์ถ”๊ฐ€๋œ ๊ฒฝ์šฐ ์ด๋ฏธ์ง€ ์™œ๊ณก์ด ์‹ฌํ•˜๊ฒŒ ๋ฐœ์ƒํ•œ๋‹ค.

Untitled

Consistency across time steps

๋ชจ๋“  ์ƒ˜ํ”Œ์˜ $\Delta h_t$๋Š” ๊ท ์ผํ•˜๋ฉฐ, ํ‰๊ท  $\Delta h^{mean}$์œผ๋กœ ๋Œ€์ฒดํ•ด๋„ ๋น„์Šทํ•œ ๊ฒฐ๊ณผ๊ฐ€ ๋‚˜์˜จ๋‹ค. ์ตœ์ƒ์˜ ํ’ˆ์งˆ๊ณผ ์กฐ์ž‘์„ ์œ„ํ•ด $\Delta h_t$๋ฅผ ์‚ฌ์šฉํ•˜์ง€๋งŒ, ๊ฐ„๊ฒฐ์„ฑ์„ ์œ„ํ•ด $\Delta h^{mean}$ , ๋˜๋Š” ์•ฝ๊ฐ„์˜ ์ ˆ์ถฉ์ธ $\Delta h^{global}$์„ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค. ์ด ๋•Œ์—๋„ $\Delta h_t$๋ฅผ ์‚ฌ์šฉํ–ˆ์„ ๋•Œ์™€ ๋น„์Šทํ•œ ๊ฒฐ๊ณผ๋ฅผ ์–ป์„ ์ˆ˜ ์žˆ๋‹ค.

Untitled

6. Conclusion

๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์‚ฌ์ „ ํ›ˆ๋ จ๋œ diffusion model์„ ์œ„ํ•ด latent semantic space h-space์—์„œ ์ด๋ฏธ์ง€ ํŽธ์ง‘์„ ์šฉ์ดํ•˜๊ฒŒ ํ•˜๋Š” ์ƒˆ๋กœ์šด ์ƒ์„ฑ ํ”„๋กœ์„ธ์Šค์ธ Asyrp์„ ์ œ์•ˆํ•˜์˜€๋‹ค. h-space๋Š” GAN์˜ latent space์™€ ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ homogeneity, Linearity, Robustness, Consistency across timesteps ๋“ฑ์˜ ์ข‹์€ ํŠน์„ฑ์„ ๊ฐ€์ง€๊ณ  ์žˆ๋‹ค. ์ „์ฒด editing process๋Š” ์‹œ๊ฐ„ ๋‹จ๊ณ„๋ณ„๋กœ editing strength์™€ quality deficiency๋ฅผ ์ธก์ •ํ•˜์—ฌ ๋‹ค์–‘ํ•œ ํŽธ์ง‘๊ณผ ๋†’์€ ํ’ˆ์งˆ์„ ๋‹ฌ์„ฑํ•˜๋„๋ก ์„ค๊ณ„๋˜์—ˆ๋‹ค.

์ด ๊ธฐ์‚ฌ๋Š” ์ €์ž‘๊ถŒ์ž์˜ CC BY 4.0 ๋ผ์ด์„ผ์Šค๋ฅผ ๋”ฐ๋ฆ…๋‹ˆ๋‹ค.