The goal of this project is to explore and understand the problem of image to image translation. Two approaches addressing this topic will be analyzed: CycleGan and FreeControl. An implementation of CycleGAN is also discussed.

Introduction

Image-to-image translation describes a class of visions and graphics tasks that aims to discover a mapping from input image $\text{A}$ to output image $\text{B}$, in a way that preserves the input picture’s structure and content. In certain tasks, the mapping algorithm can also be given an object or criteria to pre-define the transformation.

img2img_ex [1]

Image-to-image translation have many real life applications. For example, collection style transfer aims to combine stylistic components from a number of styled images and transform a natural image to display new styles, such as changing the season from summer to winter and vice versa. Some algorithms can produce photo enhancements by changing the brightness, contrast, sharpness, or coloration of an image. Additionally, Object transfiguration can also be achieved by changing the object inside the input image.

Deep Learning Methods

In more recent years, deep learning has revolutionized image-to-image translation by leveraging the power of neural networks to learn complex mappings directly from data. One notable breakthrough came with the introduction of Generative Adversarial Networks (GANs), which introduced a novel framework for training generative models by simultaneously training a generator network to produce realistic images and a discriminator network to distinguish between real and generated images.

Since then, numerous deep learning architectures have been proposed for various image-to-image translation tasks, including conditional GANs, Pix2Pix, CycleGAN, UNIT, and FreeControl. These models have demonstrated remarkable capabilities in tasks such as image style transfer, image-to-image translation between different domains (e.g., day to night, horse to zebra), and even the generation of photorealistic images from semantic label maps. In the sections below, we will discuss two different methods that address the topic of image-to-image translation.

CycleGAN

CycleGAN is a model that addresses image-to-image translation with unpaired data. The model attempts to learn a mapping both ways between a pair of image distributions through unsupervised learning, which is achieved using a GAN. Therefore, to understand CycleGAN, we should first understand how a GAN works.

GAN Model Architecture

Generative Adversarial Network, or GAN, is a classical approach to many generative tasks in computer vision. At its core, a GAN consists of two neural networks: a generator and a discriminator. The generator aims to create realistic images from random noise, likely sampled from a Gaussian, while the discriminator’s job is to distinguish between real images and those generated by the generator. Through training, the generator can better generate new images and affect the discriminator to classify incorrectly, or “trick” the discriminator.

Generator

Below is a pytorch implementation of the Generator module.

# Generator Model Implementation

class Generator(nn.Module):
    def __init__(self, in_channels, out_channels, map_size):
        super(Generator, self).__init__()
        self.main = nn.Sequential(
            # input is Z, going into a convolution
            nn.ConvTranspose2d(in_channels, map_size * 8, 4, 1, 0, bias=False),
            nn.BatchNorm2d(map_size * 8),
            nn.ReLU(True),
            # state size. ``(map_size*8) x 4 x 4``
            nn.ConvTranspose2d(map_size * 8, map_size * 4, 4, 2, 1, bias=False),
            nn.BatchNorm2d(map_size * 4),
            nn.ReLU(True),
            # state size. ``(map_size*4) x 8 x 8``
            nn.ConvTranspose2d( map_size * 4, map_size * 2, 4, 2, 1, bias=False),
            nn.BatchNorm2d(map_size * 2),
            nn.ReLU(True),
            # state size. ``(map_size*2) x 16 x 16``
            nn.ConvTranspose2d( map_size * 2, map_size, 4, 2, 1, bias=False),
            nn.BatchNorm2d(map_size),
            nn.ReLU(True),
            # state size. ``(map_size) x 32 x 32``
            nn.ConvTranspose2d( map_size, out_channels, 4, 2, 1, bias=False),
            nn.Tanh()
            # state size. ``(nc) x 64 x 64``
        )

    def forward(self, input):
        return self.main(input)

The generator takes a latent vector, $z$, as input of size in_channels, and transforms $z$ to image pixel space. To create an output image of $\left( 3 \times 64 \times 64 \right)$, this implementation uses 5 2d Transposed Convolutional Layers to upsample each layer’s output by the specified amount. An $\text{ReLU}$ activation function is applied after each upsampling layer. Batch Normalization is also applied after each upsampling to make the model easier to train, which is the critical contribution introduced in DCGAN, an improvement from the original GAN. Finally, the output is passed through the $\text{Tanh}$ activation function to create values between -1 and 1 for the discriminator to classify.

Discriminator

Below is a pytorch implementation of the Discriminator module.

class Discriminator(nn.Module):
    def __init__(self, in_channels, map_size):
        super(Discriminator, self).__init__()
        self.main = nn.Sequential(
            # input is ``(in_channels) x 64 x 64``
            nn.Conv2d(in_channels, map_size, 4, 2, 1, bias=False),
            nn.LeakyReLU(0.2, inplace=True),
            # state size. ``(map_size) x 32 x 32``
            nn.Conv2d(map_size, map_size * 2, 4, 2, 1, bias=False),
            nn.BatchNorm2d(map_size * 2),
            nn.LeakyReLU(0.2, inplace=True),
            # state size. ``(map_size*2) x 16 x 16``
            nn.Conv2d(map_size * 2, map_size * 4, 4, 2, 1, bias=False),
            nn.BatchNorm2d(map_size * 4),
            nn.LeakyReLU(0.2, inplace=True),
            # state size. ``(map_size*4) x 8 x 8``
            nn.Conv2d(map_size * 4, map_size * 8, 4, 2, 1, bias=False),
            nn.BatchNorm2d(map_size * 8),
            nn.LeakyReLU(0.2, inplace=True),
            # state size. ``(map_size*8) x 4 x 4``
            nn.Conv2d(map_size * 8, 1, 4, 1, 0, bias=False),
            nn.Sigmoid()
        )

    def forward(self, input):
        return self.main(input)

The discriminator takes in input either as a real image or generated fake image, and transforms it to obtain a probability for classification. In this implementation, there are five 2d Convolution Layers and reduce the size of feature maps while increasing the number of channels, to capture different levels of information within the input image. Batch Normalization and $\text{LeakyReLu}$ activation functions are used similar to the generator to improve model performance and prevent dying gradients. The final Sigmoid activation converts the output to a probability for classification between true or false.

GAN Loss Function

The GAN loss function is expressed as the sum of two terms, which jointly updates the Generator and Discriminator weights. From the paper, the GAN loss function is

\[\min_{G}\max_{D}\mathbb{E}\_{x\sim p_{\text{data}}(x)}\left[\log{D(x)}\right] + \mathbb{E}\_{z\sim p_{\text{z}}(z)}\left[\log{(1 - D(G(z)))}\right]\]

In this function:

  • $D(x) =$ probability that real image $x$ is real
  • $G(z) =$ generator’s output given noise $z$
  • $D(G(z)) =$ probability that generated fake image is real
  • $E_z =$ expected value over all random inputs to generator

Both terms in the formula is derived from cross-entropy loss. The generator attempts to minimize $\log(1 - D(G(z)))$ while the discriminator maximizes $\log(D(x))$ in the objective. In practice, the generator and the discriminator are updated in separate training loops, therefore one term will freeze while the other term is updated. For example, when the generator is updated, $D$ weights will be freezed to prevent overfitting on finite datasets, and $\log(D(x))$ is not directly affected while optimizing the generator.

CycleGAN Specific Features

The CycleGAN approach modifies the original GAN by introducing the concept of cycle consistency. For a translation task to be “cycle consistent”, the function $G$ that transforms an image from domain $\mathbb{X}$ to $\mathbb{Y}$, and the transformation $F$ which maps domain $\mathbb{Y}$ to $\mathbb{X}$, should be inverses of each other. Furthermore, both transformations should be bijections and ideally achieve one-to-one mapping that covers the entirely of both distributions. This introduction is critical since solely using GAN adversarial loss is challenging to ensure that the learned transformation can map any input to its desired output and vice versa (e.g. inputs might collapse to the same output). In detail, CycleGAN introduced two cycle consistencies:

  • forward cycle-consistency: $\,x \rightarrow G(x) \rightarrow F(G(x)) \approx x$
  • backward cycle-consistency: $\,y \rightarrow F(y) \rightarrow G(F(y)) \approx y$

The idea is, passing input $x$ into transformation $G$ and its inverse $F$ should result in $x$ again. The same conclusion should also yield when given the transformation composition $G \cdot F$.

CycleGAN utilizes 2 separate GAN networks to represent transformations $G$ and $F$. The loss function for transformation $G:X \rightarrow Y$, which is directly derived from GAN, is expressed as:

\[\mathcal{L}\_{GAN} \left( G,D_Y,X,Y \right) = \mathbb{E}\_{y\sim p_{data} (y)} \left[\log D_Y(y) \right] + \mathbb{E}\_{x\sim p_{data}(x)}\left[\log(1-D_Y(G(x)))\right]\]

where the generator $G$ samples from the input domain $\mathbb{X}$ instead of the latent distribution. The inverse transformation, $F: Y \rightarrow X$, is formulated similarly.

To exploit the properties of cycle consistency, the model also introduced two reconstruction L1 losses to derive cycle consistency loss:

\[\mathcal{L}\_{cyc}(G,F) = \mathbb{E}\_{x\sim p_{data}(x)} \left[||F(G(x))-x||\_1\right] + \mathbb{E}\_{y\sim p_{data}(y)}\left[||G(F(y))-y||\_1\right]\]

Together, the full objective function of CycleGAN is:

\[\mathcal{L} (G,F,D_X,D_Y) = \mathcal{L}\_{GAN}(G,D_Y,X,Y) + \mathcal{L}\_{GAN}(F,D_X,Y,X) + \lambda \mathcal{L}\_{cyc}(G,F)\]

where $\lambda$ controls the relative significance of the cycle consistency loss. Both generators in the CycleGAN model architecture are influenced by the cycle-consistency loss, and are updated to preserve the bijective nature of the underlying distributions.

photo2painting [1]

CycleGAN produces impressive results translating between input photos and various artistic painting styles, without requiring a paired image in the target image domain. CycleGAN also performs well on season transfers and object transfiguration tasks, and can effectively modify the depth of field within an image.

Understanding Diffusion Models

Diffusion models are at the forefront of image generation in deep learning. They generate images similar to their training data by gradually adding noise to the data, and then learning to reverse the process, or denoising, in order to generate new samples that are similar to the original data.

Training a diffusion model involves two main steps. First, noise (usually Gaussian) is added over $T$ timesteps. Second, a neural network is trained to remove the added noise over $T$ timesteps and restore clarity to the image. The more steps used in training, the better the results tend to be. This is because smaller doses of noise are easier for the model to handle, leading to clearer images. Using the same model across all steps, combined with a predefined denoising schedule, ensures consistency in the noise removal process. By encoding temporal information this way, the model can effectively restore previous states of the image.

A defined loss function guides the model’s improvement through gradient descent. By quantifying the difference between generated and true images, this function helps refine the model.

Introducing FreeControl

Text-to-image diffusion models have been successful in generating high-quality images from text descriptions. However, relying solely on text is insufficient for conveying preferences in content creation. Recent advancements like ControlNet allow users to guide the image composition by providing additional guidance images along with the text. Yet, these methods require specific training for each spatial condition, which is costly and impractical due to the variety of control signals and evolving model architectures.

FC_architecture [3]

To address these issues, FreeControl is introduced as a training-free method for controlling text-to-image diffusion. It utilizes the inherent spatial structure captured by feature maps in the generation process to align with guidance images while preserving the input text’s concept appearance. FreeControl eliminates the need for additional training on pretrained models because it’s able to retain spatial and semantic information directly from different input conditions; it also supports various control conditions, model architectures, and customized checkpoints, which allows for versatile and portable applications. FreeControl outperforms prior training-free methods, excels in challenging control conditions, and achieves competitive image quality compared to training-based methods while maintaining stronger image-text alignment and supporting a wider range of control signals.

FreeControl Specific Features

Analysis Stage

During the analysis stage, seed images $I^s$ are generated with $\epsilon_{\theta}$ using a text prompt $\tilde{c}$ that is modified from $c$ to be more generic. This allows $I^s$ to to cover diverse object shape, pose, and appearance as well as image composition and style.

Next, DDIM inversion is applied on $I^s$ to obtain time-dependent diffusion features $F_t^s$ from $\epsilon_{\theta}$. Then PCA is applied to obtain the time-dependent semantic bases $\mathbf{B}_t$ with consists of $N_b$ principle components.

\[\mathbf{B}_t = \left[ \mathbf{p}_t^{(1)}, \mathbf{p}_t^{(2)}, \ldots, \mathbf{p}_t^{(N_b)} \right] \sim \text{PCA}\left( \{ \mathbf{F}_t^s \} \right)\]

DDIM inversion is also applied to $I^g$ to obtain diffusion features $F_t^s$. $F_t^s$ is then projected onto $\mathbf{B}_t$ to obtain semantic coordinates $\mathbf{S}_t^g$. For local control of foreground structure, we further derive a mask $\mathbf{M}$ from cross-attention maps of the concept tokens.

Synthesis Stage

Structure Guidance:

The semantic coordinates $\mathbf{S}_t$ are obtained by projecting the Diffusion features $\mathbf{F}_t$ onto $\mathbf{B}_t$ at each denoising step $t$. The energy function $g_s$ is shown below as follows:

\[g_s \left(S_t;S_t^g,\mathbf{M} \right) = \underbrace{ \frac{\sum_{i,j} m_{i,j} \| [s_t]^{i,j} - [s_t^g]^{i,j} \|^2_2 }{ \sum_{i,j} m_{i,j} } }_{\text{forward guidance}} + w\cdot \underbrace{ \frac{ \sum _{i,j}(1-m _{i,j})\|\max([\mathbf{s}_t] _{i,j}- \boldsymbol{\tau}_t, 0)\|^2_2}{\sum _{i,j}(1-m _{ij})} } _\text{backward guidance}\]

The forward term facilitates alignment between $I$ and $I^g$ in foreground structures in the input conditions, while the backward guidance term gives weightage to better separate the foreground by suppressing background structures in the input conditions.

Appearance Guidance:

Here the paper introduces energy function $g_a$ that penalizes the difference in the appearance representations. This enables the facilitation of appearance transfer from $I$ to $\overline{I}$.

\[g_a \left({V_t^{(k)}} ; {\overline{V}_t^{(k)}} \right) = \frac{ \sum _{i,j} \left( \left\| \mathbf{v}_t^{(k)} - \mathbf{ \overline{v} }_t^{(k)} \right\|_2^2 \right) }{N_a}\]

Finally, the modified score estimate function, which combines the structure guidance and appearance guidance energy functions to calculate the final objective, guides the T2I generation process. The function is defined as:

\[\hat{\epsilon}_t = \left( 1+s \right) \epsilon_θ \left(x_t; t, c \right) - s \epsilon_θ \left(\mathbf{x}_t ; t, \varnothing\right) + \lambda_s g_s + \lambda_a g_a\]

where $s$, $\lambda_s$, and $\lambda_a$ gives the importance of each term in the objective.

Extending FreeControl for Image-To-Image Translation

FreeControl can be extended for image-to-image translation problems. In this domain, the challenge becomes having FreeControl preserve the background content of the input image. The authors propose two modifications:

  1. Remove the mask $\mathbf{M}$ in structure guidance
  2. Start inference from the inverted latent $\mathbf{x}_T^g$ of the condition image

Comparisons of the results for various image-to-image translation approaches, including the modified versions of FreeControl are shown below:

FC_img2img [3]

Comparing both qualitatively and quantitatively in text-guided image-to-image translation, FreeControl offers versatile manipulation of image composition and style. It strikes a good balance between preserving the structure of the original image (measured by self-similarity distance) as well as aligning images with text (measured by CLIP score) when compared to strong baseline methods such as PnP, P2P, Pix2Pix-zero and SDEdit.

Comparing CycleGAN and FreeControl

Both CycleGAN and FreeControl perform considerably well on the task of image-to-image translation. However, there are several factors that differentiate these methods. The sections below will compare the advantages and disadvantages of using each methods for image-to-image translation tasks.

Architecture

  • The CycleGAN architecture consists of two GAN networks, each with a Generator function. Through training, the CycleGAN generators learn to transform from one input domain to the other, forming a two way transformation channel between the two image distributions. Therefore, CycleGAN can train two different generation models at once. However, the generators are conditioned only on the provided datasets, therefore the generators cannot accurately transform from/to a novel image distribution. Hence, the generalization of the CycleGAN architecture is quite limited.
  • FreeControl is trained on a wide variety of image distributions. This method can translate the given input image in drastically different ways through one diffusion model, which highlights the generalizability and flexibility of diffusion models compared to classic GAN based generation methods. FreeControl also enables semantic input to guide the diffusion process to produce a diverse range of outputs images in different distributions.

Efficiency

  • Being directly derived from Generative Adversarial Networks, CycleGAN inherits the original GAN architecture to generate an image in output domain $\mathbb{Y}$ by sampling from the input domain $\mathbb{X}$, in one forward pass through the generator. Therefore, this model is very efficient at generating new image samples given a new input image $x$.
  • As a diffusion based model, the input at each timestep $t$ will feed into the backward pass of the model for $T$ times, which is often a defined hyperparameter. However, more denoising steps through the model typically result in better outputs, therefore it takes much more passes through the model to complete the entire diffusion process. The time needed to generate images for FreeControl will be much longer than CycleGAN.

Performance

  • As mentioned in the “Efficiency” section, CycleGAN translates an input image to the output domain through one generator pass. Although this method is very efficient, the quality of the generated images is heavily dependent on the underlying input data and model complexity, which might vary significantly across different image datasets. In addition, GANs, including CycleGAN, are prone to mode collapse, which might result in limited diversity of generated samples.
  • FreeControl is able to generate high-quality images with realistic details and textures. By gradually denoising the input and transform it to the target distribution, FreeControl can learn the underlying distribution of the input data and be less prone to mode collapse compared to CycleGAN.

Experiments

We acquired an open source implementation of the CycleGAN model.

Google colab for our implementations and ideas: CycleGAN_Implementation

Another Google colab for running existing codebase: CycleGAN2

Code

CycleGAN Codebases: Github1, Github2

DCGAN: Website

Reference

[1] Goodfellow Ian, Pouget-Abadie Jean, Mirza Mehdi, Xu Bing, Warde-Farley David, Ozair Sherjil, Courville Aaron, and Bengio Yoshua. 2014. Generative adversarial nets. In Advances in Neural Information Processing Systems, Vol. 27. 139–144.

[2] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image translation using cycle-consistent adversarial networks,” in Proceedings of ICCV, pp. 2223–2232, 2017.

[3] Mo, S., Mu, F., Lin, K.H., Liu, Y., Guan, B., Li, Y., Zhou, B.: Freecontrol: Training free spatial control of any text-to-image diffusion model with any condition. arXivpreprint arXiv:2312.07536 (2023)