Image Generation using Generative Adversarial Network and Stable Diffusion

doi:10.21203/rs.3.rs-4231306/v1

Download PDF

Research Article

Image Generation using Generative Adversarial Network and Stable Diffusion

https://doi.org/10.21203/rs.3.rs-4231306/v1

This work is licensed under a CC BY 4.0 License

You are reading this latest preprint version

Generative adversarial networks (GANs) and Stable Diffusion represent two un- supervised techniques within the field of Deep Learning, utilized for discerning the underlying structure within multimodal imaging data. It is challenging to train GANs with stable dif- fusion because of two main issues: mode collapse and non-convergence. A workable way to tackle these two problems using GAN and Stable Diffusion is to rework the network archi- tecture to obtain a more powerful model. In this project, GANs and Stable Diffusion-based systems are designed and implemented for image generation and analysis. The different frameworks of GANs and Stable Diffusion are used for the interpretation of the images, such as stable-diffusion-2-base, stable-diffusion-2-1, GAN and unsupervised text to image translation models. The performance analysis is carried out by incorporating additional hy- brid architecture. The CLIP scores and Real/Fake scores metrics to evaluate Stable Diffusion and GAN performance respectively. The proposed GAN model achieved the accuracy of 81% and 91% on FFHQ dataset and Flicker dataset respectively. The proposed Stable Diffusion model achived 31.98 CLIP score on Flicker dataset. The Stable Diffusion models outper- forming GANs by generating more realistic images. Such systems are useful for Criminal images generation on the basis of few available images of the criminal, Image Datasets, Medical Image Generation, Art Generation, Data Augmentation, Forensics and Simulations, Text-to-Image Translation, Super Resolution, etc.

Generative Adversarial Networks (GANs)

Stable Diffusion

Image Generation

Data Augmentation

Text-to-Image Translation

Super Resolution

The ability of Generative Adversarial Networks (GANs) to learn a generative model directly from the data distribution has made them a promising method for image generation tasks. GANs com- prise two neural networks, a generator, and a discriminator, which are concurrently trained within a minimax game framework. The generator generates images that are indistinguishable from actual ones, whereas the discriminator differentiates between real and generated images. GANs can learn to create extremely realistic images with a variety of styles, resolutions, and structures through this adversarial process. In recent years, GANs have found widespread application in various image generation tasks, which includes image synthesis, style transfer, and image-to-image translation. Image synthesis involves generating new images from scratch, while style transfer transfers one image’s style to another. Image-to-image translation tasks entail transforming an input image into an output image belonging to a different domain, such as converting a black and white image to color or a sketch to a realistic image.Despite the success of GANs in image generation, there are still many challenges and limitations in this field. One of the main challenges is the instability and mode collapse issue, which can lead to the generation of low-quality and repetitive images. To address this issue, various GAN architectures and regularization techniques have been proposed, such as Wasserstein GANs, spectral normalization, and adversarial training with gradient penalty. Moreover, assessing the diversity and quality of images generated by GANs is still challenging, and various metrics and evaluation methods have been proposed to measure the fidelity and diversity of generated images.

This survey paper provides an overview of recent advances in GAN-based image generation techniques. We first introduce the basic architecture and training process of GANs. Following that, we address the various applications of GANs in image generation and present an in-depth analysis of the recent research in GANs, including the latest developments in GAN architecture, regularization techniques, and evaluation methods. Furthermore, we highlight the challenges and limitations of GAN-based image generation and discuss the future directions and opportunities in this field. Since, there are various different models for image generation, one such is Stable diffusion.Diffusion models are a class of generative models in machine learning that mimic the patterns found in the training data to generate new data, like sounds or images. They get their name by imitating the diffusion process, which is part of their methodology. The process involves repeatedly adding noise to the original data and learning how to undo it, resulting in highly similar new data. The overall study hence proves that stable diffusion holds the ability to generate more realistic and accurate images in comparison to GANs.Comparison of CLIP scores across different versions of the Stable Diffusion model reveals distinct performance patterns, with specific model versions exhibiting superiority for particular datasets. This underscores the importance of selecting appropriate model architectures tailored to dataset characteristics for optimal performance in image generation tasks.

1.1 Contribution of the study

This section highlights the study’s key contributions:

The project focuses on applying GANs and Stable Diffusion to obtain realistic images.
The comparison between these demonstrates how stable diffusion techniques can be used to overcome the drawbacks of GANs and generate high-quality images.
The study highlights how crucial it is to investigate and refine stable diffusion models in order to achieve better picture generating challenges. Moreover, it highlights the potential for further research in this area and offers intriguing applications that might have a big influence across a variety of fields, like style transfer and artistic picture generation.

1.2 Organization of the paper

This paper outlines the topic of GANs and Stable Diffusion, focusing on their capacity to produce lifelike images through text-based exploration. Section 2 outlines the goals and objectives of the study, while Section 3 delves into a comprehensive literature review and the conducted research. Methodological details are elaborated in Section 4, while Section 5 outlines the findings, and Section 6 provides the paper’s conclusion.

Image generation has significance in many industries due to the fact that it can create images autonomously without human intervention, saving time and resources. This time and cost savings help to streamline processes across industries, encouraging new creativity and innovation by provid- ing a platform for experimentation and exploration. Furthermore, the creation of realistic images benefits a variety of fields, increasing personalization and engagement in the creative process while also enabling advancements in medical imaging, design, and the arts.

The motivation for writing a research paper on Image Generation using GANs and Stable Diffusion using Deep Learning techniques comes from several important factors, including:

Medical Imaging: Image generation in medical imaging helps to create detailed representations of anatomical structures, which aids in diagnosis, treatment planning, and medical research.
Art Generation: Image generation allows artists and designers to experiment with new creative directions, pushing the boundaries of traditional art forms and encouraging artistic innovation.
Data Augmentation: Image generation improves the robustness and diversity of datasets used to train machine learning models, resulting in improved performance and generalization capa- bilities across a wide range of tasks.
Forensics and Simulation: In forensics, image generation helps to reconstruct crime scenes and create facial composites of suspects. Simulation allows for the creation of realistic scenarios for training and analysis in fields such as aviation, defence, and disaster management.

It is also important that we clearly define our study’s objectives. Here are some specific objec- tives we’d like to highlight:

To produce high-quality images, develop a sophisticated image generation model that employs deep learning techniques.
Fine-tune various image generation models using domain-specific datasets and compare their accuracy.
To optimise the model for fast image generation, reduce the time between user input and image output.
To implement image generation by taking text as an input (text to image) or voice as an input (voice to image),
To create multiple images of a person from a small amount of data, which can be used to identify thieves in various outfits.
To create images that replicate the drawings.

3.1 Conditionally-Independent Pixel Synthesis

This is a new approach to image generation using deep neural networks. There are various existing methods for image generation, such as GANs, which suffer from several limitations, including insta- bility during training and the inability to generate high-resolution images with fine details.A new architecture for image generation called ”Conditionally-Independent Pixel Synthesis” [5] (CIPS) can be used to address these limitations.

The primary idea behind this approach is to use a cascade of non-linear transformations to generate each pixel in the image, conditioned on the previous pixels. This approach allows for the generation of high-resolution images with fine details, without the need for convolutions or pooling operations.This approach introduces fresh opportunities for producing high-fidelity images across numerous domains, including computer vision, graphic design, and artistic expression.

3.2 Infinite Generative Adversarial Networks

The concept of Infinite Generative Adversarial Networks introduces a fresh methodology for un- supervised image generation through the utilization of GANs. Existing GANs- based methods for image generation are plagued by a phenomenon known as mode collapse, where the generator

learns to produce a limited set of samples that do not fully capture the underlying distribution of the data.

To address this limitation, this new type of GAN called the Infinite Generative Adversarial Network (iGAN) [6] is proposed, which is able to generate an infinite number of samples from the underlying distribution. The primary idea behind iGAN is to utilize an infinite mixture of GANs, each of which is trained on a random subset of the data. During generation, iGAN samples from the mixture of GANs, resulting in a diverse set of generated samples that fully capture the underlying distribution of the data.

There has been a comparison of this method to several modern GAN-based approaches to image generation, including DCGAN, WGAN, and SNGAN, and demonstrate that iGAN is able to generate high-quality images with greater diversity and coverage of the underlying distribution. They also show that iGAN is able to learn a meaningful latent space that can be used for various image manipulation tasks, such as style transfer and interpolation.

3.3 Image Quality of StyleGAN

StyleGAN [2] architecture proposes several modifications to improve the quality of generated im- ages and is a promising approach because it is able to generate highly realistic images with fine details. When the analysis of StyleGAN architecture is done while focusing on its synthesis net- work and mapping network, several limitations can be identified in the architecture, including the tendency to produce overly smooth images and the lack of diversity in generated images. To ad- dress these limitations, the authors propose several modifications to the StyleGAN architecture. These modifications include the use of a new normalization technique, the introduction of skip connections, and the incorporation of a noise channel. These modifications lead to improvements in image quality and diversity, as well as faster convergence during training.

3.4 StarGAN v2

StarGAN v2 is a multi-domain image synthesis approach which has the ability to produce diverse images with a single model among various domains. The existing methods for multi-domain image synthesis are limited in their ability to generate diverse images, which can be a significant bottleneck for applications such as image-to-image translation and virtual try-on. Many of these methods rely on domain-specific models, which can be inefficient and difficult to train. Also, they are often limited in their ability to generate diverse images, since they rely on a fixed mapping between input and output domains. To address these limitations, ”StarGAN v2” [3] can be used,which possesses the capacity to produce varied images across numerous domains utilizing a unified architecture. The key idea behind this approach is to use a style-based generator, in which the model is trained to generate images based on a style vector that is shared across all domains. During training, the model is provided with images from multiple domains, and is trained to generate corresponding images in each domain while preserving the shared style vector.

3.5 High-resolution Image Synthesis

The clatest methods for high-resolution image synthesis are limited in their ability to generate images of arbitrary resolutions, which can be a significant bottleneck for applications such as video generation and interactive image editing. Many of these methods rely on iterative refinement, in which low-resolution images are first generated and then upsampled to higher resolutions. To address this limitation,”Any-resolution Training” [1] method is proposed, which has the ability to generate images of arbitrary resolutions with a single model. The fundamental concept driving this approach involves employing a multi-scale architecture, wherein the model undergoes training to generate images across various resolutions concurrently. This method can help us getting videos of superior quality with smooth transitions between resolutions.

3.6 Text to Image Generation

Diffusion models generate high-quality images that take into account nearly all relevant details. The Diffusion Model has the advantage of not requiring training from the start. However, because they work directly on individual pixels, Diffusion Model is frequently power and time consuming. To overcome this limitation, The method ”Text to Image Generation Using Stable Diffusion”

[11] is proposed, with the process divided into compressive and generative learning. This method employs an auto-encoding technique to generate a space that is similar to the image space but much more computationally simple. It enables diffusion models to operate in a lower-dimensional space, making them much more computationally efficient. Taking advantage of the strengths of DMs, which are especially useful for structured data, without sacrificing quality through excessive compression. DMs can perform tasks more efficiently and save time and energy by working in a simplified environment. As a result, versatile compression models that can be used for a variety of applications, such as creating images guided by text and training various generative models, have been developed.

array

Table 1: Techniques used for Image Enhancement including StarGAN, Simple GAN, Style GAN, PixelCNN, and Stable Diffusion

Literature	Techniques used for Image Enhancement
Choi, Yunjey, et al 2020 [3]	StarGAN
Karras, Tero, et al 2020 [2]	Simple GAN, Style GAN, PixelCNN
Anokhin, Ivan, et al 2021 [5]	StarGAN, PixelCNN, Stable Diffusion
Liu, Ming-Yu, et al 2021 [4]	Simple GAN
Chai, Lucy, et al 2022 [1]	Simple GAN
Chenshuang Zhang, et al 2023 [12]	Stable Diffusion
Divyanshu Mataghare, et al 2023 [11]	Stable Diffusion
Lvmin Zhang, et al 2023 [18]	ControlNet
Yang Zhao, et al 2023 [19]	MobileDiffusion
Ping Chai, et al 2024 [21]	pix2pix, CycleGAN, StyleGAN, etc

Table 2: Summary of Literature Survey covering Research Papers, Models used, Accuracy and Datasets used for implementation of models

Literature	Models Used	Accuracy/Score	Dataset
Anokhin, Ivan, et al 2021 [5]	PixelCNN	2.92 bits/dim 3.51 bits/dim	CIFAR-10 ImageNet
Liu, Ming-Yu, et al 2021 [4]	DCGAN, WGAN, Progressive GAN, StyleGAN, BigGAN, ProGAN	Highest on CeleA-HQ	CelebA-HQ -
Karras, Tero, et al 2020 [2]	StyleGAN, StyleGAN2	FID 6.80 FID 4.30	FFHQ LSUN
Choi, Yunjey, et al 2020 [3]	StarGAN v2	FID 2.42 FID 6.57	CelebA-HQ AFHQ
Chai, Lucy, et al 2022 [1]	StyleGAN3, Anyres-GAN	FID 4.47, pFID 4.28	-
Chenshuang Zhang, et al 2023 [12]	Denoising diffusion probabilistic models (DDPMs)	FID 6.75 FID 6.75	ERNIE-ViLG 2.0 eDiff-I
Divyanshu Mataghare, et al 2023 [11]	Perceptual compression model, Dissemination Models, LDM using stable diffusion	IS 5.2±0.05	-

GANs and Stable Diffusion models are widely used for image generation tasks, with many different goals that can be pursued depending on the specific application. Here are some common goals for image generation using GANs and Stable Diffusion:

To new images that are similar to the existing training data
To synthesize new images that do not exist in the original dataset.
To apply the style of one image to another
To upscale low-resolution images into high-resolution counterparts.
To fill in missing or damaged parts of an image.
To generate images of a certain object category or images with specific attributes.

4.1 System Architecture

Figure 1 demonstrates the proposed approach, comprising the dataset, data-preprocessing, the generator, the discriminator, the adversarial training and information feedback for optimization.

Generative Adversarial Networks (GANs) represent a powerful paradigm in the field of deep learning, enabling the generation of high-quality synthetic data samples across diverse domains. At the core of GANs lies a competitive training framework involving two neural networks: a generator and a discriminator. Through an iterative process, the generator learns to produce data samples that closely resemble those from a target distribution, while the discriminator learns to distinguish between real and fake samples. This adversarial relationship drives the continual improvement of both networks, leading to the generation of increasingly realistic data. GANs have demonstrated remarkable success in various applications, including image generation, text-to-image synthesis, and style transfer. However, training GANs can be challenging due to issues such as mode collapse and instability. Nonetheless, ongoing research efforts continue to advance GAN architectures and training techniques, paving the way for their widespread deployment in real-world scenarios. Here is the respective algorithm.

The algorithm outlines the training function for a Stable Diffusion Model, incorporating com- ponents such as a Text Encoder, Variational Autoencoder (VAE), and UNet architecture. The goal is to train a model capable of generating stable and high-quality samples. Key steps in the training function include setting hyperparameters such as training batch size, gradient steps accumulation, learning rate (lr), maximum training steps, output directory, and gradient checkpointing. If gradi- ent checkpointing is enabled, specific components like the text encoder and UNet are configured accordingly. A dataloader is created for the training dataset, and if scale learning rate is enabled, the learning rate is adjusted accordingly. The optimizer, typically AdamW, is initialized, and the model components are prepared for training using the accelerator. Training progresses through epochs and batches, with the model updated based on accumulated gradients. Input data under- goes encoding through the VAE, with noise injected using a noise scheduler. The UNet processes the latent representations along with additional information from the text encoder to predict noise for diffusion. Overall, this procedure lays the groundwork for training a Stable Diffusion Model us- ing the provided components and hyperparameters, aimed at achieving stable and effective sample generation.

This algorithm pertains to the calculation of loss and gradient updates based on the prediction type specified in the noise scheduler configuration. If the prediction type is ”epsilon,” the target value for the loss calculation is set to the noise itself. If the prediction type is ”vprediction,” the target value is obtained from the noise scheduler’s velocity prediction function. If neither condition is met, a ValueError is raised indicating an unknown prediction type.

The loss is calculated using mean squared error (MSE) loss between the predicted noise and the target value. The resulting loss is then propagated back through the model using the accelerator’s backward function. If multiple processes are utilized, gradients are obtained from the module or model’s parameters accordingly. Additionally, gradient values corresponding to the placeholder token ID are zeroed out to prevent updates to those tokens.

After updating the optimizer’s parameters, gradients are reset, and if gradient synchronization is enabled, progress is updated accordingly. The global steps counter is incremented, and if the specified number of steps for saving progress is reached, the model’s current state is saved. The procedure continues until the maximum training steps are reached, at which point the training loop terminates, and model training is completed. Finally, the trained model is saved to the specified output directory.

4.2 Experimental setup

This section provides detailed information about the hardware and software prerequisites, the datasets utilized, and the evaluation metrics employed.

1. Hardware and Software Table 3 summarizes the essential requirements for implementing image generation using Generative Adversarial Networks (GANs) and the Stable Diffusion model. It specifies an Intel i5 11th generation processor for computational power, a 1TB hard disk drive (HDD) for storage, and 8GB of RAM for efficient data processing. The operating system needed is Windows 11, with Python chosen as the programming language for its versa- tility and extensive machine learning libraries. These specifications ensure a capable setup for successful image generation tasks with GANs and Stable Diffusion.

Table 3: Hardware and Software Specifications for the implementation image generation using GANs and Stable Diffusion

Processor	Intel i5 11th generation
HDD	1TB
RAM	8GB
Operating System	Windows 11
Programming Language	Python

2. Dataset Table 4 summarizes the datasets employed in the implementation of image generation using GANs and Stable Diffusion. It includes FFHQ with 3000 images, Flicker with 6000 images, and Celeb with 100 images. These diverse datasets offer a rich variety of images crucial for training and evaluating the effectiveness of the GANs and Stable Diffusion model.

Table 4: Datasets Used for the implentation of Image Generation using GANs and Stable Diffusion

Dataset	Count	Type
FFHQ	3000	Images
Flicker	6000	Images
Celeb	100	Images

3. Evaluation Metrics The Inception score and FID scores are the appropriate evaluation met- rics GANS according to [4].

Inception Score (IS):

This metric assesses the quality and diversity of generated images by comparing their class distribution with that of a pre-trained classifier network. The IS is defined as:

In this equation, p(y|x) denote the class distribution of the generated images predicted by the classifier network, p(y) is the dataset’s marginal class distribution, and KL is the Kullback- Leibler divergence.

Fr´echet Inception Distance (FID):

This metric evaluates the disparity between the feature representations of the generated images and those of the real images, calculated by a classifier network that has been pre-trained. The FID is defined as:

In this equation, mu_real and mu_f ake denote the mean feature representations of the real and fake images, while S_real and S_f ake denote the covariance matrices of the real and fake feature representations, respectively.

The project aims to generate high resolution images using GAN and Stable Diffusion and below are the results obtained from each technique.

5.1 Result Analysis of GAN

Before training, an image generated by a simple GAN produces random noise as output. As a result, images generated before training are typically noisy, blurry, and do not closely resemble the target data distribution. These generated images contain no meaningful patterns or structures. After training, the generator network has learned to generate data that closely matches the distri- bution of the real data it was trained on. The images generated by the GAN become increasingly realistic, with improved clarity, detail, and resemblance to the actual data. The discriminator’s feedback during training helps the generator refine its output, making it more coherent and closer to the desired output distribution.

In Fig. 3, the image generated by our GAN model, fine-tuned on the Flickr dataset, exhibits a remarkable level of realism and visual coherence. This synthetic image is a product of the GAN’s ability to transform a random noise vector into a compelling and photorealistic composition. The generated image captures the essence of the Flickr dataset, seamlessly blending various elements such as people and vehicles, and successfully mimicking the diverse scenes and objects found in the training data.

The Fig. 4 shows the image generated by our GAN model, fine-tuned on the Celebs-HQ dataset featuring the renowned actor Brad Pitt, is a striking testament to the GAN’s capability in generating Brad Pitt-like images. This synthetic portrayal exudes an exceptional likeness to the actor, effectively capturing his unique facial features, expressions, and visual identity.

Figure 5 shows an image generated by our GAN model fine-tuned on the expansive and diverse FFHQ dataset, is a testament to the GAN’s remarkable capacity to produce novel faces. This synthetic image is the product of the GAN’s ability to transform a random noise vector into a portrait that appears strikingly genuine. The generated image showcases the model’s proficiency in generating an array of random faces, each imbued with a unique combination of facial features, expressions, and identities.

The training process of our Generative Adversarial Network on three distinct datasets, namely Flickr, Celebs-HQ, and FFHQ, is vividly depicted in the accompanying loss graphs. As in Fig. 6, each graph represents the evolution of the GAN’s training loss over a set number of epochs, shedding light on the model’s convergence and generative performance. Notably, the GAN fine-tuned on the Celebs-HQ dataset exhibits an initial phase of a gradual decline in the loss curve, indicating rapid early-stage learning. In contrast, both the FFHQ and Flickr datasets display a gradual incline in the loss curves during the initial epochs, suggesting a slower learning process. As the training progresses, all three datasets ultimately converge towards lower loss values, reflecting the GAN’s increasing proficiency in generating high-quality content. These loss graphs offer valuable insights into the distinct learning dynamics of the GAN across varied datasets and training durations, underscoring the adaptability of the model to different data sources and the impact of training duration on its generative capabilities.

5.2 Result analysis of stable diffusion

In Fig. 7, the fine-tuned model demonstrates significant improvements, producing images that are more visually convincing.The images obtained by the fine-tuned Stable Diffusion model exhibit a higher level of visual realism. Details, textures, and structures in these images closely resemble the characteristics of the real-world objects or scenes. The fine-tuned model appears to produce images with greater consistency in terms of color, lighting, and object proportions. This results in a more coherent and visually pleasing output. Fine-tuning enhances the model’s ability to generate recognizable objects or scenes. It reduces instances of abstract or distorted shapes, making the generated content more relatable. Artifacts, such as noise or blurriness, are notably reduced in the images produced by the fine-tuned model. This leads to cleaner and crisper results.

As in Figure, 8, the drawing generated by fine-tuning the model on the original artistic drawing exhibits an impressive degree of high fidelity. It faithfully captures the essence and core features of the original artwork. The shapes and artistic style in the generated drawing closely mirror those in the original artwork. This shows that the fine-tuned model has successfully retained the artistic qualities of the source material. Fine-tuning has allowed the model to preserve intricate details in the original artwork, including fine lines, textures, and intricate patterns. The level of detail in the generated drawing is strikingly similar to the source. Fine-tuning has not resulted in any loss of clarity regarding the subject matter of the drawing. Objects or scenes within the generated image remain recognizable and coherent with the original content.

We have used actor Brad Pitt images for finetuning the Stable Diffusion model. As in Fig. 9, the images is generated by the fine-tuned model for the specific actor demonstrate a striking level of accuracy in capturing their current appearance. Facial features, expressions, and visual charac- teristics closely resemble the real-life subject, showcasing the model’s proficiency in mimicking the actor’s likeness. Perhaps the most noteworthy result is the model’s capacity to produce childhood images of the actor. Even though childhood images were not incorporated in the dataset, the fine- tuned model effectively extrapolates the actor’s earlier appearance, demonstrating a sophisticated understanding of facial aging and transformation over time.

In Fig. 10, the bar graph presents the comparison of clip scores across different models for three datasets: FFHQ, Flicker, and Celeb-HQ. The models are represented on the x-axis, while the y-axis denotes the clip scores.

The stability/available-diffusion-2-base model outperforms other models for the FFHQ and Flicker datasets, achieving the highest clip scores. This suggests that this model is particularly effective for these two datasets. On the other hand, the CompVis/stable-diffusion-v1-4 model ex- hibits superior performance on the Celeb-HQ dataset, securing the highest clip score among all models. This indicates that this model is more suited to the Celeb-HQ dataset.

In Fig. 11, the batch size (one of the hyperparameters) is compared with the clip score of the proposed model. From the graph, it can be concluded that the highest clip score of 31.9821 is achieved for the batch size of 5.

In Fig. 12, the bar graph compares the CLIP scores of existing models to those of the fine- tuned models. The figure offers a visually compelling representation of the model performance improvements. Each bar on the graph corresponds to a specific model, with the existing models represented by one set of bars and the fine-tuned models by another. The height of each bar reflects the CLIP score, providing a clear and quantifiable measure of the models’ capabilities. The steeper or higher bars among the fine-tuned models indicate a significant improvement in their CLIP scores, demonstrating the efficacy of the fine-tuning process.

Table 5

Performance Evaluation of Pretrained and Finetuned models on various Datasets using Real or Fake Accuracy matrix for GANs and Clipsocre matrix for Stable Diffusion models
Dataset	Model	Real or Fake	Clipscore (existing)	Clipscore (proposed)	Placeholder token
FFHQ	stabilityai/stable-diffusion-2-base	-	27.2277	29.1549	<batch2 proj 0>
	stabilityai/stable-diffusion-2-1	-	26.7496	29.3114	<batch2 proj 4>
	CompVis/stable-diffusion-v1-4	-	27.3863	28.9897	<batch2 proj 2>
	GAN	Real- 0.8404 Fake − 0.1018	-	-
Flicker	stabilityai/stable-diffusion-2-1	-	28.9321	31.9821	<batch2 proj 7>
	CompVis/stable-diffusion-v1-4	-	30.1758	31.0349	<batch2 proj 9>
	GAN	Real − 0.9125 Fake − 0.1562	-	-
Celeb	stabilityai/stable-diffusion-2-1	-	25.6839	25.7778	<batch2 proj 11>
	CompVis/stable-diffusion-v1-4	-	26.6171	27.3278	<batch2 proj 11>
	GAN	-	-	-	-

To summarize, the project focused on creating realistic images using Generative Adversarial Net- works. During the project, an image dataset was used to train a GAN model. The results showed that the generated images visually resembled real images. However, when compared to images gen- erated by Stable Diffusion, the quality of GAN-generated images was slightly lower. Stable Diffusion emerged as a technique capable of producing high-quality images by addressing the shortcomings of GANs. This explains how to use and optimize Stable Diffusion models.

Further research and development in this field has the potential to yield exciting applications in a variety of domains. It could, for example, lead to techniques such as style transfer, generating a diverse range of images from a small set of examples, and even artistic image generation.

The findings of this research project underscore the effectiveness of both GANs and Stable Diffusion models in generating realistic images. While GANs exhibit notable improvements in image quality post-training, the comparison with Stable Diffusion models reveals a slight disparity, with the latter producing higher-quality images. This highlights the importance of exploring alternative techniques like Stable Diffusion, which address the inherent limitations of GANs, such as mode collapse and non-convergence issues.

Moreover, the emergence of Stable Diffusion as a viable alternative emphasizes the significance of diversifying approaches in image generation tasks. By optimizing Stable Diffusion models, re- searchers can harness their potential to produce high-fidelity images across various domains. This indicates a promising avenue for further research and development, with opportunities to explore techniques like style transfer, image diversity generation from limited examples, and artistic image generation.

Moving forward, continued exploration and refinement of both GANs and Stable Diffusion mod- els hold the key to unlocking innovative use cases in domains such as computer vision, digital art, and multimedia content generation. By leveraging the strengths of each approach and addressing their respective limitations, researchers can pave the way for groundbreaking advancements with far-reaching implications in diverse domains.

Author Contribution

All made the project code which included fintuning of the various GAN and Stable Diffusion models. Sejal and Shreya specifically helped in making the research paper and documenting every step. Kunal and Omkar made Figure 1 and 2 that comprises of the proposed system's architecture and also helped in phrasing the algorithms 1 and 2. All authors reviewed the manuscript. Satishkumar guided us throughout the manuscript.

Lucy Chai, Michael Gharbi, Eli Shechtman, Phillip Isola, Richard Zhang. ”Any-resolution training for high-resolution image synthesis.” Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XVI. Cham: Springer Nature Switzerland, 2020. [Online] Available: https://arxiv.org/abs/2204.07156.
Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, Timo Aila. ”Analyzing and improving the image quality of stylegan.” Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2020. [Online] Available https://arxiv.org/abs/1912.04958.
Yunjey Choi, Youngjung Uh, Jaejun Yoo, Jung-Woo Ha. ”Stargan v2: Diverse image synthesis for mul- tiple domains.” Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2020. [Online] Available:https://arxiv.org/abs/1912.01865.
Ming-Yu Liu, Xun Huang, Jiahui Yu, Ting-Chun Wang, Arun Mallya. ”Generative adversarial networks for image and video synthesis: Algorithms and applications”. Proceedings of the IEEE 109.5 (2021): 839-862. [Online] Available: https://arxiv.org/abs/2008.02793.
Ivan Anokhin, Kirill Demochkin, Taras Khakhulin, Gleb Sterkin, Victor Lempitsky, Denis Ko- rzhenkov.”Image generators with conditionally-independent pixel synthesis.” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021.[Online] Available https://arxiv.org/abs/2011.13775.
Hui Ying , He Wang , Tianjia Shao1, Yin Yang , Kun Zhou “Unsupervised Image Generation with In- finite Generative Adversarial Networks”. Proceedings of IEEE/CVF International Conference on Com- puter Vision (ICCV). 2021. [Online] Available: https://arxiv.org/abs/2108.07975.
Nandakishore Dusa, Mohammed Javed, Shiv Ram Dubey and P. Nagabhushan “T2CI-GAN: Text to Compressed Image generation using Generative Adversarial Network” . Proceedings of APR’s 6th CVIP 2022. [Online] Available:https://arxiv.org/abs/2210.03734 .
Bing Yu, Youdong Ding, Zhifeng Xie and Dongjin Huang “Stacked generative adversarial net- works for image compositing”. Proceedings of EURASIP Journal on Image and Video Process- ing2021[Online]Available: https://jivp-eurasipjournals.springeropen.com/articles/10.1186/s13640-021- 00550-w.
Peihao Zhu, Rameen Abdal,Yipeng Qin,Peter Wonka “SEAN: Image Synthesis with Semantic Region- Adaptive Normalization”. Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2020 [Online] Available: https://arxiv.org/abs/1911.12861.
Tatjana Chavdarova and Francois Fleuret “SGAN: An Alternative Training of Generative Adversarial Networks ” Proceedings of Swiss National Science Foundation, under the grant CRSII2-147693 ”WILD- TRACK [Online] Available:https://arxiv.org/abs/1712.02330.
Divyanshu Mataghare, Shailendra S. Aote, Ramchand Hablani “Text to Image Generation Using Stable Diffusion” 2023 [Online] Available:https://www.eurchembull.com/uploads/paper/1692c72b0541e5b43ffda03e2270bf7c.pdf
Zhang, Chenshuang, et al. ”Text-to-image diffusion model in generative ai: A survey.” arXiv preprint arXiv:2303.07909 (2023). https://arxiv.org/pdf/2303.07909.
L. X. Nguyen, P. Sone Aung, H. Q. Le, S. -B. Park and C. S. Hong, ”A New Chapter for Medi- cal Image Generation: The Stable Diffusion Method,” 2023 International Conference on Information Networking (ICOIN), Bangkok, Thailand, 2023, pp. 483-486, doi: 10.1109/ICOIN56518.2023.10049010. Available:https://ieeexplore.ieee.org/abstract/document/10049010
D. Yi, C. Guo and T. Bai, ”Exploring Painting Synthesis with Diffusion Models,” 2021 IEEE 1st Inter- national Conference on Digital Twins and Parallel Intelligence (DTPI), Beijing, China, 2021, pp. 332- 335, doi: 10.1109/DTPI52967.2021.9540115. Available:https://ieeexplore.ieee.org/document/9540115
S. Zhang, ”Dreambooth-based Image Generation Methods for Improving the Performance of CNN,” 2023 IEEE 3rd International Conference on Electronic Technology, Communication and Information (ICETCI), Changchun, China, 2023, pp. 1181-1184, doi: 10.1109/ICETCI57876.2023.10176568. Avail-able:https://ieeexplore.ieee.org/document/10176568
M. Ivanovs et al., ”Synthetic Image Generation With a Fine-Tuned Latent Diffusion Model for Organ on Chip Cell Image Classification,” 2023 Signal Processing: Algorithms, Architectures, Arrangements, and Applications (SPA), Poznan, Poland, 2023, pp. 148-153, doi: 10.23919/SPA59660.2023.10274460. Available:https://ieeexplore.ieee.org/abstract/document/10274460
M. Hamazaspyan and S. Navasardyan, ”Diffusion-Enhanced PatchMatch: A Framework for Ar- bitrary Style Transfer with Diffusion Models,” 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Vancouver, BC, Canada, 2023, pp. 797-805, doi: 10.1109/CVPRW59228.2023.00087. Available:https://ieeexplore.ieee.org/document/10208588
Lvmin Zhang, Anyi Rao, Maneesh Agrawala: ”Adding Conditional Control to Text-to-Image Diffusion Models” 2023 arXiv:1409.0473, Computer Vision and Pattern Recognition Avail- able:https://arxiv.org/abs/2302.05543
Yang Zhao, Yanwu Xu, Zhisheng Xiao, Tingbo Hou: ”MobileDiffusion: Subsecond Text-to-Image Gen- eration on Mobile Devices” 2023 arXiv:2311.16567, Computer Vision and Pattern Recognition Avail- able:https://arxiv.org/abs/2311.16567
Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, Mark Chen: ”Hierarchical Text- Conditional Image Generation with CLIP Latents” 2022 arXiv:2204.06125, Computer Vision and Pat- tern Recognition Available:https://arxiv.org/abs/2204.06125
Ping Chai, Lei Hou, Guomin Zhang, Quddus Tushar, Yang Zou: ”Generative adversar- ial networks in construction applications” 2024 Elsevier B.V., Automation in Construction Available:https://www.sciencedirect.com/science/article/pii/S0926580524000013?ref=pdf_downloadf r = RR − 2rr = 8690d120e9520e98

No competing interests reported.

Download PDF

Editorial decision: Revision requested
14 Aug, 2024
Reviewers invited by journal
10 Apr, 2024
Submission checks completed at journal
08 Apr, 2024
Editor assigned by journal
08 Apr, 2024
First submitted to journal
07 Apr, 2024

You are reading this latest preprint version

Image Generation using Generative Adversarial Network and Stable Diffusion

Status:

Version 1

Abstract

Figures

1 Introduction

1.1 Contribution of the study

1.2 Organization of the paper

2 Motivation and Objectives

3 Literature Survey

3.1 Conditionally-Independent Pixel Synthesis

3.2 Infinite Generative Adversarial Networks

3.3 Image Quality of StyleGAN

3.4 StarGAN v2

3.5 High-resolution Image Synthesis

3.6 Text to Image Generation

4 Methodology

4.1 System Architecture

4.2 Experimental setup

5 Result and Analysis

5.1 Result Analysis of GAN

5.2 Result analysis of stable diffusion

6 Conclusion

7 Discussion

Declarations

Author Contribution

References

Additional Declarations

Status:

Version 1