Generative AI Playground: Text-to-Image Stable Diffusion with Stability AI*, Stable Diffusion* XL, and CompVis on the Latest Intel® GPU



This article was originally published on Medium*

Figure 1. Examples generated by Stable Diffusion* XL base 1.0. Image source: Podell et al. (2023)

Oct. 15, 2023—Stable Diffusion models have been making headway in the generative AI space with the ability to generate photo-realistic images from text prompts (Figure 1). These models are not only of interest and accessible to AI developers, but also to creative people from every walk of life including authors, artists, and teachers. Intel released their new powerful Intel® Data Center GPU Max Series, and I ran predictions from these Stable Diffusion models on this GPU.

There are numerous open source Stable Diffusion models out there. The three models I am demonstrating are hosted on Hugging Face*:

All of these models take text as input and generate an image from that text prompt. I generated images like the one shown in Figure 2 with the prompt Horse eating a carrot on the Empire State Building, as well as Figure 3 with the prompt Pecan tree growing on the moon.

Figure 2. An image resulting from a text prompt of Horse eating a carrot on the Empire State Building to a Stable Diffusion model on Intel’s latest GPU.

Figure 3. Prompted the Stable Diffusion model with Pecan tree growing on the moon.

Stability AI Stable Diffusion v2–1 Model

The Stability AI Stable Diffusion v2–1 model was trained on an impressive cluster of 32 x 8 x A100 GPUs (256 GPU cards total). It was fine-tuned from a Stable Diffusion v2 model. The original dataset was a subset of the LAION-5B dataset, created by the DeepFloyd team at Stability AI. The LAION-5B dataset is the largest text-image pair dataset known to date as of the time of writing, with over 5.85 billion text-image pairs. Figure 3 shows a few samples from the dataset.

Figure 4. Samples from the LAION-5B dataset of examples of cats. Image Source

The sample images shown reveal that the original images do come in a variety of pixel sizes; however, training these models in practice usually involves cropping, padding, or resizing of the images to have a consistent pixel size for the model architecture.

The breakdown of the dataset is as follows:

  • Laion2B-en: 2.32 billion text-image pairs in English
  • Laion2B-multi: 2.26 billion text-image pairs from over 100 other languages
  • Laion1B-nolang 1.27 billion text-image pairs with an undetectable language

Tracing the path of training these models is a bit convoluted, but here is the path:

  • Stable Diffusion 2-Base was trained from scratch for 550 K steps on 256 x 256 pixel images, filtered to remove pornographic material, and then trained for 850 K more steps on 512 x 512 pixel images.
  • Stable Diffusion v2 picks up training where Stable Diffusion 2-Base left off and was trained for 150 K more steps on 512 x 512 pixel images, followed by 140 K more steps on 768 x 768 pixel images.
  • Stability AI Stable Diffusion v2–1 was further fine-tuned on 768 x 768 pixel images from Stable Diffusion v2 first with 55 K steps followed by 155 K steps with two different explicit material filters.

More details on the training can be found on the Stability AI Stable Diffusion v2–1 Hugging Face model card.

Stability AI with Stable Diffusion XL Base 1.0 Model

The Stability AI Stable Diffusion XL Base 1.0 model is another text-to-image stable diffusion model, but with some improvements on the back end. It takes advantage of a 3x larger U-NET backbone architecture, as well as a second text encoder. Figure 1 shows a sample of the image outputs from the Stable Diffusion XL 1.0 model.

The order of training for Stable Diffusion XL 1.0 is as follows:

  1. A base model was trained for 600 K steps on 256 x 256 resolution images.
  2. Training was continued for 200 K steps on 512 x 512 pixel images.
  3. Interestingly, the final stage of fine-tuning involves a variety of rectangular sizes, but are composed of approximately a 1024 x 1024 pixel area.

A study of user preference showed that the Stable Diffusion XL 1.0 outperformed the Stable Diffusion v2–1 model (Podell et al., 2023). The fact that the Stable Diffusion XL 1.0 model uses a wide variety of image sizes for training in the final stage, as well as the model’s architectural changes, are key to its adoption and success. Figure 4 illustrates the Stable Diffusion XL 1.0 improvement over previous models.

Figure 5. A comparison of images from the same prompt between previous Stable Diffusion models and the Stable Diffusion XL 1.0 model. Image source: Podell et al., 2023

CompVis with Stable Diffusion v1–4 Model

The CompVis with Stable Diffusion v1–4 model picks up training from the Stable Diffusion v1-2 model and was fine-tuned on 512 x 512 pixel images for 225 K steps on LAION-Aesthetics v2 5+ data. The dataset is a subset of the previously mentioned LAION-5B dataset with high visual quality as a selection criterion (Figure 5).

Figure 6. Samples from the LAION-Aesthetics high visual quality dataset. Image Source

Intel GPU Hardware

The particular GPU that I used for my inference test is the Intel Data Center GPU Max Series 1100, which has 48 GB of memory, 56 Xe-cores, and 300 W of thermal design power. On the command line, I can first verify that I indeed do have the GPUs that I expect by running:

clinfo -l

And I get an output showing that I have access to four Intel GPUs on the current node:

Platform #0: Intel(R) OpenCL Graphics
 +-- Device #0: Intel(R) Data Center GPU Max 1100
 +-- Device #1: Intel(R) Data Center GPU Max 1100
 +-- Device #2: Intel(R) Data Center GPU Max 1100
 `-- Device #3: Intel(R) Data Center GPU Max 1100

Similar to the nvidia-smi function, you can run the xpu-smi in the command line with a few options selected to get the statistics you want on GPU use.

xpu-smi dump -d 0 -m 0,5,18

The result is a printout every 1 s of important GPU use for the device 0:

getpwuid error: Success

Timestamp, DeviceId, GPU Utilization (%), GPU Memory Utilization (%), GPU Memory Used (MiB)
13:34:51.000,    0, 0.02, 0.05, 28.75
13:34:52.000,    0, 0.00, 0.05, 28.75
13:34:53.000,    0, 0.00, 0.05, 28.75
13:34:54.000,    0, 0.00, 0.05, 28.75

Run the Stable Diffusion Example

My colleague, Rahul Nair, wrote a Stable Diffusion text-to-image Jupyter* Notebook that is hosted directly on the Intel® Developer Cloud. Here are the steps you can take to get started:

  1. Go to Intel Developer Cloud.
  2. Register as a standard user.
  3. Once you are signed in, go to the Training and Workshops section.
  4. Select the GenAI Launch Jupyter Notebook option. You can find the text-to-image Stable Diffusion for Jupyter Notebook and run it there.

In the Jupyter Notebook, to speed up inference, Intel® Extension for PyTorch* was used. One of the key functions is _optimize_pipeline where ipex.optimize is called to optimize the DiffusionPipeline object.

def _optimize_pipeline(self, pipeline: DiffusionPipeline) -> DiffusionPipeline:
        Optimizes the model for inference using ipex.

        - pipeline: The model pipeline to be optimized.

        - pipeline: The optimized model pipeline.

        for attr in dir(pipeline):
            if isinstance(getattr(pipeline, attr), nn.Module):
                        getattr(pipeline, attr).eval(),
        return pipeline

The notebook should only take a minute or so to run. Once the model is loaded into memory, each image should only take seconds to generate. Just be sure that the pytorch-gpu environment is selected when you open up the Jupyter kernel so that you don’t have to install any packages. There is a mini user interface (Figure 6) in the notebook using the package ipywidgets. Select the desired model, enter a prompt, and select the number of images to output.

Figure 7. A mini user interface for the prompt-to-image within the Jupyter Notebook.

I am glad to report that the latest Intel Data Center GPU Max Series performed well and will definitely be a contender in the generative AI space. Please let me know if you have any questions or would like help on getting started with trying out stable diffusion.

You can reach me on:

Disclaimer for Using Stable Diffusion Models

The Stable Diffusion models provided here are powerful tools for high-resolution image synthesis, including text-to-image and image-to-image transformations. While they are designed to produce high-quality results, users should be aware of potential limitations:

  • Quality Variation: The quality of generated images may vary based on the complexity of the input text or image, and the alignment with the model’s training data.
  • Licensing and Usage Constraints: Carefully review the licensing information associated with each model to ensure compliance with all terms and conditions.
  • Ethical Considerations: Consider the ethical implications of the generated content, especially in contexts that may involve sensitive or controversial subjects.

For detailed information on each model’s capabilities, limitations, and best practices, refer to the respective model cards.