Stable Diffusion: Exploring trained models and unique tokens for more accurate text-to-image results

Before I get started, I’ve got a preamble to this post. I’m only in the early discovery process of AI-powered text-to-image technology. There’s a good chance you know more than I do about this topic! I’m writing this to document what I’ve learned. If you’re reading this, you might be on the same journey. If so, I hope you find this helpful.

If you peaked ahead in this post, you’d see all kinds of images based on the prompt, a tiger sitting on a log, and some results are not that impressive. I’ve used this basic prompt intentionally because it helps illustrate two thoughts I wanted to share.

The first takeaway is that the prompt you give these tools is incredibly important. For example, take a peek at a collection of sea otter images created with DALL‑E 2. If you hover over the images on that page, you’ll see the text that generated the images.

DALL-E 2 - Sea Otters prompt example

The text you provide will significantly impact the image you get out. That makes sense, and this “sea otter” page offers ample proof.

The second takeaway I want to address here is that the model is even more important. This is a concept I didn’t grasp until I started experimenting with the same basic prompt and different data models. Stable Diffusion, which I will get to in a moment, is how I’ve been experimenting with models.

In short, there’s a lot I don’t know, but I’m learning. That’s the end of the preamble. Let’s get started.

Before Stable Diffusion, DALL‑E

In early 2022, text-to-image creation tools seemed to be everywhere online. I couldn’t wait to try them myself. The first version I tried was DALL‑E, aka DALL‑E version 1. You can easily try DALL‑E 1 in your browser at https://www.craiyon.com/. If you type in “a tiger sitting on a log,” you will get something like what you see below.

Prompt: "a tiger sitting on a log" | Engine: DALL-E 1

Technology has raced beyond this early version of DALL‑E, but these images are still impressive. Seemingly like magic, some images appear that are rough representations of what you typed. Yes, they look a little wonky, but they didn’t exist until you typed in that phrase. You are a powerful wizard. :mage:‍♂️

In 2022, the successor to DALL‑E was released by OpenAI. DALL‑E 2 was a giant leap forward. It had improved image quality and a better text interpreter. The same prompt, a tiger sitting on a log, produced greatly improved results.

DALLE2 2022 11 04 12 53 16 A tiger sitting on a log

DALLE2 2022 11 04 12 53 05 A tiger sitting on a log

DALLE2 2022 11 04 12 52 56 A tiger sitting on a log

Prompt: "a tiger sitting on a log" | Engine: DALL-E 2

Stable Diffusion

Last week I had time to experiment with Stable Diffusion. What’s the difference? I’ll quote from the Stable Diffusion article on Wikipedia.

Unlike models like DALL‑E, Stable Diffusion makes its source code available, along with pretrained weights. […] The user owns the rights to their generated output images, and is free to use them commercially.

In other words, you can run Stable Diffusion on your computer and tinker with it. Stable Diffusion claims no copyright on the images you create.

During DALL‑E 2 private beta, the images were not available for commercial use, but when DALL‑E 2 opened up into public beta, the creators, OpenAI, updated their terms of use, giving users ownership of the images they create. Their press release said, “users get full usage rights to commercialize the images they create with DALL·E, including the right to reprint, sell, and merchandise.”

DALL‑E 2 has an easy-to-use web interface and generates wonderful results, but I don’t see any way to manipulate the model. (Could I be wrong about that? Absolutely!)

Stable Diffusion, being open, has a growing community of hackers pushing it forward. I installed Stable Diffusion on my Mac to explore some of these experiments.

Installing Stable Diffusion

There are a variety of ways of installing Stable Diffusion on your computer. As of November 2022, the moment I’m writing this, my advice is to not do this unless you are comfortable working on your computer’s command line. Installing this is experimental at best, and you could break things. Seriously. Proceed with caution. If you can wait, these tools will become easier to use over time.

If you want to experiment without installing this on your machine, a search engine called Lexica will allow you to search for images others have created, and if you make a free account, you can use Stable Diffusion on their site. You won’t be able to use any custom models, but you won’t have to install anything on your machine.

Lexica, a Stable Diffusion search engine

Ok, now that you’ve decided to install this stuff anyway, good luck! :stuck_out_tongue_winking_eye:

I’ve installed several versions of Stable Diffusion during my exploration, and InvokeAI is the one that has the easiest installation process. There are instructions for many platforms. Once installed, Invoke can be used from the command line or via the web interface it provides. (Since writing this post, I’ve heard about Charl‑e, for the Mac. I’ve not tried it, though. Thanks for the tip, Rudy.)

The Invoke AI web interface

What does Stable Diffusion do with the same prompt we tried with DALL‑E? Here are three results.

Tiger Sitting on a log - stable-diffusion-1.5

Prompt: "a tiger sitting on a log" | model: sd-v1-5.ckpt | Engine: Stable Diffusion

I think these are impressive results. Notice I’ve included an additional note about the “model” above. The model here, sd-v1-5.ckpt, is the model’s filename containing the AI’s text and image training data. This model lives at https://huggingface.co/runwayml/stable-diffusion-v1‑5. You can see exactly what this model is supposed to do.

Stable Diffusion is a latent text-to-image diffusion model capable of generating photo-realistic images given any text input.

It’s trained for photo-realism. That’s probably why the basic prompt I’ve used is returning something that looks like a photo instead of an illustration.

If you want a deep dive into what models do, check out Stable Diffusion with Diffusers, where the authors “explain how the model works and finally dive a bit deeper into how diffusers allow one to customize the image generation pipeline.” I’ll let them explain the technical details because they’re good at it.

Adding style detail to the prompt

As you saw earlier with the sea otter example from DALL‑E, you can tailor your prompts with more descriptive words to get better results. If we don’t want photo realism as the style of our images, we need to be more detailed with our request.

Let’s say instead photo-realistic tigers, you wanted your tigers to look as if they were drawn by Hollie Mengert. Below is an example of her character design.

Piper Coyote by Hollie Mengert

Ownership and ethics

Before we go any further, we must acknowledge this artwork is Hollie’s profession. Creating her artwork is how she makes a living. For context, please read Invasive Diffusion: How one unwilling illustrator found herself turned into an AI model. I’m writing another piece regarding the topic, which I plan to post to my blog. I will update this post with a link when it’s ready.

Adding style detail to the prompt (cont.)

Using the same photo-realistic model we’ve been using so far, changing our prompt to hollie mengert artstyle tiger sitting on a log creates a very different outcome.

Prompt: "hollie mengert artstyle tiger sitting on a log" | model: sd-v1-5.ckpt | Engine: Stable Diffusion

Although I like that first tiger’s hand-sketched style, these images do not look like Hollie’s work. The sd-v1‑5.ckpt model doesn’t know what Hollie’s style is beyond the fact that it’s not a photographic style. That leads to an obvious question. Can you teach an old model new tricks? Yes, you can.

DreamBooth and training models

There’s a project called DreamBooth that offers “a new approach for ‘personalization’ of text-to-image diffusion models” to “fine-tune a pre-trained text-to-image model.”

The Illustration-Diffusion model used some of Hollie’s work and trained it to mimic her style. The documentation for this model says, “the correct token is holliemengert artstyle.”

You might wonder if there is a typo in the token. Is it missing a space? It’s not. This missing space is intentional. The token is holliemengert and not hollie mengert. This model contains data derived from images tagged with the made-up word holliemengert. You’ll see many of these made-up words if you use custom models. Using unique words allows the new data to be isolated and not get polluted with other data. Imagine the images had been tagged with the word drawing which already contains lots of data. You’d have no way to tell Stable Diffusion that you meant the “hollie mengert” style without it being polluted with many other drawing styles.

In the images above, where we asked for “hollie mengert artstyle”, I’m a little surprised we didn’t end up with some holly (not Hollie) in the images.

Christmas card with an illustration of holly

Let’s run our new prompt,holliemengert artstyle tiger sitting on a log, through Stable Diffusion using the Illustration-Diffusion model, aka hollie-mengert.ckpt.

Prompt: "holliemengert artstyle tiger sitting on a log" | model: hollie-mengert.ckpt | Engine: Stable Diffusion

The results are remarkably different. These images are not the same quality as would have been created by Hollie, but they have a much more distinctive style. These results are heavily influenced by the new model we’re using.

Using the model with the wrong “token”

How important is the token holliemengert artstyle in the prompt now that we’re using this newly trained model? What if you wanted the tiger to be influenced by classic Disney movies instead of Hollie’s work?

Here’s what happens if we try the token classic disney style as part of the prompt but still use the Illustration-Diffusion model trained on the Hollie Mengert art.

Prompt: "classic disney style tiger sitting on a log" | model: hollie-mengert.ckpt | Engine: Stable Diffusion

The results are not exactly in the style of Hollie Mengert, but they’re not really in the Disney style either. The model is picking up some influences to Disney style but doesn’t seem quite right either.

Classic Animation Diffusion

Let’s take a look at a different model. The classic-anim-diffusion model, a “model trained on screenshots from a popular animation studio,” (cough, cough) uses the token classic disney style as a style prompt. We’ll switch the model to this Disney-trained model using the same prompt that generated the previous images.

Prompt: "classic disney artstyle tiger sitting on a log" | model: classicAnim-v1.ckpt | Engine: Stable Diffusion

Once again, there is an obvious shift in the style. The first two images are not keepers. The third is acceptable. They look like Disney tigers though, so the experiment worked.

Superhero Diffusion

To wrap up these experiments, we’ll try another model called Superhero-Diffusion trained on Pepe Larraz’s work (you can see real examples of his work with this Google Image search) that uses the token comicmay artstyle. Again, we’re keeping the same basic prompt and changing the token and the model.

Prompt: "comicmay artstyle tiger sitting on a log" | model: superhero-diffusion.ckpt | Engine: Stable Diffusion

Whoa. These look cool to me. I notice that the tiger we asked to be sitting is actually standing, but the tiger is at least perched on a log. I’m sure if Pepe Larraz could create a sitting tiger on his first attempt, but that’s a different discussion entirely.

What about Midjourney?

There is another text-to-image project that I’m aware of called Midjourney. I have seen impressive results posted on Twitter with the hashtag #midjourney, but I have not used Midjourney during my exploration. If you’re exploring, you may want to check it out also.

Wrapping up

I would be surprised if you’ve made it this far and haven’t felt the urge to try making some images of your own. Lexica is probably the easiest place to begin. Experiment with your prompt, and you’ll be amazed at your results even with the base model. Also, realize that version 1.5 is the version of the model that’s current as I write this, but I expect that to evolve quickly. The pace of change with these models is incredibly rapid, and the results will only improve with time.

If you want to experiment with dead artists whose work has seeped deeply into the culture and probably into the data models, try adding Leonardo da Vinci art style or frida kahlo art style and see what happens.

Lastly, I would encourage you to think about living artists whose work influence data models. Are you comfortable creating art based on a living artist? Does it matter if it’s only for personal exploration? Does it cross an ethical line for you if you used this artwork in a way that you made money from? Would you pay for access to a model?