NeuroPrompts: An Adaptive Framework to Optimize Prompts for Text-to-Image Generation
Despite impressive recent advances in text-to-image diffusion models, obtaining high-quality images often requires prompt engineering by humans who have developed expertise in using them. In this work, we present NeuroPrompts, an adaptive framework that automatically enhances a user's prompt to improve the quality of generations produced by text-to-image models. Our framework utilizes constrained text decoding with a pre-trained language model that has been adapted to generate prompts similar to those produced by human prompt engineers. This approach enables higher-quality text-to-image generations and provides user control over stylistic features via constraint set specification. We demonstrate the utility of our framework by creating an interactive application for prompt enhancement and image generation using Stable Diffusion.
Transcript
NeuroPrompts is an interactive application for automatically optimizing user prompts for text to image diffusion models. To get started with the application, the user simply can enter an input prompt into the top left text field. So let's try the example. A spotted frog on a bicycle. This is the only required input. And if the user does not want to specify any custom constraints that must be satisfied, they can begin the optimization process by clicking the orange Submit button in the bottom left of the screen.
After a few seconds, we'll see the optimized prompt that was calculated by our approach, which consists of the original input prompt as well as many additional modifier keywords that help improve the overall aesthetics of the image that's produced. The image generated for this optimized prompt by stable diffusion then appears in the center of the screen below. So here you can see we've generated a very nice, aesthetically pleasing image of a spotted frog on a bicycle, as we requested.
So let's go ahead and try another example. And this time we'll specify some constraints that we want to be satisfied. So we'll use the prompt lion wearing a spacesuit. Below the input prompt field, you'll see there are several dropdown menus with pre-specified terms that can be selected as constraints for different artists, styles, formats, perspectives, boosters, and vibes. The user can also manually enter their own term in these fields if they do not see it appear in the dropdown. So we'll go ahead and select Greg Rutkowski, cyberpunk, and illustration as our constraints. And we'll hit the Submit button.
And what's happening behind the scenes is that our algorithm, which uses neurosymbolic constraint text decoding, is adding constraints that ensures that these particular terms are going to appear in the optimized prompt. The optimized prompt is also being highlighted by the user interface to show which of the terms in the optimized prompt satisfy those constraints. So you can see that the format of illustration that we selected is highlighted as red and the artist, Greg Rutkowski, is highlighted in green as well as style for cyberpunk highlighted in blue. We then again see a nice optimized image that's generated from stable diffusion for this prompt displayed below.
The user can also specify negative constraints, which are words or phrases that cannot appear in the optimized prompt. So for example, perhaps the user does not want to see these particular terms digital painting and art station. They can enter one or more terms in the negative constraints field separated by a comma. So we'll go ahead and enter digital painting and art station as our negative constraints.
And when we hit Submit, this is going to re-optimize the prompt, but this time it is going to add these constraints which disallow digital painting and art station to appear. So now we're going to see that we received a slightly different optimized prompt above. And this is also going to produce a correspondingly slightly changed image from stable diffusion that's generated below.
Let's go ahead and try one more example. So we'll use the prompt a robot is playing the piano. And we'll go ahead and select a few different terms here as our constraints. And this time we'll also select a booster for octane render and hit Submit. One additional feature of the application is upvote and downvote buttons. So after the image appears in the optimized prompt, we can record whether this is satisfying or not for use in further analysis.
So in this case, you can see we have all four of the constraints that we specified appeared in the optimized prompt and are being highlighted in the field at the top. And we also produced a very nice image of this robot playing a piano with stable diffusion.
One additional feature that the interface provides is the ability to select a side by side comparison. So the user can click on this particular tab and that will automatically trigger the generation of an image from stable diffusion using the original input prompt, i.e. without our prompt optimization approach. This provides a side by side comparison of what our prompt engineering approach is doing.
And we can see in this example that the original image that's generated based off the input prompt is much less aesthetically pleasing than the one that's generated with our optimized prompt. The application also calculates a aesthetic score and pick score, which are described further in our paper, but are shown below to illustrate how our optimized prompt image has much higher aesthetics and would be more likely to be preferred by human users than the image generated from the original prompt.
Thank you for watching this demo of NeuroPrompts. We hope you enjoy using the application to optimize your future prompts for text to image diffusion models.