My last article about generating images using PyTorch feels like it was written three years ago, rather than three months ago. At that time the source code of the tools like Dall-E and Midjourney were not available to developers like myself. That limitation felt like it was not an area I could explore because it was far beyond my ability to reverse engineer it.
And then CompVis released Stable Diffusion as open source, allowing anyone with a copy of the code to experiment and extend it.
Fast forward 3–4 months since Stable Diffusion came out and an entire ecosystem has formed around it. Anyone with the skills to run a few commands (anyone) can setup a personal Stable Diffusion server — most popularly the AUTOMATIC1111 Gradio based web UI. That UI is so popular peoplle are writing their own plugins for it. We are using a variant of AUTOMATIC1111 with our own little tweaks and changes on top.
Within Open Studios having our own self hosted Stable Diffusion has been a huge boon — we have basically gained an art department that can pump out hundreds of options for us to use for banners, animation and whatever else needs visual creative direction. As a writer, I was able to quickly render a classroom scene I was working on and realised it looked boring so I moved the scene to a museum. This kind of back and forth iteration usually happens between at least a few people with different fields of expertise — i.e. an art director and sketch artists.
As an advocate, I wish I could say the images Stable Diffusion generates were good to go — most of the time they are not and it takes a bit of trial and error re-running and manual correction using Photoshop.
Discovering what a concept can be
Imagine a cloud of post-it notes grouped together based on labels, that’s how the memory for a machine learning system works. Within the study of machine learning this cloud of post-its is known as “latent space”.
The same scene of a beach can look very different depending on whether it’s taken from the POV of a person or a bird or the interpretation of light is realistic or stylised. The results can be a little bit unpredictable, which in this use case is exactly what we want because as a workflow it’s like asking a team of concept artists to go off and find ideas for you.
Sometimes we don’t even know what something needs to look like, we use prompts to generate images and then go through iterative stages of picking our favourites and making more till we get to places we like.
When exploring an unknown area it’s often helpful to draw a map and with Stable Diffusion it’s actually quite similar!
Charting latent space and prompt research
Using our studio workstations equipped with powerful RTX 30 Series graphics cards, we generally use the maximum batch size and sample count. The number of results is the product of these two settings so in 10 batches with 4 samples we would have 40 images. If XY Grids are turned on this would further multiple by the number of options in X and Y.
When using Stable Diffusion it’s common practise to generate thousands of XY grids as a kind of sample palette and then pick ones that match the appropriate creative vision and style. The AUTOMATIC1111 UI has extensive customisation options for what the X and Y axis of your grid are defined by allowing you to use different data models, text prompts and other settings in order to see how they all come together in different ways.
Prompts can take a while to figure out, sometimes requiring weeks of trial and error rearranging words and even looking online to places like Lexica when we can’t get the AI to draw what we want. In our world we don’t have things like demons or cyborgs and so it takes a bit of finesse to tell the system what we want to see — for example a demon might be a “giant humanoid bat with horns and red fur” and a cyborg could be a “man with cybernetic humanoid prosthetics”.
There are external factors that affect the text prompts effectiveness as well — settings like the seed value, sampler, CFG scale (how strictly to conform to your text prompt) and even the resolution can wildly change the result.
Adding new detail to low resolution images
The main size used for all Stable Diffusion art is 512x512 but we generally want images designed for full screen display on a minimum of 1080p but preferably 4K screen resolution. The technique we use is called img2img and can actually take any existing image as an input but for upscaling we either use an upscale specific script or re-feed it variations of the parameters used to create it, in order to ensure high detail results at the higher resolutions.
Sometimes upscaling an image can be more involved than the original creation of it. See below an original image VS it’s various upscaled forms.
In the grand scheme of things it took almost no effort to create the original image but on the flipside of that the effort to then upscale this image was actually quite massive. All told it was an informative experiment — at least teaching us to clean images up in Photoshop first before upscaling!
Workflow and file management
We work with thousands of images in every Stable Diffusion project, often paired to metadata sheets that explain how we built those assets in case we need to rebuild or change them in the future.
Cloud drives and Git definitely helps, but anyone familiar with media production will know that there tends to be quite a bit of discarded data in these types of projects. Processes for disposing of and archiving things for reference can be just as important as storing that data in the first place.
Hopefully this gives you a bit of insight into the work going on behind the scenes at Open Studios. I used to envy older tech professionals for being there for the Dot Com boom, but I feel like we’ve just been given ours.
References and links:
- Lexica — https://lexica.art/
- Automatic1111 web UI — https://github.com/AUTOMATIC1111/stable-diffusion-webui
- Model database — https://upscale.wiki/wiki/Model_Database#Drawn_Material