Dall-E image generated by Aaron Hertzmann for this blog post. Aaron started with a naive prompt by me which did not generate particularly interesting images, but he then creatively explored a few alternative prompts before reaching the above results.
I will confess that I used to be skeptical and uninterested in research on image captioning and image synthesis from captions. But tools such as Dall-E 2, Imagen and CogView and their creative use by people such as Aaron Hertzmann have convinced me that this is probably the most transformational change in image making and the visual arts since the invention of photography. And yes, I mean photography 200 years ago, not just digital photography.
Yes, these tools are still limited, but so was photography at first: multiple minutes of exposure for a grainy and blurry picture. Yes, these tools can be perceived as a mockery of what it means for a human artist to create a picture and they completely bypass much of what we consider to be the creative image-making process. But the same can be and was said of photography.
Fundamentally, both photography and AI image synthesis completely change how humans envision and create images. With photography, all it takes is to point a device to a real scene and you’ve got an image. You can complain that this is a crude and uncreative way of making an image. You can complain that you lose a ton of flexibility and artistic control. But artists quickly figured out all the creativity that can be put into making photos and how to control the output. Furthermore, making images is not just about artistic endeavors, it also enhance the human experience whether it can be called art or not, from illustrating important events to anchoring dear memories. Photography never completely replaced “manual” art. But it allowed us to gather images that would never be possible with older manual techniques, from casual pictures of our families to photojournalism.
In the case of AI art with tools such as Dall-E 2, Imagen or CogView, all you need to do to create a picture is to write a text prompt. Similar to the switch from paintbrushes and human artistry to pressing a button on a device, this appears to be a crude interface for visual creativity with an unacceptably low control over the final result. But like with photography, people are already figuring out how to creatively craft prompts to get creative results. And like with photography, it will allow scenarios that were impossible before. For example, rather than use stock footage, you can write a prompt to illustrate an article, like I did for this post. That’s certainly not less creative than searching an image bank, and it probably gives you more artistic control. I will confess that I am still giddy about how professional I feel it makes my post look. You can write a story and get it automatically illustrated. You can author new interactive narratives where user decisions affect the storyline and get illustrated automatically. We’ll be able to synthesize images of memories we failed to photograph at the time. That’s just the surface.
These tools raise ethical issues for sure, including their use for offensive, nefarious, or non-consensual imagery, as well as the biases and stereotypes that they may inherit from their training data. But it doesn’t make them less transformational.
This post does not discuss whether these methods are also big advances in AI and what level of intelligence they may exhibit. I am only looking at it from the perspective of a human who want to make images, and from that perspective, this is a transformational step.
Aaron Hertzmann’s CACM article Computers Do Not Make Art, People Do
This AI can produce stunning images with just a few words of description, but is it art? by Aaron
Dall-e hashtag on twitter
update 7/13: now you even have how-to ebooks.
update 9/23: video by Steve Seitz
update 10/26: http://www.argmin.net/2022/10/26/ai-image-search/
More images to illustrate this post by Aaron Hertzmann using Dall-E 2
One of the first prompts I gave Aaron did not yield useful results though. “An AI machine is painting an image based on text input.”