San Francisco-based artificial intelligence research firm OpenAI has created an artificial intelligence program, DALL-E, that creates images from text descriptions. The program utilizes a 12-billion parameter version of the Generative Pre-trained Transformer 3 (GPT-3) autoregressive language model, which is itself developed by OpenAI.
DALL-E creates illustrations, paintings, photos, renders, sketches and more of basically anything you can describe using text. In OpenAI’s paper about DALL-E, numerous examples are showcased. For example, a text prompt of ‘the same cat on the top as a sketch on the bottom’ produced a photo of a gray cat and five different accompanying sketches in different styles of the cat. Given another prompt, ‘an armchair in the shape of an avocado,’ DALL-E produced five different realistic renders of, well, an armchair shaped like an avocado.
|In this instance, the prompt is ‘the exact same cat on the top as a sketch on the bottom.’ Click to enlarge. Image credit: OpenAI|
OpenAI describes DALL-E as a ‘simple decoder-only transformer that receives both the text and the image as a single stream of 1280 tokens – 256 for the text and 1024 for the same – and models all of them autoregressively. The attention mask at each of its 64 self-attention layers allows each image token to attend to all text tokens. DALL·E uses the standard causal mask for the text tokens, and sparse attention for the image tokens with either a row, column, or convolutional attention pattern, depending on the layer.’ Further details about DALL-E’s architecture and how OpenAI trained the program will be available in an upcoming paper.
|An armchair in the shape of an avocado might not be the most practical of comfortable furniture, but DALL-E clearly understood the prompt. Click to enlarge. Image credit: OpenAI|
What DALL-E does is not in itself new, but OpenAI’s new program operates with fairly good success and can handle variations in input with varying success. The foundation for its abilities is the GPT-3 language model. When producing hundreds or thousands of results for a given prompt, many results will be fine, but occasionally some results will be weird, for lack of a better term. Images don’t hold up to scrutiny when viewed up close. For example, a generated image of an animal will not have the same quality or sharpness as a genuine image captured by a modern digital camera.
Further, looking at an array of images generated by the ‘a cube made of porcupine with the texture of a porcupine’ prompt can be interesting, but DALL-E can’t account for how disturbing some such images can be.
|The text prompt ‘a cube made of porcupine. a cube with the texture of a porcupine’ produces these disturbing, albeit rather impressive, results. What stands out here is the variety in results. Click to enlarge. Image credit: OpenAI|
OpenAI writes further, ‘We’ve found that [DALL-E] has a diverse set of capabilities, including creating anthropomorphized versions of animals and objects, combining unrelated concepts in plausible ways, rendering text, and applying transformations to existing images.’ Consider the example below, showing the results for ‘a penguin made of comfort.’ Comfort is a somewhat abstract concept and it’s fascinating to see how DALL-E interpreted the prompt.
|Click to enlarge. Image credit: OpenAI|
Named as a portmanteau of Salvador Dalí and the Pixar character, WALL-E, DALL-E is quite adept at handling the variations in natural language. However, it does have some limitations. OpenAI notes that while DALL-E can control multiple objects and respective attributes and spatial relationships between multiple objects, DALL-E can be confused by objects and respective associations, and is ‘brittle with respect to rephrasing of the caption’ in challenging scenarios.
OpenAI writes, ‘[DALL-E] can independently control the attributes of a small number of objects, and to a limited extent, how many there are, and how they are arranged with respect to one another. It can also control the location and angle from which a scene is rendered, and can generate known objects in compliance with precise specifications of angle and lighting conditions.’ OpenAI continues, ‘Unlike a 3D rendering engine, whose inputs must be specified unambiguously and in complete detail, DALL·E is often able to “fill in the blanks” when the caption implies that the image must contain a certain detail that is not explicitly stated.’
While the results generated from different text inputs run the gamut in terms of quality and plausibility, DALL-E’s ability to combine real-world objects into wholly fictional results, such as a snail in the shape of a harp, is impressive. The program can also infer and create details that weren’t mentioned in the text prompt. In the case of the prompt ‘a painting of a capybara sitting on a field at sunrise,’ DALL-E can draw details necessary, such as shadows, that weren’t mentioned in the prompt. OpenAI found that ‘DALL-E is able to render the same scene in a variety of styles, and can adapt the lighting, shadows, and environment based on the time of day or season’ even when these details are not specified.
In OpenAI’s paper, you not only can learn about DALL-E and see many results, but you can click on any underlined text or phrase in a given prompt to see different results, which OpenAI states have not been cherry-picked by a human, aside from thumbnail selections. The samples displayed are the top 32 results after being ranked by the company’s new CLIP neural network. You can read about CLIP too, although the paper is a bit more technical than the one about DALL-E.
OpenAI is making great progress and DALL-E’s results, ranging from odd to impressive, represent an incredible improvement over what was possible only a few years ago. Even last September, using OpenAI’s GPT-3, the Allen Institute for AI published results that look quite different from what DALL-E is doing. The ability for AI to generate realistic depictions of text inputs written using natural language is very much a work in progress, but progress is nonetheless being made.