DALL·E is an artificial intelligence program that visualises concepts by generating images of realistic and unrealistic objects, from short phrase-like natural language prompts or text descriptions. The program, released in early 2021, gets its name from a portmanteau1 of Pixar’s movie WALL-E, about an eco-friendly robot, and surrealist painter, Dalí, whose work merges imaginations with the rational world.
The AI program uses a 12 billion parameter version of the GPT-3 Transformer model2 to interpret natural language inputs and works with another model called CLIP (Contrastive Language-Image Pre-training)3. It uses the 12 billion parameter transformer to replace text inputs with pixel outputs through training on text-image pairs from the Internet. For each text description, DALL·E generates multiple images which are ranked and curated by the image recognition system, CLIP. DALL·E has evidenced the ability to provide artificial intelligence models with a better understanding of how humans understand and interpret everything, through reference and relation; to create new ideas and concepts leading to more general artificial intelligence.
DALL·E4 is an image generator that has shown its ability to create humanised animals and objects, combine unrelated concepts to portray them in a reasonable interpretation and apply transformations to existing images, among many other features, some of which are listed below:
The exact same cat on the top as a sketch on the bottom
An armchair in the shape of an avocado
DALL·E’s strengths lie in its ability to understand natural language, grasp the concept of relation and reference in human understanding and then generate images that could be photorealistic, paintings or emojis. DALL·E evidenced some intelligent features that also came as a surprise to its creators at OpenAI. One of the most exciting features is DALL·E’s learning of visual reasoning skills6 that are said to be sufficient to solve Raven’s Matrices.7 The model’s intelligence is also reflected in its manipulation and placement of objects in the produced images. Another striking feature of DALL·E is its use of creativity that bares a remarkable resemblance to human imagination and creativity that allow it to coherently blend concepts. Other key features include being able to infer appropriate contextual details and its understanding of visual and design trends that allows it to create images appropriate for specific periods of time. All these achievements of DALL·E are a step towards achieving general artificial intelligence.
OpenAI mentioned that they did not have a specific application in mind while creating DALL·E. However, the program could have many applications. Venture Beat has called DALL·E “a visual idea generator”, which may be the most apt definition for it. With OpenAI taking the responsibility to note that there is a potential for bias and ethical challenges and equally importantly there may be a widespread impact on society including the impact on work processes and professions, it may be a while before we see DALL·E in action in multiple applications, however it is pertinent to note that there could be a plethora of applications, including but not limited to the following:
A female mannequin dressed in a black leather jacket and gold pleated skirt
A loft bedroom with a white bed next to a nightstand there is a fish tank standing beside the bed
As seen above, DALL·E has great potential for numerous and widespread applications. It has demonstrated quite a large number of intelligent features that seem to be edging very close to general artificial intelligence and human imagination and creativity. In summary, DALL·E is a leap towards the future of general artificial intelligence and once OpenAi has considered it’s potential for bias and the ethical challenges it presents; with further development it could prove to be very useful.
_____________________________________________________________________________________________
1Portmanteau is a word that combines the sounds and meanings of two words
2DALL·E uses a scaled down version of GPT-3. GPT-3 originally has 175 billion parameters
3CLIP is an AI model that curates image outputs from DALL·E to present the highest quality images for any prompt. CLIP was trained on 400 million image and text pairs
4With DALL·E, Open AI has refined GPT-3 to focus on visual concepts through language
5As the number of objects in a prompt increases, DALL·E begins to get confused
6DALL·E uses zero-shot learning – which means that the input data was not used during training and is being observed for the first time by the program
7Raven’s Matrices is an intelligence test usually used to measure abstract reasoning
__________________________________________________________________________________________
References:
1. https://openai.com/blog/dall-e/