CLIPDraw: Exploring Text-to-Drawing Synthesis

This work presents CLIPDraw, an algorithm that synthesizes novel drawings based on natural language input. CLIPDraw does not require any training; rather a pre-trained CLIP language-image encoder is used as a metric for maximizing similarity between the given description and a generated drawing.
By Kevin Frans
July 6, 2021
3 animated gifs showing CLIPDraw synthesizing stroke-based drawings based on natural language input, from a few strokes to a more complete image.

AI-assisted art has always been intriguing to me. To me, visual art is a very human thing -- I can’t imagine a way that a computer rediscovers our own cultural concepts without some kind of experience living in our world. So when AI methods are able to learn from us, and produce artwork that we as humans might make ourselves, something cool is definitely happening.

This post presents CLIPDraw, a text-to-drawing synthesis method that I’ve been playing with nonstop for the past few weeks. At its core, CLIPDraw is quite simple, yet it is able to produce drawings that display a whole range of interesting behaviors and connections. So for this article, I want to focus on showing off behaviors of CLIPDraw that I personally found intriguing and wanted to learn more about. For more detailed analysis and technical detail, check out the CLIPDraw paper, or play around with the Colab notebook yourself!

Image Synthesis -> Drawing Synthesis

The field of text-to-image synthesis has a broad history, and recent methods have shown stunningly realistic image generation through GAN-like methods. Realism, however, is a double-edged sword – there's a lot of overhead in generating photorealistic renderings, when often all we want are simple drawings. With CLIPDraw, I took inspiration from the web game, where players only have a few seconds to draw out a word for other players to guess. What if an AI could play What would it draw? Are simple shapes enough to represent increasingly complex concepts?

How does CLIPDraw work?

CLIPDraw's knowledge is powered through a pre-trained CLIP model, a recent release which I definitely recommend reading about. In short, a CLIP model consists of an image encoder and a text encoder, which both map onto the same representational space. This setup allows us to measure the similarities between images and text. And if we can measure similarities, we can also try to discover images that maximize that similarity, therefore matching a given textual prompt.

The basic CLIPDraw loop follows this principle of synthesis-through-optimization. First, start with a human-given description prompt and a random set of Bezier curves. Then, gradually adjust those curves through gradient descent so that the drawing best matches the given prompt. There are a few tricks that also help, as detailed in the paper, but this loop is basically all there is to it.

Now, let's examine what CLIPDraw comes up with in practice.

What visual techniques does CLIPDraw use?

A recurring theme is that CLIPDraw tends to interpret the description prompts in multiple, unexpected ways. A great example for this is "A painting of a starry night sky", which shows a painterly-styled sky with a moon and stars, along with an actual painting canvas and painter in the foreground, which then also features black and blue swirls resembling Van Gogh's "The Starry Night".

CLIPDraw also likes to use abstract symbols, the most prominent being when it writes out the literal word inside the image itself. Sometimes, CLIPDraw will use tangentially related concepts, like the Google Maps screenshot when asked for "自転車 (Bicycle in Japanese)”. A fun result is the prompt for "Fast Food", which shows hamburgers and a McDonald's logo, but also has a bunch of joggers racing in the background.

How does CLIPDraw react to different styles?

An experiment I really enjoyed was to synthesize images of cats, but in different artistic styles. With CLIPDraw, this was as easy as changing the descriptor adjectives of the textual prompt. Surprisingly, CLIPDraw is quite robust at handling different styles, contrary to the initial intent to synthesize scribble-like drawings.

One interesting feature is that CLIPDraw adjusts not only the textures of drawings, ala Style Transfer methods, but also the structure of the underlying content. In the cat experiments, asking for "a drawing" produces a simplified cartoonish cat, while prompts like "a 3D wireframe" produce a cat in perspective, with depth and shadows.

Does stroke count affect the drawings that CLIPDraw produces?

A key aspect of CLIPDraw is that drawings are represented as a set of Bezier curves rather than a matrix of pixels. This feature gives us a nice parameter to tweak in the number of curves a drawing is comprised of.

Empirically, drawings with low stroke counts tend to result in a more cartoonish or abstract representation of the prompt, such as the 16-stroke version of "The Eiffel Tower" being basically made of a few straight lines. As stroke count is increased, CLIPDraw begins to target more structured shapes, shown as our Eiffel Tower begins gaining 3D perspective, lights, and finally a background.

What happens if abstract words are given as a prompt?

A fun thing to push the limits is to give CLIPDraw abstract descriptions, and see what it does with them. As a human artist, even I would have to stop and think about how to convey these concepts through visuals, so it's interesting to see how the AI will approach things.

In most cases, CLIPDraw likes to use symbols to showcase concepts that are culturally related to the given phrase, like the fireworks and smiles in "Happiness" or the Japanese and English-like letters in "Translation".

My favorite here is the drawing for "Self", which features a body holding up multiple heads. The drawing can almost be seen as a metaphor for e.g. the idea that a person's self may contain multiple outward personalities, or that a self is actually a sum of many cognitive processes. This piece is definitely the most "art-like" example I came across; there's a lot of room for individual interpretation, and it almost feels like CLIP knows something that I don't.

Can drawings be fine-tuned via negative prompts?

A final experiment was to see if CLIPDraw behavior could be finely adjusted by introducing additional optimization objectives. In the normal process, drawings are optimized to best match a textual prompt. What if we also tried to optimize unsimilarity with a set of negative prompts?

In a few situations, this works! By penalizing the prompt for "Words and text", the "Hashtag" drawing features less prominent words and instead draws a set of selfie-like faces. Negative prompts can also do things like adjust the color of images, or force drawings to contain only a single subject rather than many.

Practically, the examples above were pretty hard to achieve, and most of the time negative prompts don't change much about the final drawing at all. I think there's still a lot of room for experimentation on how to improve this technique. One thing I'd love to see is a panacea prompt, like "A messy drawing", that consistently improves drawing quality if used as a negative objective regardless of context.

CLIPDraw: Parting Thoughts

The CLIPDraw algorithm isn't particularly novel; people have doing synthesis-through-optimization for a while through activation-maximization methods, and recently through CLIP-matching objectives. However, I do believe biasing towards drawings rather than photorealism gives images more freedom of expression, and optimizing Bezier curves is a nice way to do this efficiently. I also personally love this art style and I think the drawings are quite similar to what an artist would produce ;).

That being said, I do think the behaviors showcased here should be pretty generalizable to any CLIP-based optimization method. Already a few extensions come to mind – Can we synthesize videos? 3D models? Can an AI play Skribbl or Broken Picture Phone with itself? I'm sure you as a reader have a bunch of ideas I haven't even considered. So please, feel free to take this method and go wherever you feel is exciting. And then tell me about it!

You can experiment with the Colab notebook here, and play with CLIPDraw in the browser. Results are generally pretty quick (within a minute) unless you crank up stroke count and iterations. You can also check out the full paper, which contains a bit of a deeper analysis and details on the technical implementation.

Thanks for reading!

About the Author

Kevin Frans is a third-year undergraduate at MIT and researcher at Cross Labs investigating open-endedness, multi-agent worlds, and AI-powered generative design.