In this blog I will investigate natural language representations as they are used in computer vision. A foray into the predominant language architecture, Transformers, will be linked to tasks in image captioning and art generation.

Relevant Papers

[1] Attention is All You Need


The seminal paper on the Transformer architecture introduces attentional mechanisms for capturing long range dependencies between language.

[2] Learning Transferable Visual Models From Natural Language Supervision


OpenAI’s paper on CLIP (Connecting Text and Images) demonstrates the zero-shot potential of pretraining on iamge-caption pairs.

[3] Hierarchical Text-Conditional Image Generation with CLIP Latents


This paper leverages CLIP’s text to image embedding capabilities as an encoder, combined with a diffusion based decoder, to generate images conditioned on text prompts.