In this blog I will investigate natural language representations as they are used in computer vision. A foray into the predominant language architecture, Transformers, will be linked to tasks in image captioning and art generation.

Relevant Papers

[1] Attention is All You Need

  • https://arxiv.org/abs/1706.03762
  • https://huggingface.co/docs/transformers/index

The seminal paper on the Transformer architecture introduces attentional mechanisms for capturing long range dependencies between language.

[2] Learning Transferable Visual Models From Natural Language Supervision

  • https://arxiv.org/pdf/2103.00020v1.pdf
  • https://github.com/openai/CLIP

OpenAI’s paper on CLIP (Connecting Text and Images) demonstrates the zero-shot potential of pretraining on iamge-caption pairs.

[3] Hierarchical Text-Conditional Image Generation with CLIP Latents

  • https://arxiv.org/abs/2204.06125
  • https://huggingface.co/spaces/multimodalart/latentdiffusion

This paper leverages CLIP’s text to image embedding capabilities as an encoder, combined with a diffusion based decoder, to generate images conditioned on text prompts.