ViPE: Visualise Prettymuch Everything

ViPE is the first language model designed for assisting in text-to-image generation. It translates any arbitrary piece of text into a visualizable prompt. It helps any text-to-image model in figurative or non-lexical language visualizations. Below is a comparison between SDXL with and without ViPE given infinity as a prompt.

Below is another example of using DALLE 2 with and without ViPE for a highly abstract prompt. The image on the left shows the prompts and the generated image. The images on the right, show ViPE's interpretaion of how the initial prompt could be visualized and the genrated images. 

How ViPE is Built?

Building ViPE involves three main steps

  • Data Collection: Scraping all the English lyrics from the Genius platform, preprocessing and noise removal
  • Synthetic Label Generation: Applying GPT3.5 Turbo to generate visual translation (elaborations) for the lyrics based on human instructions and the context of the songs. Compiling the LyricCanvas dataset comprising of 10M samples.
  • Training: Obtaining a robust and lightweight model by training GPT2 on the LyricCanvas dataset with causal language modeling objective conditioned on the lyrics

Down Stream Applications

ViPE's robust generalization capabilities offers a wide range of usage including

ViPE Paper

ViPE, won the outstanding paper award at the main conference of EMNLP 2023.


List of outstanding papers: