Artificial Intelligence Advancements in Text-to-Image Generation
Artificial intelligence has made significant progress in text-to-image generation in recent years. This technology has various applications, including content creation, aiding the visually impaired, and storytelling. However, researchers have faced two major obstacles: a lack of high-quality data and copyright issues related to internet-scraped datasets.
In a recent study, a team of researchers proposed the idea of building an image dataset under a Creative Commons license (CC) and using it to train open diffusion models that can outperform Stable Diffusion 2 (SD2). To achieve this, they needed to overcome two major challenges:
Absence of Captions
Although high-resolution CC photos are open-licensed, they often lack the necessary textual descriptions (captions) required for training text-to-image generative models. Without captions, the model struggles to comprehend and produce visuals based on textual input.
Scarcity of CC Photos
Compared to larger proprietary datasets like LAION, CC photos are scarce despite being a valuable resource. The scarcity raises concerns about whether there is enough data to successfully train high-quality models.
The team employed transfer learning techniques and created synthetic captions using a pre-trained model, which they matched with a carefully selected collection of CC photos. This approach leveraged the model’s ability to generate text from photos or other inputs. They compiled a dataset of photos and fabricated captions, which could be used to train generative models that translate words into visuals.
To tackle the second challenge, the team developed a compute- and data-efficient training recipe. This recipe aims to achieve the same quality as current SD2 models with significantly less data. Only around 3% of the original data used to train SD2 (approximately 70 million examples) is required. This suggests that there are enough CC photos available to efficiently train high-quality models.
The team trained several text-to-image models using the data and the effective training procedure. Together, these models form the CommonCanvas family and represent a significant advancement in generative models. They can generate visual outputs of similar quality to SD2.
The largest model in the CommonCanvas family, trained on a CC dataset less than 3% the size of the LAION dataset, achieves performance comparable to SD2 in human evaluations. Despite the limitations in dataset size and the use of artificial captions, the method effectively generates high-quality results.
The team summarized their primary contributions as follows:
- They utilized transfer learning to produce excellent captions for Creative Commons (CC) photos that initially lacked captions.
- They provided a dataset called CommonCatalog, consisting of approximately 70 million CC photos released under an open license.
- The CommonCatalog dataset was used to train a series of Latent Diffusion Models (LDM). Collectively, these models, known as CommonCanvas, perform competitively both qualitatively and quantitatively compared to the SD2-base baseline.
- The study incorporated various training optimizations, resulting in the SD2-base model training almost three times faster.
- To encourage collaboration and further research, the team made the trained CommonCanvas model, CC photos, artificial captions, and the CommonCatalog dataset freely available on GitHub.
For more information, please refer to the paper. All credit for this research goes to the researchers involved in this project. Don’t forget to join our ML SubReddit, Facebook Community, Discord Channel, and Email Newsletter for the latest AI research news, cool AI projects, and more.
If you enjoy our work, you’ll love our newsletter. We are also available on Telegram and WhatsApp.
About the Author
Tanya Malhotra is a final year undergraduate student at the University of Petroleum & Energy Studies, Dehradun. She is pursuing a BTech in Computer Science Engineering with a specialization in Artificial Intelligence and Machine Learning. Tanya is a Data Science enthusiast with strong analytical and critical thinking skills. She has a keen interest in acquiring new skills, leading groups, and managing work in an organized manner.