This article discusses the recent breakthroughs in machine learning that have allowed for the development of multi-modal applications, which can handle multiple types of data simultaneously. It specifically focuses on the use case of multi-modal image search and provides a practical implementation using a model from the Hugging Face library. The article also explains the concept of multi-modal systems and introduces GPT-4V, an advanced model that can handle multiple modalities of text and image inputs.