- Intuitively and Exhaustively Explained
- Posts
- Image Search in 5 Minutes
Image Search in 5 Minutes
Cutting-edge image search, simply and quickly

In this post we’ll implement Text-to-image search (allowing us to search for an image via text) and Image-to-image search (allowing us to search for an image based on a reference image) using a lightweight pre-trained model. The model we’ll be using to calculate image and text similarity is inspired by Contrastive Language Image Pre-Training (CLIP), which I discuss in another article.

Who is this useful for? Any developers who want to implement image search, data scientists interested in practical applications, or non-technical readers who want to learn about A.I. in practice.
How advanced is this post? This post will walk you through implementing image search as quickly and simply as possible.
Pre-requisites: Basic coding experience.
What We’re Doing, and How We’re Doing it
This article is a companion piece to my article on “Contrastive Language-Image Pre-Training”. Feel free to check it out if you want a more thorough understanding of the theory:
CLIP, Intuitively and Exhaustively ExplainedCreating strong image and language representations for general machine learning tasks.towardsdatascience.com
CLIP models are trained to predict if an arbitrary caption belongs with an arbitrary image. We’ll be using this general functionality to create our image search system. Specifically, we’ll be using the image and text encoders from CLIP to condense inputs into a vector, called an embedding, which can be thought of as a summary of the input.

The whole idea behind CLIP is that similar text and images will have similar vector embeddings.

The specific model we’ll be using is called uform, which is conceptually similar to CLIP. uform is a permissively licensed, pretrained, and resource efficient model which promises superior performance to CLIP. Uform comes in 3 flavors, we’ll be using the “late fusion” variant which is conceptually similar to CLIP.

Actual similarity between embeddings will be calculated using cosine similarity. The essence of cosine similarity is that two things can be defined as “similar” if the angle between their embedding is small. Thus, we can calculate how similar text and images are to each other by first embedding both the text and the images, then calculating the cosine similarity between the embeddings.

And that’s the idea in a nutshell: we download the CLIP inspired model (uform), use the encoders to embed images and text, then use cosine similarity to search for similarities. Feel free to refer to the companion article for a deeper dive on the theory. Now we just need to put it into practice.
Implementation
I’ll be skipping through some of the unimportant stuff. The full code can be found here:
MLWritingAndResearch/ImageSearch.ipynb at main · DanielWarfield1/MLWritingAndResearchNotebook Examples used in machine learning writing and research - MLWritingAndResearch/ImageSearch.ipynb at main ·…github.com
Downloading the Model
This is super easy, just pip install the uform module, then use the module to download the model from Hugging Face. We’ll be using the english version, but versions in other languages are also available.
!pip install uformimport uformmodel = uform.get_model('unum-cloud/uform-vl-english')

Defining a Database of Images to Search
I downloaded a few images from a dataset for us to play with, which is a derivative of a dataset form the harvard dataverse (licensed under creative commons), and put them in a public github repo. The following pseudo code downloads those images to the list images . This list of images is what we’ll ultimately be searching through.
#List all filesurls = get_image_urls_from_github()#Download each fileimages = download_images(urls)#Render out a few examplesrender_examples(images)

Implementing Text-to-Image Search
Here’s where the rubber meets the road. First we’ll define some search text, in this example a rainbow by the water. Then we can embed that text and compare it to the embeddings for all images. We can then sort by the cosine similarity to display the top five images which are most similar to the search text. Keep in mind, a CLIP style model has a separate image and text encoder, So text gets encoded with the text encoder, and images get encoded with the image encoder.
"""Implementing text to image searchusing the uform model to encode text and all images. Then using cosinesimilarity to find images which match the specified text. Rendering out thetop 5 results"""import torch.nn.functional as F#defining search phrasetext = "a rainbow by the water"print(f'search text: "{text}"')#embedding texttext_data = model.preprocess_text(text)text_embedding = model.encode_text(text_data)#calculating cosine similaritysort_ls = []print('encoding and calculating similarity...')for image in tqdm(images): #encoding image image_data = model.preprocess_image(image) image_embedding = model.encode_image(image_data) #calculating similarity sim = F.cosine_similarity(image_embedding, text_embedding) #appending to list for later sorting sort_ls.append((sim, image))#sorting by similaritysort_ls.sort(reverse=True, key = lambda t: t[0])print('top 5 most similar results:')_, axs = plt.subplots(1, 5, figsize=(12, 8))axs = axs.flatten()for img, ax in zip([im for sim, im in sort_ls][:5], axs): ax.imshow(img) ax.get_xaxis().set_visible(False) ax.get_yaxis().set_visible(False)plt.show()

Implementing Image-to-Image Search
Image to image search behaves similarly to the text to image search previously discussed; we embed the image we’re using to search, and embed all other images. The embedding of our search image is compared to the embedding of all other images (using cosine similarity), allowing us to find the most similar images to our search image. Naturally, the most similar image in this example is the image itself.
"""Implementing image to image searchsimilar to previous approach, except all images are compared to an input image.Rendering out the top 5 results"""#defining search imageinput_image = images[15]#rendering search imageprint('input image:')fig = plt.figure(figsize=(4,4))ax = fig.add_subplot(111)ax.imshow(input_image)ax.get_xaxis().set_visible(False)ax.get_yaxis().set_visible(False)plt.show()#embedding imageimage_data = model.preprocess_image(input_image)search_image_embedding = model.encode_image(image_data)#calculating cosine similaritysort_ls = []print('encoding and calculating similarity...')for image in tqdm(images): #encoding image image_data = model.preprocess_image(image) image_embedding = model.encode_image(image_data) #calculating similarity sim = F.cosine_similarity(image_embedding, search_image_embedding) #appending to list for later sorting sort_ls.append((sim, image))#sorting by similaritysort_ls.sort(reverse=True, key = lambda t: t[0])print('top 5 most similar results:')_, axs = plt.subplots(1, 5, figsize=(12, 8))axs = axs.flatten()for img, ax in zip([im for sim, im in sort_ls][:5], axs): ax.imshow(img) ax.get_xaxis().set_visible(False) ax.get_yaxis().set_visible(False)plt.show()

Conclusion
And that’s it! We successfully used a CLIP style model’s image and text encoder to implement two types of image search; one based on input text, and one based on an input image. We did this by using the text encoder to calculate an embedding for the text, the image encoder to calculate an embedding of the images, and searched by sorting the similarity of embeddings using cosine similarity.
Feel free to check out the companion article for a deeper dive on CLIP.
Follow For More!
I describe papers and concepts in the ML space, with an emphasis on practical and intuitive explanations.
Attribution: All of the resources in this document were created by Daniel Warfield, unless a source is otherwise provided. You can use any resource in this post for your own non-commercial purposes, so long as you reference this article, https://danielwarfield.dev, or both. An explicit commercial license may be granted upon request.