Fiddling with the Open AI Clip Neural Network
Having experimented quite a bit with LLMs, GenAI and "Retrieval-Augmented Generation" and found it all somewhat frustrating, I noodled around looking for other interesting technology to explore and came across OpenAI's Clip Neural Network.
So What Is CLIP?
OpenAI CLIP (source code here) is a neural network that can understand both images and text in the same "conceptual semantic space". What that means - in more human terms - is that it has learned how images and their natural language descriptions relate. So instead of being trained on one narrow task (e.g. "detect cats"), having been trained on hundreds of millions of image–caption pairs it's able to handle far broader tasks, e.g. connect visual content with open-ended human concepts: “classify photos of healthy leaves versus photos of diseased leaves” - without having been specifically trained on healthy and unhealthy leaves.
Two Things I'm Particularly Interested In:
1: Labelling an images content: Based on the descriptions at https://openai.com/index/clip/ and https://github.com/openai/CLIP it seems that CLIP should be able to say, for example, whether a dog has black fur, or a type of a bicycle.
2: Comparing images: It turns out that a side-effect of CLIP is that it's actually very good at comparing the contents of images e.g. "are these two dogs similar"?
Let's Try It Out!
We have four test images:
Bike 1 and Bike 2:


Dog 1 and Dog 2:


We can use the "Sentence Transformer" Python library to work with the catchily titled 'clip-ViT-B-32' model ('the Image & Text model that maps text and images to a shared vector space') to compare and label images. FYI when comparing images, the CLIP model gives a score between 0 and 1, with 1 indicating that two images are identical. So, what do get?
Similarity Scores:
- Bike1 and Bike2 have a similarity score of 0.8743 - indicating that they're possibly the same bike.
- Dog1 and Dog2 have a similarity score of 0.7134 - indicating they're both the same thing (i.e. dogs), but they're not the same dog, and not the same type.
- Just to check - Bike1 and Dog1 have a similarity score of 0.5532 - indicating they're just not the same thing at all.
Labelling:
- Bike1 got the labels: "commuter bike", "nearly new" and "dark grey"
- And Bike2 was labelled at: "commuter bike", "nearly new" and "light grey"
- Dog1: "brown fur", and Dog2: "white fur"
Based on this very small test it seems like the model is worth investigating further. Seems like it has potential.
Comparing Book and Album Covers:
This got me thinking about how else OpenAI’s CLIP could be applied. So, I downloaded around 100 album cover images from https://musicbrainz.org/ and Wikipedia's list of best selling music artists, and and around 300 book cover images from OpenLibrary and Wikipedia's list of best selling books & wrote some code to explore how CLIP could be used to browse collections of books and albums.
Could CLIP be used to browse these books and albums based purely on the images? Would this work and make sense? Might it throw up something interesting? Or would it be completely nonsensical?
So, I ended up writing something that allows browsing by:
- Semantic similarity according to CLIP: Remember this will be how close CLIP perceives the semantic content or meaning of the book or album cover to another.
- Colour: i.e. find books / albums with similar colours
- "Perceptual Hash": i.e. via a compact representation of how an image looks — not its exact pixels, but its structure, tone, and general layout.
Let's have a look at a few examples:
Books:
Images semantically similar to Michelle Obama's book cover (according to CLIP):
I'd say CLIP has done fairly well here identifying similar books by or about Michelle and Barack Obama. I wonder whether the Mariah Carey book creeped in due to similarity between "Becoming" and "Meaning". The other two books .. er, not too sure about those. Maybe the structure was similar. 🤷
Images similar by colour to Michelle Obama's book cover:
Certainly found images that are similar in colour - how useful that is, that's a different question 😜
Images similar by 'PHash' to Michelle Obama's book cover:
OK, so "Perceptual Hash" is apparently matching "structure, tone, and general layout". Hmmm. 🤔
Albums:
Images semantically similar to the cover of Bob Marley's Greatest Hits (according to CLIP):
Again CLIP has done a pretty good job here pulling in a bunch of other Bob Marley albums. Quite how "Sgt Pepper" is related is a bit of a mystery though. 🤷
Images similar by colour to the cover of Bob Marley's Greatest Hits:
By colour hasn't worked quite as well as with Michelle Obama's book cover, seems to have got fixated on red rather alot.
Images similar by 'PHash' to the cover of Bob Marley's Greatest Hits:
"Perceptual Hash" is just rather surreal really. 🤪
My Conclusions:
- Finding images that CLIP perceives to be semantically similar seems to work reasonably well.
- Labelling of images e.g. type of bike or dog fur - shows promise.
- Finding images with similar colours might work - if one could find a use.
- "Perceptual Hash" might be interesting to pursue if one could find a way to have the neural network explain what it does. But it does seem rather surreal at present.
Couple of Real-World Uses of CLIP:
If you're interested in reading a bit more about how others have used CLIP, here's a couple of examples:
- lupasearch.com: Hack your E-commerce Search Accuracy: Introduce OpenAI’s CLIP
- roboflow.com: Zero-Shot Content Moderation with OpenAI's New CLIP Model
What's Next?
I'm going to look into whether OpenAI’s CLIP could be useful in identifying stolen bikes. We have:
- A list of stolen bikes at Bike Register
- And many bikes for sale on sites such as Cash Converters and of course Facebook Marketplace
Could we download a bunch of images and compare them and maybe find stolen bikes for sale? All will be revealed in the next blog post ... 🥳
