Related blog posts
To find related blog posts to a certain post, the idea is pretty simple:
- Calculate the embedding vectors for all posts, including the one we want to find related posts to.
- Use a function to compare our post vector and all the others, grab the top N vectors most similar.
A common way of calculating the similarity between embedding vectors is to use cosine similarity (wikipedia).
Here’s an implementation of such a function in C#:
Func<IReadOnlyList<float>, IReadOnlyList<float>, float> CosineSimilarity = (V1, V2) =>
{
int N = Math.Min(V1.Count, V2.Count);
double dot = 0.0;
double mag1 = 0.0;
double mag2 = 0.0;
for (int n = 0; n < N; n++)
{
dot += V1[n] * V2[n];
mag1 += Math.Pow(V1[n], 2);
mag2 += Math.Pow(V2[n], 2);
}
return (float)(dot / (Math.Sqrt(mag1) * Math.Sqrt(mag2)));
};
The function CosineSimilarity
calculates the cosine similarity between two vectors. Cosine similarity is a measure of similarity between two non-zero vectors of an inner product space that measures the cosine of the angle between them.
The cosine of two identical vectors is 1 because the angle between them is 0. As the angle increases, the cosine decreases.
On the other hand, if the cosine similarity is close to -1, it means that the angle between the two vectors is close to 180 degrees, which in turn means that the two vectors are very dissimilar.
So, in the context of this code, if the value returned by CosineSimilarity
is close to 1, it means that the two vectors V1
and V2
are very similar.
Some example data to play with¶
First it would be nice to have some more blog posts to work with to test this with. Download a bunch of generated blog posts from here: CowPress.GenerateContent.zip.
Heads up that Chrome probably wants to protect from this file it hasn’t seen so many times yet:
Danger
Make sure you unzip the file in the correct directory as described here
- Unzip in your repository directory (the one with the sln file). This can be done by placing the zip file in that directory and double clicking it.
- then
cd CowPress.GenerateContent
and run the import usingdotnet run import
In the Embedding.cs
file, the Embedding
class has a Vector
property that provides a float array representing a vector. To store this in the database using EF, the class handles the conversion between the float array and a byte array automatically.
Building it¶
The previous team already implemented a related posts feature that you can find at the bottom of a blog post view. Unfortunately our customers have been complaining the results aren’t very good. Let’s see if we can get better results with embeddings.
The current implementation can be found in the class RelatedBlogPosts
.
Now add code to calculate the embedding when saving the post. Here you can find docs on how to do it using the Azure OpenAI SDK.
After doing that you can fetch and compare the similarity of embeddings of other posts in RelatedBlogPosts
I need more help on how to do this.
The current related algorithm is implemented in RelatedBlogposts.cs
, you could probably implement and use GenerateEmbedding
from ContentGenerator
for this lookup along with the previously mentioned CosineSimilarity
.
I need more help than that on how to do this.
Then the written instructions isn’t here yet. Please find a someone from the supervisor team, or ask on Slack.
When you have working implementation¶
Fantastic job! You are well on your way to becoming an AI Engineer. We hope you’ve gotten some new insights & learnings, and have had an enjoyable day with your team!
This was all we hoped for you to complete on your first day, but if you’re perhaps done early or eager to try some more challenges, you can take a look at to Extra learning
Created : November 17, 2023