9. Building a recommender system for subjects

View on GitHub | Run in Google Colab

Finally, we'll consider building a recommender system using data from Wellcome Collection. Thes machine learning models work slightly differently to the ones we've seen so far. Rather than being trained to predict a single value, they're trained to predict a whole matrix of interactions between two kinds of entity.

Classically, these entities are 'users' and 'items', where an item might be a film, a book, a song, etc. However, because we don't have data about the interactions between users of wellcomecollection.org and its works, we're instead going to train a recommender system to predict the interactions between works and the subjects they're tagged with.

The hope is that we'll then be able to make recommendations for subjects which could appear on another work, based on the other subjects which it has been tagged with. We might also be able to use some of the features learned along to way to explore the similarity between works, or between subjects.

9.1 Model architecture

This class of problem is known as a collaborative filtering problem, and there are a number of different approaches to solving it. We're going to use a relatively simple technique called matrix factorisation.

First, we'll create a matrix to represent interactions between works and subjects. The interaction matrix should be binary, where the value of each element (i, j) is 1 if the work i is tagged with the subject j, and 0 otherwise. This interaction matrix will be what our model is trained to predict. Because we're representing the interaction between all possible works and all possible subjects, the interaction matrix will be very large, and very sparse.

Target interaction matrix shape: (n_works, n_subjects)

Then, we'll create two sets of randomly-initialised embeddings, one for works and one for subjects, with a shared dimensionality.

Work embedding matrix shape: (n_works, embedding_dim)

Subject embedding matrix shape: (n_subjects, embedding_dim)

We'll multiply these two matrices together to get a matrix of predicted interactions between works and subjects.

Predicted interaction matrix shape: (n_works, n_subjects)

We'll then train the model by incrementally tweaking the embeddings for all of our works and subjects, making slightly better predictions about the likely interactions between works and subjects each time. We'll use the binary cross entropy error loss function to measure our progress along the way.

import json
import pandas as pd
from pathlib import Path
import numpy as np
from tqdm.auto import tqdm

9.2 Building the target interaction matrix

First, we need to load all of the works and keep track of which subjects exist on each one. We'll do this by loading the local works.json snapshot (see the notebook on snapshots to download this if you haven't already!). We'll then iterate over the works and subjects, and create a dictionary mapping from work ID to a list of subject IDs.

works_path = Path("./data/works.json")
n_works = sum(1 for line in open(works_path))

n_works

subject_dict = {}

with open(works_path, "r") as f:
    for i, line in tqdm(enumerate(f), total=n_works):
        work = json.loads(line)
        if len(work["subjects"]) == 0:
            continue
        try:
            subject_dict[work["id"]] = []
            for subject in work["subjects"]:
                for concept in subject["concepts"]:
                    subject_dict[work["id"]].append(concept["id"])
        except KeyError:
            continue

len(subject_dict)

Now we can start building the target interaction matrix. We'll use the scipy.sparse module to create a sparse matrix with the correct shape, and then iterate over the dictionary we just created, setting the value of each element in the matrix to 1.

from scipy.sparse import csr_matrix
from sklearn.preprocessing import MultiLabelBinarizer

mlb = MultiLabelBinarizer()
mlb.fit(subject_dict.values())

len(mlb.classes_)
mlb.classes_

subject_matrix = csr_matrix(
    (
        np.ones(sum(len(v) for v in subject_dict.values())),
        (
            np.repeat(list(subject_dict.keys()), [len(v) for v in subject_dict.values()]),
            np.concatenate(list(subject_dict.values())),
        ),
    )
)

9.3 Building our embedding layers

Now that we've got a sparse matrix of interactions, we can begin modelling the features which will go into the calculation of our predictions. We'll start by creating two embedding layers, one for works and one for subjects. We'll use the torch.nn.Embedding class to do this. We'll set the size of the embedding to 10, which means that each work and subject will be represented by a vector of 10 floating point numbers. We'll create one embedding for each unique work, and one for each unique subject.

import torch
from torch.nn import Embedding

unique_work_ids = list(subject_dict.keys())
unique_subjects = list(
    set([subject for subjects in subject_dict.values() for subject in subjects])
)

len(unique_work_ids), len(unique_subjects)

work_embeddings = Embedding(
    num_embeddings=len(unique_work_ids), embedding_dim=10
)

subject_embeddings = Embedding(
    num_embeddings=len(unique_subjects), embedding_dim=10
)

work_embeddings.weight.shape, subject_embeddings.weight.shape

These embedding layers can be multiplied together to produce a matrix of predicted interactions between works and subjects. We'll use torch.matmul to do the matrix multiplication.

a = torch.matmul(
    work_embeddings.weight[:100], subject_embeddings.weight[:100].T
)
a

a.shape

That's the core of our model! The predictions might not be meaningful at the moment (we're just multiplying two random matrices together), but we can train the model to make better predictions by tweaking the values of the embeddings.

9.4 Grouping our layers into a model

Before we begin training our model, we should wrap our embedding layers up into a single model class. We'll use the torch.nn.Module class to do this, giving us access to all sorts of pytorch magic with very little effort.

We'll need to add a .forward() method to the class, which will be used to calculate predictions during each training step, and at inference time.

from torch.nn import Module

class Recommender(Module):
    def __init__(self, n_works, n_subjects, embedding_dim):
        super().__init__()
        self.work_embeddings = Embedding(
            num_embeddings=n_works, embedding_dim=embedding_dim
        )
        self.subject_embeddings = Embedding(
            num_embeddings=n_subjects, embedding_dim=embedding_dim
        )

    def forward(self, work_ids, subject_ids):
        work_embeddings = self.work_embeddings(work_ids)
        subject_embeddings = self.subject_embeddings(subject_ids)
        predictions = torch.matmul(work_embeddings, subject_embeddings.T)
        return torch.sigmoid(predictions)

model = Recommender(
    n_works=len(unique_work_ids), 
    n_subjects=len(unique_subjects), 
    embedding_dim=10
)

9.5 Training the model

Now that we've got a model, we need to come up with a way of training it. Typically, machine learning models are trained using a technique called stochastic gradient descent. This involves taking a small batch of training examples, calculating the error for each one, and then using the average error to calculate the gradient of the error function with respect to each of the model's parameters. The parameters are then updated by a small amount in the direction of the gradient, and the process is repeated.

Our interaction matrix is huge, and sparsely populated, so training on the whole thing at once would be very slow. Instead, we'll randomly sample a small batch of interactions from the matrix, and train on those, incrementally updating the weights. We'll do this repeatedly, until we've seen every interaction in the matrix at least once.

This process will be wrapped up into a custom BatchGenerator class.

class BatchGenerator:
    def __init__(self, subject_dict, batch_size):
        self.subject_dict = subject_dict
        self.batch_size = batch_size
        self.n_batches = len(subject_dict) // batch_size
        self.work_ids = list(self.subject_dict.keys())
        self.work_id_to_index = {
            work_id: i for i, work_id in enumerate(self.work_ids)
        }
        self.index_to_work_id = {
            i: work_id for i, work_id in enumerate(self.work_ids)
        }

        self.unique_subjects = list(
            set(
                [
                    subject
                    for subjects in subject_dict.values()
                    for subject in subjects
                ]
            )
        )
        self.subject_to_index = {
            subject: i for i, subject in enumerate(self.unique_subjects)
        }
        self.index_to_subject = {
            i: subject for i, subject in enumerate(self.unique_subjects)
        }

    def __len__(self):
        return self.n_batches

    def __iter__(self):
        """
        Yields a tuple of work_ids and subject_ids, and the target
        adjacency matrix for each batch.
        """
        # split the work ids into randomly shuffled batches
        work_ids_batches = np.array_split(
            np.random.permutation(self.work_ids), self.n_batches
        )
        for work_ids_batch in work_ids_batches:
            # get the work ids for each work in the batch
            work_ids = [
                self.work_id_to_index[work_id] for work_id in work_ids_batch
            ]

            # get the subset of subjects which appear on the works in the batch.
            # this is the set of subjects we want to predict against for each work
            # in the batch.
            subject_ids = [
                self.subject_to_index[subject]
                for work_id in work_ids_batch
                for subject in self.subject_dict[work_id]
            ]

            # create the target adjacency matrix using the work ids and subject
            # ids
            target_adjacency_matrix = torch.zeros(
                len(work_ids), len(subject_ids)
            )

            for i, work_id in enumerate(work_ids_batch):
                for subject in self.subject_dict[work_id]:
                    j = subject_ids.index(self.subject_to_index[subject])
                    target_adjacency_matrix[i, j] = 1

            yield (
                torch.tensor(work_ids),
                torch.tensor(subject_ids),
                target_adjacency_matrix,
            )

We'll use the Adam optimiser to update the weights of our model at each step. Adam is a variant of stochastic gradient descent which uses a slightly more sophisticated update rule, and is generally more effective than raw SGD.

from torch.optim import Adam

optimizer = Adam(params=model.parameters(), lr=0.001)

Our optimiser will be led by a BinaryCrossEntropyLoss loss function, which will calculate the error between our predictions and the target interactions.

from torch.nn import BCELoss

binary_cross_entropy = BCELoss()

We can now set our model training, using a batch size of 512, and training for 10 epochs. As usual, we'll keep track of the loss at each step, and plot it at the end.

n_epochs = 10
batch_size = 512

losses = []
progress_bar = tqdm(range(n_epochs * (len(subject_dict) // batch_size)), unit="batches")
for epoch in progress_bar:
    progress_bar.set_description(f"Epoch {epoch}")
    batch_generator = BatchGenerator(subject_dict, batch_size=batch_size)
    for work_ids, subject_ids, target_adjacency_matrix in batch_generator:
        predictions = model(work_ids, subject_ids)
        loss = binary_cross_entropy(predictions, target_adjacency_matrix)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        losses.append(loss.item())
        progress_bar.set_postfix({
            "BCE": np.mean(losses[-100:]),
        })
        progress_bar.update()

from matplotlib import pyplot as plt

plt.plot(losses)

# plot the log of the loss to see the trend more clearly
plt.plot(np.log(losses))

9.6 Making predictions

We can now use our trained model to make predictions about the interactions between works and subjects. If we select a random work (and find its trained embedding) we can multiply it by the subject embeddings to get a vector of predicted interactions between that work and each subject. We can then sort the subjects by their predicted interaction, and select the 10 highest scoring subjects. These are the subjects which our model thinks are most likely to be relevant to the work.

model.eval()

work_embeddings = model.work_embeddings.weight.detach().numpy()
subject_embeddings = model.subject_embeddings.weight.detach().numpy()

random_work_id = np.random.choice(unique_work_ids)
random_work_index = unique_work_ids.index(random_work_id)

Let's use the API to find out what this work is

base_url = "https://api.wellcomecollection.org/catalogue/v2/"
work_url = f"{base_url}works/{random_work_id}"

import requests

requests.get(work_url).json()["title"]

Let's find its embedding

random_work_embedding = work_embeddings[random_work_index]
random_work_embedding

And make some predictions about which subjects it's likely to be tagged with

predictions = np.matmul(
    random_work_embedding, subject_embeddings.T
)

top_predicted_subject_indexes = predictions.argsort()[::-1][:10]

predicted_concept_ids = [
    unique_subjects[index] for index in top_predicted_subject_indexes
]

predicted_concept_ids

Let's have a look at the top 10 predicted subjects' labels

for concept_id in predicted_concept_ids:
    concept_url = f"{base_url}concepts/{concept_id}"
    label = requests.get(concept_url).json()["label"]
    print(label)

9.7 Visualising the embeddings

We can also visualise the similarity of the embeddings we've learned, in the same way as we did for the text embeddings in the previous notebook. Again, we'll use the UMAP algorithm to reduce the dimensionality of the embeddings to 2, and then plot them on a scatter plot.

from umap import UMAP

reducer = UMAP(n_components=2)
subject_embeddings_2d = reducer.fit_transform(subject_embeddings)

subject_embeddings_2d.shape

from matplotlib import pyplot as plt

plt.figure(figsize=(20, 20))
plt.scatter(
    subject_embeddings_2d[:, 0], 
    subject_embeddings_2d[:, 1], 
    alpha=0.2, 
    c="k",
    s=5
)
plt.show()

Exercises

Try training the model for longer, or with a different batch size. How does this affect the loss, and the quality of the corresponding predictions?
Try changing the size of the embedding. How does this affect the loss, and the quality of the corresponding predictions?
Can you think of a way of incorporating more prior information into the embeddings? Can you make these constraints trainable in the model's backward pass?

9.1 Model architecture​

9.2 Building the target interaction matrix​

9.3 Building our embedding layers​

9.4 Grouping our layers into a model​