Contrastive learning encompasses a variety of methods that learn a constrained embedding space to solve a task. The embedding space is constrained such that a chosen metric, a function that measures the distance between two embeddings, satisfies some desired properties, usually that small distances imply a shared class. Contrastive learning underlies many self-supervised methods, such as MoCo
In contrastive learning, there are two components that determine the constraints on the learned embedding space: the similarity function and the contrastive loss. The similarity function takes a pair of embedding vectors and quantifies how similar they are as a scalar. The contrastive loss determines which pairs of embeddings have similarity evaluated and how the resulting set of similarity values are used to measure error with respect to a task, such as classification. Backpropagating to minimize this error causes a model to learn embeddings that best satisfy the constraints induced by the similarity function and contrastive loss.
This blog post examines how similarity functions and contrastive losses affect the learned embedding spaces. We first examine the different choices for similarity functions and contrastive losses. Then we conclude with a brief case study investigating the effects of different similarity functions on supervised contrastive learning.
A similarity function \(s(z_1, z_2): \mathbb{R}^d \times \mathbb{R}^d \rightarrow \mathbb{R}\) maps a pair of \(d\)-dimensional embedding vectors \(z_1\) and \(z_2\) to a real similarity value, with greater values indicating greater similarity. A temperature hyperparameter \(0 < \tau \leq 1\) is often included, via \(\frac{s(z_1, z_2)}{\tau}\), to scale a similarity function. If the similarity function has a range that is a subset of \(\mathbb{R}\), then \(\tau\) can increase that range. \(\tau\) is omitted for simplicity here.
A common similarity function is cosine similarity:
\[s(z_1, z_2) = \frac{z_1 \cdot z_2}{||z_1|| \cdot ||z_2||}\]This function measures the cosine of the angle between \(z_1\) and \(z_2\) as a scalar in \([-1, 1]\). Cosine similarity violates the triangle inequality, making it the only similarity function discussed here that is not derived from a distance metric.
The recently proposed negative arc length similarity function
This function assumes that \(||z_1|| = ||z_2|| = 1\) which is a common normalization
The negative Euclidean distance similarity function is simply:
\[s(z_1, z_2) = -||z_1 - z_2||_2\]Euclidean distance measures the shortest path in Euclidean space, making it the geodesic distance when \(z_1\) and \(z_2\) can take any value in \(\mathbb{R}^d\). In this case the similarity function has range \([-\infty, 0]\).
The negative Euclidean distance can also be used with embeddings restricted to a hypersphere, resulting in range \([-2, 0]\). However, this is not the geodesic distance for the hypersphere as the path being measured is inside the sphere. The Euclidean distance will be less than the arc length unless \(z_1 = z_2\), in which case they both equal 0.
A contrastive loss function maps a set of embeddings and a similarity function to a scalar value. Losses are written such that derivatives for backpropagation are taken with respect to the embedding \(z\).
The original contrastive loss
The structure of this loss implies that \(z_1\) and \(z_2\) share a class if \(s(z_1, z_2) < m\) and otherwise they do not share a class. This margin hyperparameter can be challenging to tune for efficiency throughout the training process because it needs to be satisfiable but also provide \(z^-\) samples within the margin in order to backpropagate the error.
The triplet loss
The triplet loss only updates a network when its loss is positive, so finding triplets satisfying that condition are important for learning efficiency.
Lifted Structured Loss
The Batch Hard loss
The decision to compute the loss based on comparisons between \(z\), a single \(z^+\), and a single \(z^-\) comes with advantages and disadvantages. These methods can be easier to adapt for learning with varying levels of supervision because complete knowledge of whether similarity should be maximized or minimized for each pair in the dataset is not required. However, these methods also make training efficiently difficult and provide relatively loose constraints on the embedding space.
A common contrastive loss is the Information Noise Contrastive Estimation (InfoNCE)
InfoNCE is a cross entropy loss whose logits are similarities for \(z\). \(z^+\) is a single embedding whose similarity with \(z\) should be maximized while \(z^-_1, z^-_2, \ldots, z^-_n\) are a set of \(n\) embeddings whose similarity with \(z\) should be minimized. The structure of this loss implies that \(z_1\) shares a class with \(z_2\) if no other embedding has greater similarity with \(z_1\).
The choice of \(z^+\) and \(z^-\) sets varies across methods. The self-supervised InfoNCE loss chooses \(z^+\) to be an embedding of an augmentation of the input that produced \(z\) and \(z^-\) to be the other inputs and augmentations in the batch. This is called instance discrimination because only augmentations of the same input instance have their similarity maximized.
Supervised methods expand the definition of \(z^+\) to also include embeddings which share a class with \(z\). The expectation of InfoNCE loss over choices of \(z^+\) is used to jointly maximize their similarity to \(z\). The Supervised Contrastive (SupCon) loss
Considering a set of similarities during loss calculation allows the loss to implicitly perform hard negative mining
Many modern contrastive learning techniques build off of the combination of cosine similarity and cross entropy losses. However, few papers have explored changing similarity functions and losses outside of the context of a more complex model.
Koishekenov et al.
We utilize the methodology of Feeney and Hughes
CIFAR-10 | CIFAR-100 | |||
Similarity | 1NN | 5NN | 1NN | 5NN |
Cosine | 95.88 | 95.91 | 76.23 | 76.13 |
Negative Arc Length | 95.66 | 95.65 | 75.81 | 76.41 |
We find no statistically significant difference based on the 95% confidence interval of the accuracy difference
We also visualize the learned embedding space for each CIFAR-10 model. For each test set image, the similarity value is plotted for the closest training set image that shares a class (“Target”) and that does not share a class (“Noise”). This visualizes the 1-nearest neighbor decision process. Both similarity functions are plotted for each model, with the title denoting the similarity function used during training.
The model trained with cosine similarity maximizes the similarity to target images well. There are a small number of noise images with near maximal similarity, but the majority are below 0.3 cosine similarity. Interestingly, the peaks seen in the noise similarity reflects the fact that individual classes will have different modes of their noise histograms.
The model trained with negative arc length similarity does a better job of forcing target similarity values very close to 1 negative arc length similarity, but also has a notable number of target similarities near 0.5 negative arc length similarity. The noise distribution also reflects the fact that individual classes have different modes for their noise histograms, but in this case the modes are spread across more similarity values. Notably the peak for the horse class is very close to the max similarity due to a high similarity to the dog class, although they are still separated enough from the target similarities to not have an impact on accuracy.
The choice of similarity function clearly has an effect on the learned embedding space despite a lack of statistically significant changes in accuracy. The cosine similarity histogram most cleanly aligns with the intuition that contrastive losses should be maximizing and minimizing similarities. In contrast, the negative arc length similarity histogram suggests similarity minimization is sacrificed for very consistent maximization, producing small differences in similarity between some target classes and noise examples. I hypothesize that this change in behavior arises from the difference in similarity function behavior with small angles described in Koishekenov et al.
These differences in the learned embedding spaces could affect performance on downstream tasks such as transfer learning. I hypothesize that the larger difference between target and noise similarities seen in the cosine similarity model would improve transfer learning performance, similar to the improvement of SINCERE over SupCon loss reported in Feeney and Hughes