- Published on
The Mathematics of Latent Prediction & Collapse in SSL
Overview
This note develops the mathematical theory underlying Self-Supervised Learning, with a focus on predictive SSL — methods that learn representations by predicting one view from another in latent space. We cover:
- What we are actually estimating and why it is fundamentally difficult
- The information-theoretic foundation of SSL objectives
- The collapse problem: complete, dimensional, and its spectral characterization
- A catalog of mathematical remedies (exact loss functions and why they work)
- JEPA's mathematical distinctiveness
- Open theoretical questions
Part 1: The Fundamental Problem Setup
1.1 What Are We Estimating?
Let be the input space (images) and let be a latent representation space. SSL learns an encoder such that captures semantic content — information about objects, scenes, and their relationships — while being invariant to nuisance factors (lighting, viewpoint, texture variations).
In predictive SSL specifically, we have two views and of the same underlying data point (e.g., two crops of the same image, two frames of the same video, two blocks of the same image). The goal is to learn representations and that are predictable from each other — i.e., there exists a predictor such that .
Formally: We want to find such that the conditional expectation preserves the information in about the underlying semantic state , while discarding view-specific noise :
Here is a stochastic transformation (data augmentation, cropping, masking) that produces a view from the latent state and noise . The representation should capture but not .
1.2 Why Is This Difficult?
Difficulty 1: The invariance-objective contradiction. The natural objective — make and close — has a trivial minimum: make constant. This is the collapse problem.
Difficulty 2: What should be invariant vs. what should be preserved? There is no ground-truth signal telling us which variations are semantic (should be preserved) and which are nuisances (should be discarded). Data augmentations encode human assumptions about this.
Difficulty 3: The abstraction hierarchy. Semantics exist at multiple scales — a car wheel and a car are both "semantic" but at different levels. No single objective automatically discovers this hierarchy.
Difficulty 4: The noise-information trade-off. To discard noise, the representation must lose information. But how much and what kind? Too much information loss → collapse. Too little → overfitting to nuisance.
Difficulty 5: The evaluation gap. SSL objectives optimize for a surrogate (contrastive, reconstruction, prediction) but we evaluate on downstream tasks (classification, detection). The gap between surrogate and downstream is not formally characterized.
Part 2: Information-Theoretic Foundations
2.1 InfoNCE and Mutual Information Lower Bound
Source: Representation Learning with Contrastive Predictive Coding (van den Oord et al., 2018) — arXiv:1807.03748
The InfoNCE loss for a set of samples with one positive and negatives :
where is a context (e.g., in CPC, this is the representation of the past), and is a "critic" scoring compatibility.
Key theoretical result: The optimal critic satisfies:
which is a density ratio. Substituting this back:
This yields a lower bound on mutual information:
Proof sketch:
- Let be the positive sample and be the set containing 1 positive + negatives, all drawn independently from except the positive which depends on .
- The probability that the critic correctly identifies the positive among candidates is bounded by .
- Jensen's inequality on gives the bound.
Important caveat: The bound tightens as , but for finite it is limited to .
2.2 Alignment and Uniformity Decomposition
Source: Understanding Contrastive Representation Learning through Alignment and Uniformity on the Hypersphere (Wang & Isola, 2020) — arXiv:2005.10242
For normalized representations on the unit hypersphere (i.e., ), as the number of negatives , the InfoNCE loss decomposes into two terms:
Where this comes from: For unit vectors, . As , the sum over negatives converges to an expectation, and becomes a constant additive shift.
Alignment encourages positive pairs to have similar representations. Uniformity encourages the overall distribution of representations to be uniform on the hypersphere — this is the force that prevents collapse.
Properties:
- Alignment alone is minimized by collapse (all points → same location → alignment = 0)
- Uniformity alone is minimized by a uniform distribution on
- Together, they balance: representations of the same object should be close, but different objects should be well-separated
Empirical finding: Directly optimizing alignment + uniformity (without InfoNCE) achieves comparable or better downstream performance than full contrastive learning, confirming these are the essential properties.
2.3 The Tschannen Critique: Is InfoNCE Really Maximizing MI?
Source: On Mutual Information Maximization for Representation Learning (Tschannen et al., 2019) — arXiv:1907.13625
Three fundamental problems with the MI-maximization interpretation:
1. Bound saturation. For negative samples, the bound is . Since , the maximum lower bound is . With batch size 256 (SimCLR default), nats. The actual MI could be much higher — the bound is uninformative.
2. The critic, not the encoder, is optimized. The bound is actually on where is the critic — not on where is the encoder. The encoder representations only influence the bound through the critic. This creates a gap.
3. Architecture-driven success. The paper argues that contrastive learning succeeds because of:
- The data augmentation design (which defines the prediction task)
- The projection head (which peels off unwanted information)
- The embedding geometry (hypersphere concentration)
...not because of MI maximization. The MI bound is loose and the actual optimization dynamics are better understood through the alignment-uniformity lens.
2.4 Noise Contrastive Estimation (NCE) Connection
InfoNCE is closely related to Noise Contrastive Estimation. In NCE:
InfoNCE differs by:
- Learned density ratio — the critic learns rather than assuming a parametric form
- Self-normalization — the denominator over all partitions the space
- Multi-class formulation — -way classification rather than binary
The key distinction: NCE is designed for density estimation (the critic is the output), while InfoNCE is designed for representation learning (the critic is discarded after training, only the encoder is kept).
Part 3: The Collapse Problem — Mathematical Characterization
3.1 Complete Collapse
Definition: All inputs map to the same constant vector :
Properties:
- Covariance matrix: (zero matrix)
- All eigenvalues are zero:
- Rank:
- Entropy:
- Loss under : (global minimum)
Why it happens: The invariance objective alone has zero as its global minimum at . There is no gradient signal pushing representations apart.
Linear network dynamics: For a linear encoder :
where is the positive-semidefinite difference covariance. The solution:
Every singular value of decays exponentially to zero.
3.2 Dimensional Collapse
Source: Understanding Dimensional Collapse in Contrastive Self-supervised Learning (Jing et al., 2021) — arXiv:2110.09348
Definition: Representations span a low-dimensional subspace but are not all identical. The empirical covariance has rank where .
Properties:
- Eigenvalues: ,
- Effective rank: where
- Under collapse:
Spectral dynamics under InfoNCE: The eigenvalue evolves as:
Key insight: The exponential creates a rich-get-richer dynamic. Larger eigenvalues get stronger positive gradients (via the softmax numerator), while the negative-pair repulsion is approximately uniform across dimensions. This amplifies dominant modes and suppresses minor ones.
Three phases of collapse:
- Alignment phase: All eigenvalues grow as representations spread out from initialization.
- Spectral separation: Top eigenvalues keep growing; smaller ones plateau or begin to shrink.
- Dimensional collapse: Effective rank plummets — the model "gives up" on minor dimensions and concentrates information in a low-dimensional subspace.
Gradient decomposition for linear encoder :
where concentrates on the nearest negatives — which lie along the dominant eigendirections. Thus negatives push hardest along directions where representations already have high variance, further amplifying the imbalance.
3.3 Why Invariance Alone Collapses (BUT Non-Contrastive Methods Don't Necessarily)
Source: Understanding Self-supervised Learning Dynamics without Contrastive Pairs (Tian et al., 2021) — arXiv:2102.06810
The core result: in a simplified linear network setting, the asymmetric architecture (encoder + predictor + stop-gradient) of methods like BYOL and SimSiam creates dynamics where collapse is not a fixed point.
Setup: Online encoder , predictor , target encoder (stop-gradient, optionally EMA).
Gradient flow:
where , .
Why collapse is avoided: The predictor learns:
This creates an asymmetry that prevents the collapse fixed point. The stationary condition is not satisfied by .
Per-eigenmode result: When the cross-correlation is full rank, the system converges to:
which is full rank (). Neither complete nor dimensional collapse occurs.
Three necessary conditions for non-collapse:
- Predictor present — without , the loss becomes symmetric and collapse is a fixed point
- Stop-gradient applied — without , the target chases and both collapse
- Cross-correlation is full rank — if views are too similar, is low-rank and collapse occurs despite asymmetry
Part 4: Mathematical Remedies for Collapse — Catalog with Exact Loss Functions
4.1 SimCLR — Temperature-Controlled Negative Repulsion
Source: arXiv:2002.05709
Loss (NT-Xent):
where , (projection head), and is temperature.
Why it prevents collapse: The denominator forces every sample to be distinguishable from every other sample. At collapse ( for all pairs):
This generates gradients that break collapse.
Role of temperature :
- : Softmax → argmax. Hardest negative dominates → strongest uniformity.
- : All similarities ~1 → no discrimination.
- Optimal (0.07–0.5 for SimCLR): balances alignment strength with uniformity.
4.2 VICReg — Variance + Invariance + Covariance
Source: arXiv:2105.04906
Loss:
where , (target std).
Why it prevents collapse — three distinct forces:
- Invariance pulls positive pairs together (would alone collapse).
- Variance hinge forces each feature dimension's standard deviation . This directly prevents dimensional collapse: if one dimension collapses to a constant, that dimension's std = 0, triggering the hinge loss.
- Covariance penalizes off-diagonal correlation matrix entries. This decorrelates features, preventing redundancy and ensuring each dimension captures distinct information.
Connection to uniform hypersphere: For uniformly distributed points on , the expected covariance is approximately — diagonal with equal variance. VICReg enforces exactly this structure, albeit in rather than on the hypersphere.
4.3 Barlow Twins — Cross-Correlation Minimization
Source: arXiv:2103.03230
Loss:
Why it prevents collapse — redundancy reduction:
- Diagonal (): Each feature dimension must be perfectly correlated across the two views (invariance for that dimension).
- Off-diagonal (, ): Different dimensions must be decorrelated (no redundancy).
Mechanism: At collapse, all samples map to the same point. After batch normalization (which shifts and scales), each dimension is a constant. The cross-correlation between different dimensions of two constant vectors equals the product of the constants' normalized values, which is generally non-zero. This activates the off-diagonal penalty — which produces gradients that force different dimensions to carry different information. But at a single point, there's no variation to differentiate. The network must learn to spread representations to minimize the off-diagonal term.
4.4 BYOL — Predictor Asymmetry + EMA Target
Source: arXiv:2006.07733
Architecture:
- Online: encoder , projector , predictor
- Target: encoder , projector — updated by EMA:
- No gradients flow through the target (stop-gradient)
Loss (symmetrized):
After L2 normalization, this is equivalent to:
Why it prevents collapse (the asymmetry argument):
- Without a predictor: online and target would be symmetric → MSE has trivial collapse solution.
- With a predictor: the online network must learn to predict the target's representation. The predictor is a learnable mapping that maps the online representation to the target representation space.
- EMA target: provides stable, slowly-changing regression targets. If collapse tries to happen, the target (from an older, stable encoder) still produces diverse representations, creating a persistent error signal.
- Formal result (Tian et al., 2021): Without a predictor, variance goes to zero. The predictor forces information-preserving representations because it must learn a non-trivial mapping.
The parameter (from PMAX/JEPA context): BYOL operates in the regime — no explicit anti-collapse regularizer, relying entirely on architectural asymmetry. This is the "hard" regime that BYOL was the first to make work at ImageNet scale.
4.5 SimSiam — Stop-Gradient as Implicit EM
Source: arXiv:2011.10566
Loss:
where = stop-gradient, (negative cosine similarity).
Why it prevents collapse — implicit EM interpretation:
Without stop-gradient, collapse is immediate — both branches predict each other's constant vector, loss = -1. With stop-gradient, SimSiam implements alternating optimization:
- E-step: Given encoder parameters , compute targets (expected representation under augmentations). Approximated by .
- M-step: Given targets, update to minimize MSE: .
Why collapse is unstable: At exact collapse, loss = 0 (stationary point). However, any perturbation away from collapse produces E-step targets that differ across samples. The M-step then tries to match these different targets, pulling representations apart. Collapse is a stationary point but not an attractor — the EM dynamics naturally increase diversity.
4.6 MoCo — Momentum Encoder + Large Queue
Source: arXiv:1911.05722
Loss (InfoNCE with queue):
where , (positive), are negatives from the queue.
Momentum update: , with .
Why it prevents collapse:
- Momentum encoder provides near-consistent targets across training steps. If the query encoder collapses, the positive key from the stable momentum encoder still produces a diverse target — maintaining high loss.
- Large queue ( batch size) provides abundant negatives. At collapse, for all , so — very high loss driving strong gradients.
- Queue decouples dictionary from batch size — unlike SimCLR's limit, MoCo provides temporal diversity from past iterations.
4.7 Summary: The Collapse Prevention Zoo
| Method | Loss Type | Negatives? | Collapse Prevention | Math Mechanism |
|---|---|---|---|---|
| SimCLR | NT-Xent | Yes (batch) | Negative repulsion | Softmax uniformity + temperature |
| MoCo | InfoNCE + Queue | Yes (queue) | Large queue + momentum encoder | negatives, EMA target stability |
| VICReg | Var+Inv+Cov | No | Variance hinge | per dimension |
| Barlow Twins | Cross-correlation | No | Redundancy reduction | Off-diagonal penalty |
| BYOL | MSE + Predictor + EMA | No | Architectural asymmetry | Predictor + stop-gradient breaks symmetry |
| SimSiam | Neg Cosine + Stop-Grad | No | Implicit EM | E-step/M-step alternation; collapse not attractor |
Part 5: JEPA's Mathematical Distinctiveness
5.1 The JEPA Formulation
Source: Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture (Assran et al., 2023) — arXiv:2301.08243
JEPA defines a joint-embedding predictive objective. Given an image , sample:
- A context block (large, spatially distributed)
- Target blocks (smaller, semantically meaningful regions)
The encoder encodes the context: . The predictor predicts target representations from the context:
where is the target representation from a momentum encoder (EMA), and are multi-scale prediction heads.
Key mathematical choices:
- Non-generative — predicts in latent space, not pixel space. No decoder.
- Multi-scale prediction — different predictor heads operate at different block sizes, naturally creating a semantic hierarchy.
- EMA target — prevents the trivial collapse by decoupling the prediction target from the online encoder.
- Block-wise masking — unlike MAE's random per-patch masking, JEPA masks entire semantic blocks.
5.2 Why This Avoids Collapse (Theoretical)
JEPA combines two collapse-prevention mechanisms inherited from the BYOL/PMAX lineage:
-
Architectural asymmetry — predictor + EMA target creates the same dynamics as BYOL (Section 4.4). Without this, the MSE between context and target representations would collapse.
-
Multi-task prediction — the multiple prediction heads at different scales create a variety of prediction tasks. This naturally constrains the representation: must be informative enough to predict targets at multiple scales, preventing it from discarding too much information.
-
Spatial masking strategy — target blocks are sampled at large semantic scale (not individual patches). This forces the representation to capture high-level semantics rather than local pixel statistics. The context block being spatially distributed ensures it contains sufficient information for non-trivial prediction.
5.3 Relation to Energy-Based Models
JEPA can be understood through LeCun's Energy-Based Model (EBM) framework. The energy of a configuration is:
where is the context and is the target block.
The goal is to learn an energy function that assigns low energy to compatible (context, target) pairs and high energy to incompatible pairs. In JEPA, this is achieved by:
- Pulling down energy for positive pairs (ground-truth context-target combinations)
- The EMA target implicitly prevents pushing-up from collapsing
More broadly, JEPA is an instance of a regularized latent-variable energy model where the latent variable is the representation itself, and regularization comes from the architectural asymmetry.
5.4 Comparison: Predictive vs. Generative vs. Contrastive
| Aspect | Generative (MAE) | Contrastive (SimCLR) | Predictive (JEPA) |
|---|---|---|---|
| Target | Raw pixels | Embedding of negative samples | Embedding of positive target |
| Loss | |||
| Abstraction level | Low (pixel) | High (embedding) | High (embedding) |
| View generation | Random masking | Hand-crafted augmentations | Spatial block structure |
| Collapse risk | Low (pixel space well-constrained) | Low (negatives push apart) | Moderate (asymmetry + EMA) |
5.5 The Multi-Scale Prediction Hierarchy
A unique mathematical aspect of JEPA is its multi-scale prediction heads. Each predictor predicts the representation of a target block of a specific size. This creates an implicit curriculum:
- Small target blocks: predict fine-grained, local content (textures, edges)
- Large target blocks: predict coarse, global content (object shapes, scene layout)
Mathematically, for block sizes :
The context representation must simultaneously be predictive at all scales. This forces it to encode information that is simultaneously local and global — a form of multi-resolution representation learning without explicit multi-resolution architecture.
Part 6: Open Theoretical Questions
6.1 Why Does BYOL Work Without Negatives?
This remains the most important open question in SSL theory. BYOL achieves state-of-the-art results with an apparently easy-to-collapse objective (MSE between two views). Current explanations:
- Predictor asymmetry (Tian et al., 2021) — the linear network analysis shows collapse is avoided, but the extension to deep nonlinear networks is heuristic.
- Implicit regularization — batch normalization, weight decay, and the specific initialization create implicit collapse-prevention forces that are not captured by simplified analyses.
- Feature diversity via EMA — the target encoder provides a consistent but drifting target; this prevents the slow collapse seen in symmetric architectures.
6.2 Why Does Predictor Strength Affect BYOL and SimSiam Differently?
- In BYOL: a stronger predictor (2-layer vs. 1-layer MLP) hurts performance.
- In SimSiam: a stronger predictor helps or is neutral.
- The theoretical reason is unknown. Hypothesis: in BYOL, the EMA target is already a good predictor, so an elaborate predictor overfits; in SimSiam (no EMA), the predictor must do more work.
6.3 The Relationship Between Batch Size and Collapse
- SimCLR: larger batches → more negatives → stronger uniformity → better representations
- BYOL: batch size has much less effect (no negatives needed)
- Barlow Twins: batch size determines the quality of the cross-correlation estimate
- Why batch size affects some methods dramatically and others barely is not fully characterized.
6.4 The Role of Data Augmentation
- Stronger augmentations improve contrastive and non-contrastive methods alike
- But augmentations define the invariance the model learns — there's no principled way to choose them
- JEPA's use of spatial structure rather than augmentations is an attempt to avoid this, but it introduces its own hyperparameters (block size, aspect ratio, spatial distribution)
6.5 The Spectral Gap Between SSL and Supervised Representations
SSL representations (even from strong methods) differ systematically from supervised ones:
- SSL representations have higher effective rank (more dimensions are used)
- SSL representations are less specialized — individual dimensions don't align with semantic concepts
- Supervised representations have sharper spectral decay
- The theoretical implications of this spectral gap for downstream task performance are not fully understood.
References
Information-Theoretic Foundations
- van den Oord, A., Li, Y., & Vinyals, O. (2018). Representation Learning with Contrastive Predictive Coding. — arXiv:1807.03748 — InfoNCE loss derivation, MI lower bound, density ratio estimation.
- Wang, T., & Isola, P. (2020). Understanding Contrastive Representation Learning through Alignment and Uniformity on the Hypersphere. ICML. — arXiv:2005.10242 — Alignment + uniformity decomposition of contrastive loss.
- Tschannen, M., Djolonga, J., Rubenstein, P. K., Gelly, S., & Lucic, M. (2019). On Mutual Information Maximization for Representation Learning. — arXiv:1907.13625 — Critique of InfoNCE as MI maximization.
Collapse Analysis
- Jing, L., et al. (2021). Understanding Dimensional Collapse in Contrastive Self-supervised Learning. — arXiv:2110.09348 — Mathematical characterization of dimensional collapse, spectral dynamics.
- Tian, Y., et al. (2021). Understanding Self-supervised Learning Dynamics without Contrastive Pairs. — arXiv:2102.06810 — Linear network analysis of BYOL/SimSiam non-collapse dynamics.
Mathematical Remedies
- Chen, T., et al. (2020). A Simple Framework for Contrastive Learning of Visual Representations. ICML. — arXiv:2002.05709 — NT-Xent loss.
- He, K., et al. (2020). Momentum Contrast for Unsupervised Visual Representation Learning. CVPR. — arXiv:1911.05722 — MoCo momentum encoder + queue.
- Grill, J.-B., et al. (2020). Bootstrap Your Own Latent. NeurIPS. — arXiv:2006.07733 — BYOL: predictor + EMA.
- Chen, X., & He, K. (2021). Exploring Simple Siamese Representation Learning. CVPR. — arXiv:2011.10566 — SimSiam: stop-gradient as implicit EM.
- Zbontar, J., et al. (2021). Barlow Twins: Self-Supervised Learning via Redundancy Reduction. NeurIPS. — arXiv:2103.03230 — Cross-correlation minimization.
- Bardes, A., et al. (2022). VICReg: Variance-Invariance-Covariance Regularization for Self-Supervised Learning. ICLR. — arXiv:2105.04906 — Explicit variance regularization.
JEPA & Related
- Assran, M., et al. (2023). Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture. ICCV. — arXiv:2301.08243 — I-JEPA: multi-scale latent prediction.
- Bardes, A., et al. (2024). Revisiting Feature Prediction for Learning Visual Representations from Video. — arXiv:2404.08471 — V-JEPA: temporal prediction.
- LeCun, Y. (2022). A Path Towards Autonomous Machine Intelligence. OpenReview. — https://openreview.net/forum?id=BZ5a1r-kVsf — EBM and world model vision for JEPA.
Ancillary
- Matrix Information Theory for Self-Supervised Learning. — arXiv:2305.17326 — Unifies SimSiam, Barlow Twins, MEC under maximum entropy encoding.