R
Published on

The Mathematics of Latent Prediction & Collapse in SSL

Overview

This note develops the mathematical theory underlying Self-Supervised Learning, with a focus on predictive SSL — methods that learn representations by predicting one view from another in latent space. We cover:

  1. What we are actually estimating and why it is fundamentally difficult
  2. The information-theoretic foundation of SSL objectives
  3. The collapse problem: complete, dimensional, and its spectral characterization
  4. A catalog of mathematical remedies (exact loss functions and why they work)
  5. JEPA's mathematical distinctiveness
  6. Open theoretical questions

Part 1: The Fundamental Problem Setup

1.1 What Are We Estimating?

Let XXX \in \mathcal{X} be the input space (images) and let ZRdZ \in \mathbb{R}^d be a latent representation space. SSL learns an encoder fθ:XRdf_\theta: \mathcal{X} \to \mathbb{R}^d such that Z=fθ(X)Z = f_\theta(X) captures semantic content — information about objects, scenes, and their relationships — while being invariant to nuisance factors (lighting, viewpoint, texture variations).

In predictive SSL specifically, we have two views xx and x+x^+ of the same underlying data point (e.g., two crops of the same image, two frames of the same video, two blocks of the same image). The goal is to learn representations z=fθ(x)z = f_\theta(x) and z+=fθ(x+)z^+ = f_\theta(x^+) that are predictable from each other — i.e., there exists a predictor gϕg_\phi such that gϕ(z)z+g_\phi(z) \approx z^+.

Formally: We want to find fθf_\theta such that the conditional expectation E[Z+Z]\mathbb{E}[Z^+ | Z] preserves the information in Z+Z^+ about the underlying semantic state SS, while discarding view-specific noise NN:

Z=fθ(X)h(S)where X=T(S,N)Z = f_\theta(X) \approx h(S) \quad \text{where } X = T(S, N)

Here TT is a stochastic transformation (data augmentation, cropping, masking) that produces a view from the latent state SS and noise NN. The representation ZZ should capture SS but not NN.

1.2 Why Is This Difficult?

Difficulty 1: The invariance-objective contradiction. The natural objective — make zz and z+z^+ close — has a trivial minimum: make fθf_\theta constant. This is the collapse problem.

Lnaive=E[fθ(x)fθ(x+)2]fθ(x)=c is global minimum\mathcal{L}_{\text{naive}} = \mathbb{E}\left[\|f_\theta(x) - f_\theta(x^+)\|^2\right] \quad \Rightarrow \quad f_\theta(x) = c \text{ is global minimum}

Difficulty 2: What should be invariant vs. what should be preserved? There is no ground-truth signal telling us which variations are semantic (should be preserved) and which are nuisances (should be discarded). Data augmentations encode human assumptions about this.

Difficulty 3: The abstraction hierarchy. Semantics exist at multiple scales — a car wheel and a car are both "semantic" but at different levels. No single objective automatically discovers this hierarchy.

Difficulty 4: The noise-information trade-off. To discard noise, the representation must lose information. But how much and what kind? Too much information loss → collapse. Too little → overfitting to nuisance.

Difficulty 5: The evaluation gap. SSL objectives optimize for a surrogate (contrastive, reconstruction, prediction) but we evaluate on downstream tasks (classification, detection). The gap between surrogate and downstream is not formally characterized.


Part 2: Information-Theoretic Foundations

2.1 InfoNCE and Mutual Information Lower Bound

Source: Representation Learning with Contrastive Predictive Coding (van den Oord et al., 2018) — arXiv:1807.03748

The InfoNCE loss for a set of NN samples {xi}i=1N\{x_i\}_{i=1}^N with one positive x+x^+ and N1N-1 negatives {xj}j=1N1\{x_j^-\}_{j=1}^{N-1}:

LNCE=E[logf(x+,c)j=1Nf(xj,c)]\mathcal{L}_{\text{NCE}} = -\mathbb{E}\left[\log \frac{f(x^+, c)}{\sum_{j=1}^N f(x_j, c)}\right]

where cc is a context (e.g., in CPC, this is the representation of the past), and f(x,c)=exp(zxzc/τ)f(x, c) = \exp(z_x^\top z_c / \tau) is a "critic" scoring compatibility.

Key theoretical result: The optimal critic satisfies:

f(x,c)=p(xc)p(x)f^*(x, c) = \frac{p(x|c)}{p(x)}

which is a density ratio. Substituting this back:

LNCE=E[logp(x+c)/p(x+)p(x+c)/p(x+)+j=1N1p(xjc)/p(xj)]\mathcal{L}_{\text{NCE}}^* = -\mathbb{E}\left[\log \frac{p(x^+|c)/p(x^+)}{p(x^+|c)/p(x^+) + \sum_{j=1}^{N-1} p(x_j^-|c)/p(x_j^-)}\right]

This yields a lower bound on mutual information:

I(X;C)log(N)LNCEI(X; C) \geq \log(N) - \mathcal{L}_{\text{NCE}}

Proof sketch:

  • Let XX be the positive sample and {Xj}j=1N\{X_j\}_{j=1}^N be the set containing 1 positive + N1N-1 negatives, all drawn independently from p(x)p(x|) except the positive which depends on cc.
  • The probability that the critic correctly identifies the positive among NN candidates is bounded by p(x+c)p(x+)\frac{p(x^+|c)}{p(x^+)}.
  • Jensen's inequality on log(1+N1NE[p(xc)p(x)/p(x+c)p(x+)])\log(1 + \frac{N-1}{N} \mathbb{E}[\frac{p(x^-|c)}{p(x^-)} / \frac{p(x^+|c)}{p(x^+)}]) gives the bound.

Important caveat: The bound tightens as NN \to \infty, but for finite NN it is limited to logN\log N.

2.2 Alignment and Uniformity Decomposition

Source: Understanding Contrastive Representation Learning through Alignment and Uniformity on the Hypersphere (Wang & Isola, 2020) — arXiv:2005.10242

For normalized representations on the unit hypersphere Sd1\mathbb{S}^{d-1} (i.e., z=1\|z\| = 1), as the number of negatives KK \to \infty, the InfoNCE loss decomposes into two terms:

LcontrastKEx,x+[f(x)f(x+)2]alignment  +  logEx,y[ef(x)f(y)2/2τ2]uniformity\mathcal{L}_{\text{contrast}} \xrightarrow{K\to\infty} \underbrace{\mathbb{E}_{x,x^+}\left[\|f(x) - f(x^+)\|^2\right]}_{\text{alignment}} \;+\; \underbrace{\log \mathbb{E}_{x,y}\left[e^{-\|f(x)-f(y)\|^2 / 2\tau^2}\right]}_{\text{uniformity}}

Where this comes from: For unit vectors, f(x)f(x+)=112f(x)f(x+)2f(x)^\top f(x^+) = 1 - \frac{1}{2}\|f(x)-f(x^+)\|^2. As KK \to \infty, the sum over negatives converges to an expectation, and log(K)\log(K) becomes a constant additive shift.

Alignment encourages positive pairs to have similar representations. Uniformity encourages the overall distribution of representations to be uniform on the hypersphere — this is the force that prevents collapse.

Properties:

  • Alignment alone is minimized by collapse (all points → same location → alignment = 0)
  • Uniformity alone is minimized by a uniform distribution on Sd1\mathbb{S}^{d-1}
  • Together, they balance: representations of the same object should be close, but different objects should be well-separated

Empirical finding: Directly optimizing alignment + uniformity (without InfoNCE) achieves comparable or better downstream performance than full contrastive learning, confirming these are the essential properties.

2.3 The Tschannen Critique: Is InfoNCE Really Maximizing MI?

Source: On Mutual Information Maximization for Representation Learning (Tschannen et al., 2019) — arXiv:1907.13625

Three fundamental problems with the MI-maximization interpretation:

1. Bound saturation. For NN negative samples, the bound is I(X;C)log(N)LNCEI(X;C) \geq \log(N) - \mathcal{L}_{\text{NCE}}. Since LNCE0\mathcal{L}_{\text{NCE}} \geq 0, the maximum lower bound is logN\log N. With batch size 256 (SimCLR default), logN5.5\log N \approx 5.5 nats. The actual MI could be much higher — the bound is uninformative.

2. The critic, not the encoder, is optimized. The bound is actually on I(fϕ(X);C)I(f_\phi(X); C) where fϕf_\phi is the critic — not on I(fθ(X);C)I(f_\theta(X); C) where fθf_\theta is the encoder. The encoder representations only influence the bound through the critic. This creates a gap.

3. Architecture-driven success. The paper argues that contrastive learning succeeds because of:

  • The data augmentation design (which defines the prediction task)
  • The projection head (which peels off unwanted information)
  • The embedding geometry (hypersphere concentration)

...not because of MI maximization. The MI bound is loose and the actual optimization dynamics are better understood through the alignment-uniformity lens.

2.4 Noise Contrastive Estimation (NCE) Connection

InfoNCE is closely related to Noise Contrastive Estimation. In NCE:

LNCE=E[logp(xc)p(xc)+p(x)]\mathcal{L}_{\text{NCE}} = -\mathbb{E}\left[\log \frac{p(x|c)}{p(x|c) + p(x)}\right]

InfoNCE differs by:

  1. Learned density ratio — the critic f(x,c)f(x,c) learns p(xc)/p(x)p(x|c)/p(x) rather than assuming a parametric form
  2. Self-normalization — the denominator over all {xj}\{x_j\} partitions the space
  3. Multi-class formulationNN-way classification rather than binary

The key distinction: NCE is designed for density estimation (the critic is the output), while InfoNCE is designed for representation learning (the critic is discarded after training, only the encoder is kept).


Part 3: The Collapse Problem — Mathematical Characterization

3.1 Complete Collapse

Definition: All inputs map to the same constant vector cRdc \in \mathbb{R}^d:

xX,fθ(x)=c\forall x \in \mathcal{X}, \quad f_\theta(x) = c

Properties:

  • Covariance matrix: Σ=E[(zzˉ)(zzˉ)]=0\Sigma = \mathbb{E}[(z - \bar{z})(z - \bar{z})^\top] = 0 (zero matrix)
  • All dd eigenvalues are zero: λ1==λd=0\lambda_1 = \cdots = \lambda_d = 0
  • Rank: rank(Σ)=0\text{rank}(\Sigma) = 0
  • Entropy: H(Z)=0H(Z) = 0
  • Loss under Lalign\mathcal{L}_{\text{align}}: L=0\mathcal{L} = 0 (global minimum)

Why it happens: The invariance objective Linv=Efθ(x)fθ(x+)2\mathcal{L}_{\text{inv}} = \mathbb{E}\|f_\theta(x) - f_\theta(x^+)\|^2 alone has zero as its global minimum at fθ(x)=cf_\theta(x) = c. There is no gradient signal pushing representations apart.

Linear network dynamics: For a linear encoder fθ(x)=Wxf_\theta(x) = Wx:

dWdt=2ηWE[(xx+)(xx+)]S\frac{dW}{dt} = -2\eta \, W \, \underbrace{\mathbb{E}[(x - x^+)(x - x^+)^\top]}_{S}

where SS is the positive-semidefinite difference covariance. The solution:

W(t)=W(0)e2ηStW(t) = W(0) e^{-2\eta S t}

Every singular value of WW decays exponentially to zero.

3.2 Dimensional Collapse

Source: Understanding Dimensional Collapse in Contrastive Self-supervised Learning (Jing et al., 2021) — arXiv:2110.09348

Definition: Representations span a low-dimensional subspace rdr \ll d but are not all identical. The empirical covariance Σ\Sigma has rank rr where 1r<d1 \leq r < d.

Properties:

  • Eigenvalues: λ1λr>0\lambda_1 \geq \cdots \geq \lambda_r > 0, λr+1λd0\lambda_{r+1} \approx \cdots \approx \lambda_d \approx 0
  • Effective rank: eff-rank(Σ)=exp(i=1dpilogpi)\text{eff-rank}(\Sigma) = \exp\left(-\sum_{i=1}^d p_i \log p_i\right) where pi=λi/jλjp_i = \lambda_i / \sum_j \lambda_j
  • Under collapse: eff-rank(Σ)d\text{eff-rank}(\Sigma) \ll d

Spectral dynamics under InfoNCE: The eigenvalue λk\lambda_k evolves as:

dλkdt=λk(1τeλk/τjeλj/τNτE[alignment])\frac{d\lambda_k}{dt} = \lambda_k \left(\frac{1}{\tau}\frac{e^{\lambda_k/\tau}}{\sum_j e^{\lambda_j/\tau}} - \frac{N}{\tau} \mathbb{E}[\text{alignment}]\right)

Key insight: The exponential eλk/τe^{\lambda_k/\tau} creates a rich-get-richer dynamic. Larger eigenvalues get stronger positive gradients (via the softmax numerator), while the negative-pair repulsion is approximately uniform across dimensions. This amplifies dominant modes and suppresses minor ones.

Three phases of collapse:

  1. Alignment phase: All eigenvalues grow as representations spread out from initialization.
  2. Spectral separation: Top eigenvalues keep growing; smaller ones plateau or begin to shrink.
  3. Dimensional collapse: Effective rank plummets — the model "gives up" on minor dimensions and concentrates information in a low-dimensional subspace.

Gradient decomposition for linear encoder WW:

WL=1τEx,x+[(zz+)x]pull positive1τEx,x[wij(zz)x]push negatives\nabla_W \mathcal{L} = \underbrace{\frac{1}{\tau}\mathbb{E}_{x,x^+}[(z - z^+)x^\top]}_{\text{pull positive}} - \underbrace{\frac{1}{\tau}\mathbb{E}_{x,x^-}[w_{ij}(z - z^-)x^\top]}_{\text{push negatives}}

where wij=ezizj/τ/kezizk/τw_{ij} = e^{z_i^\top z_j/\tau} / \sum_k e^{z_i^\top z_k/\tau} concentrates on the nearest negatives — which lie along the dominant eigendirections. Thus negatives push hardest along directions where representations already have high variance, further amplifying the imbalance.

3.3 Why Invariance Alone Collapses (BUT Non-Contrastive Methods Don't Necessarily)

Source: Understanding Self-supervised Learning Dynamics without Contrastive Pairs (Tian et al., 2021) — arXiv:2102.06810

The core result: in a simplified linear network setting, the asymmetric architecture (encoder + predictor + stop-gradient) of methods like BYOL and SimSiam creates dynamics where collapse is not a fixed point.

Setup: Online encoder WW, predictor PP, target encoder Wˉ\bar{W} (stop-gradient, optionally EMA).

L=EPWxWˉx+2\mathcal{L} = \mathbb{E}\|P W x - \bar{W} x^+\|^2

Gradient flow:

dWdt=2η(PPWAPWˉC)\frac{dW}{dt} = -2\eta (P^\top P W A - P^\top \bar{W} C^\top) dPdt=2η(PWAWWˉCW)\frac{dP}{dt} = -2\eta (P W A W^\top - \bar{W} C^\top W^\top)

where A=E[xx]A = \mathbb{E}[xx^\top], C=E[x(x+)]C = \mathbb{E}[x(x^+)^\top].

Why collapse is avoided: The predictor learns:

P=WˉCW(WAW)1P^* = \bar{W} C^\top W^\top (W A W^\top)^{-1}

This creates an asymmetry that prevents the collapse fixed point. The stationary condition PWA=WˉCP W A = \bar{W} C^\top is not satisfied by W=0W = 0.

Per-eigenmode result: When the cross-correlation CC is full rank, the system converges to:

WW=CA1CA1W^\top W = C^\top A^{-1} C A^{-1}

which is full rank (r=dr = d). Neither complete nor dimensional collapse occurs.

Three necessary conditions for non-collapse:

  1. Predictor present — without PP, the loss becomes symmetric and collapse is a fixed point
  2. Stop-gradient applied — without sg(.)sg(.), the target Wˉ\bar{W} chases WW and both collapse
  3. Cross-correlation CC is full rank — if views are too similar, CC is low-rank and collapse occurs despite asymmetry

Part 4: Mathematical Remedies for Collapse — Catalog with Exact Loss Functions

4.1 SimCLR — Temperature-Controlled Negative Repulsion

Source: arXiv:2002.05709

Loss (NT-Xent):

Li,j=logexp(sim(zi,zj)/τ)k=12N1[ki]exp(sim(zi,zk)/τ)\mathcal{L}_{i,j} = -\log \frac{\exp(\text{sim}(z_i, z_j)/\tau)}{\sum_{k=1}^{2N} \mathbb{1}_{[k \neq i]} \exp(\text{sim}(z_i, z_k)/\tau)}

where sim(u,v)=uv/(uv)\text{sim}(u,v) = u^\top v / (\|u\|\|v\|), z=g(f(x))z = g(f(x)) (projection head), and τ>0\tau > 0 is temperature.

Why it prevents collapse: The denominator forces every sample to be distinguishable from every other sample. At collapse (sim=1\text{sim}=1 for all pairs):

L=log12N1>0\mathcal{L} = -\log\frac{1}{2N-1} > 0

This generates gradients that break collapse.

Role of temperature τ\tau:

  • τ0+\tau \to 0^+: Softmax → argmax. Hardest negative dominates → strongest uniformity.
  • τ\tau \to \infty: All similarities ~1 → no discrimination.
  • Optimal τ\tau (0.07–0.5 for SimCLR): balances alignment strength with uniformity.

4.2 VICReg — Variance + Invariance + Covariance

Source: arXiv:2105.04906

Loss:

L=λ1Nizizi22invariance  +  μ1djmax(0,γVar(z:,j)+ϵ)variance  +  ν1dij[C(Z)]ij2covariance\mathcal{L} = \underbrace{\lambda \cdot \frac{1}{N}\sum_i \|z_i - z'_i\|_2^2}_{\text{invariance}} \;+\; \underbrace{\mu \cdot \frac{1}{d}\sum_j \max(0, \gamma - \sqrt{\text{Var}(z_{:,j}) + \epsilon})}_{\text{variance}} \;+\; \underbrace{\nu \cdot \frac{1}{d}\sum_{i \neq j} [C(Z)]_{ij}^2}_{\text{covariance}}

where C(Z)=1N1i(zizˉ)(zizˉ)C(Z) = \frac{1}{N-1}\sum_i (z_i - \bar{z})(z_i - \bar{z})^\top, γ=1\gamma = 1 (target std).

Why it prevents collapse — three distinct forces:

  1. Invariance pulls positive pairs together (would alone collapse).
  2. Variance hinge forces each feature dimension's standard deviation γ\geq \gamma. This directly prevents dimensional collapse: if one dimension collapses to a constant, that dimension's std = 0, triggering the hinge loss.
  3. Covariance penalizes off-diagonal correlation matrix entries. This decorrelates features, preventing redundancy and ensuring each dimension captures distinct information.

Connection to uniform hypersphere: For uniformly distributed points on Sd1\mathbb{S}^{d-1}, the expected covariance is approximately 1dId\frac{1}{d} I_d — diagonal with equal variance. VICReg enforces exactly this structure, albeit in Rd\mathbb{R}^d rather than on the hypersphere.

4.3 Barlow Twins — Cross-Correlation Minimization

Source: arXiv:2103.03230

Loss:

Cij=bzb,iAzb,jBb(zb,iA)2b(zb,jB)2C_{ij} = \frac{\sum_b z_{b,i}^A z_{b,j}^B}{\sqrt{\sum_b (z_{b,i}^A)^2} \sqrt{\sum_b (z_{b,j}^B)^2}}

L=i(1Cii)2+λijiCij2\mathcal{L} = \sum_i (1 - C_{ii})^2 + \lambda \sum_i \sum_{j \neq i} C_{ij}^2

Why it prevents collapse — redundancy reduction:

  • Diagonal (Cii1C_{ii} \to 1): Each feature dimension must be perfectly correlated across the two views (invariance for that dimension).
  • Off-diagonal (Cij0C_{ij} \to 0, iji \neq j): Different dimensions must be decorrelated (no redundancy).

Mechanism: At collapse, all samples map to the same point. After batch normalization (which shifts and scales), each dimension is a constant. The cross-correlation between different dimensions of two constant vectors equals the product of the constants' normalized values, which is generally non-zero. This activates the off-diagonal penalty — which produces gradients that force different dimensions to carry different information. But at a single point, there's no variation to differentiate. The network must learn to spread representations to minimize the off-diagonal term.

4.4 BYOL — Predictor Asymmetry + EMA Target

Source: arXiv:2006.07733

Architecture:

  • Online: encoder fθf_\theta, projector gθg_\theta, predictor qθq_\theta
  • Target: encoder fξf_\xi, projector gξg_\xi — updated by EMA: ξτξ+(1τ)θ\xi \leftarrow \tau \xi + (1 - \tau)\theta
  • No gradients flow through the target (stop-gradient)

Loss (symmetrized):

Lθ=qθ(zθ)qθ(zθ)2zξzξ222predictortarget projection  +  symmetric term\mathcal{L}_\theta = \underbrace{\left\|\frac{q_\theta(z_\theta)}{\|q_\theta(z_\theta)\|_2} - \frac{z'_\xi}{\|z'_\xi\|_2}\right\|_2^2}_{\text{predictor} \to \text{target projection}} \;+\; \text{symmetric term}

After L2 normalization, this is equivalent to:

L=22qθ(zθ)qθ(zθ),zξzξ\mathcal{L} = 2 - 2 \left\langle \frac{q_\theta(z_\theta)}{\|q_\theta(z_\theta)\|}, \frac{z'_\xi}{\|z'_\xi\|} \right\rangle

Why it prevents collapse (the asymmetry argument):

  1. Without a predictor: online and target would be symmetric → MSE has trivial collapse solution.
  2. With a predictor: the online network must learn to predict the target's representation. The predictor is a learnable mapping qθq_\theta that maps the online representation to the target representation space.
  3. EMA target: provides stable, slowly-changing regression targets. If collapse tries to happen, the target (from an older, stable encoder) still produces diverse representations, creating a persistent error signal.
  4. Formal result (Tian et al., 2021): Without a predictor, variance goes to zero. The predictor forces information-preserving representations because it must learn a non-trivial mapping.

The ε\varepsilon parameter (from PMAX/JEPA context): BYOL operates in the ε=1\varepsilon = 1 regime — no explicit anti-collapse regularizer, relying entirely on architectural asymmetry. This is the "hard" regime that BYOL was the first to make work at ImageNet scale.

4.5 SimSiam — Stop-Gradient as Implicit EM

Source: arXiv:2011.10566

Loss:

L=12D(p1,sg(z2))+12D(p2,sg(z1))\mathcal{L} = \frac{1}{2} D(p_1, \text{sg}(z_2)) + \frac{1}{2} D(p_2, \text{sg}(z_1))

where sg()\text{sg}(\cdot) = stop-gradient, D(p,z)=p/p,z/zD(p, z) = -\langle p/\|p\|, z/\|z\|\rangle (negative cosine similarity).

Why it prevents collapse — implicit EM interpretation:

Without stop-gradient, collapse is immediate — both branches predict each other's constant vector, loss = -1. With stop-gradient, SimSiam implements alternating optimization:

  • E-step: Given encoder parameters θ\theta, compute targets ηx=E[fθ(x+)x]\eta_x = \mathbb{E}[f_\theta(x^+) | x] (expected representation under augmentations). Approximated by sg(fθ(x2))\text{sg}(f_\theta(x_2)).
  • M-step: Given targets, update θ\theta to minimize MSE: θargminθEfθ(x1)ηx2\theta \leftarrow \arg\min_\theta \mathbb{E}\|f_\theta(x_1) - \eta_x\|^2.

Why collapse is unstable: At exact collapse, loss = 0 (stationary point). However, any perturbation away from collapse produces E-step targets that differ across samples. The M-step then tries to match these different targets, pulling representations apart. Collapse is a stationary point but not an attractor — the EM dynamics naturally increase diversity.

4.6 MoCo — Momentum Encoder + Large Queue

Source: arXiv:1911.05722

Loss (InfoNCE with queue):

Lq=logexp(qk+/τ)i=0Kexp(qki/τ)\mathcal{L}_q = -\log \frac{\exp(q \cdot k_+ / \tau)}{\sum_{i=0}^K \exp(q \cdot k_i / \tau)}

where q=fq(xquery)q = f_q(x^{\text{query}}), k+=fk(xkey)k_+ = f_k(x^{\text{key}}) (positive), {k0,...,kK}\{k_0, ..., k_K\} are KK negatives from the queue.

Momentum update: fkmfk+(1m)fqf_k \leftarrow m f_k + (1 - m) f_q, with m=0.999m = 0.999.

Why it prevents collapse:

  1. Momentum encoder provides near-consistent targets across training steps. If the query encoder collapses, the positive key from the stable momentum encoder still produces a diverse target — maintaining high loss.
  2. Large queue (K=65536K = 65536 \gg batch size) provides abundant negatives. At collapse, qki1q \cdot k_i \approx 1 for all ii, so Llog(K+1)11.09\mathcal{L} \approx \log(K+1) \approx 11.09 — very high loss driving strong gradients.
  3. Queue decouples dictionary from batch size — unlike SimCLR's 2N12N-1 limit, MoCo provides temporal diversity from past iterations.

4.7 Summary: The Collapse Prevention Zoo

MethodLoss TypeNegatives?Collapse PreventionMath Mechanism
SimCLRNT-XentYes (batch)Negative repulsionSoftmax uniformity + temperature
MoCoInfoNCE + QueueYes (queue)Large queue + momentum encoderK=65536K=65536 negatives, EMA target stability
VICRegVar+Inv+CovNoVariance hingemax(0,γstd(z:,j))\max(0, \gamma - \text{std}(z_{:,j})) per dimension
Barlow TwinsCross-correlationNoRedundancy reductionOff-diagonal Cij2C_{ij}^2 penalty
BYOLMSE + Predictor + EMANoArchitectural asymmetryPredictor + stop-gradient breaks symmetry
SimSiamNeg Cosine + Stop-GradNoImplicit EME-step/M-step alternation; collapse not attractor

Part 5: JEPA's Mathematical Distinctiveness

5.1 The JEPA Formulation

Source: Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture (Assran et al., 2023) — arXiv:2301.08243

JEPA defines a joint-embedding predictive objective. Given an image xx, sample:

  • A context block BcB_c (large, spatially distributed)
  • Target blocks {Bt(k)}k=1K\{B_t^{(k)}\}_{k=1}^K (smaller, semantically meaningful regions)

The encoder fθf_\theta encodes the context: sc=fθ(xBc)s_c = f_\theta(x|_{B_c}). The predictor gϕg_\phi predicts target representations from the context:

LJEPA=k=1KE[gϕ(k)(sc)sˉt(k)2]\mathcal{L}_{\text{JEPA}} = \sum_{k=1}^K \mathbb{E}\left[\|g_\phi^{(k)}(s_c) - \bar{s}_t^{(k)}\|^2\right]

where sˉt(k)=fˉξ(xBt(k))\bar{s}_t^{(k)} = \bar{f}_\xi(x|_{B_t^{(k)}}) is the target representation from a momentum encoder (EMA), and gϕ(k)g_\phi^{(k)} are multi-scale prediction heads.

Key mathematical choices:

  1. Non-generative — predicts in latent space, not pixel space. No decoder.
  2. Multi-scale prediction — different predictor heads operate at different block sizes, naturally creating a semantic hierarchy.
  3. EMA target — prevents the trivial collapse by decoupling the prediction target from the online encoder.
  4. Block-wise masking — unlike MAE's random per-patch masking, JEPA masks entire semantic blocks.

5.2 Why This Avoids Collapse (Theoretical)

JEPA combines two collapse-prevention mechanisms inherited from the BYOL/PMAX lineage:

  1. Architectural asymmetry — predictor gϕg_\phi + EMA target fˉξ\bar{f}_\xi creates the same dynamics as BYOL (Section 4.4). Without this, the MSE between context and target representations would collapse.

  2. Multi-task prediction — the multiple prediction heads gϕ(k)g_\phi^{(k)} at different scales create a variety of prediction tasks. This naturally constrains the representation: scs_c must be informative enough to predict targets at multiple scales, preventing it from discarding too much information.

  3. Spatial masking strategy — target blocks are sampled at large semantic scale (not individual patches). This forces the representation to capture high-level semantics rather than local pixel statistics. The context block being spatially distributed ensures it contains sufficient information for non-trivial prediction.

5.3 Relation to Energy-Based Models

JEPA can be understood through LeCun's Energy-Based Model (EBM) framework. The energy of a configuration (x,y)(x, y) is:

E(x,y)=gϕ(fθ(x))fˉξ(y)2E(x, y) = \|g_\phi(f_\theta(x)) - \bar{f}_\xi(y)\|^2

where xx is the context and yy is the target block.

The goal is to learn an energy function that assigns low energy to compatible (context, target) pairs and high energy to incompatible pairs. In JEPA, this is achieved by:

  • Pulling down energy for positive pairs (ground-truth context-target combinations)
  • The EMA target implicitly prevents pushing-up from collapsing

More broadly, JEPA is an instance of a regularized latent-variable energy model where the latent variable is the representation itself, and regularization comes from the architectural asymmetry.

5.4 Comparison: Predictive vs. Generative vs. Contrastive

AspectGenerative (MAE)Contrastive (SimCLR)Predictive (JEPA)
TargetRaw pixelsEmbedding of negative samplesEmbedding of positive target
Lossxx^2\|x - \hat{x}\|^2logesim(z,z+)/τesim(z,z)/τ-\log\frac{e^{\text{sim}(z,z^+)/\tau}}{\sum e^{\text{sim}(z,z^-)/\tau}}g(sc)st2\|g(s_c) - s_t\|^2
Abstraction levelLow (pixel)High (embedding)High (embedding)
View generationRandom maskingHand-crafted augmentationsSpatial block structure
Collapse riskLow (pixel space well-constrained)Low (negatives push apart)Moderate (asymmetry + EMA)

5.5 The Multi-Scale Prediction Hierarchy

A unique mathematical aspect of JEPA is its multi-scale prediction heads. Each predictor gϕ(k)g_\phi^{(k)} predicts the representation of a target block of a specific size. This creates an implicit curriculum:

  • Small target blocks: predict fine-grained, local content (textures, edges)
  • Large target blocks: predict coarse, global content (object shapes, scene layout)

Mathematically, for block sizes {s1,...,sK}\{s_1, ..., s_K\}:

scgϕ(1)sˉt(1)(s1)s_c \xrightarrow{g_\phi^{(1)}} \bar{s}_t^{(1)}(s_1) scgϕ(2)sˉt(2)(s2)s_c \xrightarrow{g_\phi^{(2)}} \bar{s}_t^{(2)}(s_2) \vdots scgϕ(K)sˉt(K)(sK)s_c \xrightarrow{g_\phi^{(K)}} \bar{s}_t^{(K)}(s_K)

The context representation scs_c must simultaneously be predictive at all scales. This forces it to encode information that is simultaneously local and global — a form of multi-resolution representation learning without explicit multi-resolution architecture.


Part 6: Open Theoretical Questions

6.1 Why Does BYOL Work Without Negatives?

This remains the most important open question in SSL theory. BYOL achieves state-of-the-art results with an apparently easy-to-collapse objective (MSE between two views). Current explanations:

  • Predictor asymmetry (Tian et al., 2021) — the linear network analysis shows collapse is avoided, but the extension to deep nonlinear networks is heuristic.
  • Implicit regularization — batch normalization, weight decay, and the specific initialization create implicit collapse-prevention forces that are not captured by simplified analyses.
  • Feature diversity via EMA — the target encoder provides a consistent but drifting target; this prevents the slow collapse seen in symmetric architectures.

6.2 Why Does Predictor Strength Affect BYOL and SimSiam Differently?

  • In BYOL: a stronger predictor (2-layer vs. 1-layer MLP) hurts performance.
  • In SimSiam: a stronger predictor helps or is neutral.
  • The theoretical reason is unknown. Hypothesis: in BYOL, the EMA target is already a good predictor, so an elaborate predictor overfits; in SimSiam (no EMA), the predictor must do more work.

6.3 The Relationship Between Batch Size and Collapse

  • SimCLR: larger batches → more negatives → stronger uniformity → better representations
  • BYOL: batch size has much less effect (no negatives needed)
  • Barlow Twins: batch size determines the quality of the cross-correlation estimate
  • Why batch size affects some methods dramatically and others barely is not fully characterized.

6.4 The Role of Data Augmentation

  • Stronger augmentations improve contrastive and non-contrastive methods alike
  • But augmentations define the invariance the model learns — there's no principled way to choose them
  • JEPA's use of spatial structure rather than augmentations is an attempt to avoid this, but it introduces its own hyperparameters (block size, aspect ratio, spatial distribution)

6.5 The Spectral Gap Between SSL and Supervised Representations

SSL representations (even from strong methods) differ systematically from supervised ones:

  • SSL representations have higher effective rank (more dimensions are used)
  • SSL representations are less specialized — individual dimensions don't align with semantic concepts
  • Supervised representations have sharper spectral decay
  • The theoretical implications of this spectral gap for downstream task performance are not fully understood.

References

Information-Theoretic Foundations

  1. van den Oord, A., Li, Y., & Vinyals, O. (2018). Representation Learning with Contrastive Predictive Coding.arXiv:1807.03748 — InfoNCE loss derivation, MI lower bound, density ratio estimation.
  2. Wang, T., & Isola, P. (2020). Understanding Contrastive Representation Learning through Alignment and Uniformity on the Hypersphere. ICML. — arXiv:2005.10242 — Alignment + uniformity decomposition of contrastive loss.
  3. Tschannen, M., Djolonga, J., Rubenstein, P. K., Gelly, S., & Lucic, M. (2019). On Mutual Information Maximization for Representation Learning.arXiv:1907.13625 — Critique of InfoNCE as MI maximization.

Collapse Analysis

  1. Jing, L., et al. (2021). Understanding Dimensional Collapse in Contrastive Self-supervised Learning.arXiv:2110.09348 — Mathematical characterization of dimensional collapse, spectral dynamics.
  2. Tian, Y., et al. (2021). Understanding Self-supervised Learning Dynamics without Contrastive Pairs.arXiv:2102.06810 — Linear network analysis of BYOL/SimSiam non-collapse dynamics.

Mathematical Remedies

  1. Chen, T., et al. (2020). A Simple Framework for Contrastive Learning of Visual Representations. ICML. — arXiv:2002.05709 — NT-Xent loss.
  2. He, K., et al. (2020). Momentum Contrast for Unsupervised Visual Representation Learning. CVPR. — arXiv:1911.05722 — MoCo momentum encoder + queue.
  3. Grill, J.-B., et al. (2020). Bootstrap Your Own Latent. NeurIPS. — arXiv:2006.07733 — BYOL: predictor + EMA.
  4. Chen, X., & He, K. (2021). Exploring Simple Siamese Representation Learning. CVPR. — arXiv:2011.10566 — SimSiam: stop-gradient as implicit EM.
  5. Zbontar, J., et al. (2021). Barlow Twins: Self-Supervised Learning via Redundancy Reduction. NeurIPS. — arXiv:2103.03230 — Cross-correlation minimization.
  6. Bardes, A., et al. (2022). VICReg: Variance-Invariance-Covariance Regularization for Self-Supervised Learning. ICLR. — arXiv:2105.04906 — Explicit variance regularization.
  1. Assran, M., et al. (2023). Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture. ICCV. — arXiv:2301.08243 — I-JEPA: multi-scale latent prediction.
  2. Bardes, A., et al. (2024). Revisiting Feature Prediction for Learning Visual Representations from Video.arXiv:2404.08471 — V-JEPA: temporal prediction.
  3. LeCun, Y. (2022). A Path Towards Autonomous Machine Intelligence. OpenReview. — https://openreview.net/forum?id=BZ5a1r-kVsf — EBM and world model vision for JEPA.

Ancillary

  1. Matrix Information Theory for Self-Supervised Learning. — arXiv:2305.17326 — Unifies SimSiam, Barlow Twins, MEC under maximum entropy encoding.