Ryan Warner
- BSc (University of Calgary, 2011)
- BEng (University of Victoria, 2019)
Topic
Hyperbolic Vision–Language Embeddings and Loss Functions for Multimodal Meme Classification
Department of Computer Science
Date & location
- Tuesday, January 13, 2026
- 1:00 P.M.
- Virtual Defence
Examining Committee
Supervisory Committee
- Dr. Alex Thomo, Department of Computer Science, University of Victoria (Supervisor)
- Dr. Kwang Moo Yi, Department of Computer Science, UBC (Non-Unit Member)
External Examiner
- Dr. Ashery Mbilinyi, Department of Computer Science, UVic
Chair of Oral Examination
- Dr. Karena Shaw, School of Environmental Studies, UVic
Abstract
This thesis investigates whether hyperbolic geometry can improve multimodal VLM performance on the Facebook Hateful Memes dataset. The challenge of this benchmark stems partly from subtle semantic hierarchies in how images and text combine. For example, the phrase “you smell great” paired with a skunk conveys a fundamentally different meaning than the same text paired with a rose. Such shifts in meaning are often attributed to entailment relationships within and between the visual and textual components of a meme.
We hypothesize that hyperbolic geometry, with its natural capacity to represent hierarchical structure, may capture these entailment relationships more effectively than flat Euclidean space. To explore this idea, we introduce Hyperbolic Flamingo, to our knowledge the first Flamingo-style architecture implemented in hyperbolic geometry. The model combines frozen MERU/CLIP encoders with hyperbolic gated cross-attention layers. We adopt Flamingo’s frozen-encoder design because the benchmark requires rapid iteration, and lightweight adapters allow efficient experimentation across different geometric configurations.
Initial experiments revealed a core difficulty for hyperbolic VLMs: boundary col lapse, where embeddings drift toward the edge of the Poincar´e disk and gradients vanish. Under these conditions, angle-based losses saturate and performance collapses toward randomness. To address this, our main methodological contribution is a dis criminative prototype loss (Lproto). Instead of classifying via token-likelihood, the model predicts labels by geodesic distance to learnable class prototypes. This shift from generative prediction to geometric separation prevents boundary collapse and enables stable hyperbolic training where previous approaches failed. Experiments with centroid-regularised prototypes (Lproto-reg) show mixed, dimension-dependent effects: regularisation helps at high dimensionalities (e.g., +1.15% at 256d) but reduces performance in lower-dimensional settings.
Ablation across embedding dimensions shows that the Euclidean baseline (67.32% ± 0.56% AUROC) remains competitive with most hyperbolic variants, with the hyperbolic 128-dimensional model achieving a modest improvement (68.84%). Importantly, all prototype-based models outperform the token-likelihood baseline (61.64%), confirming the value of the discriminative pivot. The architecture is also unexpectedly efficient: our 3.8M-parameter adapter surpasses Flamingo-9B (62.7%) regardless of geometry. We view the Lorentzian distance losses explored here as a starting point; more advanced formulations (such as Accept-the-Gap’s exterior-angle losses or HyCoCLIP’s compositional entailment cones) may better realise the theoretical benefits of hyperbolic space. Overall, these findings establish Hyperbolic Flamingo as a practical and extensible foundation for further research on geometric inductive bias in vision–language models.
Code availability: https://github.com/rkwarnerwsslskunkworx/Hyperbolic-Flamingo