Event Details
Object-wise Metric Distance Estimation from a Single RGB Image via Semantic and Geometric Reasoning
Presenter: Abida Sultana
Supervisor:
Date: Wed, April 15, 2026
Time: 09:00:00 - 00:00:00
Place: Zoom - see below.
ABSTRACT
Join Zoom Meeting
https://uvic.zoom.us/j/4534002368?pwd=2Yu38RsexfniB1auf3UBNsVqw8ByMH.1&omn=87889111010
Meeting ID: 453 400 2368
Password: 658037
One tap mobile
+16475580588,,4534002368# Canada
+17789072071,,4534002368# Canada
Dial by your location
+1 647 558 0588 Canada
+1 778 907 2071 Canada
Meeting ID: 453 400 2368
Find your local number: https://uvic.zoom.us/u/kbY9u23Uyd
Abstract: Estimating metric object distance from a single RGB image is challenging because monocular depth does not provide an absolute scale. Existing solutions either require active sensors such as LiDAR or stereo, rely on monocular depth that remains scale-ambiguous, or use implicit vision-language reasoning that can be unstable for precise measurement. This thesis proposes a semantic–geometric pipeline for recovering metric scale by combining open-vocabulary object grounding and segmentation, label normalization, monocular depth, and camera cues. Object-centric 3D points are reconstructed from the predicted depth, an oriented 3D bounding box is fitted to estimate object dimensions, and real-world size priors are used to compute a scale factor that converts relative depth into absolute distance. The proposed method is evaluated on HOT3D, ScanNet, ARKitScenes, and a custom iPhone dataset, achieving Multi-Threshold Relative Accuracy (MRA) values of 68.85%, 88.30%, 75.12%, and 89.85%, respectively, under the per-frame average mean-distance strategy. The results show that frame-level averaging improves stability by reducing the influence of instance-level outliers. The main limitations of the approach are its dependence on segmentation and depth quality, sensitivity to canonical size priors for categories with high size variation, possible instability under occlusion or truncation, and relatively high processing time. Future work includes more robust scale estimation, adaptive size priors, improved object fitting, the use of consecutive frames for temporal consistency, and pipeline optimization for lower latency.
