Evaluation
This page describes how submissions to the FRAME Track are evaluated and ranked. The evaluation philosophy follows the Metrics Reloaded recommendations [Maier-Hein et al., 2024], balancing interpretability of the ranking against methodological precision of the detailed performance analyses. The same evaluation protocol applies to the FRAME, SEGMENT, and PROCEDURE Tracks, with track-specific differences noted where relevant.
| At a glance. Each test case is a Visual Question Answering (VQA) instance. Submissions are scored with Accuracy, aggregated into capability- and robustness-stratified buckets, and combined into a final ranking using the Copeland method. Ties within the top three places are resolved by bootstrap-based head-to-head win rates. |
Why this evaluation design
InterpretabilityA single, transparent primary metric, Accuracy, makes leaderboard positions easy to read and verify, while a richer stratified analysis is reported alongside for nuance. |
Fairness across categoriesEqual-weight aggregation across capabilities and robustness buckets prevents large question categories from dominating the ranking. |
Robustness firstOut-of-distribution (OOD) questions, including unseen procedure types and unseen question formulations, are weighted equally to in-distribution (ID) questions to reward generalization. |
Theoretical guaranteesBecause no aggregation scheme satisfies every desirable property simultaneously (Arrow's impossibility theorem), we adopt the Copeland method, which has favorable properties relative to alternatives such as Borda counts [Rofin et al., 2023]. |
Primary metric: accuracy
The primary metric used for ranking is Accuracy, defined as the proportion of correctly answered VQA cases. We chose Accuracy because it directly operationalizes the assessment goal of correctness, is the de facto standard for VQA benchmarking, and yields results that are comparable across categorical and open-ended question types.
The well-known disadvantages of Accuracy, including sensitivity to class prevalence and threshold choice, are mitigated by the stratified data splits and hierarchical aggregation described below, rather than by replacing the metric.
Closed-ended questions
Closed-ended questions are scored by exact match against the reference answer after format-specific verification and parsing. Each question declares its answer format during dataset generation. Submissions whose response does not pass the format's verification step are counted as incorrect.
The full set of formats is implemented in the
focus.data.formats
module of the
orena-focus
Python package and includes:
| Format | Accepted response | Comparison rule |
binary |
yes / no case-insensitive |
Exact match on parsed boolean |
number |
Non-negative integer | Exact match on parsed integer |
percentage |
Non-negative number, optional % suffix |
Tolerance-aware match on parsed float |
fo_class |
A registered foreign-object class name or none |
Case-insensitive match on canonical class name |
time |
hh:mm:ss timestamp |
Tolerance-aware match |
multiple_choice / open_ended / matching |
One of a predefined option set, or free-form text up to 300 characters | LLM-as-a-judge |
Open-ended questions: LLM-as-a-judge
For questions whose responses cannot be checked by exact match, namely formats
multiple_choice, open_ended, and matching, semantic correctness
is assessed using an LLM-as-a-judge protocol. A sample judge implementation is provided in
focus.evaluation.judges.
Up to three independent judge LLMs evaluate every open-ended response; the final verdict is determined by majority vote. The voting routine short-circuits as soon as an absolute majority is reached, which keeps inference cost low without changing the outcome. This paradigm has become common practice in modern VLM benchmarks and tolerates clinically meaningful linguistic variability; internal pilot experiments showed only minor disagreement across state-of-the-art judge LLMs.
To prevent tuning towards specific judges, the exact set of judge models used for the official evaluation will not be disclosed until after the challenge concludes. The evaluation code is released openly, but the judge identity will remain redacted.
| Anti-gaming policy. Any attempt to manipulate the LLM-as-a-judge through adversarial prompting, jailbreaking, or other techniques aimed at unfairly influencing evaluation will result in immediate disqualification of the team. |
Tolerance-aware accuracy
For questions where the reference annotation has inherent uncertainty, most notably temporal
references in the time format and numeric estimates in the percentage
format, predictions are counted as a true positive if they fall within a predefined tolerance
window configured per question via the format's threshold_seconds or
threshold_pp parameter. Tolerance thresholds were derived from inter-rater variability
and clinical input, similar to the application-dependent accuracy formulation of
[Dergachyova et al., 2016].
Missing submissions
Any VQA case for which a submission does not produce a response, including timeouts exceeding the per-question time budget, is treated as incorrect.
Aggregation: stratified buckets
Each test case carries two pieces of meta-information used for aggregation:
1. Robustness levelIn-distribution (ID) or out-of-distribution (OOD) with respect to procedure type and question formulation. OOD cases come from procedure types not represented in the training set and from question phrasings not seen during training. |
2. Primary capabilityEach case is mapped to exactly one primary capability based on its question intent, using the FOCUS taxonomy. |
Capability mapping
| Capability | Question intent | Example |
| Object recognition and instance matching | Which object? In which state? Where? | Which type of foreign object is visible in the lower right of the frame? |
| Temporal grounding | When? How long? | When in the segment is the sponge first introduced? |
| Aggregation | How many? | How many distinct needles appear in this segment? |
| Event and procedural understanding | Which action? | Is the clip being applied or removed in this segment? |
| Complex reasoning | Why? What happens if? | Why is the surgeon repositioning the specimen bag? |
For every model, mean Accuracy is computed within each capability × robustness bucket. This yields up to 2 × 5 = 10 bucket scores per model for the SEGMENT and PROCEDURE Tracks. The FRAME Track uses a reduced taxonomy, focused on object recognition and aggregation, and therefore has fewer buckets.
Ranking procedure
The ranking is computed in three steps.
Step 1 — Per-bucket ranking with significance adjustment
Within each bucket, models are ranked by mean Accuracy. Pairwise significance tests using cluster-aware bootstrapping are then used to collapse ranks when performance differences are not significant. Two models with statistically indistinguishable bucket-level Accuracy receive the same rank, so that irrelevant differences within the noise floor of the test set do not propagate into the final ranking.
Step 2 — Aggregating buckets via the Copeland method
The per-bucket rankings are combined into a single overall ranking using the Copeland method [Rofin et al., 2023]:
- For every ordered pair of models A and B, count the number of buckets in which A is ranked strictly higher than B, and vice versa.
- Model A dominates model B if A is ranked higher more often than B is.
- Each model's Copeland score is the number of models it dominates minus the number of models that dominate it.
- Higher Copeland scores are better; models are ordered by descending Copeland score.
This approach was chosen over linear schemes, such as the Borda rule, equivalent to averaging ranks, because it is more robust to irrelevant alternatives. In other words, adding or removing a weak submission does not arbitrarily perturb the ordering at the top.
Step 3 — Tie-breaking via bootstrap win rate
Ties within the top three positions of the Copeland ranking are resolved by directly comparing the tied models:
- Within every bucket, draw K bootstrap samples with replacement of the case-level scores, respecting the clustering of cases within source videos.
- For each bootstrap sample, identify the tied model with the highest Accuracy in that bucket. This is one win.
- The win rate of a model in a bucket is the fraction of bootstrap samples in which it wins. The model's overall win rate is the mean win rate across all buckets.
- Tied models are ordered by descending overall win rate.
Outperforming the baselines
Two baseline submissions are provided:
1. Frontier closed-source VLMA state-of-the-art frontier model, for example GPT-class or Gemini-class, applied zero-shot and selected by best validation performance. |
2. Fine-tuned open-source VLMA strong open-source VLM fine-tuned on the challenge training data by the organizers. |
During the pre-evaluation phase, a team is considered to outperform a baseline if its mean Accuracy across all buckets is higher than the baseline's. Only teams that beat both baselines during pre-evaluation are admitted to the final test phase. Prizes during the final phase require beating both baselines on the final ranking.
The full Copeland-based ranking, not just mean Accuracy, is applied in the final test phase.
Open-source evaluation code
The evaluation code for the pre-evaluation phase is released publicly as part of the
orena-focus
Python package, so that participants can reproduce per-case scoring locally on the released
training data. The aggregating and ranking code for the final phase will be released soon.
The specific judge LLMs used for open-ended scoring remain undisclosed during the challenge to
prevent over-fitting to a particular judge.