Resources

  • ORena SAVE FOCUS structured Challenge Design - FRAME Track: LINK
  • ORena SAVE FOCUS Submission Template Code: Will be released by 2026/06/15
  • ORena SAVE FOCUS Evaluation and Ranking Toolkit: Will be released by 2026/06/15
  • ORena SAVE FOCUS first data batch: HeiCo-FOCUS-VQA

All Training Data

The complete ORena SAVE FOCUS training data will consist of two batches. The VQA pairs are generated and quality-controlled by expert annotators.

First training data batch
30 colorectal surgery videos
Second training data batch
170 laparoscopic cholecystectomy videos
Total training videos
200 videos
Total VQA pairs
50,000 questions

The data will be hosted via Hugging Face and can be accessed conveniently through the ORena SAVE FOCUS Python package.


First Training Data Batch

The first training data batch is based on the HeiCo-FOCUS dataset, built on {Hei}delberg {Co}lorectal surgeries. HeiCo-FOCUS contains 30 colorectal surgery videos and 15,000 VQA pairs covering five core capability groups: object recognition and identity matching, temporal grounding, aggregation, event and procedural understanding, and complex reasoning.

For the FRAME Track, the first training data batch contains 6,000 single-image VQA pairs. These questions focus on information that can be inferred from a single laparoscopic image, such as visible foreign object identification, counting, attributes, and spatial localization.

As also can be seen in Figure 1, HeiCo-FOCUS was constructed through a rigorous multi-stage annotation pipeline involving large-scale annotation and 39 domain experts to ensure high quality and clinical relevance. To systematically probe model behavior, the benchmark uses a multi-track evaluation framework that progressively increases temporal and contextual demands from single frames to full procedures.

Extensive experiments demonstrate that HeiCo-FOCUS pushes current VLMs to their limits: although leading models substantially outperform blind baselines, top overall Accuracy remains below 60% across all context lengths and drops to 37% on long-horizon PROCEDURE tasks. We therefore expect HeiCo-FOCUS to serve as a catalyst for the development of models capable of reliable, temporally consistent reasoning over hours-long videos.


Figure 1: Overview of the HeiCo-FOCUS benchmark. a) Clinical motivation. Ensuring the retrieval of foreign objects at the end of a surgery is critical, as retained objects can lead to severe complications. b) Dataset overview. The dataset comprises 96 hours of surgical video annotated with 15,000 visual question answering (VQA) pairs spanning five core capabilities.

Our taxonomy for surgical VQA is depicted in Figure 2. For question categorization, each question is assigned to one primary capability category and may optionally be tagged with additional secondary sub-categories to reflect its multi-faceted nature. The primary category is determined by the dominant reasoning requirement of the question. For example, “Where” questions are assigned to object recognition and identity matching ➡ spatial localization, “How many” questions to aggregation, and “When” questions to temporal grounding ➡ temporal localization. This primary assignment enables the balancing of VQA pairs across capabilities while secondary labels allow fine-grained stratified evaluation. The complete taxonomy is provided further below.


Figure 2: HeiCo-FOCUS taxonomy with sample questions.


Second Data Batch

The second training data batch will consist of 170 videos of laparoscopic cholecystectomies. These videos have not been made public before and therefore represent new challenge data.

The second data batch will contain:

FRAME
14,000 VQA pairs
SEGMENT
14,000 VQA pairs
PROCEDURE
7,000 VQA pairs

Together with the first training data batch, this results in 50,000 training VQA pairs.


Validation and Test Data

Test data will include 200 videos from a broad range of procedures, including cholecystectomies and additional procedure types. These test videos will not be conveyed to the participants. Leaderboard validation data will include 20 additional videos representative of the test data.


Taxonomy

The following taxonomy defines the capability groups and sub-capabilities used to categorize VQA pairs across the ORena SAVE FOCUS Challenge. Not all sub-capabilities are used in every track: the FRAME Track focuses on sub-capabilities that can be evaluated from a single image, while the SEGMENT and PROCEDURE tracks additionally cover temporal and long-context reasoning.

1. Object recognition and identity matching: Recognition of object instances, their semantic type, attributes, spatial context, and identity consistency across time. Subcategories:

  • 1a. Object identification: semantic classification of visible object instances
  • 1b. Object instance identity matching: correspondence of object instances across time
  • 1c. Object attributes and state: recognition of observable object properties or handling states
  • 1d. Object spatial localization (camera): object localization relative to the image plane
  • 1e. Object spatial localization (situs): object localization relative to anatomical structures

2. Temporal grounding: Localization of events or object occurrences within the video timeline, including their temporal position and duration. Subcategories:

  • 2a. Temporal localization: identification of the time point or interval at which an event or object occurrence takes place
  • 2b. Duration estimation: estimation of the temporal extent for which an event or object occurrence persists

3. Aggregation: Combination of information on foreign objects and events across multiple objects, instances, and/or time points. Subcategories:

  • 3a. Object aggregation: aggregation of information across multiple foreign object instances and/or categories
  • 3b. Event aggregation: aggregation of events that involve foreign objects across time

4. Event and procedural understanding: Recognition and interpretation of actions, events, and their procedural structure over time. Subcategories:

  • 4a. Foreign object interaction recognition: identification of actions or manipulations involving a foreign object
  • 4b. Foreign object usage purpose: identification of the general procedural role of a foreign object at a given time. Answers are grounded in what is visually demonstrated; no inference about hypothetical outcomes or hidden intent
  • 4c. Temporal ordering: determination of the relative order of events or procedural steps, including the identification of the n-th event in a series of events

5. Complex reasoning: Inference of functional, causal, relational, or outcome-related information beyond direct observation, and reasoning over interactions, dependencies, comparative patterns, and temporally extended evidence across objects and events. Subcategories:

  • 5a. Functional reasoning: inference of the intended function or role of an object or action beyond what is directly shown
  • 5b. Causal and consequence reasoning: inference of cause–effect relationships between actions, events, or object states and inference of potential or actual outcomes resulting from actions or events
  • 5c. Multi-step compositional reasoning: inference that requires the execution of a sequence of interdependent reasoning steps across objects, space, and time, involving the decomposition of a query into multiple sub-tasks to derive an answer that cannot be attributed to a single dominant capability