Task

The goal of the OmniLabel benchmark is to detect all objects in an image given any natural language description

We define the task as to build a model M that takes as input an image along with a label space D and outputs bounding boxes bbox according to the label space D. With free-form text descriptions of objects, our label space is virtually infinite in size. The OmniLabel dataset provides a relevant set of object descriptions for each image (where each description can refer to zero, one or many objects). Hence, note that the label space D is different for each image. The figure below shows an example of the label space D = ["Person", "Donuts", ..., "Chocolate donut", "Donut with green glaze"]. Compared to standard object detection benchmarks, OmniLabel evaluates detection ability based on rich free-form text descriptions going beyond plain categories. But note that both plain categories as well as specific object descriptions are part of the OmniLabel's labelspace. Our benchmark is also more challenging than traditional referring expression benchmarks with often more complex text descriptions that not only refer to a single instance, but to either zero, one or more.

Okay ... and how exactly can I evaluate my V&L-Detection model on this benchmark?

Inputs:

Each sample in the dataset is a pair of image I and label space list D. Each image I has a unique image_id. Similarly, the label space D is a list of object descriptions, each coming with a unique description_id and a text, e.g., D = [{id=321, text="Person"}, {id=223, text="Donuts"}, {id=4321, text="Chocolate donut"}, {id=12, text="Donut with green glaze"}].

Outputs:

The expected output of a model M is a set of triplets (bbox, score, description_id). Each triplet consisting of a bounding box, a confidence score, and an index linking the prediction to an object description in D. A bounding box bbox = (x,y,w,h) consists of 4 coordinates in the image space and defines the extent of an object. The confidence score is a real-valued scalar expressing the confidence in the model's prediction. Finally, the index description_id points to an ID in the label space D and indicates that the box is described by the corresponding text. Note that one object in the image may be referred to by multiple object descriptions (e.g., “person” and “woman in red shirt”), in which case the output should be one bounding box which points to two descriptions.

To submit your results on the test set, please use the following format:

[

    {

        image_id        ... the image id this predicted box belongs to

        bbox            ... the bounding box coordinates of the object (x,y,w,h)

        description_ids ... list of description IDs that refer to this object

        scores          ... list of confidences, one for each description

    },

    ...

]

Evaluation metric

To evaluate a model M on our task, we can follow the basic evaluation protocol of standard object detection with the Average Precision (AP) metric. This metric computes precision-recall curves at various Intersection-over-Union (IoU) thresholds to evaluate both classification and localization ability of a model M. Compared to traditional object detection, we adjust the evaluation protocol to account for the complex, free-form text object descriptions, which are virtually infinite in size:

Our evaluation toolkit is available on github

How to train a model?

The OmniLabel dataset is an evaluation-only benchmark. We want to motivate the use of multiple, existing datasets with different forms of annotation to train a model. To provide a common playground and a fair comparison between models for the official challenge, we define a few tracks, each with a different set of allowed training data sets and pre-trained models.

Track A

Track B

Track C

Disallowed: Do not use the validation/test sets of COCO, Objects-365 or OpenImages for training. Also, we discourage the use of the OmniLabel validation set for training/fine-tuning.

Relation to existing tasks

Difference to object detection benchmarks: The main difference is the label space, which is more complex (with natural text object descriptions) as well as dynamic (size of label space changes for every test image). Standard object detectors fail this task because of their fixed label space assumption. To reflect these differences, our results format also differs from conventional detection as marked in red below. This is to allow multiple descriptions for each predicted box with corresponding scores.

[

    {

        image_id

        bbox

        category_id     ... a single category ID

        score           ... real-valued scalar

    },

    ...

]

Results format for detection evaluation using pycocotools

[

    {

        image_id

        bbox

        description_ids  ... list of description IDs

        scores           ... list of confidence scores

    },

    ...

]

Results format for Omnilabel using our evaluation tool

Difference to referring expression benchmarks: While the task definition is similar, there are important differences in the data and the evaluation. First, the object descriptions D range from plain categories to highly specific descriptions. Second, the description in most referring expression benchmarks refers to exactly one instance in the image. PhraseCut is the only exception where expressions can refer to multiple instances. Our task and evaluation data is defined more broadly where any object description can refer to zero, one, or multiple instances in the image. Even specific descriptions like “person wearing red shirt and sunglasses” may not be present in an image. The model M needs to output an empty set in this case. Third, the object descriptions we collect are more challenging with examples that also include negations, e.g., a negation of an attribute like “cup NOT on the table”.