Task
The goal of the OmniLabel benchmark is to detect all objects in an image given any natural language description
We define the task as to build a model M that takes as input an image along with a label space D and outputs bounding boxes bbox according to the label space D. With free-form text descriptions of objects, our label space is virtually infinite in size. The OmniLabel dataset provides a relevant set of object descriptions for each image (where each description can refer to zero, one or many objects). Hence, note that the label space D is different for each image. The figure below shows an example of the label space D = ["Person", "Donuts", ..., "Chocolate donut", "Donut with green glaze"]. Compared to standard object detection benchmarks, OmniLabel evaluates detection ability based on rich free-form text descriptions going beyond plain categories. But note that both plain categories as well as specific object descriptions are part of the OmniLabel's labelspace. Our benchmark is also more challenging than traditional referring expression benchmarks with often more complex text descriptions that not only refer to a single instance, but to either zero, one or more.