Kosmos Class Documentation

The Kosmos class is a Python class that provides a set of methods for performing various tasks related to multimodal grounding, referring expression comprehension, referring expression generation, grounded visual question answering, and grounded image captioning. The class uses the transformers library to load a pretrained model and processor for performing these tasks.

Class Definition

class Kosmos:
    def __init__(self, model_name="ydshieh/kosmos-2-patch14-224")
    def get_image(self, url)
    def run(self, prompt, image)
    def __call__(self, prompt, image)
    def multimodal_grounding(self, phrase, image_url)
    def referring_expression_comprehension(self, phrase, image_url)
    def referring_expression_generation(self, phrase, image_url)
    def grounded_vqa(self, question, image_url)
    def grounded_image_captioning(self, image_url)
    def grounded_image_captioning_detailed(self, image_url)
    def draw_entity_boxes_on_image(image, entities, show=False, save_path=None)
    def generate_boxees(self, prompt, image_url)

Class Methods

`init(self, model_name="ydshieh/kosmos-2-patch14-224")`

This method initializes the Kosmos class. It loads the pretrained model and processor from the transformers library.

Parameters

model_name (str): The name of the pretrained model to load. Default is "ydshieh/kosmos-2-patch14-224".

`get_image(self, url)`

This method retrieves an image from a given URL.

Parameters

url (str): The URL of the image to retrieve.

Returns

Image: The retrieved image.

`run(self, prompt, image)`

This method runs the model with a given prompt and image.

Parameters

prompt (str): The prompt to use for the model.
image (Image): The image to use for the model.

`call(self, prompt, image)`

This method allows the Kosmos class to be called like a function. It runs the model with a given prompt and image.

Parameters

prompt (str): The prompt to use for the model.
image (Image): The image to use for the model.

`multimodal_grounding(self, phrase, image_url)`

This method performs multimodal grounding with a given phrase and image URL.

Parameters

phrase (str): The phrase to use for multimodal grounding.
image_url (str): The URL of the image to use for multimodal grounding.

`referring_expression_comprehension(self, phrase, image_url)`

This method performs referring expression comprehension with a given phrase and image URL.

Parameters

phrase (str): The phrase to use for referring expression comprehension.
image_url (str): The URL of the image to use for referring expression comprehension.

`referring_expression_generation(self, phrase, image_url)`

This method generates referring expressions with a given phrase and image URL.

Parameters

phrase (str): The phrase to use for referring expression generation.
image_url (str): The URL of the image to use for referring expression generation.

`grounded_vqa(self, question, image_url)`

This method performs grounded visual question answering with a given question and image URL.

Parameters

question (str): The question to use for grounded visual question answering.
image_url (str): The URL of the image to use for grounded visual question answering.

`grounded_image_captioning(self, image_url)`

This method generates a grounded image caption for a given image URL.

Parameters

image_url (str): The URL of the image to caption.

`grounded_image_captioning_detailed(self, image_url)`

This method generates a detailed grounded image caption for a given image URL.

Parameters

image_url (str): The URL of the image to caption.

`draw_entity_boxes_on_image(image, entities, show=False, save_path=None)`

This method draws bounding boxes around entities in an image.

Parameters

image (Image or str or torch.Tensor): The image or image path or image tensor on which to draw bounding boxes.
entities (list): A list of entities to draw bounding boxes around.
show (bool): Whether to display the image. Default is False.
save_path (str): The path to save the image. If None, the image

Returns

new_image (numpy.ndarray): The image with bounding boxes drawn around entities.

`generate_boxees(self, prompt, image_url)`

This method generates bounding boxes for entities in an image based on a given prompt.

Parameters

prompt (str): The prompt to use for generating bounding boxes.
image_url (str): The URL of the image to use for generating bounding boxes.

Usage Examples

Example 1: Multimodal Grounding

from exa import Kosmos

kosmos = Kosmos()
kosmos.multimodal_grounding("Find the red apple in the image.", "https://example.com/apple.jpg")

Example 2: Referring Expression Comprehension

from exa import Kosmos

kosmos = Kosmos()
kosmos.referring_expression_comprehension("Show me the green bottle.", "https://example.com/bottle.jpg")

Example 3: Grounded Visual Question Answering

from exa import Kosmos

kosmos = Kosmos()
kosmos.grounded_vqa("What is the color of the car?", "https://example.com/car.jpg")

Additional Information

The Kosmos class uses the transformers library to load a pretrained model and processor. The model is used to generate responses based on the given prompts and images, and the processor is used to process the inputs and outputs of the model.

The Kosmos class provides a set of methods for performing various tasks related to multimodal grounding, referring expression comprehension, referring expression generation, grounded visual question answering, and grounded image captioning. These tasks involve generating responses based on the given prompts and images, and drawing bounding boxes around entities in the images.

The Kosmos class also provides a method for retrieving an image from a given URL, and a method for running the model with a given prompt and image. These methods are used internally by the other methods of the class.

The Kosmos class can be used in a variety of applications, such as image captioning, visual question answering, and object detection. It provides a simple and intuitive interface for performing these tasks, making it easy to use for both beginners and experienced developers.