Programmable Media

Cloudinary AI Vision (Beta)

Last updated: Jan-17-2025

Important
The Cloudinary AI Vision add-on is currently in development and is available as a Public Beta, which means we value your feedback, so please feel free to share any thoughts with us.

Cloudinary is a cloud-based service that provides solutions for image and video management. These include server or client-side upload, on-the-fly image and video transformations, fast CDN delivery, and a variety of asset management options.

The Cloudinary AI Vision add-on is a service utilizing LLM (Large Language Model) capabilities, specialized models, advanced algorithms, prompt engineering, and Cloudinary's knowledge, to interpret and respond to visual content queries, providing answers to questions (e.g., "Are there flowers?") and requests (e.g., "Describe this image") about an image's content. By seamlessly integrating visual and textual data, AI Vision provides a more holistic and adaptable understanding of content, enabling businesses to tailor solutions that align closely with their unique brand and customer expectations, thus securing a substantial competitive advantage.

AI Vision is designed to cater to a variety of needs across different industries, streamlining content moderation, media classification and understanding content, and providing a powerful tool that automates the analysis, tagging, and moderation of visual content.

Note
AI Vision uses the Analyze API and doesn't require the image to be stored in your Cloudinary account. The AI Vision methods accept either the asset_id of an image in your Cloudinary account, or a valid uri to an image.

Getting started

Before you can use the Cloudinary AI Vision add-on:

  • You must have a Cloudinary account. If you don't already have one, you can sign up for a free account.

  • Register for the add-on: make sure you're logged in to your account and then go to the Add-ons page. For more information about add-on registrations, see Registering for add-ons.

  • Keep in mind that many of the examples on this page use our SDKs. For SDK installation and configuration details, see the relevant SDK guide.

  • If you're new to Cloudinary, you may want to take a look at the Developer Kickstart for a hands-on, step-by-step introduction to Programmable Media features.

Overview

AI Vision offers scalable solutions for handling large volumes of media assets to provide a seamless, ready-to-use experience, enabling users to integrate effortlessly without having to do any complex customizations or prompt engineering. The add-on supports the following modes:

  • Tagging - Automatically tag images based on provided definitions.
  • Moderation - Evaluate images against specific moderation questions.
  • General - Gain insights from images by asking open-ended questions.

Tagging mode

The Tagging mode accepts a list of tag names along with their corresponding descriptions. If the image matches the description, which may encompass various elements, the response will be appropriately tagged. This approach enables customers to align with their own brand taxonomy, offering a dynamic, flexible, and open method for image classification.

To return the tags for an image based on provided definitions you call the ai_vision_tagging method with the following parameters:

  • source: The image to be analyzed. Either a uri or an asset_id can be specified.
  • tag_definitions: A list of tag definitions containing names and descriptions (max 10).

Example Request:

bag

Example Response:

Moderation mode

The Moderation mode accepts multiple questions about an image, to which the response provides concise answers of "yes," "no," or "unknown." This functionality allows for a nuanced evaluation of whether the image adheres to specific content policies, creative specs, or aesthetic criteria.

To evaluate images against specific moderation questions you call the ai_vision_moderation method with the following parameters:

  • source: The image to be analyzed. Either a uri or an asset_id can be specified.
  • rejection_questions: A list of yes/no questions to ask (max 10).

Example Request:

bag

Example Response:

General mode

The General mode serves a wide array of applications by providing detailed answers to diverse questions about an image. Users can inquire about any aspect of an image, such as identifying objects, understanding scenes, or interpreting text within the image.

To ask general questions you call the ai_vision_general method with the following parameters:

  • source: The image to be analyzed. Either a uri or an asset_id can be specified.
  • prompts: A list of questions or requests to ask (max 10).

Example Request:

bag

Example Response:

Tokens

Your AI Vision Add-on quota is based on tokens. A token is a unit of measurement, similar to a word, used to quantify the processing required. Tokens can represent both text and images, with pricing based on the number of tokens processed.

  • Input tokens: Data sent to AI Vision, like text or images. Images are treated as input and converted into tokens.
  • Output tokens: Data generated by AI Vision in response, like text descriptions.

Consolidating into token count provides a clear understanding of the total token used.

Every response also includes a limits node with the number of tokens used by the operation. For example:

✔️ Feedback sent!

Rate this page: