Last updated: Jan-17-2025
Cloudinary is a cloud-based service that provides solutions for image and video management. These include server or client-side upload, on-the-fly image and video transformations, fast CDN delivery, and a variety of asset management options.
The Cloudinary AI Vision add-on is a service utilizing LLM (Large Language Model) capabilities, specialized models, advanced algorithms, prompt engineering, and Cloudinary's knowledge, to interpret and respond to visual content queries, providing answers to questions (e.g., "Are there flowers?") and requests (e.g., "Describe this image") about an image's content. By seamlessly integrating visual and textual data, AI Vision provides a more holistic and adaptable understanding of content, enabling businesses to tailor solutions that align closely with their unique brand and customer expectations, thus securing a substantial competitive advantage.
AI Vision is designed to cater to a variety of needs across different industries, streamlining content moderation, media classification and understanding content, and providing a powerful tool that automates the analysis, tagging, and moderation of visual content.
asset_id
of an image in your Cloudinary account, or a valid uri
to an image. Getting started
Before you can use the Cloudinary AI Vision add-on:
You must have a Cloudinary account. If you don't already have one, you can sign up for a free account.
Register for the add-on: make sure you're logged in to your account and then go to the Add-ons page. For more information about add-on registrations, see Registering for add-ons.
Keep in mind that many of the examples on this page use our SDKs. For SDK installation and configuration details, see the relevant SDK guide.
If you're new to Cloudinary, you may want to take a look at the Developer Kickstart for a hands-on, step-by-step introduction to Programmable Media features.
Overview
AI Vision offers scalable solutions for handling large volumes of media assets to provide a seamless, ready-to-use experience, enabling users to integrate effortlessly without having to do any complex customizations or prompt engineering. The add-on supports the following modes:
- Tagging - Automatically tag images based on provided definitions.
- Moderation - Evaluate images against specific moderation questions.
- General - Gain insights from images by asking open-ended questions.
Tagging mode
The Tagging mode accepts a list of tag names along with their corresponding descriptions. If the image matches the description, which may encompass various elements, the response will be appropriately tagged. This approach enables customers to align with their own brand taxonomy, offering a dynamic, flexible, and open method for image classification.
To return the tags for an image based on provided definitions you call the ai_vision_tagging
method with the following parameters:
-
source
: The image to be analyzed. Either auri
or anasset_id
can be specified. -
tag_definitions
: A list of tag definitions containing names and descriptions (max 10).
Example Request:
Example Response:
Moderation mode
The Moderation mode accepts multiple questions about an image, to which the response provides concise answers of "yes," "no," or "unknown." This functionality allows for a nuanced evaluation of whether the image adheres to specific content policies, creative specs, or aesthetic criteria.
To evaluate images against specific moderation questions you call the ai_vision_moderation
method with the following parameters:
-
source
: The image to be analyzed. Either auri
or anasset_id
can be specified. -
rejection_questions
: A list of yes/no questions to ask (max 10).
Example Request:
Example Response:
General mode
The General mode serves a wide array of applications by providing detailed answers to diverse questions about an image. Users can inquire about any aspect of an image, such as identifying objects, understanding scenes, or interpreting text within the image.
To ask general questions you call the ai_vision_general
method with the following parameters:
-
source
: The image to be analyzed. Either auri
or anasset_id
can be specified. -
prompts
: A list of questions or requests to ask (max 10).
Example Request:
Example Response:
Tokens
Your AI Vision Add-on quota is based on tokens. A token is a unit of measurement, similar to a word, used to quantify the processing required. Tokens can represent both text and images, with pricing based on the number of tokens processed.
- Input tokens: Data sent to AI Vision, like text or images. Images are treated as input and converted into tokens.
- Output tokens: Data generated by AI Vision in response, like text descriptions.
Consolidating into token count provides a clear understanding of the total token used.
Every response also includes a limits
node with the number of tokens used by the operation. For example: