Multimodality

Introduction

Multimodality refers to the use of multiple modes or methods to convey information such as text, images, audio, and video, to create a more effective and engaging way of transmitting information.

Multimodality in DIAL

DIAL taps into this by connecting to Large Language Models (LLMs) that handle various media types. You can create applications to handle specific modality tasks or even comprehensive solution (orchestrators) to blend together applications for more complex scenarios.

Working with different types of media is made available by supporting working with files. User or application in DIAL, can input and output files that are saved in a dedicated bucket and are accessible based on a flexible permissions model. Files can be provided as an input for multimodal models and generated by them as an output.

Models

DIAL Chat application offers user interface for communication with the Supported Models.

Connection to LLMs is realized using so-called adapters. Refer to OpenAI, Bedrock, Vertex adapters to learn more about them and the supported models. You can use DIAL SDK to create custom model adapters.

Working with text

DIAL has adapters to a variety of text-to-text processing LLMs. Refer to Supported Models to view the list of supported models.

Working with images

For image-to-text tasks, DIAL has adapters for GPT-4o, Claude 4 and Gemini 2.5 models.
For text-to-image tasks, DIAL has adapters for DALL-E-3, Gemini 2.5 Flash Image (Nano Banana), Google Imagen and Stability diffusion models.

Working with audio and video

For audio/video-to-text tasks, DIAL has adapters for Gemini 2.0/2.5. Refer to Vertex Adapter to view all supported models.
For text-to-video, image-to-video, video-to-video tasks, DIAL supports Sora model by OpenAI. Refer to OpenAI Adapter to learn more.
For text-to-audio and audio-to-text tasks, DIAL OpenAI adapter supports models connected via Azure Audio API such as GPT-4o mini TTS and Whisper.

Applications

You can use DIAL SDK to create custom applications compatible with the DIAL Unified API. Refer to Tutorials to learn how to create a simple application or watch a demo video.

Such application can be designed and configured to use multimodal LLMs to perform specific tasks or even form an ecosystem of applications that can interact with each other.

In the Cookbook section, you can find several examples:

Orchestrator

Besides creating applications solving specific multimodal tasks, you can create orchestrators that can use available AI models as tools to solve a given task in a workflow.

DIAL ChatHub is an example of an orchestrator that combines several applications and models into one unified access point. ChatHub can automatically route prompts to one of several agents (text-to-text applications, text-to-image applications, vision-to-text applications) depending on the task that needs to be performed. For example, if a user asks about weather, the Web RAG agent is engaged, if a user wants to output an image based on the text input - a specific application handles this task that is connected with a corresponding model. All this is done while interacting with one ChatHub solution.

Refer to Quick Apps 2.0 to learn how to create AI agents orchestrators.

Multimodality

Introduction​

Multimodality in DIAL​

Models​

Working with text​

Working with images​

Working with audio and video​

Applications​

Orchestrator​