Phi 4 Multimodal

by Microsoft

Phi-4 Multimodal is a lightweight 5.6 billion parameter open-weight foundation model that unifies speech, vision, and text processing in a single architecture using a mixture-of-LoRAs design. It supports a 128K token context window and processes multiple input modalities including text across 23 languages, English visual understanding, and audio in eight languages. Trained on 5 trillion text tokens, 2.3 million speech hours, and 1.1 trillion image-text tokens, the model handles image understanding, optical character recognition, chart and table reasoning, speech recognition and translation, and multi-image analysis. It ranks first on Hugging Face's OpenASR leaderboard and is the first open-source model offering speech summarization. Released under the MIT license in February 2025, Phi-4 Multimodal is designed for memory and compute-constrained environments while maintaining competitive performance on vision benchmarks against larger multimodal models.

Key info

Input
Output
Features
Context window
128K
Max output
4K

Available routes

No routes currently available — Phi 4 Multimodal isn't routed through the Opper gateway right now. It may return.

Contact us about this model →

Available models from Microsoft

Start building with 300+ models

One API key. Every major provider. Up and running in minutes.

Get startedView Documentation
Phi 4 Multimodal by Microsoft — not currently on Opper | Opper AI