Phi 4 Multimodal

by Microsoft

Phi-4 Multimodal is a lightweight 5.6 billion parameter open-weight foundation model that unifies speech, vision, and text processing in a single architecture using a mixture-of-LoRAs design. It supports a 128K token context window and processes multiple input modalities including text across 23 languages, English visual understanding, and audio in eight languages. Trained on 5 trillion text tokens, 2.3 million speech hours, and 1.1 trillion image-text tokens, the model handles image understanding, optical character recognition, chart and table reasoning, speech recognition and translation, and multi-image analysis. It ranks first on Hugging Face's OpenASR leaderboard and is the first open-source model offering speech summarization. Released under the MIT license in February 2025, Phi-4 Multimodal is designed for memory and compute-constrained environments while maintaining competitive performance on vision benchmarks against larger multimodal models.