Phi-4 + FastViT-HD VLM

This talk explains how to combine a text-only Phi-4 LLM with FastViT-HD image encoder to build and fine-tune an efficient open-source Vision-Language Model.

Overview

I’ll walk through how I merged Microsoft’s 2.7 B-parameter text-only Phi-4-mini-reasoning LLM with Apple’s high-speed FastViT-HD image encoder to create Friday-VLM—a finetuned Vision Language Model (VLM) that can caption, reason over, and chat about high-resolution images. I’ll cover the end-to-end pipeline (pre-training, instruction fine-tuning, and image encoding) and demo live inference.

Links

https://github.com/krohling/friday-vlm
Friday-VLM: PyTorch Phi-4/FastViT VLM for efficient instruction-tuned multimodal learning.
https://huggingface.co/kevin510/friday
Friday-VLM: multimodal LLM fine-tuned for image-text instruction following.

Tech stack