Members-Only
Recent Talks & Demos are for members only
You must be an AI Tinkerers active member to view these talks and demos.
Phi-4 + FastViT-HD VLM
This talk explains how to combine a text-only Phi-4 LLM with FastViT-HD image encoder to build and fine-tune an efficient open-source Vision-Language Model.
I’ll walk through how I merged Microsoft’s 2.7 B-parameter text-only Phi-4-mini-reasoning LLM with Apple’s high-speed FastViT-HD image encoder to create Friday-VLM—a finetuned Vision Language Model (VLM) that can caption, reason over, and chat about high-resolution images. I’ll cover the end-to-end pipeline (pre-training, instruction fine-tuning, and image encoding) and demo live inference.
Friday-VLM: PyTorch Phi-4/FastViT VLM for efficient instruction-tuned multimodal learning.
Friday-VLM: multimodal LLM fine-tuned for image-text instruction following.