Llama 3.2 Vision

· Source: Ollama Blog · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Intermediate, quick

Summary

Llama 3.2 Vision, a new multimodal large language model, was released on November 6, 2024, and is now available for local execution via Ollama. It comes in two sizes: an 11B parameter model requiring at least 8GB of VRAM, and a larger 90B parameter model needing a minimum of 64GB of VRAM. Users can download Ollama 0.4 and run the models directly from the command line. The model supports various vision-language tasks, including handwriting recognition, Optical Character Recognition (OCR), analysis of charts and tables, and general image question-answering. Integration is also provided through Ollama's Python and JavaScript libraries, as well as via cURL for API access.

Key takeaway

For Computer Vision Engineers developing local AI applications, Llama 3.2 Vision's availability in Ollama provides a robust, accessible option. You should consider its VRAM requirements (8GB for 11B, 64GB for 90B) when selecting a model size. This enables you to quickly prototype and deploy multimodal features like OCR or image Q&A without cloud dependencies, using familiar Python or JavaScript libraries.

Key insights

Llama 3.2 Vision offers local multimodal AI capabilities via Ollama, supporting diverse image-text tasks.

Principles

Method

Download Ollama 0.4, then use `ollama run llama3.2-vision` (or `:90b`) to interact with the model, providing images via drag-and-drop or file path.

In practice

Topics

Code references

Best for: Computer Vision Engineer, Machine Learning Engineer, AI Engineer, Data Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Ollama Blog.