Vision models

· Source: Ollama Blog · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Intermediate, quick

Summary

Ollama has updated its LLaVA (Large Language-and-Vision Assistant) model collection to version 1.6, introducing significant enhancements for vision models. These updates include support for up to 4x higher image resolution, enabling the models to process more intricate visual details. The LLaVA 1.6 models also feature improved text recognition and reasoning capabilities, achieved through training on additional document, chart, and diagram datasets. The new models are distributed under more permissive licenses, specifically Apache 2.0 or the LLaMA 2 Community License. Available in 7B, 13B, and a new 34B parameter size, they can be run via the `ollama run` command-line interface or integrated using Ollama's Python and JavaScript libraries, supporting both file paths and base64-encoded images.

Key takeaway

For AI Engineers building multimodal applications, the LLaVA 1.6 update provides significantly improved image resolution and text recognition. You should consider migrating to these new models, especially the 34B version, to enhance detail comprehension and document analysis in your vision-language tasks. Leverage the more permissive Apache 2.0 license for broader deployment flexibility.

Key insights

LLaVA 1.6 models offer enhanced vision capabilities, higher resolution, and improved text recognition under permissive licenses.

Principles

Method

Integrate LLaVA 1.6 models via `ollama run` CLI or Ollama's Python/JavaScript libraries, providing image paths or base64-encoded files for vision tasks.

In practice

Topics

Code references

Best for: AI Engineer, Machine Learning Engineer, Software Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Ollama Blog.