Accelerating LLM fine-tuning with unstructured data using SageMaker Unified Studio and S3

2026-03-26 · Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cloud Computing & IT Infrastructure, Data Science & Analytics · Depth: Intermediate, long

Summary

AWS announced an integration between Amazon SageMaker Unified Studio and Amazon S3 general purpose buckets, enabling teams to use unstructured S3 data for machine learning and data analytics. This post demonstrates fine-tuning the Llama 3.2 11B Vision Instruct model for visual question answering (VQA) using this integration. The process involves accessing the model via SageMaker JumpStart, using the DocVQA dataset from Hugging Face, and creating three fine-tuned model versions with 1,000, 5,000, and 10,000 images. Experimentation and evaluation are tracked using Amazon SageMaker fully managed serverless MLflow, measuring performance with the Average Normalized Levenshtein Similarity (ANLS) score. The fine-tuned model with 10,000 images achieved an ANLS score of 0.902, a 4.9 percentage point increase over the base model's 0.853.

Key takeaway

For MLOps Engineers building VQA solutions, integrating Amazon SageMaker Unified Studio with S3 general purpose buckets simplifies data access and model fine-tuning. This approach, validated by a 4.9% ANLS improvement on Llama 3.2 11B Vision Instruct, allows for more efficient use of unstructured data and better model performance. You should explore this integration to streamline your data-to-model pipeline and enhance collaboration between data and ML teams.

Key insights

Integrating SageMaker Unified Studio with S3 streamlines ML workflows, improving VQA model performance via fine-tuning.

Principles

Dataset size correlates positively with VQA model performance.
Unified platforms simplify ML data discovery and access.

Method

The method involves creating SageMaker Studio projects for data producers and consumers, cataloging S3 data, fine-tuning a JumpStart LLM with varying dataset sizes, and tracking results with MLflow.

In practice

Use `ml.p4de.24xlarge` instances for Llama 3.2 11B Vision Instruct training.
Set JupyterLab idle time to 6 hours for long training jobs.
Utilize S3 Access Grants for secure data access in SageMaker.

Topics

Amazon SageMaker Unified Studio
LLM Fine-tuning
Amazon S3
Visual Question Answering
Llama 3.2 11B Vision Instruct

Code references

Best for: Machine Learning Engineer, MLOps Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.