Accelerating LLM fine-tuning with unstructured data using SageMaker Unified Studio and S3
Summary
AWS announced an integration between Amazon SageMaker Unified Studio and Amazon S3 general purpose buckets, enabling teams to use unstructured S3 data for machine learning and data analytics. This post demonstrates fine-tuning the Llama 3.2 11B Vision Instruct model for visual question answering (VQA) using this integration. The process involves accessing the model via SageMaker JumpStart, using the DocVQA dataset from Hugging Face, and creating three fine-tuned model versions with 1,000, 5,000, and 10,000 images. Experimentation and evaluation are tracked using Amazon SageMaker fully managed serverless MLflow, measuring performance with the Average Normalized Levenshtein Similarity (ANLS) score. The fine-tuned model with 10,000 images achieved an ANLS score of 0.902, a 4.9 percentage point increase over the base model's 0.853.
Key takeaway
For MLOps Engineers building VQA solutions, integrating Amazon SageMaker Unified Studio with S3 general purpose buckets simplifies data access and model fine-tuning. This approach, validated by a 4.9% ANLS improvement on Llama 3.2 11B Vision Instruct, allows for more efficient use of unstructured data and better model performance. You should explore this integration to streamline your data-to-model pipeline and enhance collaboration between data and ML teams.
Key insights
Integrating SageMaker Unified Studio with S3 streamlines ML workflows, improving VQA model performance via fine-tuning.
Principles
- Dataset size correlates positively with VQA model performance.
- Unified platforms simplify ML data discovery and access.
Method
The method involves creating SageMaker Studio projects for data producers and consumers, cataloging S3 data, fine-tuning a JumpStart LLM with varying dataset sizes, and tracking results with MLflow.
In practice
- Use `ml.p4de.24xlarge` instances for Llama 3.2 11B Vision Instruct training.
- Set JupyterLab idle time to 6 hours for long training jobs.
- Utilize S3 Access Grants for secure data access in SageMaker.
Topics
- Amazon SageMaker Unified Studio
- LLM Fine-tuning
- Amazon S3
- Visual Question Answering
- Llama 3.2 11B Vision Instruct
Code references
- aws-samples/sample-finetuning-sagemaker-unified-studio-with-s3
- aws/boto3-s3-access-grants-plugin
- aws/aws-s3-accessgrants-plugin-java-v2
Best for: Machine Learning Engineer, MLOps Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.