Humans' ALMANAC: A Human Collaboration Dataset of Action-Level Mental Model Annotations for Agent Collaboration

2026-06-06 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics, Robotics & Autonomous Systems · Depth: Expert, extended

Summary

Almanac is a new human collaboration dataset designed to guide Large Language Model (LLM) agents toward process-level collaborative competence, moving beyond mere task completion. Built from the classic Map Task, Almanac contains 2,987 collaboration actions from 50 participants across 25 dyadic sessions. Each action is paired with theory-informed mental model annotations, capturing participants' self-reasoning, perceived partner intent, and perceived team goal. Researchers benchmarked six LLMs (Qwen3.6-35B-A3B, Llama 3.3 70B, GPT-5.5, Claude 4.6 Sonnet, Qwen3-4B, Qwen3-30B-A3B) on next-turn behavior and mental model prediction. Results show Almanac's utility in evaluating models' ability to simulate human collaborative behaviors and infer underlying mental models, with fine-tuned smaller models approaching larger proprietary models.

Key takeaway

For AI Scientists and Machine Learning Engineers developing collaborative LLM agents, Almanac offers crucial process-level supervision signals. You should consider fine-tuning models on this dataset to improve their ability to infer human partners' mental states, especially shared goals and partner intent, which are more predictable than private self-reasoning. This will help agents move beyond task-solving to become more genuine collaborative partners.

Key insights

Almanac dataset provides action-level mental model annotations to train LLM agents for effective human-agent collaboration.

Principles

Effective collaboration requires continuous mental model alignment.
Observable behavior and mental models offer complementary signals.

Method

A two-step annotation framework combines in-session checkpoints (25%, 50%, 75% progress) with post-session retrospective labeling, using memory anchors to capture action-level mental models.

In practice

Benchmark LLMs on next-turn behavior prediction.
Evaluate LLMs on mental model inference capabilities.

Topics

LLM Agents
Human-Agent Collaboration
Mental Models
Collaboration Datasets
Map Task
Behavior Prediction

Best for: AI Scientist, Machine Learning Engineer, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.