The Agent Read the Whole Spec. It Didn't Need To.¶

Part 6 of the KCP series. Previous: What Happens When Your Agent Needs Knowledge From Five Teams?

A 42,000-token specification. An agent with 8,000 tokens of remaining budget. The agent loads the document. The context overflows. The session fails, truncates, or clears — taking everything the agent had accumulated with it.

What makes this worse: the knowledge base also contained a 600-token TL;DR that answered the same question. The agent had no way to know it existed.

RFC-0006 proposes a hints block that gives agents the information they need to make loading decisions before fetching content. How large is this unit? Should it be loaded eagerly or only on demand? Is there a shorter version? If context fills, which units should be evicted first?

KCP manifests tell agents what a unit is and what question it answers. They do not tell agents how expensive it is to load. An agent navigating a large knowledge base currently has three bad choices:

Load everything and hope it fits — fails for large corpora
Load nothing and ask the user what to load — defeats the point of a manifest
Guess from filenames — CHANGELOG.md is probably large; README.md is probably not; SPEC.md is anybody's guess

The result is that context overflow is discovered retroactively. The fetch has already happened. The tokens are already spent.

Two `hints` blocks¶

RFC-0006 adds a hints block at two levels: on individual units, and at the manifest root.

Unit-level hints¶

units:
  - id: full-specification
    path: SPEC.md
    intent: "What are the normative rules for a knowledge.yaml manifest?"
    scope: global
    audience: [human, agent, developer, architect]
    hints:
      token_estimate: 42000
      token_estimate_method: measured     # measured | estimated
      load_strategy: lazy                 # eager | lazy | never
      priority: supplementary             # critical | supplementary | reference
      density: dense                      # dense | standard | verbose
      summary_available: true
      summary_unit: spec-summary

  - id: spec-summary
    path: SPEC-tldr.md
    intent: "What are the key points of the spec in 500 words?"
    hints:
      token_estimate: 600
      load_strategy: eager
      priority: critical
      summary_of: full-specification

The three enum fields do the work.

load_strategy advises when to load: - eager — load immediately when the manifest is processed. For short, high-signal units: an overview, a schema index, a TL;DR. - lazy — load on demand, when the agent determines the unit is relevant. The default for most content. - never — do not load proactively. Only if explicitly requested. For raw data dumps, full changelogs, large archives where the agent should read the summary instead.

priority advises what to evict first when context fills: - critical — evict last. Essential facts the agent must retain. - supplementary — standard priority. May be evicted if budget is tight. - reference — evict first. API specs, changelogs, raw data used for spot lookups, not sustained reasoning.

density advises whether to compress before loading: - dense — nearly every sentence is load-bearing. Compression risks information loss. Load the full text. - standard — normal prose. Some compression acceptable. - verbose — high token count relative to information content. Tutorials, narrative explanations, marketing copy. Summarisation before loading is likely worthwhile.

The summary relationship¶

The most immediately useful part of RFC-0006 is the summary pairing. When a short summary of a large unit exists, both sides declare the relationship:

  - id: architecture
    path: architecture.md
    intent: "What is the system architecture and how do the components relate?"
    hints:
      token_estimate: 18000
      load_strategy: lazy
      priority: supplementary
      density: dense
      summary_available: true
      summary_unit: architecture-summary    # → points to the short version

  - id: architecture-summary
    path: architecture-tldr.md
    intent: "What are the key architectural decisions in 400 words?"
    hints:
      token_estimate: 500
      load_strategy: eager
      priority: critical
      summary_of: architecture              # ← points back to the full version

An agent with a constrained budget loads architecture-summary first (eager, critical, 500 tokens). It reaches for architecture only when it needs the normative detail (lazy, 18,000 tokens). Without the hints, it would have had to load all 18,000 tokens to find the 400-word answer it actually needed.

Chunked documents¶

For large documents that have natural sections, RFC-0006 proposes explicit chunk relationships:

  - id: api-reference
    path: api/reference.md
    intent: "What endpoints, parameters, and response schemas does the API expose?"
    hints:
      token_estimate: 62000
      load_strategy: never      # never load the full reference proactively
      priority: reference
      density: dense
      chunked: true
      chunk_count: 5

  - id: api-ref-auth
    path: api/reference-auth.md
    intent: "What are the authentication endpoints and token schemas?"
    hints:
      token_estimate: 9400
      chunk_of: api-reference
      chunk_index: 1
      total_chunks: 5
      chunk_topic: "Authentication and token management"

  - id: api-ref-resources
    path: api/reference-resources.md
    intent: "What are the resource CRUD endpoints and their schemas?"
    hints:
      token_estimate: 18600
      chunk_of: api-reference
      chunk_index: 2
      total_chunks: 5
      chunk_topic: "Resource management endpoints"

The agent knows the 62,000-token full reference exists (never — do not load it). When it needs the authentication endpoints specifically, it loads api-ref-auth (9,400 tokens) directly, without touching the rest. The chunk_topic field on each chunk is the key: the agent selects the right section from the manifest without reading any of the content first.

Root-level hints¶

At the manifest level, a hints block provides aggregate information before any unit is loaded:

kcp_version: "0.3"
project: platform-docs
version: 3.0.0

hints:
  total_token_estimate: 128400
  unit_count: 94
  recommended_entry_point: overview
  has_summaries: true
  has_chunks: true

total_token_estimate: 128400 tells an agent with 16,000 tokens of remaining budget that it cannot load this entire corpus — before it loads a single unit. It can then plan: load the recommended_entry_point, follow the eager units, use summaries where available, and reach for chunks only when needed.

has_summaries: true and has_chunks: true are flags that signal the corpus is navigable at lower cost. An agent that knows summaries exist will look for them. An agent that does not know they exist will not.

The tokenizer problem¶

One genuinely hard open question: token counts vary across models. A 42,000-token document by GPT-4's tokenizer may be 38,000 tokens by Claude's tokenizer and 46,000 by Llama's.

RFC-0006 currently proposes model-agnostic estimates — a single integer that approximates "typical LLM tokenizer." The alternative is a model-keyed map:

hints:
  token_estimates:
    cl100k_base: 42000    # GPT-4
    claude: 38500
    llama: 46000

Precise, but a maintenance burden that few publishers would actually keep current. The RFC leaves this open — the model-agnostic single integer is the current proposal, with the map as a possible extension.

What this changes in practice¶

The benchmark that motivated the KCP composability work showed a 40% reduction in tool calls when agents had a structured manifest. Context hints extend that result: agents with size metadata load the right unit the first time rather than discovering overflow after the fact, backing out, and retrying.

The more fundamental shift is that hints makes context budgeting something the publisher can inform. Today, context management is entirely the agent's problem — it has no data to work with other than what it discovers by loading. With hints, the publisher declares the cost profile of their knowledge base. Agents can plan. Manifests become navigable not just by content but by cost.

Open questions¶

Token estimate staleness. token_estimate is a snapshot that drifts as content changes. Is the unit's validated date sufficient as a freshness proxy, or should there be a separate token_estimate_updated field?

never and search. Should units with load_strategy: never appear in manifest query results at all? The agent should know they exist — but returning them in a search result risks the agent loading them anyway. Should they be returned with a flag, or filtered out unless explicitly queried by id?

Chunk navigation. Chunks are selected by chunk_topic today. Should KCP support topic-based chunk selection server-side — where an agent declares a topic and the publisher returns the matching chunk id — or is that a query concern beyond the manifest format?

Comment on Issue #9.

Full RFC: RFC-0006-Context-Window-Hints.md

Spec and all RFCs: github.com/cantara/knowledge-context-protocol

Series: Knowledge Context Protocol

← The HTTP Status Code That Waited 30 Years for Autonomous Agents · Part 9 of 24 · What Happens When an AI Submits a PR and Another AI Reviews It →