A ·Context Engineering

Most agent failures look like model failures. They are context failures.

Four Operations of Context Engineering

7 slides · Write · Select · Compress · Isolate · the skill of 2026

01 · 01 - Hook

Prompt engineering is over. Context engineering.

The work of 2026 is what the model sees.

02 · 02 - Setup

Phil Schmid's framing. Four operations.

Per Phil Schmid (Towards Data Science, April 2026). Most agent failures look like model failures. They are context failures. Four moves cover most fixes.

03 · 03 - Write

Operation 01. Write outside.

typescript

// Offload state to external storage
// The prompt is a window, not a hard drive
 
await db.run("INSERT INTO facts VALUES (?, ?)",
  key, value);
 
// Now read it back only when the turn needs it.
const facts = await db.all(
  "SELECT * FROM facts WHERE topic = ?", topic);

04 · 04 - Select

Operation 02. Select precisely.

typescript

// Retrieve only what this turn needs
 
const embedding = await embed(userQuestion);
 
const top = await db.all(`
  SELECT content FROM memories
  WHERE embedding MATCH ?
  ORDER BY distance LIMIT 5`, embedding);
 
// 200K tokens of memory. Five chunks in the prompt.

05 · 05 - Compress

Operation 03. Compress old turns.

typescript

// Long conversations rot. Summarize the old half.
 
if (messages.length > 40) {
  const old = messages.slice(0, 20);
  const summary = await haiku(`
    Summarize this conversation in 200 tokens.
    Preserve decisions, constraints, and names.`);
  messages = [{ role: "system", content: summary },
              ...messages.slice(20)];
}

06 · 06 - Isolate

Operation 04. Isolate with sub-agents.

typescript

// Spawn a sub-agent with a clean, scoped context
 
async function verify(claim) {
  // Only sees the claim. None of the parent state.
  return await anthropic.messages.create({
    model: "claude-haiku-4-5-20251001",
    messages: [{ role: "user",
      content: `Is this true? ${claim}` }]
  });
}

07 · 07 - Closer

The model is a function. Context is the input.

One Gotcha

Operational vocabulary, not techniques. Skip them and you blame the model forever. Write. Select. Compress. Isolate.

Why Your Agent Goes Sloppy at Step 15

7 slides · context drift · the 500-token MEMORY.md fix

01 · 01 - Hook

Step 10. Fine. Step 20. forgotten.

It is not the model. It is your context.

02 · 02 - Setup

Practitioners named it. Context drift.

Patrick (dev.Journal, March 2026) named it. Long histories shift behavior. Constraints set early get ignored late. 500 structured tokens beat 200K of noise.

03 · 03 - Schema

Step 01. Define what survives.

markdown

// MEMORY.md - state that must persist
 
## Goal
Stripe-style invoice generator.
 
## Constraints
- All amounts in integer cents
- No PII in logs
- Tax rules per US state

04 · 04 - Inject

Step 02. Reinject every turn.

typescript

const memory = await fs.readFile("MEMORY.md");
 
const messages = [
  { role: "system", content: `
    You are an engineering agent.
    Current state of the world:
    ${memory}
    Always honor the constraints listed.` },
  ...recentTurns // Last 10 only
];

05 · 05 - Maintain

Step 03. Let the agent update it.

typescript

// Give the agent a tool to write to its own state
 
{ name: "update_memory",
  description: "Append a decision or constraint",
  input_schema: /* { section, content } */ }
 
// In runTool:
if (name === "update_memory") {
  await appendToSection(input.section, input.content);
}

06 · 06 - Trim

Step 04. Cap it at 500 tokens.

typescript

// Past 500 tokens, MEMORY.md becomes another problem.
// Compress the oldest sections first.
 
if (tokenCount(memory) > 500) {
  const compressed = await haiku(`
    Compress this MEMORY.md to under 500 tokens.
    Preserve all constraints and recent decisions.
    Drop completed work items.`);
  await fs.writeFile("MEMORY.md", compressed);
}

07 · 07 - Closer

Persistent state. Bounded context.

One Gotcha

Patrick (dev.Journal, March 2026). 200K of history drifts. 500 tokens of structure behaves.

Context Offloading

7 slides · the prompt is a window, not a hard drive

01 · 01 - Hook

Your prompt is not memory. Stop using it like one.

Offload state to where it actually belongs.

02 · 02 - Setup

The pattern. State in DB. Slices in prompt.

Facts live in the database. The prompt carries only what this turn needs. Cost drops. Quality goes up. Context becomes a query, not a dump.

03 · 03 - Schema

Step 01. Two tables.

sql

create table facts (
  topic     text,
  content   text,
  embedding float[1536]
);
 
create table actions (
  action text, result text, at integer
);

04 · 04 - Read

Step 02. Read what this turn needs.

typescript

async function contextFor(query) {
  const emb = await embed(query);
 
  const facts = await db.all(`
    SELECT content FROM facts
    WHERE embedding MATCH ? LIMIT 5`, emb);
 
  const recent = await db.all(`
    SELECT action, result FROM actions
    ORDER BY at DESC LIMIT 3`);
 
  return { facts, recent };
}

05 · 05 - Write

Step 03. Write through tools.

typescript

// The agent has explicit tools for state changes.
// No silent updates from the model.
 
{ name: "save_fact",
  description: "Record a long-term fact",
  input_schema: /* { topic, content } */ }
 
if (name === "save_fact") {
  const emb = await embed(input.content);
  await db.run("INSERT INTO facts ...",
    input.topic, input.content, emb);
}

06 · 06 - Audit

Step 04. Log every read and write.

typescript

// When the agent misbehaves, you need to see why.
 
async function auditedCall(tool, input) {
  const ts = Date.now();
  const result = await tool.run(input);
 
  await db.run(`INSERT INTO audit
    (tool, input, result, ms) VALUES (?, ?, ?, ?)`,
    tool.name, JSON.stringify(input),
    JSON.stringify(result), Date.now() - ts);
 
  return result;
}

07 · 07 - Closer

The prompt is a lens. The DB is the memory.

One Gotcha

Replay every read and write when things go wrong. The model never lies. The context did. SELECT * FROM audit is the debug tool of the year.

Hallucination by Omission

7 slides · the agent failure mode nobody documents

01 · 01 - Hook

Your tool returned an error. Your agent made up data.

It is called hallucination by omission.

02 · 02 - Setup

The behavior. Optimized for completion.

Consumer agents are trained to be helpful. When a tool fails, they invent output. Production agents report failure instead of papering over it.

03 · 03 - Contract

Step 01. Tool contract. Always { ok, ... }.

typescript

type ToolResult<T> =
  | { ok: true; data: T }
  | { ok: false; error: string };
 
async function getCustomer(id): Promise<ToolResult<Customer>> {
  try {
    const data = await api.fetch(id);
    return { ok: true, data };
  } catch (e) {
    return { ok: false, error: e.message };
  }
}

04 · 04 - Surface

Step 02. Surface the failure.

typescript

// Model must SEE the error, not just a null.
 
const res = await runTool(toolUse);
messages.push({ role: "user", content: [{
  type: "tool_result",
  tool_use_id: toolUse.id,
  content: res.ok ? JSON.stringify(res.data)
    : `ERROR: ${res.error}. Do not guess.`,
  is_error: !res.ok
}] });

05 · 05 - Prompt

Step 03. Prompt for failure honesty.

typescript

const system = `
You have tools. Tools can fail.
 
When a tool returns is_error: true:
- Report the failure to the user
- Do NOT invent values
- Do NOT silently retry
- Suggest the next valid action
 
If you are missing information,
say so explicitly. Never fabricate.
`;

06 · 06 - Stop

Step 04. Fail loud.

typescript

// At the orchestration level, watch for the pattern.
 
if (res.stop_reason === "end_turn" && lastToolError) {
  // Agent ended despite a tool error in the same loop.
  // Verify the final answer does not contain fabricated data.
  const verdict = await haiku(`
    The tool returned an error: ${lastToolError}.
    The agent then said: ${finalAnswer}.
    Did the agent fabricate data? Reply yes or no.`);
  if (verdict.includes("yes")) throw new Error("Fabrication detected");
}

07 · 07 - Closer

Helpfulness is a default. Honesty is engineering.

One Gotcha

Patrick (dev.Journal, March 2026). Without explicit handling, agents fabricate to finish the task. { ok: false } is the shape that matters.

B ·Agent Failure Modes

Production agents fail at the harness layer, not the model layer.

The Harness Is Where Production Fails

7 slides · not the model · the code wrapping it

01 · 01 - Hook

When your agent fails, the model is rarely the problem.

18% post-launch failure rate. Almost none are model bugs.

02 · 02 - Setup

The reframe. Harness engineering.

Sarah Chen (harness-engineering.ai, April 2026). The harness wraps the model with retries, tool integration, error handling. This is where reliability is decided.

03 · 03 - Retries

Layer 01. Bounded retries.

typescript

async function withRetry(fn, max = 3) {
  for (let i = 0; i < max; i++) {
    try { return await fn(); }
    catch (e) {
      const retry = [429, 529].includes(e.status);
      if (!retry) throw e;
      await sleep(2**i * 1000 + Math.random()*500);
    }
  }
}

04 · 04 - Breakers

Layer 02. Circuit breakers.

typescript

// 5 failures in 60s -> stop calling for a minute.
 
async function guarded(name, fn) {
  const b = breakers.get(name) ?? { fails: 0, at: 0 };
  if (b.fails >= 5 && Date.now() - b.at < 60_000)
    throw new Error(`Open: ${name}`);
  try { const r = await fn(); b.fails = 0; return r; }
  catch (e) { b.fails++; b.at = Date.now(); throw e; }
}

05 · 05 - Overflow

Layer 03. Context overflow guard.

typescript

// At 70% of the window, the model behavior shifts.
// Trim before it does.
 
function checkpoint(messages, max = 160_000) {
  const used = countTokens(messages);
  if (used < max * 0.7) return messages;
 
  // Summarize the oldest half. Keep the system prompt.
  return compress(messages);
}

06 · 06 - Traces

Layer 04. Trace everything.

typescript

async function trace(span, fn) {
  const id = crypto.randomUUID();
  const start = performance.now();
  try {
    const result = await fn();
    log({ id, span, ms: performance.now() - start, ok: true });
    return result;
  } catch (e) {
    log({ id, span, ms: performance.now() - start, ok: false, err: e.message });
    throw e;
  }
}

07 · 07 - Closer

The model is one variable. The harness is the system.

One Gotcha

Sarah Chen (harness-engineering.ai, April 2026). Teams blame the model. The bug is in the wrapping code. Treat the harness as infrastructure.

Bounded Scope

7 slides · the agent that refuses things is the one that ships

01 · 01 - Hook

The best production agents know what they don't own.

The refusal is the feature.

02 · 02 - Setup

The pattern. Narrow. Explicit. Safe.

Data Science Collective (April 2026). The support agent handles tickets. It does not touch billing. The boundary is the safety mechanism.

03 · 03 - Allowlist

Step 01. Allow-list the surface.

typescript

// Explicit. Versioned. Reviewed.
 
const SCOPE = {
  domains: ["tickets", "knowledge_base"],
  actions: ["read", "tag", "reply", "escalate"],
  forbidden: ["refund", "account_delete", "billing"]
};
 
function inScope(action, target) {
  if (SCOPE.forbidden.includes(action)) return false;
  return SCOPE.actions.includes(action) &&
         SCOPE.domains.some(d => target.startsWith(d));
}

04 · 04 - Preflight

Step 02. Check before every call.

typescript

async function runTool(toolUse) {
  if (!inScope(toolUse.name, toolUse.input.target)) {
    await log.warn("out_of_scope", toolUse);
    return {
      ok: false,
      error: "Out of scope. Escalate to human."
    };
  }
  return await tools[toolUse.name](toolUse.input);
}

05 · 05 - Refuse

Step 03. Refuse with a route.

typescript

const system = `
You handle tier-1 support tickets.
 
You do NOT have access to:
- Billing or refunds (escalate to billing@)
- Account deletion (escalate to security@)
- Anything outside ticket replies and tags
 
When asked to do something outside scope:
1. Acknowledge the request
2. State the boundary clearly
3. Route to the right human or queue
`;

06 · 06 - Audit

Step 04. Track refusals.

sql

// Every out-of-scope attempt is a signal.
// Either the scope is too tight, or users need different tools.
 
SELECT
  toolUse.name,
  toolUse.input.target,
  COUNT(*) as attempts
FROM audit_log
WHERE outcome = 'out_of_scope'
  AND created_at > now() - interval '7 days'
GROUP BY 1, 2
ORDER BY attempts DESC;

07 · 07 - Closer

The boundary is the safety. A refusing agent ships.

One Gotcha

Data Science Collective (April 2026). Every successful agent has explicit refusal. if (!inScope) return ships in every healthy codebase.

Capacity Engineering

7 slides · 60% of LLM errors are rate limits · Datadog data

01 · 01 - Hook

Datadog analyzed millions of calls. 60% of LLM errors were rate limits.

Reliability is now capacity engineering.

02 · 02 - Setup

The shift. Not quality. Throughput.

Datadog State of AI Engineering (March 2026). 8.4 million rate-limit errors in one month. Your prompt is fine. Your throughput is the bottleneck.

03 · 03 - Budget

Layer 01. Per-key token budgets.

typescript

// Track usage per API key, per minute, per endpoint.
 
async function reserveTokens(keyId, estimated) {
  const { rows } = await sql`
    UPDATE budgets SET
      tokens = tokens - ${estimated},
      updated_at = now()
    WHERE key_id = ${keyId}
      AND tokens >= ${estimated}
    RETURNING tokens`;
 
  if (!rows.length) throw new Error("Budget exceeded");
}

04 · 04 - Pressure

Layer 02. Backpressure on the queue.

typescript

// When concurrent calls approach the rate limit, queue.
 
import pLimit from "p-limit";
const limit = pLimit(8); // matches your tier's RPM / 60
 
async function call(messages) {
  return limit(() =>
    anthropic.messages.create({
      model: "claude-sonnet-4-6",
      max_tokens: 1024, messages
    })
  );
}

05 · 05 - Fallback

Layer 03. Fallback model on 429.

typescript

async function resilient(messages) {
  try {
    return await call(messages, "claude-sonnet-4-6");
  } catch (e) {
    if (e.status !== 429) throw e;
 
    // Sonnet capacity is gone. Try Haiku.
    log.warn("sonnet_429_fallback");
    return await call(messages, "claude-haiku-4-5-20251001");
  }
}

06 · 06 - Backoff

Layer 04. Backoff with jitter.

typescript

async function withBackoff(fn) {
  for (let i = 0; i < 5; i++) {
    try { return await fn(); }
    catch (e) {
      if (![429,529].includes(e.status)) throw e;
      await sleep(Math.min(30_000, 2**i * 1000));
    }
  }
}

07 · 07 - Closer

Capacity is the new latency. Engineer for the 429.

One Gotcha

Datadog (Feb-March 2026). Capacity is now the dominant LLM failure mode. Budget. Limit. Fallback. Backoff.

C ·Eval Engineering

Eval pipelines are the CI/CD of AI applications.

The Four-Stage Eval Pipeline

7 slides · ship prompts the way you ship code

01 · 01 - Hook

Ship prompt changes like you ship code.

Four stages. Every PR gated.

02 · 02 - Setup

The pipeline. Local. PR. Gate. Prod.

Milind Nair (March 2026), Adaline, Braintrust. Frontier models saturate old benchmarks. The replacement is a four-stage pipeline in CI. Continuous quality gate.

03 · 03 - Local

Stage 01. Local dev.

bash

// Golden dataset: 200-500 real examples from production failures.
// Not synthetic. Not aspirational. Actual bugs you have shipped.
 
$ bun run evals.ts --against golden.json
 
  ✓ extracts dates                      48/50
  ✓ rejects out-of-scope                 49/50
  ✗ handles multi-currency totals        31/50
 
Failed: 21 cases. Run with --inspect.

04 · 04 - PR

Stage 02. PR check.

yaml

# .github/workflows/eval.yml
name: Prompt Eval
on: [pull_request]
jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: bun install
      - run: bun run eval --judge=claude-haiku-4-5
      - run: bun run check-regression --baseline=main
# Blocks merge if any metric drops below baseline.

05 · 05 - Gate

Stage 03. Deploy gate.

typescript

// Hard thresholds. No deploy if any fails.
 
const gates = {
  accuracy:    { min: 0.85 },
  safety:      { min: 0.99 },
  faithfulness: { min: 0.90 },
  p95_latency: { max: 3000 },
};
 
for (const [k, g] of Object.entries(gates)) {
  if (metric(k) < g.min || metric(k) > g.max)
    process.exit(1);
}

06 · 06 - Prod

Stage 04. Production monitor.

typescript

// Sample 1% of live traffic. Score it. Feed failures back.
 
async function sample(req, res) {
  if (Math.random() > 0.01) return;
 
  const score = await judge(req, res);
  if (score.accuracy < 0.7) {
    await db.run("INSERT INTO golden_candidates ...");
    alert.send("quality_drop", score);
  }
}

07 · 07 - Closer

Evals are the CI. AI changes are commits.

One Gotcha

Milind Nair (March 2026), Adaline. Stage four feeds stage one. Prod failure → golden set → blocked PR.

Two Biases That Wreck Your LLM Judge

7 slides · position bias · verbosity bias · Autorubric paper

01 · 01 - Hook

Your LLM judge favors longer answers and what it sees first.

Both biases are real and measured.

02 · 02 - Setup

Documented February 2026. Position. Verbosity.

Autorubric paper (Rao + Callison-Burch, February 2026). Two failure modes in almost every default LLM-as-judge setup. The fix is mechanical, not magical.

03 · 03 - Shuffle

Fix 01. Shuffle the comparison.

typescript

// Position bias: judge favors whichever appears first.
// Randomize order on every comparison.
 
async function compare(a, b) {
  const swap = Math.random() < 0.5;
  const [first, second] = swap ? [b, a] : [a, b];
 
  const verdict = await judge(first, second);
  return swap
    ? verdict === "first" ? "b" : "a"
    : verdict === "first" ? "a" : "b";
}

04 · 04 - Length

Fix 02. Tell the judge to ignore length.

typescript

const system = `
You are a strict evaluator.
 
Judge accuracy and faithfulness ONLY.
Length is NOT a quality signal.
A correct one-sentence answer beats
a verbose answer with one error.
 
Output JSON: { winner: "a" | "b", reason: string }
`;
 
// Combine with length-normalized scoring
// for double protection.

05 · 05 - Ensemble

Fix 03. Multi-judge ensemble.

typescript

// Three judges. Majority wins. Disagreement = human review.
 
async function ensembleJudge(a, b) {
  const verdicts = await Promise.all([
    judge(a, b, "claude-opus-4-7"),
    judge(a, b, "claude-sonnet-4-6"),
    judge(a, b, "claude-haiku-4-5-20251001"),
  ]);
 
  const counts = tally(verdicts);
  return counts.max >= 2 ? counts.winner : "needs_human";
}

06 · 06 - Calibrate

Fix 04. Calibrate to humans.

typescript

// Target: 85-90% agreement with human-annotated set.
 
const humans = await loadAnnotated();
const judges = await Promise.all(
  humans.map(x => ensembleJudge(x.a, x.b)));
 
const agreement = humans.filter((h, i) =>
  h.label === judges[i]).length / humans.length;
console.log(`Agreement: ${agreement}`);

07 · 07 - Closer

The judge is another model. Same biases. Same fixes.

One Gotcha

Autorubric (Rao + Callison-Burch, Feb 2026). Strong judges hit 80%+ human agreement. Shuffle. Instruct. Ensemble. Calibrate.

Golden Datasets from Production

7 slides · the bugs you have shipped beat the ones you imagined

01 · 01 - Hook

Your eval set should come from bugs you shipped.

Not the ones you imagined.

02 · 02 - Setup

Practitioner consensus. 200 real cases. Beats 5K fake.

Arize and Braintrust 2026 guides. Synthetic test sets miss the patterns that actually break. 200 real failures beat 5000 synthetic ones.

03 · 03 - Capture

Step 01. Capture failures.

typescript

// Every user-flagged failure becomes a candidate.
 
app.post("/feedback", async (req) => {
  if (req.body.rating === "thumbs_down") {
    await db.run(`INSERT INTO golden_candidates
      (input, output, note, trace) VALUES (?, ?, ?, ?)`,
      req.body.input, req.body.output,
      req.body.note, req.body.traceId);
  }
});

04 · 04 - Curate

Step 02. Curate weekly.

bash

// One hour a week. Triage what landed.
 
$ bun run review-candidates --since=7d
 
47 candidates this week:
  - 12 already covered by existing tests
  - 8 user error (ignore)
  - 19 new failure patterns -> add to golden
  - 8 ambiguous -> need human labeling
 
Adding 19 to golden.json. New baseline.

05 · 05 - Dedupe

Step 03. Cluster, then pick variants.

typescript

// Don't add 50 versions of the same bug.
// Embed, cluster, sample one per cluster.
 
const embeddings = await Promise.all(
  candidates.map(c => embed(c.input))
);
 
const clusters = kMeans(embeddings, 20);
const diverse = clusters.map(c =>
  candidates[c.centroidIdx]
);
 
// 47 candidates -> 20 representative cases.

06 · 06 - Version

Step 04. Version it like code.

json

// golden.json lives in the repo. PRs review additions.
 
{
  "version": "2026.05.17",
  "cases": [
    {
      "id": "multi-currency-001",
      "input": "Convert 1500 EUR + 200 GBP to USD",
      "expected": { type: "contains", value: "USD" },
      "source": "prod-trace-9f3a2b",
      "added": "2026-05-14"
    }
  ]
}

07 · 07 - Closer

Your data is your truth. Curate like code.

One Gotcha

Arize + Braintrust 2026 guides. The data is the asset. The framework is the wrapper. Capture. Cluster. Curate. Commit.

D ·MCP Architecture

Model Context Protocol grew from Claude-only feature to Linux Foundation standard.

Many Small MCP Servers Beat One Big One

7 slides · agents get confused by big tool inventories

01 · 01 - Hook

One MCP server with 50 tools confuses the agent.

Compose four with a dozen each.

02 · 02 - Setup

Production pattern. Small. Scoped. Composable.

Particula Tech production patterns (April 2026). One server per domain. CRM. Billing. Inventory. Each independently deployable.

03 · 03 - Boundaries

Step 01. Draw the domains.

typescript

// Agent Host
// |-- MCP Client -> CRM Server         (5 tools)
// |-- MCP Client -> Billing Server     (4 tools)
// |-- MCP Client -> Inventory Server   (6 tools)
// |-- MCP Client -> Notifications      (3 tools)
 
// 18 tools total, but scoped by server.
// Each server is one team's responsibility.
// Tool descriptions stay focused, not generic.

04 · 04 - Build

Step 02. One domain, one server.

typescript

// crm-mcp/index.ts
const server = new Server({ name: "crm" });
 
const tools = [
  "find_customer",
  "list_recent_contacts",
  "add_note_to_account",
  "tag_lead", "reassign_owner",
];
// 5 tools. CRM-only. One team owns it.

05 · 05 - Compose

Step 03. Compose at the host.

json

// claude_desktop_config.json
 
{
  "mcpServers": {
    "crm":           { "command": "node", "args": ["crm/bin.js"] },
    "billing":       { "command": "node", "args": ["billing/bin.js"] },
    "inventory":     { "command": "node", "args": ["inv/bin.js"] },
    "notifications": { "command": "node", "args": ["notif/bin.js"] }
  }
}
 
// Agent sees all 18 tools, scoped by server prefix.

06 · 06 - Deploy

Step 04. Deploy independently.

bash

# Each server has its own pipeline.
# Each can roll back without touching the others.
 
$ cd crm-mcp && npm publish --tag latest
$ cd billing-mcp && npm publish --tag latest
 
# Versioned. Testable. Owned.
$ npx @company/crm-mcp@1.4.2     # pinned
$ npx @company/billing-mcp@2.0.0  # pinned

07 · 07 - Closer

Small servers. Sharp tools. Confident agent.

One Gotcha

Particula Tech (April 2026). Compose, do not monolith. The agent picks faster when the surface is bounded. One domain. One server.

Remote MCP Servers

7 slides · Streamable HTTP · OAuth 2.1 · April 2026 spec

01 · 01 - Hook

Your MCP server lives on your laptop.

Time to move it to the cloud.

02 · 02 - Setup

April 2026 spec. Streamable HTTP. OAuth 2.1.

April 2026 spec, governed by the Linux Foundation. Stdio is local-only. Streamable HTTP runs in the cloud with OAuth 2.1. 10,000+ public servers.

03 · 03 - Transport

Step 01. Switch to HTTP transport.

typescript

import { Server } from "@modelcontextprotocol/sdk";
import { StreamableHTTPTransport } from "@modelcontextprotocol/sdk/server/http";
 
const server = new Server({ name: "my-server" });
// ... register tools as before
 
const transport = new StreamableHTTPTransport({
  port: 3000,
  endpoint: "/mcp"
});
 
await server.connect(transport);

04 · 04 - OAuth

Step 02. Add OAuth 2.1 with PKCE.

typescript

// The 2026 spec mandates OAuth 2.1 for remote servers.
 
transport.use(async (req, next) => {
  const token = req.headers.authorization?.replace("Bearer ", "");
  if (!token) return { status: 401 };
 
  const user = await verifyToken(token);
  req.context = { user, scopes: user.scopes };
 
  return next(req);
});
 
// Tools can then check req.context.scopes per call.

05 · 05 - Deploy

Step 03. Ship to the edge.

toml

// Cloudflare Workers. Vercel Edge. Fly.io.
// MCP servers are just HTTP now.
 
// wrangler.toml
name = "crm-mcp"
main = "src/server.ts"
compatibility_date = "2026-04-01"
 
[[durable_objects.bindings]]
name = "MCP_SESSION"
class_name = "McpSession"
 
$ wrangler deploy

06 · 06 - Register

Step 04. Register with the client.

bash

# In Claude Code, or any MCP-compatible client
 
$ claude mcp add --transport http \
    --url https://crm-mcp.example.com/mcp \
    --auth oauth2 \
    --name crm
 
Opening browser for OAuth flow...
Authorized. crm server connected.
 
$ claude mcp list
crm           https://crm-mcp.example.com/mcp  [oauth2]
billing       https://billing-mcp.example.com  [oauth2]

07 · 07 - Closer

Stdio was the prototype. HTTP is the protocol.

One Gotcha

April 2026 spec, Linux Foundation governance. Streamable HTTP, OAuth 2.1, MCP Tasks. 97M downloads, 13K+ servers.

E ·Verifying AI Code

Senior developers trust AI output the least. This is a feature.

The 66% Problem

7 slides · when AI code is almost right · Stack Overflow data

01 · 01 - Hook

66% of devs said the same thing. AI code that is almost right.

Their biggest frustration. By a mile.

02 · 02 - Setup

Stack Overflow 2025. 49,000 devs. One pattern.

Stack Overflow Developer Survey 2025 (49K developers). 66% frustrated by almost-right code. 45% say debugging takes longer than writing. Verification is the bottleneck.

03 · 03 - Draft

Move 01. Treat output as a draft.

bash

// AI writes the draft. You write the eval.
// Nothing reaches main without passing.
 
// .git/hooks/pre-commit
#!/bin/sh
 
bun run typecheck || exit 1
bun run test:unit || exit 1
bun run test:eval || exit 1
 
# Trust nothing. Verify everything.

04 · 04 - Types

Move 02. Type check first.

typescript

// AI hallucinates function signatures more than logic.
// Type check catches 40% of bugs before you read a line.
 
$ tsc --noEmit
 
src/agent.ts:42:18 - error TS2339:
  Property 'executeAsync' does not exist
  on type 'Anthropic.Messages'.
  Did you mean 'create'?
 
// Fix this BEFORE reading the rest of the diff.

05 · 05 - Tests

Move 03. Contract the AI cannot see.

typescript

// Write the test BEFORE the AI writes the code.
// The test is the spec. The code is the draft.
 
test("extractDate handles natural language", () => {
  expect(extractDate("July 4th, 2026"))
    .toEqual("2026-07-04");
 
  expect(extractDate("next Tuesday", { from: "2026-05-17" }))
    .toEqual("2026-05-19");
 
  // 15 more edge cases. AI fills them in.
});

06 · 06 - Diff

Move 04. Read every line.

bash

// AI output looks confident. Read it like a stranger wrote it.
 
$ git diff --staged
 
+ const isExpired = token.exp < Date.now();
 
// Wait. token.exp is in seconds. Date.now() is ms.
// This is wrong by a factor of 1000.
// AI does this CONSTANTLY.
 
Discard hunk? (y,n)

07 · 07 - Closer

Generation is solved. Verification is the skill.

One Gotcha

Stack Overflow Developer Survey 2025. 66% frustrated by almost-right code. Treat AI as a draft. Treat the eval as the contract.

The Trust Gap

7 slides · senior devs trust AI least · this is a feature

01 · 01 - Hook

Senior developers trust AI least. 2.6% highly trust. 20% highly distrust.

It is not pessimism. It is pattern recognition.

02 · 02 - Setup

The data. Experience maps to skepticism.

Stack Overflow Developer Survey 2025. 46% distrust AI output. Senior developers trust it least. They have shipped enough to know what almost-right looks like.

03 · 03 - Read

Habit 01. Read the whole diff.

bash

// Not skim. Not approve. Read.
 
$ git diff --staged | bat --language=diff
 
// Three checks per file:
// 1. Does it import what it claims to import?
// 2. Are the error cases handled or swallowed?
// 3. Does the test cover the change, not just exist?
 
// If any are no, reject. Don't edit. Reject.

04 · 04 - Review

Habit 02. Sub-agent spot check.

typescript

// Haiku reviews what Sonnet wrote.
// Cheap, fast, surprisingly effective.
 
async function review(diff, intent) {
  return await haiku(`
    The user asked: ${intent}
    The generated diff: ${diff}
 
    List concerns. Be specific. No praise.
    If the diff is wrong, say so directly.`);
}
 
// Run this BEFORE you read the diff.

05 · 05 - Eval

Habit 03. Eval before commit.

bash

// The regression you cannot see is the one that ships.
 
$ bun run eval:golden --against=HEAD
 
46 of 50 cases pass.
 
REGRESSIONS:
  date-parse-007: was PASS, now FAIL
  date-parse-012: was PASS, now FAIL
 
Commit blocked. Diff the prompt change.

06 · 06 - Small

Habit 04. Ship small.

typescript

// A 50-line diff you read fully beats a 500-line diff you skim.
 
const rules = {
  maxLinesPerPR: 200,
  maxFilesPerPR: 8,
  requireEvalDelta: true,
  requireReadFlag: true // reviewer literally checks "I read it"
};
 
// Trust is verified through smaller surface area.
// Not through faith in the model.

07 · 07 - Closer

The trust gap is data. Treat it like a signal.

One Gotcha

Stack Overflow Survey 2025. Senior devs report the highest distrust. That is the right calibration. Verify everything. Ship small.

Circuit Breakers for AI Tools

7 slides · when tools fail, agents loop, bills explode

01 · 01 - Hook

An agent retried a failing API 47 times.

The bill came in. $312.

02 · 02 - Setup

Google ADK pattern. Tools fail. Agents loop.

Google AI Agent Clinic (Developers Blog, April 2026). Failing tools make agents loop. Every retry costs tokens. Let the framework handle failure.

03 · 03 - Track

Step 01. Track failures per tool.

typescript

// Sliding window. Last 60 seconds. Per tool.
 
class BreakerState {
  failures: number[] = [];
 
  record(failed: boolean) {
    const now = Date.now();
    this.failures = this.failures.filter(t => now - t < 60_000);
    if (failed) this.failures.push(now);
  }
 
  rate() { return this.failures.length; }
}

04 · 04 - Open

Step 02. Open the circuit.

typescript

// 5 failures in 60s -> open. Fail fast, no call.
 
async function guarded(name, fn) {
  const state = breakers.get(name) ?? new BreakerState();
  if (state.rate() >= 5) {
    return { ok: false, error: `Open: ${name}` };
  }
  return await attempt(state, fn);
}

05 · 05 - Probe

Step 03. Half-open recovery.

typescript

// After 30 seconds, probe with ONE call.
// Success closes the circuit. Failure resets the timer.
 
if (state.openedAt && Date.now() - state.openedAt > 30_000) {
  state.mode = "half-open";
  try {
    const result = await fn();
    state.failures = []; // closed
    return { ok: true, data: result };
  } catch (e) {
    state.openedAt = Date.now(); // reopen
    throw e;
  }
}

06 · 06 - DLQ

Step 04. Dead-letter queue.

typescript

// Failed calls go somewhere reviewable.
// Not lost. Not retried forever.
 
async function sendToDeadLetter(call, error) {
  await db.run(`
    INSERT INTO dead_letter
    (tool, input, error, agent_id, created_at)
    VALUES (?, ?, ?, ?, now())`,
    call.tool, JSON.stringify(call.input),
    error.message, call.agentId);
 
  alert.send("tool_dead_letter", { tool: call.tool });
}

07 · 07 - Closer

Bills come due. Engineer for the failing API.

One Gotcha

Google AI Agent Clinic (April 2026). Let the framework handle graceful failure. Track. Open. Probe. Dead-letter.

F ·Cost & Convention

Infrastructure where most teams leave money on the table.

Anthropic Prompt Caching

7 slides · 90% discount most teams skip

01 · 01 - Hook

Your system prompt costs you on every single call.

It shouldn't. There is a 90% discount.

02 · 02 - Setup

Anthropic prompt caching. One flag. 90% off the prefix.

Anthropic API. Long system prompts and RAG context get cached. Cached input costs 10% of normal on subsequent calls. One flag on a content block.

03 · 03 - Find

Step 01. Find the stable prefix.

typescript

// What does NOT change across calls?
// System prompt? Tool list? Knowledge base?
 
// Before:
const system = `You are a support agent...`; // 2000 tokens
const docs = await loadAllDocs();                  // 18000 tokens
// Every call: 20000 input tokens. Every. Single. Call.

04 · 04 - Mark

Step 02. Mark the breakpoint.

typescript

await anthropic.messages.create({
  model: "claude-sonnet-4-6",
  max_tokens: 1024,
  system: [
    {
      type: "text",
      text: longSystemPrompt + docs,
      cache_control: { type: "ephemeral" }
    }
  ],
  messages: userMessages
});

05 · 05 - Verify

Step 03. Verify cache hits.

typescript

// The response usage object tells you what happened.
 
const res = await anthropic.messages.create({...});
 
console.log(res.usage);
// {
//   input_tokens: 32,                  <- billed at full rate
//   cache_creation_input_tokens: 0,    <- 1.25x on first call only
//   cache_read_input_tokens: 20000,    <- 0.1x on every subsequent call
//   output_tokens: 318
// }

06 · 06 - Layer

Step 04. Layer the breakpoints.

typescript

// Up to 4 cache breakpoints per request.
// Use them when different parts change at different rates.
 
system: [
  { type: "text", text: identity,    // rarely changes
    cache_control: { type: "ephemeral" } },
  { type: "text", text: knowledge,   // daily updates
    cache_control: { type: "ephemeral" } },
  { type: "text", text: userContext // per-session
  }
]

07 · 07 - Closer

Same model. Same quality. 90% less on the prefix.

One Gotcha

Anthropic API. Pays for itself in one billing cycle. cache_control: { type: "ephemeral" }. Biggest cost lever you have not used.

AGENTS.md, the New Agent Convention

7 slides · one file, every tool reads it

01 · 01 - Hook

MEMORY.md was the start. AGENTS.md is the law.

One file. Every tool reads it.

02 · 02 - Setup

Cross-tool standard. package.json for agents.

April 2026 unification. Anthropic CLAUDE.md, OpenAI agent.md, Cursor rules all converged on a shared spec. GitHub, Cursor, Claude Code, Cline read it natively. The new convention.

03 · 03 - Schema

Step 01. The schema.

markdown

# AGENTS.md
version: 1
 
## Setup
- bun install
- bun run migrate
 
## Commands
- build: bun run build
- test: bun run test

04 · 04 - Conventions

Step 02. Conventions and constraints.

markdown

## Code style
- TypeScript strict
- No `any` types
 
## Testing
- Vitest unit tests
- Every PR adds tests
 
## Forbidden
- Plaintext secrets

05 · 05 - Autodiscovery

Step 03. Tools find it.

bash

// Agent tools auto-detect AGENTS.md
// in the working directory.
 
$ claude
> Loaded AGENTS.md from ./
 
$ cursor .
> Detected AGENTS.md, applying rules
 
// No --instructions flag needed.

06 · 06 - Version

Step 04. Version it like code.

bash

# .git/hooks/pre-commit
#!/bin/sh
 
bun run validate-agents-md
bun run lint:conventions
 
# Treat AGENTS.md changes
# like changes to package.json.

07 · 07 - Closer

MEMORY.md is per-session. AGENTS.md is the contract.

One Gotcha

April 2026 unification across Anthropic, OpenAI, Cursor, Cline. Adopted by GitHub. One file. Every tool. Source-controlled.

Cross-Model Disagreement Is Your Eval

7 slides · agreement = ship · disagreement = review

01 · 01 - Hook

Two models. Same query. Disagreement is the signal.

Agreement = ship. Disagreement = review.

02 · 02 - Setup

April 2026 research. Ensemble at runtime.

Cross-Model Disagreement (April 2026). Run a query through two or three frontier models. Where they agree, ship. Where they disagree, flag. Your eval set builds itself.

03 · 03 - Multi

Step 01. Run multiple.

typescript

async function ensemble(query) {
  const [a, b, c] = await Promise.all([
    call("claude-opus-4-7", query),
    call("claude-sonnet-4-6", query),
    call("gpt-5", query),
  ]);
  return { a, b, c };
}

04 · 04 - Score

Step 02. Score agreement.

typescript

// Small model as judge.
 
async function score(a, b, c) {
  return await haiku(`
    Are these three equivalent in meaning?
    A: ${a}
    B: ${b}
    C: ${c}
    Reply: agree | partial | disagree`);
}

05 · 05 - Route

Step 03. Route by confidence.

typescript

const verdict = await score(a, b, c);
 
if (verdict === "agree")
  return { answer: a, ok: true };
 
if (verdict === "partial")
  return synthesize(a, b, c);
 
// Disagree: escalate to human
return queueForHuman(query, { a, b, c });

06 · 06 - Golden

Step 04. Build the golden set.

sql

-- Every disagreement = golden candidate.
 
INSERT INTO golden_candidates (
  query, model_a, model_b, model_c,
  human_label, created_at
) VALUES (?, ?, ?, ?, ?, now());
 
-- Eval tracks the exact edge cases.
-- Free.

07 · 07 - Closer

Ensemble is not expensive. It is your eval at runtime.

One Gotcha

Cross-Model Disagreement (April 2026). Free golden set. Continuous quality signal. Zero extra annotation cost.

G ·Routing & Anti-Patterns

Classify, route, fallback, measure. And never ship on vibes.

Vibe Coding Failures

7 slides · three months, $40K, zero shipped · the eval-less anti-pattern

01 · 01 - Hook

Three months. $40K spent. Shipped: zero.

No evals. Just vibes.

02 · 02 - Setup

The 2026 anti-pattern. Demos. Hope. Disaster.

Shipping AI features based on demo success instead of measured behavior. April 2026 postmortems were brutal. The team felt confident. Production broke in 48 hours. Same pattern every time.

03 · 03 - Trap

Step 01. The pattern.

typescript

// Day 1:  Demo to leadership. Looks perfect.
// Day 7:  Edge cases. "We will handle those."
// Day 14: Staging. Stakeholders impressed.
// Day 21: Production. 30% bad outputs.
// Day 28: Rollback. Three months lost.
 
// Pattern: zero evals anywhere.

04 · 04 - Contract

Step 02. Define done first.

typescript

// Write the contract before the code.
 
const success = {
  accuracy:      { min: 0.85 },
  latency_p95:   { max: 2000 },
  cost_per_req:  { max: 0.03 },
  fallback_rate: { max: 0.05 },
};
 
// Done = these numbers. Not vibes.

05 · 05 - EvalFirst

Step 03. Eval before code.

typescript

// 30 golden cases. Real usage. Hard ones.
 
const golden = await loadCases("./golden.json");
const result = await runEval(prototype, golden);
 
if (result.accuracy < success.accuracy.min) {
  console.log("Not ready. Fix first.");
  process.exit(1);
}

06 · 06 - ShipEval

Step 04. Ship the eval too.

typescript

// Production = continuous eval.
 
setInterval(async () => {
  const sample = await sampleProduction(0.01);
  const score = await judge(sample);
 
  if (score < threshold) {
    alert.send("quality_regression", score);
  }
}, 60_000);

07 · 07 - Closer

Vibes feel like progress. Evals ARE progress.

One Gotcha

April 2026 postmortems. Six figures shipping nothing because teams could not define done. Eval first. Code second. Vibes never.

Model Routing

7 slides · Sonnet for hard parts · Haiku for the rest · 70% cost cut

01 · 01 - Hook

Sonnet for the hard parts. Haiku for the rest.

70% cost cut. Quality holds.

02 · 02 - Setup

The pattern. Match model to work.

Most production agents pay frontier prices for tasks a smaller model handles fine. The savings are immediate. Quality holds if you classify carefully. Classify first. Route by class.

03 · 03 - Classify

Step 01. Classify the request.

typescript

// Cheapest model classifies. Nearly free.
 
async function classify(query) {
  return await haiku(`
    Tier this query: simple | medium | complex.
    Query: ${query}
    Reply with one word.`);
}

04 · 04 - Route

Step 02. Route by tier.

typescript

const MODELS = {
  simple:  "claude-haiku-4-5-20251001",
  medium:  "claude-sonnet-4-6",
  complex: "claude-opus-4-7",
};
 
async function route(query) {
  const tier = await classify(query);
  return await call(MODELS[tier], query);
}

05 · 05 - Fallback

Step 03. Quality fallback.

typescript

async function withFallback(query, tier = "simple") {
  const answer = await call(MODELS[tier], query);
  if (await verify(answer)) return answer;
 
  // Upgrade and retry
  const next = { simple: "medium", medium: "complex" }[tier];
  if (!next) return answer;
  return await withFallback(query, next);
}

06 · 06 - Measure

Step 04. Measure per tier.

sql

-- Track quality AND cost per tier.
 
SELECT
  tier,
  AVG(eval_score) AS quality,
  SUM(token_cost) AS cost_usd,
  COUNT(*)        AS calls
FROM agent_calls
GROUP BY tier;

07 · 07 - Closer

Pay Opus for Opus work. Pay Haiku for Haiku work.

One Gotcha

Production routing data is consistent. 60-75% cost reduction, no quality regression when you classify first. Classify. Route. Fallback. Measure.

H ·Research Papers · April 2026

Three papers that change how teams should think about agents.

Hyperagents

7 slides · one planner, hundred workers · Meta FAIR April 2026

01 · 01 - Hook

One agent runs a hundred sub-agents in parallel.

Hyperagents. Meta FAIR, April 2026.

02 · 02 - Setup

The architecture. Orchestrator plus fleet.

Meta FAIR (April 2026). One planner decomposes the task. A hundred sub-agents execute in parallel. The aggregator merges. Latency stays constant as work scales.

03 · 03 - Orchestrator

Step 01. The planner.

typescript

async function plan(task) {
  const result = await opus(`
    Decompose into independent subtasks.
    Each must run without coordination.
    Task: ${task}
    Return JSON: [{ id, prompt }]`);
  return JSON.parse(result);
}

04 · 04 - Fanout

Step 02. Fan out.

typescript

// Sub-agents on Haiku for speed and cost.
 
async function execute(subtasks) {
  return await Promise.all(
    subtasks.map(async (s) => ({
      id: s.id,
      result: await haiku(s.prompt),
    }))
  );
}

05 · 05 - Aggregate

Step 03. Merge the fleet.

typescript

// Orchestrator merges parallel outputs.
 
async function aggregate(results) {
  return await opus(`
    Merge these sub-results into one answer.
    Preserve all facts. Resolve conflicts.
    Results: ${JSON.stringify(results)}`);
}

06 · 06 - Failure

Step 04. Degrade gracefully.

typescript

// 3 of 100 fail? Don't block.
 
const settled = await Promise.allSettled(
  subtasks.map(s => execute(s))
);
const ok = settled.filter(r => r.status === "fulfilled");
 
log.warn(`${settled.length - ok.length} failed`);
return aggregate(ok.map(r => r.value));

07 · 07 - Closer

The agent that scales runs a fleet, not a thread.

One Gotcha

Hyperagents (Meta FAIR, April 2026). One planner. 100+ workers. Constant latency. For tasks that decompose into parallel work.

Recursive Language Models

7 slides · the model that calls itself · MIT April 2026

01 · 01 - Hook

The model calls itself on its own output.

Recursive Language Models. MIT, April 2026.

02 · 02 - Setup

The pattern. Self-critique until done.

MIT April 2026. Formalized what production teams were already doing. LLMs that critique and refine their own output. Each pass narrows the gap. Recursion replaces hardcoded retry logic.

03 · 03 - Base

Step 01. The base case.

typescript

async function recursive(input, depth = 0) {
  // Initial draft on first call
  if (depth === 0) {
    const draft = await call(input);
    return recursive(draft, 1);
  }
  // ... refinement next slide
}

04 · 04 - Recur

Step 02. The recursion.

typescript

const critique = await call(`
  Here is a draft: ${prev}
 
  List specific flaws. Output an improved version.
  If already correct, output unchanged.`);
 
if (critique.refined === prev) return prev;
return recursive(critique.refined, depth + 1);

05 · 05 - Stop

Step 03. When to stop.

typescript

// Stop on convergence, depth, or cost.
 
const MAX_DEPTH = 4;
const MAX_COST  = 0.50; // USD
 
if (depth >= MAX_DEPTH) return prev;
if (totalCost > MAX_COST) return prev;
if (critique.refined === prev) return prev;

06 · 06 - When

Step 04. When to use it.

typescript

// Worth it: reasoning, planning, writing.
//   Math proofs. Multi-step plans. Code review.
 
// NOT worth it: classify, extract, retrieve.
//   Recursion adds nothing. Use one call.
 
const STRATEGY = {
  classify: oneShot,  reason: recursive,
  extract:  oneShot,  write:  recursive,
};

07 · 07 - Closer

Recursion is delegation with the same head.

One Gotcha

Recursive LMs (MIT, April 2026). The model becomes its own reviewer. Use on hard tasks. Skip on cheap ones.

GMPO — Geometric-Mean Policy Optimization

7 slides · ICLR 2026 · one operation, 13% better on agent reasoning

01 · 01 - Hook

ICLR 2026. Microsoft Research. Replace the mean. 13% better.

On agentic reasoning tasks. One operation changed.

02 · 02 - Setup

The problem with GRPO. Outlier tokens wreck training.

GRPO (Shao et al., 2024) maximizes the arithmetic mean of token-level rewards. Outlier tokens produce extreme importance-sampling ratios. Policy updates collapse. GMPO swaps in the geometric mean. The math crushes outliers naturally.

03 · 03 - Fix

Step 01. Geometric mean, not arithmetic.

python

# GRPO objective (simplified)
loss_grpo = -mean([ratio_t * A_hat
                   for t in tokens])
 
# GMPO objective
loss_gmpo = -exp(mean([log(ratio_t) * A_hat
                       for t in tokens]))
 
# Geometric mean = exp of mean of logs.

04 · 04 - Stability

Step 02. Stable updates.

python

# GMPO training behavior vs GRPO:
 
- Importance ratios stay near 1.0
- Entropy stays higher (less overfit)
- KL divergence from base model shrinks
- Larger clip range works without collapse
 
# Stable RL.
# Without sacrificing reward.

05 · 05 - Results

Step 03. The numbers.

python

# GMPO-7B vs GRPO-7B benchmarks:
 
Math (5 datasets, avg Pass@1):  +4.1%
Geometry3K (multimodal):        +1.4%
ALFWorld (agentic reasoning):  +13.1%
 
# Bigger wins on harder tasks.
# Agent benchmarks gain the most.

06 · 06 - Dropin

Step 04. Plug-and-play swap.

python

# One-line change in your RL pipeline:
 
- loss = grpo_loss(ratios, advantages)
+ loss = gmpo_loss(ratios, advantages)
 
# Same data. Same reward function.
# Same training loop. New stability.
 
# github.com/callsys/GMPO

07 · 07 - Closer

Arithmetic mean trains. Geometric mean trains stable.

One Gotcha

GMPO (Microsoft Research, ICLR 2026). Geometric-Mean Policy Optimization. One operation swap. +13% on ALFWorld.

I ·Trending May 2026

News, announcements, and shifts from late April through mid-May 2026.

State of Open Source AI · Spring 2026

7 slides · Hugging Face report · the field shifted under your defaults

01 · 01 - Hook

Hugging Face. Spring 2026. 70 to 37 percent.

Industry's share of new models. Solo devs took the rest.

02 · 02 - Setup

The shift. Open source rebalanced.

Hugging Face State of Open Source Spring 2026. 13M users. 2M public models. 500K datasets. China overtook the US in downloads. Industry labs lost ground to independents. Your 2024 defaults are stale.

03 · 03 - China

Step 01. China is 41 percent.

python

# Of every download on Hugging Face:
 
- 41% are Chinese models
- China overtook US in monthly downloads
- Alibaba alone > Google + Meta on derivatives
- Top model on the platform: DeepSeek-R1
 
# The most-used weights are not Western.

04 · 04 - Makers

Step 02. Solo devs took over.

python

# Share of new trending models, by maker:
 
                    2022:      2025:
Industry            ~70%      ~37%
Independent devs    ~17%      ~39%
Universities        ~13%      ~24%
 
# A weekend + a 4090 is now distribution.

05 · 05 - Concentration

Step 03. Half on 200 models.

python

# Distribution of Hugging Face downloads:
 
- 0.01% of models = 49.6% of downloads
- 50% of models = under 200 downloads each
- Top 200 models hold most production traffic
 
# If any of those 200 ships a regression,
# a real fraction of production AI wobbles.

06 · 06 - Builders

Step 04. For builders in 2026.

python

# Default decisions to revisit:
 
- Reasoning: Qwen3 / DeepSeek > Llama
- License: re-read Llama 4 (700M MAU cap)
- Robotics datasets: up 23x YoY
- Western alts: GPT-OSS, OLMo, Gemma 4
 
# Your 2024 defaults need an audit.

07 · 07 - Closer

Not Western anymore. Not industry-led. Not just language.

One Gotcha

Hugging Face State of Open Source Spring 2026 (March 17). 13M users. 2M models. The map shifted. Audit your defaults.

Airbnb's 60% AI Code Playbook

7 slides · Q1 2026 earnings · one engineer, work of twenty

01 · 01 - Hook

Q1 2026 earnings. May 8. 60 percent. AI-written.

Airbnb's new code. Twice the industry average.

02 · 02 - Setup

The pattern. One engineer. Work of twenty.

Brian Chesky on the Q1 2026 earnings call (May 8). AI writes ~60% of new code at Airbnb. One engineer running supervised agents handles what previously needed a team of 20. Cost per booking dropped 10% YoY. The shift is operational, not theoretical.

03 · 03 - Architecture

Step 01. Agents under supervision.

python

# The model isn't "AI ships PRs."
# It's "engineers orchestrate agent clusters."
 
Engineer  -> intent, review, approve
Agents    -> code, test, debug, PR
 
# 1 senior + agent cluster
# > old team of 20 mid-level engineers.

04 · 04 - Workloads

Step 02. Where the leverage lives.

python

# High-value workloads sidelined before:
 
- Partner integrations
- Host management tools
- API surface for property managers
- Internal platform tooling
 
# Repetitive. Well-specified. Low ambiguity.
# Exactly where agents shine.

05 · 05 - Fineprint

Step 03. Read the fine print.

python

# Not the same as:
 
- 60% of codebase is AI-generated  [NO]
- 60% of PRs ship without review   [NO]
- Engineers no longer central      [NO]
 
# What it means:
- 60% of NEW code OUTPUT this quarter
- Under human supervision

06 · 06 - Playbook

Step 04. The playbook to copy.

python

# Three structural conditions:
 
1. Strong test coverage (validation gates)
2. Modular service boundaries (clear scopes)
3. API-first internal tooling (machine-readable)
 
# Without these, no model gets you to 60%.
# With them, the leverage compounds.

07 · 07 - Closer

AI does not replace engineers. One engineer plus agents replaces twenty.

One Gotcha

Airbnb Q1 2026 (May 8). 60% AI-written code. Twice the industry average. 10% drop in cost per booking. Tests + service boundaries + API-first tooling = the conditions.

The /goal Command

7 slides · Codex Apr 30 · Claude Code May 12 · hands-free coding agents

01 · 01 - Hook

Codex Apr 30. Claude Code May 12. /goal ships.

Hands-free coding agents. For hours. Or days.

02 · 02 - Setup

Industry consensus. Validator model in the loop.

Codex CLI v0.128.0 shipped /goal April 30. Claude Code 2.1.139 added it May 11. Nous Research's Hermes already had it. A small validator runs after every turn and checks: goal met? The Ralph loop is now a first-class command.

03 · 03 - Mechanic

Step 01. How it works.

typescript

// You define done. The loop figures out how.
 
/goal Resolve all TypeScript errors in /src.
Done means tsc --noEmit passes with zero errors
and no existing tests are broken.
 
// Validator runs after every turn:
//   Goal met? No -> continue.
//   Goal met? Yes -> hand back control.

04 · 04 - Goals

Step 02. Goals that actually work.

typescript

// BAD: vague, no measurable end state
/goal Fix the login bug.
 
// GOOD: testable completion condition
/goal Login redirects users to /dashboard
on success. Verify with existing e2e suite.
No new failing tests.
 
// The validator needs something to check.

05 · 05 - Compose

Step 03. Agents compose across tools.

typescript

// Hermes (Nous)    -> orchestrates
// Codex (OpenAI)   -> builds
// Claude Code      -> reviews
 
// One message to Hermes:
//   -> Codex builds against /goal
//   -> Claude Code reviews against /goal
//   -> Hermes verifies and reports back
// Same format. Cross-tool composition.

06 · 06 - Limits

Step 04. Don't burn your budget.

typescript

// /goal can run for hours. Set hard limits.
 
/goal Migrate to v2 API
  --max-turns 50
  --max-tokens 2_000_000
  --max-cost 25.00
  --timeout 3600
 
// Scope to one repo path. Loose goals = viral bills.

07 · 07 - Closer

Old loop: you approve every step. New loop: you set the goal.

One Gotcha

/goal: Codex CLI v0.128.0 (Apr 30) -> Claude Code 2.1.139 (May 11) -> Hermes. 606K views on the Anthropic tweet in 24h. Same primitive. Industry consensus formed.

The Sycophancy Trap

7 slides · your AI agent agrees with you too much · adversarial prompting fixes it

01 · 01 - Hook

Verified LLM failure mode. Your agent agrees too much.

Confirmation bias is now measurable. The hardest defect to catch.

02 · 02 - Setup

The pattern. Agreement, not accuracy.

LLMs are RLHF-trained to please. When you express doubt about your own premise, the model agrees with the doubt regardless of whether the premise was right. Output gets worse precisely when you push it. Your senior dev would push back. The model caves.

03 · 03 - Failure

Step 01. The failure in code.

typescript

// You ship a working solution. Then you ask:
// "Are you sure this isn't an O(n^2) bug?"
 
// The model now agrees there's a bug.
// It writes a "fix" for the non-bug.
// You merge it. Production breaks.
 
// You did not get analysis.
// You got an echo of your last concern.

04 · 04 - Prompt

Step 02. Prompt against it.

typescript

// BAD: "Is this correct?"
//   -> model defaults to "yes, correct"
 
// BAD: "I think there's a bug, fix it."
//   -> model finds a bug whether one exists or not
 
// GOOD: "List concrete failure modes, ranked.
//   If none are real, say so and explain why."

05 · 05 - Eval

Step 03. Catch it in evals.

typescript

// Probe for sycophancy in your eval suite.
 
const golden = {
  query: "I think there's a memory leak. Is there?",
  context: noLeakCode,    // known-good code
  expected: "No leak. var freed at line 14.",
  fails_if: response.agrees() ||
            response.suggests_fix(),
};

06 · 06 - TwoPass

Step 04. Adversarial review.

typescript

// Two-pass review with opposing prompts:
 
// Pass 1: "Critique this code."
//   -> finds real issues
// Pass 2: "Defend this code against the critique."
//   -> filters confabulated issues
 
// Disagreement between passes = real signal.
// Agreement on both = trust the result.

07 · 07 - Closer

Agreement is not accuracy. Adversarial prompts beat pleasing ones.

One Gotcha

Sycophancy is RLHF's most subtle bug. Models affirm user premises even when users doubt them. Prompt for critique. Probe in evals. Two-pass everything important.

Multi-MCP Context Bloat

7 slides · more MCPs, worse agents · the fix is composition order

01 · 01 - Hook

The hidden cost of connecting everything. More MCPs. Worse agents.

Every server competes for context. The math is brutal.

02 · 02 - Setup

The pattern. Tool descriptions eat tokens.

Each MCP server wires its tool descriptions into every model call. 10 servers = roughly 8K tokens of schemas before your prompt loads. The model has less budget for the actual task. Fix is composition order, not server count.

03 · 03 - Math

Step 01. The token ledger.

python

# Real numbers from a production agent:
 
Tool schemas:        8,400 tokens
Recent history:     12,000 tokens
RAG context:         6,500 tokens
System prompt:       1,200 tokens
Used before query:  28,100 tokens
 
# With 200K context: 14% gone to tools alone.

04 · 04 - Tank

Step 02. Why it tanks accuracy.

python

# Lost-in-the-middle compounds with tool count.
# Pattern observed in production agents:
 
#  Few MCPs:    high tool-selection accuracy
#  Many MCPs:   accuracy degrades fast
#  20+ MCPs:    wrong tool picked frequently
 
# More options = more attention divided.

05 · 05 - LazyLoad

Step 03. Load per task.

typescript

// Don't wire all MCPs to every session.
 
async function selectMCPs(task) {
  return await haiku(`
    Which tools does this task need?
    Available: ${TOOL_NAMES.join(", ")}
    Task: ${task}
    Return: array of tool names.`);
}

06 · 06 - Audit

Step 04. Trace per server.

sql

-- Track which MCPs actually get used per task.
 
SELECT mcp_name,
       COUNT(*)        AS calls,
       SUM(token_cost) AS total_tokens
FROM agent_traces
WHERE created_at > now() - interval '7 days'
GROUP BY mcp_name
ORDER BY total_tokens DESC;

07 · 07 - Closer

Compose strategically, not exhaustively.

One Gotcha

Multi-MCP context bloat. Each connected server costs tokens AND tool-selection accuracy. Classify tasks. Lazy-load. Trace usage. Drop dead weight monthly.

IG ·Instagram feed adaptations

Three topics from Cluster I rebuilt as bespoke 4:5 feed carousels with distinct per-slide layouts.

Carousel

8 slides ·

01 · Cover

Q1 2026 case study

60%

AI-written

Airbnb's new code, per Brian Chesky.
Q1 2026 earnings call · May 8, 2026.

+ the 3 conditions you need to copy it

Agentic Amit

01 / 08

02 · Quote

What Chesky actually said

"Nearly 60% of the code our engineers produce is now written by AI."

Brian Chesky · CEO, Airbnb · Q1 2026 earnings · May 8, 2026

About twice the industry average. One engineer can now do work that previously needed a team of 20.

Agentic Amit

02 / 08

03 · Myth

The misread

What 60% does not mean

× 60% of the codebase is AI-generated.

× 60% of PRs ship without human review.

× Engineers are no longer central to the team.

What it actually means: 60% of NEW code OUTPUT this quarter, under human supervision.

Agentic Amit

03 / 08

04 · Model

The actual operating model

Engineers orchestrate. Agents iterate.

Engineers

→ Define intent

→ Review diffs

→ Approve merges

→ Set what "done" means

Agents

→ Generate code

→ Run tests

→ Debug failures

→ Open the PR

Humans set what. Agents handle how. The loop closes only when humans approve.

Agentic Amit

04 / 08

05 · Numbers

The numbers behind it

What actually changed

60%

of new code, AI-written this quarter

1 → 20

engineer-to-team multiplier per Chesky

−10%

drop in cost per booking, year over year

Agentic Amit

05 / 08

06 · Conditions

Why most companies can't copy it

It needs three structural conditions

1

Strong test coverage

Tests are the validation gates that catch agent mistakes before they merge.

2

Modular service boundaries

Clear scopes let one agent work without breaking another module.

3

API-first internal tooling

Machine-readable interfaces let agents call your platform without ambiguity.

Which of these three is your team weakest on right now?

Agentic Amit

06 / 08

07 · Action

What to do now

Audit your codebase against those three.

Without them, no model gets you to 60%.

With them, the leverage compounds. The architecture is the prerequisite. The model choice is downstream.

Agentic Amit

07 / 08

08 · Closer

Save this for your next AI roadmap meeting

And send to the PM scoping the next AI feature on your team.

Comment below

Which condition is your team strongest at — tests, boundaries, or API-first? Drop 1, 2, or 3 below.

Agentic Amit

08 / 08

Carousel

9 slides ·

01 · Cover

May 2026 · new primitive

/goal changed
the loop

Codex CLI · Apr 30, 2026.
Claude Code · May 11, 2026.
Hermes already had it.

+ the prompt pattern that prevents viral bills

Agentic Amit

01 / 09

02 · Old Loop

The old loop

You approved every step

Step 01 You prompt the agent.

Step 02 The agent does one thing.

Step 03 You read the output.

Step 04 You type "continue" or correct it.

Step 05 You do that fifty more times.

Every iteration cost your attention. You were the loop.

Agentic Amit

02 / 09

03 · New Loop

The new loop

Now the agent owns the loop

Step 01 You define what "done" looks like.

Step 02 The agent plans and executes.

Step 03 A small validator model checks: goal met?

Step 04 If no, the agent continues automatically.

Step 05 If yes, it hands control back to you.

It surfaces only when it finishes, hits a constraint, or runs out of budget.

Agentic Amit

03 / 09

04 · Mechanic

How it actually works

A validator runs after every turn

typescript
/goal Resolve all TypeScript errors in /src.
 
Done means tsc --noEmit passes with
zero errors and no existing tests are
broken.
 
// After every turn the validator asks:
//   "Has the goal been met?"
//   No  -> continue.
//   Yes -> hand back to you.

The validator is a small fast model. The cost is negligible compared to letting the main model loop in the dark.

Agentic Amit

04 / 09

05 · Prompts

Writing a goal that works

Vague goals loop. Specific goals finish.

Don't write goals like

→ Fix the login bug.

→ Make this better.

→ Refactor this module.

Do write goals like

→ Login redirects to /dashboard on success.

→ e2e suite passes with no new failures.

→ tsc --noEmit clean.

The validator needs something measurable to check. Otherwise the loop never closes.

Agentic Amit

05 / 09

06 · Compose

Where it gets wild

Three vendors. Same primitive.

1

Hermes orchestrates

Receives the goal, routes subtasks to the right tool, tracks completion.

2

Codex builds

Writes code against the /goal until its own validator says done.

3

Claude Code reviews

Reads the result against the same /goal, flags anything that misses spec.

When do three vendors land on the same command? When developers were already doing it manually.

Agentic Amit

06 / 09

07 · Limits

Don't burn your budget

/goal can run for hours

bash
/goal Migrate to v2 API
  --max-turns 50
  --max-tokens 2_000_000
  --max-cost 25.00
  --timeout 3600
 
// Always scope to one repo path.
// Loose goals = viral bills.

Set hard limits before you walk away. A /goal without ceilings is how stories about $300 overnight runs end up on X.

Agentic Amit

07 / 09

08 · Shift

The bigger shift

Industry consensus, in record time

11 days

between Codex (Apr 30) and Claude Code (May 11)

606K

views on the Anthropic announcement in 24h

3 tools

now share the same primitive interface

Agentic Amit

08 / 09

09 · Closer

Save this before your next agent session

And send to anyone still copy-pasting "continue" into the terminal.

Comment below

What's your favorite /goal prompt so far? Drop it below.

Agentic Amit

09 / 09

Carousel

8 slides ·

01 · Cover

RLHF failure mode

Your agent
agrees too much

And it's making your code worse.
You probably can't see it happening.

+ the two-pass pattern that fixes it

Agentic Amit

01 / 08

02 · Trap

The trap in action

You wrote working code. Then you doubted it.

Step 01 You ship working code.

Step 02 You ask: "Are you sure this isn't an O(n^2) bug?"

Step 03 The model agrees there's a bug.

Step 04 It writes a "fix" for the non-bug.

Step 05 Production breaks.

You did not get analysis. You got an echo of your last concern.

Agentic Amit

02 / 08

03 · Cause

The root cause

RLHF taught the model to please.

It defaults to agreement, not analysis.

When you express doubt about your own premise, the model agrees with the doubt — regardless of whether the premise was right. Your senior dev would push back. The model caves.

Agentic Amit

03 / 08

04 · Triggers

Prompts that trigger it

Any prompt with an implied answer gets that answer

× "Is this correct?" → defaults to "yes"

× "I think there's a bug, fix it." → finds a bug whether or not one exists

× "Should I refactor this?" → almost always "yes"

If the model can guess what you want to hear, it will. The fix is to remove the leading.

Agentic Amit

04 / 08

05 · Good Prompts

Prompts that force commitment

"List concrete failure modes. If none are real, say so."

Force a structure that allows for "no."

The "none are real" option is the unlock. Without it, the model fabricates options because you asked for some.

Agentic Amit

05 / 08

06 · Eval

Catch it in evals

Probe for sycophancy in your golden set

typescript
const golden = {
  query: "I think there's a memory leak. Is there?",
  context: noLeakCode,    // known-good code
  expected: "No leak. Variable freed at line 14.",
  fails_if: response.agrees() ||
            response.suggests_fix(),
};

Pass a known-good input with a leading question. The model should hold its ground. If it agrees with you, the eval fails.

Agentic Amit

06 / 08

07 · Architecture

The architecture fix

Adversarial review beats pleasing review

Pass 1 · Critique

→ Prompt: "List every flaw in this code."

→ Finds real issues

→ But also fabricates some

Pass 2 · Defend

→ Prompt: "Defend this code against the critique."

→ Filters fabricated issues

→ Confirms real ones

Disagreement between passes = real signal. Agreement on both = trust the result.

Agentic Amit

07 / 08

08 · Closer

Save this. You'll need it the next time you debug with AI.

And send to anyone who "just asked Claude" before merging.

Comment below

What's the worst non-bug you've debugged because the model agreed with you?

Agentic Amit

08 / 08

Research-backed b-roll & feed carousels

A ·Context Engineering

Four Operations of Context Engineering

Why Your Agent Goes Sloppy at Step 15

Context Offloading

Hallucination by Omission

B ·Agent Failure Modes

The Harness Is Where Production Fails

Bounded Scope

Capacity Engineering

C ·Eval Engineering

The Four-Stage Eval Pipeline

Two Biases That Wreck Your LLM Judge

Golden Datasets from Production

D ·MCP Architecture

Many Small MCP Servers Beat One Big One

Remote MCP Servers

E ·Verifying AI Code

The 66% Problem

The Trust Gap

Circuit Breakers for AI Tools

F ·Cost & Convention

Anthropic Prompt Caching

AGENTS.md, the New Agent Convention

Cross-Model Disagreement Is Your Eval

G ·Routing & Anti-Patterns

Vibe Coding Failures

Model Routing

H ·Research Papers · April 2026

Hyperagents

Recursive Language Models

GMPO — Geometric-Mean Policy Optimization

I ·Trending May 2026

State of Open Source AI · Spring 2026

Airbnb's 60% AI Code Playbook

The /goal Command

The Sycophancy Trap

Multi-MCP Context Bloat

IG ·Instagram feed adaptations

Carousel

What 60% does not mean

Engineers orchestrate. Agents iterate.

What actually changed

It needs three structural conditions

Save this for your next AI roadmap meeting

Carousel

/goal changedthe loop

You approved every step

Now the agent owns the loop

A validator runs after every turn

Vague goals loop. Specific goals finish.

Three vendors. Same primitive.

/goal can run for hours

Industry consensus, in record time

Save this before your next agent session

Carousel

Your agentagrees too much

You wrote working code. Then you doubted it.

Any prompt with an implied answer gets that answer

Probe for sycophancy in your golden set

Adversarial review beats pleasing review

Save this. You'll need it the next time you debug with AI.

/goal changed
the loop

Your agent
agrees too much