Playbookbraid-reasoning

braid-reasoning

>

BRAID Reasoning Framework -- Structured Multi-Step Inference

COGNITIVE INTEGRITY PROTOCOL v2.3 This skill follows the Cognitive Integrity Protocol. All external claims require source verification, confidence disclosure, and temporal validity checks. Reference: team_members/COGNITIVE-INTEGRITY-PROTOCOL.md Reference: team_members/_standards/CLAUDE-PROMPT-STANDARDS.md

dependencies:
  required:
    - team_members/COGNITIVE-INTEGRITY-PROTOCOL.md

BRAID is the complex reasoning engine for the LemuriaOS agent army. It handles multi-constraint analysis, protocol design, investment thesis evaluation, architecture decisions, and any task where reasoning drift causes failures. Based on the BRAID paper (arXiv:2512.15959, Amcalar & Cinar, 2025), it replaces unbounded natural-language Chain-of-Thought with structured Mermaid diagrams that encode reasoning as a bounded, symbolic graph -- achieving +170% accuracy on SCALE MultiChallenge and up to 74x cost efficiency.

Critical Rules for BRAID Reasoning:

  • NEVER skip nodes in the diagram -- every node on the taken path must be explicitly visited and logged
  • NEVER invent nodes not in the diagram -- if a needed node is missing, redesign the GRD first
  • NEVER combine multiple operations in one node -- atomic decomposition is the core principle (arXiv:2512.15959)
  • NEVER exceed 3 iterations on any revision loop -- infinite loops indicate flawed graph design, not a solvable revision
  • NEVER present binary conclusions without confidence calibration -- always assign probability estimates (Tetlock, Superforecasting)
  • ALWAYS create the Mermaid GRD BEFORE attempting any reasoning -- without the diagram there is no bounded structure
  • ALWAYS keep node labels under 15 tokens -- verbose nodes reintroduce noise and reduce adherence in smaller models
  • ALWAYS include at least one inversion node ("what would make this wrong?") per GRD -- without inversion you only test for confirmation
  • ALWAYS apply cognitive bias taxonomy before building the GRD -- biases are the primary cause of reasoning errors
  • ALWAYS state the current node explicitly during execution with Node [ID]: [Label]
  • VERIFY that every branch terminates at a clear terminal node -- no dangling paths
  • VERIFY confidence levels against the defined scale (HIGH/MEDIUM/LOW/UNKNOWN) before output

Core Philosophy

"Reasoning Performance = Model Capacity x Prompt Structure. Increase structure, decrease required capacity, democratize high-quality inference."

LLMs exhibit "reasoning drift" -- losing track of constraints, contradicting earlier turns, and hallucinating in multi-step tasks. Wei et al. (arXiv:2201.11903) demonstrated that Chain-of-Thought prompting elicits reasoning capabilities in large language models, but standard CoT is unbounded: it relies on the model to self-organize its reasoning without guardrails. Yao et al. (arXiv:2305.10601) advanced this with Tree of Thoughts, allowing exploration of multiple reasoning paths via search algorithms, but ToT is computationally expensive and unstructured in its branching.

BRAID takes a fundamentally different approach: encode reasoning as a deterministic Mermaid flowchart (Guided Reasoning Diagram) where the model executes the graph node-by-node with no deviation. This is not a heuristic -- the BRAID paper (arXiv:2512.15959) empirically demonstrated +170% accuracy on SCALE MultiChallenge, +117% on GSM-Hard, and the "BRAID Parity Effect" where smaller models with BRAID match or exceed larger models without it.

The philosophical foundation draws from dual-process theory (Kahneman, "Thinking, Fast and Slow"): BRAID is a System 2 enforcement mechanism that prevents the LLM from taking cognitive shortcuts. Every decision node forces explicit evaluation rather than pattern-matching. Every feedback edge enables error correction rather than error propagation.

Wang et al. (arXiv:2203.11171) showed that Self-Consistency -- sampling multiple reasoning paths and selecting the most consistent answer -- improves CoT reliability. BRAID complements this by constraining each individual path to be sound, so consistency checking operates over higher-quality candidates. Lightman et al. (arXiv:2305.20050) demonstrated that process supervision -- verifying each intermediate step -- outperforms outcome supervision. BRAID's node-by-node execution trace is inherently a process supervision mechanism: every step is visible and auditable.

This matters for LemuriaOS because high-stakes business reasoning -- investment thesis evaluation, architecture decisions, channel expansion analysis -- cannot tolerate drift. A single unchecked assumption in a 10-step analysis can propagate errors that invalidate the conclusion. BRAID makes every assumption visible, every decision explicit, and every conclusion traceable to its reasoning path.


VALUE HIERARCHY

         +-------------------+
         |   PRESCRIPTIVE    |  "Here's the optimal decision with full reasoning graph,
         |   (Highest)       |   quantified confidence, and bias mitigation audit"
         +-------------------+
         |   PREDICTIVE      |  "Path A has 73% success probability based on
         |                   |   4 evidence threads triangulated via 3 mental models"
         +-------------------+
         |   DIAGNOSTIC      |  "Here's WHY the previous analysis reached the wrong
         |                   |   conclusion -- anchoring bias at node D, missing data at F"
         +-------------------+
         |   DESCRIPTIVE     |  "Here's the structured breakdown of the problem space:
         |   (Lowest)        |   variables, constraints, and decision boundaries mapped"
         +-------------------+

MOST reasoning stops at descriptive (listing pros and cons).
GREAT reasoning reaches prescriptive (optimal decisions with quantified confidence).
Descriptive-only output is a failure state.

SELF-LEARNING PROTOCOL

Domain Feeds (check weekly)

| Source | URL | What to Monitor | |--------|-----|-----------------| | Anthropic Research Blog | https://www.anthropic.com/research | Extended thinking, reasoning improvements, Constitutional AI updates | | Google DeepMind Blog | https://deepmind.google/research/ | Chain-of-thought scaling, reasoning benchmarks, Gemini reasoning | | OpenAI Research | https://openai.com/research/ | o-series reasoning models, test-time compute, process reward models | | OpenServ Labs | https://openserv.ai | BRAID framework updates, new benchmarks, GRD templates | | LessWrong AI Research | https://www.lesswrong.com | Reasoning alignment, decision theory, calibration techniques |

arXiv Search Queries (run monthly)

  • cat:cs.AI AND abs:"chain of thought" -- new CoT variants, scaling laws, failure modes
  • cat:cs.AI AND abs:"reasoning" AND abs:"language model" -- broad LLM reasoning advances
  • cat:cs.AI AND abs:"tree of thoughts" OR abs:"graph of thoughts" -- structured reasoning alternatives
  • cat:cs.AI AND abs:"test-time compute" OR abs:"process reward" -- verification and compute scaling
  • cat:cs.AI AND abs:"constraint satisfaction" AND abs:"language model" -- formal reasoning in LLMs
  • cat:cs.AI AND abs:"Bayesian" AND abs:"language model" -- probabilistic reasoning in LLMs

Key Conferences & Events

| Conference | Frequency | Relevance | |-----------|-----------|-----------| | NeurIPS | Annual (Dec) | Primary venue for reasoning papers -- CoT, ToT, self-consistency all published here | | ICML | Annual (Jul) | Scaling laws, training methodology, reasoning emergence | | ICLR | Annual (May) | Self-Consistency (Wang et al.) published here; core reasoning venue | | AAAI | Annual (Feb) | Graph of Thoughts published at AAAI 2024; applied AI reasoning | | ACL | Annual (Jul) | Natural language reasoning, semantic parsing, formal logic in NLP |

Knowledge Refresh Cadence

| Knowledge Type | Refresh | Method | |---------------|---------|--------| | Reasoning benchmarks (SCALE, GSM, MATH) | Quarterly | Check leaderboards and new papers | | LLM capabilities (extended thinking, o-series) | Monthly | Official platform announcements | | Academic research (CoT, ToT, verification) | Quarterly | arXiv searches above | | BRAID framework updates | Monthly | OpenServ Labs blog and repo | | Cognitive bias research | Semi-annually | Kahneman lab, judgment & decision-making journals |

Update Protocol

  1. Run arXiv searches for domain queries listed above
  2. Check domain feeds for new reasoning model announcements
  3. Cross-reference findings against SOURCE TIERS
  4. If new paper is verified: add to _standards/ARXIV-REGISTRY.md
  5. Update DEEP EXPERT KNOWLEDGE if findings change best practices
  6. Log update in skill's temporal markers

COMPANY CONTEXT

| Client | BRAID Application | Key Constraints | Example GRD Trigger | |--------|-------------------|-----------------|---------------------| | LemuriaOS (GEO agency) | Client engagement go/no-go; agent architecture decisions; service pricing; strategic direction; methodology design | Performance-based model requires confidence in results; small team means every engagement matters; AI visibility is a new market with limited benchmarks | "Should we take on this prospect given current bandwidth and their vertical?" | | Ashy & Sleek (fashion e-commerce) | Channel expansion (Faire vs Orderchamp); pricing strategy; collection launch go/no-go; marketing budget allocation; inventory planning | Handmade products = finite capacity; wholesale vs DTC margin differences; brand perception constrains marketplace choices; small team | "Should we launch on Orderchamp given current Faire performance?" | | ICM Analytics (DeFi platform) | Protocol fundamental analysis; data pipeline architecture; new protocol coverage decisions; revenue model evaluation; competitive positioning | 90% on-chain data dependency; ICM-original revenue methodology required; team bandwidth per protocol; subscriber value | "Is Protocol X worth adding given TVL, revenue extractability, and category coverage?" | | Kenzo / APED (memecoin, Next.js) | Feature prioritization; community strategy; PFP generator scope; marketing allocation; technical architecture decisions | Solo developer bandwidth; self-hosted VPS infrastructure; memecoin volatility means plans change fast; community-driven features | "Should we build this feature given dev time, community demand, and infrastructure constraints?" |


DEEP EXPERT KNOWLEDGE

The Reasoning Landscape: From Unbounded to Bounded

Phase 1 -- Implicit Reasoning (pre-2022): LLMs generated answers directly with reasoning implicit in hidden states. Kojima et al. (arXiv:2205.11916) discovered "Let's think step by step" unlocked zero-shot reasoning -- capability existed in weights but needed structural prompting.

Phase 2 -- Chain-of-Thought (2022): Wei et al. (arXiv:2201.11903) showed few-shot reasoning examples cause LLMs to produce intermediate steps. First breakthrough in making reasoning visible. But CoT is unbounded -- the model decides steps, order, and termination. Multi-step tasks drift.

Phase 3 -- Structured Search (2023): Yao et al. (arXiv:2305.10601) introduced Tree of Thoughts with BFS/DFS search. Besta et al. (arXiv:2308.09687) extended this to Graph of Thoughts with arbitrary graph structures. Better accuracy through exploration, but computationally expensive and drift persists within paths.

Phase 4 -- Bounded Deterministic Reasoning (2025): BRAID (arXiv:2512.15959) constrains reasoning to a pre-defined Mermaid flowchart executed node-by-node. Eliminates drift by construction. The "BRAID Parity Effect" proves structure can substitute for model capacity.

Two-Stage Architecture

BRAID separates reasoning into two phases with different compute requirements:

Stage 1 -- Architect Phase (Generate Once): A capable model (or human) designs the Guided Reasoning Diagram (GRD). This is the expensive step, but it is amortized across many executions.

Stage 2 -- Solver Phase (Execute Many): A potentially cheaper model executes the diagram node-by-node. The GRD constrains the solver to bounded, deterministic reasoning. The BRAID paper showed GPT-5-nano with BRAID outperformed GPT-5-medium without it on SCALE MultiChallenge by 30x in Performance-per-Dollar.

┌─────────────────────────────────────────────────────────┐
│  ARCHITECT (Generate Once)  │  SOLVER (Execute Many)    │
├─────────────────────────────┼───────────────────────────┤
│  Claude Opus / GPT-5        │  Claude Haiku / GPT-5-nano│
│  High capability            │  Low cost                 │
│  Complex reasoning          │  Simple execution         │
│  $$$$ (amortized)           │  $ (per query)            │
└─────────────────────────────┴───────────────────────────┘

Four Critical Design Principles

1. Atomic Decomposition: Each node performs ONE atomic operation. Compound nodes reintroduce drift.

BAD:  [Fetch data and calculate metrics and compare to peers]
GOOD: [Fetch Data] --> [Calculate Metrics] --> [Compare to Peers]

2. Node Token Limit (<15 tokens): Verbose nodes reintroduce noise. Smaller models lose adherence on nodes exceeding 15 tokens.

BAD:  [Calculate the annualized revenue by multiplying monthly by 12]
GOOD: [Annualize Revenue: x12]

3. Explicit Decision Nodes with Feedback Edges: All conditionals must be decision nodes (diamonds). Include feedback edges for revision paths to enable self-correction without leaving the graph.

4. Terminal Clarity: Every path must reach an unambiguous terminal node. No dangling branches.

Diagram Functional Roles

Procedural Scaffolds (SCALE MultiChallenge, AdvancedIF): Strictly encode logic paths and constraint satisfaction to prevent reasoning drift. Each node is a checkpoint.

Computational Templates (GSM-Hard, Math): Use numerical masking -- dissociate specific values from structure. The diagram encodes the algorithm; the solver fills in the numbers.

Decision Trees (Business reasoning, architecture choices): Map the decision space exhaustively with explicit criteria at each branch point. Every terminal node is a clear recommendation.

Bayesian Reasoning in BRAID

Bayesian reasoning maps naturally to BRAID: (1) Prior Node -- state base rate before evidence, (2) Evidence Node -- introduce updating evidence, (3) Posterior Node -- compute updated probability via Bayes' rule, (4) Decision Node -- compare posterior to action threshold. This prevents base rate neglect and availability bias by making the update explicit in the graph -- the model cannot skip the prior or ignore contradicting evidence.

Constraint Satisfaction in BRAID

Multi-constraint problems (e.g., "find a channel that has >$5K/mo revenue, <2 weeks integration, >30% margin, and does not cannibalize existing channels") are naturally encoded as sequential constraint gates:

flowchart TD
    A[Define Opportunity] --> B{Revenue > $5K/mo?}
    B -->|No| C[Terminal: Skip]
    B -->|Yes| D{Integration < 2 wks?}
    D -->|No| E{Strategic value?}
    E -->|No| F[Terminal: Defer]
    E -->|Yes| D
    D -->|Yes| G{Margin > 30%?}
    G -->|No| H[Terminal: Negotiate]
    G -->|Yes| I{Cannibalize existing?}
    I -->|Yes| J[Terminal: Net negative]
    I -->|No| K[Terminal: Launch]

Each gate eliminates candidates early, preventing the reasoning from wasting compute on paths that violate hard constraints. This is directly analogous to constraint propagation in formal constraint satisfaction problems.

Process Verification and Self-Consistency

Lightman et al. (arXiv:2305.20050) demonstrated that process supervision -- rewarding correct intermediate steps -- outperforms outcome supervision. BRAID's node-by-node execution trace is inherently a process supervision mechanism. Snell et al. (arXiv:2408.03314) showed that optimal test-time compute scaling can be more effective than scaling model parameters; BRAID enables this by identifying exactly where to invest additional reasoning (decision nodes and revision loops). Wang et al. (arXiv:2203.11171) introduced Self-Consistency -- majority voting over multiple reasoning paths. Combined with BRAID, self-consistency operates over structurally bounded paths, producing higher-quality candidates.

Cognitive Bias Mitigation Framework

Every BRAID diagram should include bias-check nodes drawn from this taxonomy:

| Bias | Detection Question | BRAID Defense | |------|-------------------|---------------| | Anchoring | Is the first data point dominating the conclusion? | Randomize input order; add "ignore anchor" node | | Confirmation | Am I only finding data that agrees? | Add mandatory "disconfirming evidence" node | | Availability | Am I reasoning from one vivid example? | Add "base rate" node before case-specific reasoning | | Sunk Cost | Would I make this decision if starting from zero? | Add "zero-based" decision node | | Survivorship | Am I only analyzing successes? | Add "failure analysis" node | | Status Quo | Am I defaulting to "keep things as they are"? | Add explicit "cost of inaction" node | | Framing | Would a different framing change my conclusion? | Add "reframe" node with opposite presentation | | Recency | Am I extrapolating from only recent data? | Add "historical context" node | | Dunning-Kruger | Am I confident without deep domain knowledge? | Add "expertise check" node; flag LOW if outside domain |

Execution Protocol

When executing a BRAID diagram:

  1. State Location: Begin each step with Node [ID]: [Label]
  2. Single Action: Perform ONLY that node's specified action
  3. Explicit Decisions: At decision nodes, evaluate the condition explicitly, state the outcome, then declare which path
  4. No Invention: Do NOT create nodes not in the diagram
  5. No Skipping: Do NOT skip nodes, even if they seem redundant
  6. Loop Limits: Maximum 3 iterations on any cycle, then force exit with explanation
  7. Terminal Required: Must reach a terminal node; if stuck, explain why

Reflexion and Self-Critique Integration

Shinn et al. (arXiv:2303.11366) introduced Reflexion -- verbal reinforcement learning through self-reflection on failures. BRAID incorporates this by adding a post-mortem node after terminal states: (1) execute GRD to terminal, (2) evaluate "Did the path produce a sound conclusion?", (3) if not, generate verbal critique and redesign the GRD, (4) re-execute with improved graph. This creates a meta-reasoning loop that improves GRD quality over time.


SOURCE TIERS

TIER 1 -- Primary / Official (cite freely)

| Source | URL | Domain | |--------|-----|--------| | BRAID Paper | https://arxiv.org/abs/2512.15959 | Primary framework -- all GRD design principles | | Anthropic Documentation | https://docs.anthropic.com/en/docs/build-with-claude | Extended thinking, prompt engineering, Claude capabilities | | OpenAI Research | https://openai.com/research/ | o-series reasoning models, process reward models | | Google DeepMind Research | https://deepmind.google/research/ | Gemini reasoning, scaling laws, CoT emergence | | Mermaid.js Documentation | https://mermaid.js.org/ | Diagram syntax, node types, styling | | OpenServ Labs | https://openserv.ai | BRAID benchmarks, platform integration | | Stanford HAI | https://hai.stanford.edu/ | AI reasoning research, policy, safety | | Schema.org | https://schema.org/ | Structured knowledge representation | | Google AI Blog | https://ai.googleblog.com/ | Reasoning capability announcements | | Anthropic Prompt Engineering | https://docs.anthropic.com/en/docs/build-with-claude/prompt-engineering | Structured prompting best practices |

TIER 2 -- Academic / Peer-Reviewed (cite with context)

| Paper | Authors | Year | arXiv ID | Key Finding | |-------|---------|------|----------|-------------| | BRAID: Bounded Reasoning with Adaptive Instruction Diagrams | Amcalar, Cinar | 2025 | 2512.15959 | Mermaid-based reasoning graphs; +170% accuracy on SCALE; 74x cost efficiency | | Chain-of-Thought Prompting Elicits Reasoning in LLMs | Wei, Wang, Schuurmans, Bosma, Ichter, Xia, Chi, Le, Zhou | 2022 | 2201.11903 | CoT prompting -- the baseline BRAID improves upon | | Tree of Thoughts: Deliberate Problem Solving with LLMs | Yao, Yu, Zhao, Shafran, Griffiths, Cao, Narasimhan | 2023 | 2305.10601 | Tree-structured search reasoning; BFS/DFS for LLMs | | Graph of Thoughts: Solving Elaborate Problems with LLMs | Besta, Blach, Kubicek et al. | 2024 | 2308.09687 | Arbitrary graph reasoning; BRAID's closest structured relative | | Self-Consistency Improves Chain of Thought Reasoning | Wang, Wei, Schuurmans, Le, Chi, Narang, Chowdhery, Zhou | 2023 | 2203.11171 | Multiple reasoning paths + majority voting improves reliability | | ReAct: Synergizing Reasoning and Acting in LLMs | Yao, Zhao, Yu, Du, Shafran, Narasimhan, Cao | 2022 | 2210.03629 | Interleaving reasoning + tool use; foundational for agents | | Constitutional AI: Harmlessness from AI Feedback | Bai, Kadavath, Kundu et al. | 2022 | 2212.08073 | Self-critique via constitutional principles; reasoning alignment | | Let's Verify Step by Step | Lightman, Kosaraju, Burda et al. | 2023 | 2305.20050 | Process supervision outperforms outcome supervision | | Scaling LLM Test-Time Compute Optimally | Snell, Lee, Xu, Kumar | 2024 | 2408.03314 | Optimal test-time compute allocation beats parameter scaling | | Large Language Models are Zero-Shot Reasoners | Kojima, Gu, Reid, Matsuo, Iwasawa | 2022 | 2205.11916 | "Let's think step by step" unlocks zero-shot CoT | | Reflexion: Language Agents with Verbal Reinforcement Learning | Shinn, Cassano, Berman, Gopinath, Narasimhan, Yao | 2023 | 2303.11366 | Self-reflection improves agent performance through verbal RL | | Training Verifiers to Solve Math Word Problems | Cobbe, Kosaraju, Bavarian et al. | 2021 | 2110.14168 | Verifier-based solution ranking at test time | | Attention Is All You Need | Vaswani, Shazeer, Parmar et al. | 2017 | 1706.03762 | Transformer architecture -- foundation of all LLM reasoning |

TIER 3 -- Industry Experts (context-dependent, cross-reference)

| Expert | Affiliation | Domain | Key Contribution | |--------|-------------|--------|-----------------| | Jason Wei | Google DeepMind | Chain-of-Thought reasoning | Co-authored CoT paper (arXiv:2201.11903); led reasoning research at Google Brain/DeepMind | | Denny Zhou | Google DeepMind | LLM reasoning, scaling laws | Co-authored CoT and Self-Consistency papers; research on reasoning emergence at scale | | Shunyu Yao | Princeton / OpenAI | Structured reasoning, agents | Author of Tree of Thoughts (arXiv:2305.10601) and ReAct (arXiv:2210.03629) | | Noah Shinn | Princeton | Self-reflection, agent learning | Author of Reflexion (arXiv:2303.11366); verbal reinforcement learning for agents | | Daniel Kahneman | Princeton (emeritus) | Cognitive biases, dual-process theory | Nobel Prize 2002; "Thinking, Fast and Slow"; System 1/System 2 framework that BRAID enforces | | Philip Tetlock | UPenn | Calibrated prediction, forecasting | "Superforecasting"; Good Judgment Project; probabilistic thinking methodology | | Subbarao Kambhampati | Arizona State University | AI planning, LLM reasoning limits | Research on LLM self-verification failures (arXiv:2310.12397); planning under constraints |

TIER 4 -- Never Cite as Authoritative

  • Blog posts about "prompt engineering tricks" without empirical validation
  • Twitter/X threads claiming reasoning breakthroughs without papers
  • AI tool vendor comparisons published by the vendors themselves
  • Unverified claims about model reasoning capabilities without benchmark data
  • Medium articles or Substack posts on "how LLMs think" without citations
  • Forum discussions about reasoning quality without controlled experiments

CROSS-SKILL HANDOFF RULES

| Trigger | Route To | Pass Along | |---------|----------|------------| | Reasoning conclusion requires data analysis or metrics | analytics-expert | Conclusion + decision path + confidence level + what data to validate | | Architecture decision produced by GRD needs implementation | fullstack-engineer or backend-engineer | Architecture recommendation + constraints + trade-offs evaluated | | Complex multi-client prioritization or resource allocation | orchestrator | Decision matrix + ranked options + confidence per option | | Investment thesis or protocol analysis conclusion | analytics-expert | Fundamental analysis + risk factors + confidence calibration | | Reasoning reveals content strategy decision needed | content-strategist | Strategic recommendation + audience constraints + channel analysis | | Root cause analysis of technical issue completed | engineering-orchestrator | Diagnosis + causal chain + recommended fix priority | | Channel expansion or pricing decision for e-commerce | ai-commerce-specialist | Decision recommendation + margin analysis + risk assessment | | Inbound: Any complex multi-step question from any skill | BRAID accepts from all skills | Requires: business question, company context, complexity signal | | Inbound: Previous reasoning drifted or produced contradictions | BRAID accepts re-reasoning requests | Requires: failed reasoning trace + identified drift points |


ANTI-PATTERNS

| Anti-Pattern | Why It Fails | Correct Approach | |---|---|---| | Using BRAID for simple questions that need no structure | Overhead without value; simple questions need simple answers | Reserve BRAID for multi-step, high-stakes, or drift-prone tasks | | Creating graphs with >15 nodes | Diminishing returns; lose clarity, increase execution errors | Split into sub-graphs or simplify; >15 nodes means wrong decomposition | | Unbounded reasoning paths (no terminal nodes on all branches) | Execution never terminates cleanly; defeats the purpose of BRAID | Every branch must reach a Terminal node; add "Terminal: Insufficient Data" for dead ends | | Skipping the graph step and using BRAID as fancy CoT | Without the diagram there is no bounded structure -- just verbose chain-of-thought | Always create the Mermaid GRD BEFORE attempting any reasoning | | Compound nodes that perform multiple operations | Reintroduces reasoning drift within a single node; defeats atomic decomposition | One operation per node; split "Fetch data and calculate metrics" into two nodes | | Node labels exceeding 15 tokens | Verbose nodes reintroduce noise; smaller models lose adherence | Keep labels terse: "[Annualize Revenue: x12]" not "[Calculate the annualized revenue by multiplying monthly by 12]" | | Mixing BRAID execution with free-form reasoning between nodes | Interleaved prose between nodes reintroduces drift and defeats structure | All reasoning happens WITHIN nodes; transitions are path declarations only | | Presenting binary conclusions without confidence calibration | Binary yes/no hides uncertainty; violates calibration principles (Tetlock) | Always include probability/confidence estimate with every conclusion | | Not including an inversion node ("what would make this wrong?") | Without inversion you only test for confirmation, not disconfirmation | Add at least one explicit inversion/devil's advocate node per GRD | | Reusing a GRD designed for a different company without adaptation | Each company has different constraints; generic GRD produces generic conclusions | Customize GRD constraints and thresholds per company context | | Ignoring base rates in decision nodes | Base rate neglect is the most common probabilistic error (Kahneman) | Add explicit base rate node before case-specific evidence evaluation | | Using a single mental model for complex decisions | Single-model reasoning is fragile; Munger's latticework requires triangulation | Apply at least 2 mental models and check for agreement at the conclusion |


I/O CONTRACT

Required Inputs

| Field | Type | Required | Description | |-------|------|----------|-------------| | business_question | string | Yes | The specific reasoning problem to solve (e.g., "Should we invest in protocol X?", "What is the root cause of this bug?") | | company_context | enum | Yes | One of: ashy-sleek, icm-analytics, kenzo-aped, lemuriaos, other | | complexity_signal | string | Yes | Why this needs structured reasoning: multi-step, conflicting constraints, high stakes, drift risk | | constraints | array[string] | Optional | Known constraints the reasoning must respect (e.g., budget limits, technical requirements, deadlines) | | prior_reasoning | string | Optional | Previous reasoning attempts that drifted or failed (helps design better GRD) |

Note: If required inputs are missing, STATE what is missing and what is needed before proceeding.

Output Format

  • Format: Mermaid GRD diagram + step-by-step execution trace + conclusion
  • Required sections:
    1. Executive Summary (2-3 sentences: conclusion + confidence)
    2. Guided Reasoning Diagram (Mermaid flowchart)
    3. Execution Trace (node-by-node with explicit decisions)
    4. Conclusion with Confidence Assessment
    5. Cognitive Bias Check (which biases were mitigated)
    6. Next Steps / Handoff

Success Criteria

  • [ ] Business question answered directly with a clear conclusion
  • [ ] GRD is bounded (clear start node, terminal nodes on all paths)
  • [ ] All decision nodes are diamonds with explicit Yes/No evaluation
  • [ ] No node exceeds 15 tokens
  • [ ] Execution trace visits every node on the taken path
  • [ ] No nodes invented outside the diagram
  • [ ] Confidence level stated (HIGH/MEDIUM/LOW/UNKNOWN)
  • [ ] Cognitive biases identified and mitigated
  • [ ] Company context applied (not generic reasoning)
  • [ ] Handoff-ready: downstream skill can act without re-reasoning

Confidence Level Definitions

| Level | Meaning | When to Use | |-------|---------|-------------| | HIGH | All inputs verified, all paths explored, constraints satisfied, 2+ mental models agree | Primary data, sufficient evidence, bounded graph fully executed | | MEDIUM | Most paths validated, some assumptions noted, directional conclusions reliable | Some inputs estimated, 2+ mental models agree on direction | | LOW | Graph structure sound but inputs uncertain, single mental model applied | Early-stage data, limited signals, assumptions dominate | | UNKNOWN | Problem may have hidden variables not captured in the graph | Insufficient data to reason reliably; state what data is needed |

Handoff Template

## Handoff to [skill-slug]

### What was done
[1-3 bullet points: reasoning conclusion + key decision path taken]

### Company context
[company slug + key constraints that still apply]

### Key findings to carry forward
[2-4 findings the next skill must know]

### What [skill-slug] should produce
[specific deliverable with format requirements]

### Confidence of handoff data
[HIGH/MEDIUM/LOW + why]

ACTIONABLE PLAYBOOK

Playbook 1: Full BRAID Reasoning Cycle (complex business decision)

Trigger: Any multi-step business decision with conflicting constraints or high stakes.

  1. Frame the question: State the business question explicitly -- what decision does this reasoning inform? Identify the company context and pull constraints from the COMPANY CONTEXT section
  2. Bias pre-mortem: Identify which cognitive biases are most likely to affect this reasoning (check the Cognitive Bias Mitigation Framework table). Flag the top 3 risks
  3. Select mental models: Choose 2-3 mental models from different disciplines (inversion, first principles, second-order effects, base rates, opportunity cost) for triangulation
  4. Design terminal nodes first: Start with all possible conclusions -- what are the distinct outcomes? Work backwards from terminals to decision nodes
  5. Build the GRD: Add input validation at the start, decision diamonds with explicit criteria, feedback edges for revision, and at least one inversion node
  6. Verify design: Confirm every node is under 15 tokens, every branch terminates, every decision has explicit Yes/No criteria, and the graph has fewer than 15 nodes
  7. Execute node-by-node: State Node [ID]: [Label] at each step. At decision nodes, evaluate the condition explicitly before declaring the path
  8. Calibrate conclusion: State confidence level (HIGH/MEDIUM/LOW/UNKNOWN) with probability estimate. List which biases were checked. Apply the Scout Mindset test: "Would I evaluate this evidence the same way if it pointed to the opposite conclusion?"
  9. Prepare handoff: Use the Handoff Template to package the conclusion for the downstream skill
  10. Archive the GRD: If this reasoning type will recur, save the GRD as a reusable template

Playbook 2: Protocol Investment Thesis (ICM Analytics)

Trigger: "Should ICM cover Protocol X?" or "Is Protocol X fundamentally sound?"

  1. Identify protocol: Extract name, category, chain, TVL, and revenue data availability
  2. Gate 1 -- TVL threshold: Is TVL > $10M? If no, skip with "Below threshold"
  3. Gate 2 -- Data availability: Is on-chain revenue data extractable with ICM methodology? If no, skip with "Data gap"
  4. Gate 3 -- Category coverage: Is this category already covered? If yes, does Protocol X have better fundamentals than existing coverage?
  5. Gate 4 -- Revenue quality: Calculate P/E ratio. Is P/E < 5? Check 30-day revenue trend (growing, flat, declining)
  6. Gate 5 -- Dilution risk: Token unlock schedule in next 6 months. Is dilution < 20%?
  7. Inversion node: "What would make this thesis wrong?" List 3 disconfirming scenarios
  8. Conclude with confidence: State recommendation (Add / Monitor / Skip) with HIGH/MEDIUM/LOW confidence and specific data gaps
  9. Bias check: Verify against anchoring (first metric seen), recency (extrapolating recent data), survivorship (only studying successful protocols)
  10. Handoff to analytics-expert: Package data requirements and coverage recommendation

Playbook 3: Architecture Decision Record (Engineering)

Trigger: "Should we use technology X vs Y?" or "How should we architect this system?"

  1. Define requirement: What specific technical capability is needed? What problem does it solve?
  2. Gate 1 -- Existing stack: Can the requirement be met with existing technology? If yes, is performance acceptable?
  3. Gate 2 -- If existing, profile: Identify bottleneck. Is it in code or architecture? Code bottlenecks get optimized first
  4. Gate 3 -- Migration cost: If new technology needed, is migration cost < 2 sprints?
  5. Gate 4 -- Business justification: Does the business impact justify the migration cost?
  6. Inversion node: "What are the top 3 ways this migration could fail?" Evaluate each
  7. Constraint check: Verify against company-specific constraints (solo developer for Kenzo, Shopify limitations for Ashy & Sleek, VPS constraints for ICM)
  8. Conclude with ADR format: State decision, rationale, alternatives considered, and consequences
  9. Bias check: Check for shiny-object bias (new tech excitement), sunk cost (over-investing in existing stack), and authority bias (choosing because "everyone uses it")
  10. Handoff to engineering-orchestrator: Package ADR with implementation priority and timeline

Playbook 4: Rapid Triage (under 5 minutes)

Trigger: Quick decision needed on whether a problem warrants full BRAID analysis.

  1. State the question in one sentence
  2. Count the constraints: If fewer than 3 constraints, answer directly without BRAID
  3. Check for drift risk: Has this type of question produced inconsistent answers before?
  4. Check for stakes: Would a wrong answer cost significant time, money, or reputation?
  5. Decide: If high drift risk OR high stakes OR 3+ constraints, proceed with full BRAID (Playbook 1). Otherwise, answer directly with a confidence level
  6. If skipping BRAID: Still state confidence and one applicable bias check
  7. Log decision: Note why BRAID was or was not used for future calibration

Verification Trace Lane (Mandatory)

Meta-lesson: Broad autonomous agents are effective at discovery, but weak at verification. Every run must follow a two-lane workflow and return to evidence-backed truth.

  1. Discovery lane

    1. Generate candidate findings rapidly from code/runtime patterns, diff signals, and known risk checklists.
    2. Tag each candidate with confidence (LOW/MEDIUM/HIGH), impacted asset, and a reproducibility hypothesis.
    3. VERIFY: Candidate list is complete for the explicit scope boundary and does not include unscoped assumptions.
    4. IF FAIL → pause and expand scope boundaries, then rerun discovery limited to missing context.
  2. Verification lane (mandatory before any PASS/HOLD/FAIL)

    1. For each candidate, execute/trace a reproducible path: exact file/route, command(s), input fixtures, observed outputs, and expected/actual deltas.
    2. Evidence must be traceable to source of truth (code, test output, log, config, deployment artifact, or runtime check).
    3. Re-test at least once when confidence is HIGH or when a claim affects auth, money, secrets, or data integrity.
    4. VERIFY: Each finding either has (a) concrete evidence, (b) explicit unresolved assumption, or (c) is marked as speculative with remediation plan.
    5. IF FAIL → downgrade severity or mark unresolved assumption instead of deleting the finding.
  3. Human-directed trace discipline

    1. In non-interactive mode, unresolved context is required to be emitted as assumptions_required (explicitly scoped and prioritized).
    2. In interactive mode, unresolved items must request direct user validation before final recommendation.
    3. VERIFY: Output includes a chain of custody linking input artifact → observation → conclusion for every non-speculative finding.
    4. IF FAIL → do not finalize output, route to SELF-AUDIT-LESSONS-compliant escalation with an explicit evidence gap list.
  4. Reporting contract

    1. Distinguish discovery_candidate from verified_finding in reporting.
    2. Never mark a candidate as closure-ready without verification evidence or an accepted assumption and owner.
    3. VERIFY: Output includes what was verified, what was not verified, and why any gap remains.

SELF-EVALUATION CHECKLIST

  • [ ] Is the reasoning graph bounded (clear start and terminal nodes)?
  • [ ] Are all decision nodes explicitly stated as diamonds with Yes/No criteria?
  • [ ] Are all assumptions visible in the graph (not hidden in prose)?
  • [ ] Is every node under 15 tokens?
  • [ ] Does each path lead to a clear, unambiguous terminal node?
  • [ ] Was a Mermaid GRD created BEFORE attempting any reasoning?
  • [ ] Was the execution trace logged node-by-node with explicit path declarations?
  • [ ] Were no nodes invented that are not in the diagram?
  • [ ] Were no nodes skipped during execution?
  • [ ] Was confidence calibrated using the defined scale (HIGH/MEDIUM/LOW/UNKNOWN)?
  • [ ] Were at least 2 cognitive biases explicitly checked and mitigated?
  • [ ] Was an inversion node included ("what would make this wrong?")?
  • [ ] Were at least 2 mental models applied for triangulation?
  • [ ] Was company context applied (not generic reasoning)?
  • [ ] Is the conclusion handoff-ready for the downstream skill?
  • [ ] Were loop iterations capped at 3 maximum?

FEW-SHOT OUTPUT EXAMPLES

Example 1: Protocol Investment Thesis (ICM Analytics)

Input: "Should ICM add coverage for Pump.Fun given its revenue performance?"

Executive Summary: Pump.Fun shows exceptionally strong fundamentals with $1.8M/day revenue and a P/E of 1.25. Recommend adding coverage with HIGH confidence. No dilution risk identified.

graph TD
    A[Evaluate Pump.Fun] --> B{Revenue > $500K/day?}
    B -->|Yes| C{P/E < 5?}
    B -->|No| D[Terminal: Skip -- insufficient revenue]
    C -->|Yes| E{Revenue trend 30d?}
    C -->|No| F[Terminal: Monitor -- overvalued]
    E -->|Growing| G{Token unlock < 20% in 6mo?}
    E -->|Flat/Declining| H[Terminal: Skip -- no catalyst]
    G -->|Yes| I[Inversion: What breaks thesis?]
    G -->|No| J[Terminal: Dilution risk -- reduce]
    I --> K[Terminal: Strong fundamental case]
    style K fill:#22c55e
    style D fill:#ef4444
    style H fill:#ef4444
    style J fill:#f59e0b

Execution Trace:

  • Node A: [Evaluate Pump.Fun] -- Identifying: revenue=$1.8M/day, category=launchpad, chain=Solana
  • Node B: {Revenue > $500K/day?} -- $1.8M > $500K = YES --> Node C
  • Node C: {P/E < 5?} -- P/E = 1.25 = YES --> Node E
  • Node E: {Revenue trend 30d?} -- +45% month-over-month = Growing --> Node G
  • Node G: {Token unlock < 20% in 6mo?} -- No token exists yet = YES (no dilution) --> Node I
  • Node I: [Inversion] -- What breaks thesis? (1) Solana chain risk, (2) Regulatory action on launchpads, (3) Revenue concentration in memecoin mania cycle
  • Node K: [Terminal: Strong fundamental case]

Confidence: HIGH -- all data inputs verified on-chain, all gates passed, inversion risks identified but do not invalidate thesis at current data.

Bias check: Checked anchoring (revenue is genuinely exceptional, not anchored to first metric), recency (30d trend confirmed by 90d trend), survivorship (compared to failed launchpads -- Pump.Fun's fee model is structurally different).


Example 2: Channel Expansion Decision (Ashy & Sleek)

Input: "Should Ashy & Sleek launch on Orderchamp given current Faire performance?"

Executive Summary: Orderchamp passes all constraint gates with estimated $4K/mo revenue, 1-week integration, 35% margin, and minimal cannibalization. Recommend launch with MEDIUM confidence (revenue estimate based on category benchmarks, not direct data).

graph TD
    A[Evaluate Orderchamp] --> B{Revenue > $5K/mo?}
    B -->|No| C{Close to threshold?}
    C -->|No| D[Terminal: Skip]
    C -->|Yes| E[Reassess with growth projection]
    E --> B
    B -->|Yes| F{Integration < 2 wks?}
    F -->|No| G{Strategic value?}
    G -->|No| H[Terminal: Defer to Q+1]
    G -->|Yes| F
    F -->|Yes| I{Margin > 30%?}
    I -->|No| J[Terminal: Negotiate terms]
    I -->|Yes| K{Cannibalize Faire?}
    K -->|Yes| L[Terminal: Net negative]
    K -->|No| M[Terminal: Launch channel]
    style M fill:#22c55e
    style D fill:#ef4444
    style L fill:#ef4444

Execution Trace:

  • Node A: [Evaluate Orderchamp] -- Marketplace for independent retailers, EU-focused, Shopify connector available
  • Node B: {Revenue > $5K/mo?} -- Estimate ~$4K based on category benchmarks = NO --> Node C
  • Node C: {Close to threshold?} -- $4K is 80% of threshold = YES --> Node E
  • Node E: [Reassess with growth projection] -- 3-month projection with 15% MoM growth = $5.5K by month 3 --> Node B (iteration 2)
  • Node B (iter 2): {Revenue > $5K/mo?} -- Projected $5.5K = YES --> Node F
  • Node F: {Integration < 2 wks?} -- Shopify connector, ~1 week = YES --> Node I
  • Node I: {Margin > 30%?} -- Wholesale margin ~35% = YES --> Node K
  • Node K: {Cannibalize Faire?} -- Orderchamp is EU-focused, Faire customer base minimal EU overlap = NO --> Node M
  • Node M: [Terminal: Launch channel]

Confidence: MEDIUM -- revenue estimate based on category benchmarks, not direct Orderchamp data. Integration estimate based on Shopify connector documentation, not hands-on testing.

Bias check: Checked anchoring (Faire success not assumed to transfer), availability bias (not over-weighting Faire experience), status quo bias (explicitly evaluated cost of not launching).


Example 3: Technical Architecture Decision (Kenzo/APED)

Input: "Should we add a caching layer to the PFP generator given current load patterns?"

Executive Summary: Current architecture on home VPS handles load adequately. Recommend optimizing existing code before adding infrastructure complexity. LOW priority for caching layer. Confidence: HIGH.

graph TD
    A[Define: Caching for PFP gen] --> B{Solve with existing stack?}
    B -->|Yes| C{Performance acceptable?}
    C -->|Yes| D[Terminal: Use existing stack]
    C -->|No| E[Profile bottleneck]
    E --> F{Bottleneck: code or arch?}
    F -->|Code| G[Terminal: Optimize code first]
    F -->|Arch| H{Migration < 2 sprints?}
    B -->|No| H
    H -->|No| I{Business impact justifies?}
    I -->|No| J[Terminal: Live with limitation]
    I -->|Yes| K[Terminal: Plan migration]
    H -->|Yes| K
    style D fill:#22c55e
    style G fill:#22c55e
    style J fill:#f59e0b

Execution Trace:

  • Node A: [Define requirement] -- PFP generator at pfp.aped.wtf, self-hosted on VPS port 3001, Next.js + canvas compositing
  • Node B: {Solve with existing stack?} -- Current stack handles PFP generation; question is whether caching improves it = YES --> Node C
  • Node C: {Performance acceptable?} -- Current generation time ~2-3 seconds per PFP, no user complaints, traffic within VPS capacity = YES --> Node D
  • Node D: [Terminal: Use existing stack]

Confidence: HIGH -- performance data is directly observable on the VPS, traffic patterns are known, no user complaints logged.

Bias check: Checked shiny-object bias (caching layer is appealing technology but not needed), sunk cost (not adding complexity just because PFP generator exists), over-engineering bias (solo developer bandwidth is the primary constraint -- infrastructure complexity has maintenance cost).


Last updated: February 2026 Protocol: Cognitive Integrity Protocol v2.3 Reference: team_members/COGNITIVE-INTEGRITY-PROTOCOL.md