Skip to content

docs: add eval evidence guide#1598

Open
YOMXXX wants to merge 1 commit into
obra:devfrom
YOMXXX:docs/eval-evidence-kit
Open

docs: add eval evidence guide#1598
YOMXXX wants to merge 1 commit into
obra:devfrom
YOMXXX:docs/eval-evidence-kit

Conversation

@YOMXXX
Copy link
Copy Markdown

@YOMXXX YOMXXX commented May 21, 2026

What problem are you trying to solve?

The PR template asks contributors to show evaluation, rigor, and adversarial evidence, and writing-skills explains RED/GREEN pressure testing for skill work. But contributors still have to infer how to package that evidence for reviewers across different change types.

That makes review harder than it needs to be: useful evidence can be buried in prose, a baseline can be missing, or a docs-only PR can overclaim behavior evidence it did not run. #1597 proposes standardizing this evidence before attempting larger workflow-state or preference features.

What does this PR change?

Adds docs/eval-evidence.md, a contributor-facing guide for packaging PR evidence. It defines a reusable evidence packet, a change-type evidence matrix, and short templates for runtime bugfixes, skill behavior changes, and docs-only guidance. It also links the guide from the README contribution steps and docs/testing.md.

Is this change appropriate for the core library?

Yes. This is contributor infrastructure for all Superpowers changes. It is not project-specific, harness-specific, or tied to a third-party service, and it does not change runtime or skill behavior.

What alternatives did you consider?

  1. Put this directly in the PR template. Rejected because the PR template is already long and should stay focused on required fields. A separate guide can include examples without making every PR body heavier.
  2. Put this inside writing-skills. Rejected because the guidance applies to runtime bugs, hook/installer fixes, harness support, and docs-only contributor guidance, not only skill authorship.
  3. Wait for a workflow-state proposal. Rejected because evaluation packaging is useful immediately and is a lower-risk first step before larger state/preference work.
  4. Do nothing. Rejected because current evidence expectations exist, but the shape of a good evidence packet is still implicit.

Does this PR contain multiple unrelated changes?

No. All changes support one concern: helping contributors present test and eval evidence in a reviewer-friendly format.

Existing PRs

  • I have reviewed all open AND closed PRs for duplicates or prior art
  • Related PRs: none found that add an eval evidence guide or docs/eval-evidence.md

Related prior art and nearby work:

Searches run included exact terms for docs/eval-evidence.md, Eval Evidence for Superpowers PRs, and evidence packet; no direct duplicate was found.

Environment tested

Harness (e.g. Claude Code, Cursor) Harness version Model Model version/ID
Local shell documentation checks macOS zsh/bash N/A N/A

New harness support (required if this PR adds a new harness)

Not applicable. This PR does not add or modify harness support.

Clean-session transcript for "Let's make a react todo list"
N/A - this PR does not add a new harness.

Evaluation

  • Initial trigger: my human partner asked me to plan and execute the next useful feature direction for Superpowers. Discussions are disabled on the repository, so I opened RFC: standardize eval evidence before adding workflow state and preferences #1597 as an RFC issue and then started with the lowest-risk first phase: contributor evidence guidance.
  • Eval sessions after making the change: 0 live model eval sessions. This is docs-only contributor guidance, not behavior-shaping skill text.
  • Before: README pointed contributors to writing-skills, the eval harness, and the PR template, but there was no focused guide that explained how to package baseline, after-change, adversarial, verification, and limits evidence for reviewers.
  • After: docs/eval-evidence.md gives a standard evidence packet, change-type matrix, templates, and common mistakes; README and docs/testing.md link to it from the contribution/test flow.

Verification commands run:

git diff --cached --check
git diff --check HEAD~1 HEAD
test -f docs/eval-evidence.md
rg -n "docs/eval-evidence.md|Eval Evidence for Superpowers PRs|Reporting Evidence in PRs" README.md docs/testing.md docs/eval-evidence.md

All commands exited 0.

Rigor

  • If this is a skills change: I used superpowers:writing-skills and completed adversarial pressure testing (paste results below)
  • This change was tested adversarially, not just on the happy path
  • I did not modify carefully-tuned content (Red Flags table, rationalizations, "human partner" language) without extensive evals showing the change is an improvement

This is a docs-only contributor guide. It does not modify skill behavior, Red Flags tables, rationalization guidance, or any prompt used by agents. The unchecked boxes are intentional because no live adversarial model eval was run or needed for a docs-only guide.

Human review

  • A human has reviewed the COMPLETE proposed diff before submission

The complete staged diff was shown before submission. The human partner previously instructed me to treat shown diffs as reviewed for these PRs unless they say otherwise.

Refs #1597.

@YOMXXX
Copy link
Copy Markdown
Author

YOMXXX commented May 21, 2026

Reviewer note: this is intentionally the smallest first step from RFC #1597.

What changed:

  • added docs/eval-evidence.md with a reusable PR evidence packet, change-type matrix, and short templates
  • linked it from README contributing steps
  • linked it from docs/testing.md

What did not change:

  • no skill behavior
  • no runtime code
  • no harness support
  • no eval harness behavior

Verification run after the commit:

git diff --check HEAD~1 HEAD
test -f docs/eval-evidence.md
rg -n "docs/eval-evidence.md|Eval Evidence for Superpowers PRs|Reporting Evidence in PRs" README.md docs/testing.md docs/eval-evidence.md

All exited 0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant