docs: add eval evidence guide#1598
Open
YOMXXX wants to merge 1 commit into
Open
Conversation
Author
|
Reviewer note: this is intentionally the smallest first step from RFC #1597. What changed:
What did not change:
Verification run after the commit: git diff --check HEAD~1 HEAD
test -f docs/eval-evidence.md
rg -n "docs/eval-evidence.md|Eval Evidence for Superpowers PRs|Reporting Evidence in PRs" README.md docs/testing.md docs/eval-evidence.mdAll exited 0. |
1 task
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What problem are you trying to solve?
The PR template asks contributors to show evaluation, rigor, and adversarial evidence, and
writing-skillsexplains RED/GREEN pressure testing for skill work. But contributors still have to infer how to package that evidence for reviewers across different change types.That makes review harder than it needs to be: useful evidence can be buried in prose, a baseline can be missing, or a docs-only PR can overclaim behavior evidence it did not run. #1597 proposes standardizing this evidence before attempting larger workflow-state or preference features.
What does this PR change?
Adds
docs/eval-evidence.md, a contributor-facing guide for packaging PR evidence. It defines a reusable evidence packet, a change-type evidence matrix, and short templates for runtime bugfixes, skill behavior changes, and docs-only guidance. It also links the guide from the README contribution steps anddocs/testing.md.Is this change appropriate for the core library?
Yes. This is contributor infrastructure for all Superpowers changes. It is not project-specific, harness-specific, or tied to a third-party service, and it does not change runtime or skill behavior.
What alternatives did you consider?
writing-skills. Rejected because the guidance applies to runtime bugs, hook/installer fixes, harness support, and docs-only contributor guidance, not only skill authorship.Does this PR contain multiple unrelated changes?
No. All changes support one concern: helping contributors present test and eval evidence in a reviewer-friendly format.
Existing PRs
docs/eval-evidence.mdRelated prior art and nearby work:
Searches run included exact terms for
docs/eval-evidence.md,Eval Evidence for Superpowers PRs, andevidence packet; no direct duplicate was found.Environment tested
New harness support (required if this PR adds a new harness)
Not applicable. This PR does not add or modify harness support.
Clean-session transcript for "Let's make a react todo list"
Evaluation
writing-skills, the eval harness, and the PR template, but there was no focused guide that explained how to package baseline, after-change, adversarial, verification, and limits evidence for reviewers.docs/eval-evidence.mdgives a standard evidence packet, change-type matrix, templates, and common mistakes; README anddocs/testing.mdlink to it from the contribution/test flow.Verification commands run:
All commands exited 0.
Rigor
superpowers:writing-skillsand completed adversarial pressure testing (paste results below)This is a docs-only contributor guide. It does not modify skill behavior, Red Flags tables, rationalization guidance, or any prompt used by agents. The unchecked boxes are intentional because no live adversarial model eval was run or needed for a docs-only guide.
Human review
The complete staged diff was shown before submission. The human partner previously instructed me to treat shown diffs as reviewed for these PRs unless they say otherwise.
Refs #1597.