This resonates a lot. The CLAUDE.md hierarchy + indexed artifacts is exactly the kind of structural thinking that most "how to prompt Claude" content skips over entirely.
The piece I keep running into with solo builders is that even when they have a good structure, the failure mode is trusting Claude's output too uniformly — treating fast generation as a proxy for correctness. The code looks clean, tests pass, ships fine... and then a month later nobody (including you) can reason about it because the decisions that shaped it never got persisted anywhere. Your decision capture artifact solves exactly that.
One thing I've been exploring from a complementary angle: rather than scaffolding the SDLC, building a clearer internal model of how Claude reasons — where it's reliable vs. where it needs human review gates. Working on a free starter pack around this (panavy.gumroad.com/l/skmaha) — would be curious if any of this maps to patterns you've seen with the scaffold.
How does your framework compare to spec-driven development e.g. https://github.com/github/spec-kit? In my experience, spec-kit produces a lot of markdown files and little source code.
Very similar, in particular the first phase is lot of markdown and no code too, but spec-kit is clearly more matrue and wide in features and support, while my scaffold is newborn and supports just Claude Code.
I feel that my scaffold is more adherent to old-style waterfall, for example it begins with the definition of the stakeholders, and take advantage of the less adopted practice to maintain assumptions and constraints, not just user stories and requirements.
A big difference is that I have introduced decisions, that are not just design decision, but also coding decisions: after the initial requirement elicitation phase whenever the agent needs to decide on approach or estabilish a pattern, that is crystallised in a decision artifact, and they are indexed in a way that future coding sessions will automatically inject the relevant decisions in their context.
Another difference is that when using the scaffold you can tell high level goals, and if the project is complex enough the design will propose a split in multiple components. Every component can be seen as a separtate codebase, with different stack and procedures. In this way you obtain a mono-repo, but with a shared requirement/design that helps a lot in the change management, because sometime changes will affect several components, and without the shared requirements and design it will be pretty hard to automate.
Thoughts on publishing an example output perhaps in another repo? Perhaps just the first two phases? Would be interesting to see what the output looks like practically speaking (before committing to using it for a project).
I don't have any benchmarks avalable right now, and honestly I found pretty hard to make them considering that the workflow I have set up is not fully automated, but there is a lot of human intervention in the pre-coding phases.
I feel the problem of token wasting a lot, and actually that was the first reason I had to introduce a hierarchy for instructions, and the artfact indexes: avoid wasting. Then I realized that this approaches helped to keep a lean context that can help the AI agent to deliver better results.
Consider that in the initial phase the token consumption is very limited: is in the implementation phase that the tokens are consumed fast and that the project can proceed with minimal human intevenction. You can try just the fist requirement collection phase to try out the approach, the implementation phase is something pretty boring and not innovative.
I am playing around with building my own similar and am faced with the question you pose.
How can you tell if your prompt process works? I feel like the outputs from SDLC process are so much more high level than could be done with evals, but I am no eval expert.
For sure the proposed approach is more token-consuming than just ask high level the final outcome of the project and make an AI agent to decide everything and deliver the code. This can be acceptable for small personal projects, but if you want to deliver production ready code, you need to be able to control all the intermediate decisions, or at least you want to save and store them. They are needed because otherwise any high level change that you will require will not be able to make focused and coherent enough code changes, with previous forgotten decision that are modified and the code change that will produce lots of side-effects.
We have built something similar for our SDLC, but it is based on Claude Code slash commands:
- /tasks:capture — Quick capture idea/bug/task to tasks/ideas/
- /tasks:groom — Expand with detailed requirements → tasks/backlog/
- /tasks:plan — Create implementation plan → tasks/planned/
- /tasks:implement — Execute plan, run tests → tasks/done/
- /tasks:review-plan — Format plan for team review (optionally Slack)
- /tasks:send — Send to autonomous dev pipeline via GitHub issue
- /tasks:fast-track — Capture → groom → plan → review in one pass
- /tasks:status — Kanban-style overview of all tasks
Workflow: capture → groom → plan → implement → done (with optional review-plan before implement, or send for autonomous execution).
The piece I keep running into with solo builders is that even when they have a good structure, the failure mode is trusting Claude's output too uniformly — treating fast generation as a proxy for correctness. The code looks clean, tests pass, ships fine... and then a month later nobody (including you) can reason about it because the decisions that shaped it never got persisted anywhere. Your decision capture artifact solves exactly that.
One thing I've been exploring from a complementary angle: rather than scaffolding the SDLC, building a clearer internal model of how Claude reasons — where it's reliable vs. where it needs human review gates. Working on a free starter pack around this (panavy.gumroad.com/l/skmaha) — would be curious if any of this maps to patterns you've seen with the scaffold.
I feel that my scaffold is more adherent to old-style waterfall, for example it begins with the definition of the stakeholders, and take advantage of the less adopted practice to maintain assumptions and constraints, not just user stories and requirements.
A big difference is that I have introduced decisions, that are not just design decision, but also coding decisions: after the initial requirement elicitation phase whenever the agent needs to decide on approach or estabilish a pattern, that is crystallised in a decision artifact, and they are indexed in a way that future coding sessions will automatically inject the relevant decisions in their context. Another difference is that when using the scaffold you can tell high level goals, and if the project is complex enough the design will propose a split in multiple components. Every component can be seen as a separtate codebase, with different stack and procedures. In this way you obtain a mono-repo, but with a shared requirement/design that helps a lot in the change management, because sometime changes will affect several components, and without the shared requirements and design it will be pretty hard to automate.
I feel the problem of token wasting a lot, and actually that was the first reason I had to introduce a hierarchy for instructions, and the artfact indexes: avoid wasting. Then I realized that this approaches helped to keep a lean context that can help the AI agent to deliver better results.
Consider that in the initial phase the token consumption is very limited: is in the implementation phase that the tokens are consumed fast and that the project can proceed with minimal human intevenction. You can try just the fist requirement collection phase to try out the approach, the implementation phase is something pretty boring and not innovative.
How can you tell if your prompt process works? I feel like the outputs from SDLC process are so much more high level than could be done with evals, but I am no eval expert.
How would you benchmark this?