Advanced slop prevention techniques

AI writes code faster than anyone can read it. That’s the whole problem.

The birth of “slop” is unverified code. Plausible, well-formatted, confidently wrong in ways nobody checked. You can’t prompt your way out of it, because the same fluency that produces good code produces convincing mistakes. The fix is a pipeline of gates where the work has to survive review it can’t talk its way past.

The organizing idea is reviewer independence. A reviewer that shares the author’s blind spots will wave the slop through, so every gate I add is deliberately less correlated with the thing that wrote the code, and harder to argue with than the gate before it.

The independence ladder

Gate	Strength	Why
Same model, same context	Weak	”Review your code” in one chat
Different model, /cleanup	Better	Uncorrelated blind spots
Different system, Greptile	Strong	Adversarial, gated on a number
Deterministic audits + CI	Total	No judgment left to fool

Read top to bottom: each gate is harder to fool than the last. A plan is all judgment. A cross-model review is a second opinion. Greptile needs a number. The audit scripts are math. CI is a boolean. Slop dies somewhere along that gradient. The further down it dies, the more certain I am that it’s actually dead.

Prevention first: plan, then slice

Most people try to review their way out of a huge diff. The better move is to not create the huge diff in the first place.

Every feature starts as a written plan at two levels. The product stance covers what we’re building and why. The technical stance covers how it fits the architecture, and it turns into a list of pull requests before a line is written. The product stance prevents feature slop, the kind no linter catches. The technical stance prevents architectural slop.

Because the PRs are planned in advance, they’re never huge. Although it is my tidiness preference, the real reason for detailed PR planning is to prevent the slop that hides in large diffs. A reviewer, human or machine, skims a 2,000-line PR and approves on vibes. A 200-line PR with a stated purpose can actually be read.

The self-review pass, cross-model

Each PR gets a /cleanup skill pass: a structured review for correctness, maintainability, modularity, and testability, plus a couple of suggested tests for any non-trivial logic. The trick is which model runs it. If I write the code with an OpenAI model, an Opus model does the cleanup. If I write with Opus, GPT does the cleanup.

This is the cheapest high-leverage move in the whole pipeline. Same-model self-review mostly catches typos, because the reviewer shares the author’s assumptions. A different model fails in different places, so it questions the assumptions themselves. The “self-review” stops being self-review at all. It’s a second opinion from a different mind, and it costs nothing but a model switch.

The adversarial gate: Greptile

Then Greptile reviews the PR on its most rigorous setting. A confidence score of 5/5 is required to merge. This is a different system, not just a different model, and it’s unskippable.

The classic AI-review failure is asking the model to grade its own homework. It rationalizes its own choices. An independent reviewer with a hard numeric gate breaks that loop. Either it hits 5/5 or the PR doesn’t merge. There’s nothing to negotiate with.

The deterministic pass: audit scripts

Everything above is judgment. The last layer is not.

Two audit scripts measure the code with no opinions involved.

The frontend runs Fallow, a single Rust codebase-intelligence tool. It reports dead code, complexity hotspots, duplication, circular dependencies, boundary violations, and an overall health score. One integrated binary.

The backend runs a Python script that orchestrates five focused tools into one report. Ruff handles lint and autofix. Radon measures cyclomatic complexity and maintainability index. Vulture finds dead code. jscpd detects duplication via tree-sitter. Deptry catches unused and missing dependencies. Five specialists, one composed verdict.

The ratchet

Both scripts have a --check mode that runs in CI. The threshold isn’t zero complexity or zero duplication because those are standards nobody can meet and everybody disables. The threshold is that this PR must not make any metric worse than the current baseline.

That’s a ratchet, not a wall. Slop can’t get in, and the baseline only ever tightens. CI enforces it whether or not I remember to, which is the point. Discipline you have to remember isn’t a process, it’s a hope.

Run it, don’t just read it

Every gate so far reads the diff. None of them check to see if the app actually works correctly.

Review and analysis answer one question, “Is the code sound?” Behavior answers another, “Does it actually do the thing?” Code can clear all five gates while still being wrong, so the last step is to use the app.

The QA doc comes from the planning doc, not the code. The same spec that generated the PRs generates a manual test checklist. Every behavior the plan promised becomes a step to perform and a result to expect. Then I run the app and walk the list by hand. I’ve automated some of this with playwright in the past, but I like the confidence of knowing features work because I tested them myself.

Deriving the checklist from intent rather than from the implementation is what makes it independent. A test written by reading the code only asserts that the code does what the code does. A test written from the plan asks whether the code does what was meant. That gap is where the most expensive slop hides. The kind that compiles, type-checks, reviews clean, and ships the wrong behavior anyway.

One project at a time

I write code one PR at a time, and one project at a time. No parallel branches, no three features in flight at once.

This reads like a productivity sacrifice. It’s really more of the same prevention. Every gate above demands attention, and attention is the one resource that doesn’t parallelize. Two PRs in flight means each gets half a review, and half a review is how slop slips through a gate that technically ran. Small PRs keep each gate tractable. Single-threading keeps me tractable.

Research isn’t a second thread running beside the code. It’s the intake that feeds the pipeline. The worthwhile parts of YouTube talks via transcripts, X threads, and public wisdom get collected as notes over time. When an idea is large enough to matter, it earns its own planning doc, and that doc becomes the next coding project. Research never competes with the code for attention. It decides what the code should be. Better inputs make better plans, and the plan is where the quality of everything downstream gets set.

What this doesn’t catch

Honesty keeps an anti-slop essay from becoming slop itself.

The one thing nothing here catches is a wrong plan. A flawless, fully-tested implementation of something nobody needed. The QA doc inherits whatever the planning doc assumed, so if the plan was wrong, every gate below it faithfully verifies the wrong thing. That failure lives at the very top of the ladder, which is exactly why the plan gets the most human attention and everything downstream gets automated.

The gates don’t make the code correct. They make it verified, which is a smaller claim and the only honest one. Correctness still starts with deciding what to build. Everything below that just makes sure what got built is what was meant.