Fat skills, thin harness, no terminal

Back in April, Garry Tan wrote about how AI coding tools should be architected. Boiled down, his take is this: pile the smarts up top (“fat skills”), keep the middle layer tiny (“thin harness” like, 200 lines), and let boring, reliable infrastructure handle the bottom. Since then he’s published two follow-ups exploring the same idea from different angles. One on resolvers, and another on meta-meta-prompting that got a GBrain run to 97.6% LongMemEval recall.

Pretty much every example in those essays assumes you’re a developer. Skills are Markdown files in a repo. The harness is a CLI runner. The “deterministic floor” is whatever you’ve wired up, including things like git, your editor, the filesystem, your build pipeline. The reader is someone who lives in a terminal.

I’m building something different. Cheddy is an AI coworker for Google Workspace and Notion, built on the same thesis but aimed at people who’ve never opened a terminal and never will. Cheddy’s users are knowledge workers on phones and laptops. Daily briefings are the headline feature, memory is central, and the fact that Cheddy plugs into Gmail, Calendar, Drive, and Notion is what makes the thing feel alive.

It turns out the architecture maps over almost one-to-one with Tan’s ideas, but there’s just one big adjustment.

The mapping

If you squint, Tan’s three layers look like this in Cheddy:

Tan’s layer	What it is for us
Fat skills	Briefings, memory, Gmail/Calendar/Drive/Notion integrations, financial tools, code execution
Thin harness	The chat loop: formatting the turn, calling tools, streaming, retries, managing context budget
Deterministic app	Tool registry, Postgres, Temporal workflows, Dexie/IndexedDB on the client

Different labels, same shape. Skills are still where the model gets to improvise. The harness is still small, opinionated, and rarely touched. The bottom layer is still where guarantees live.

What actually worked

Fat skills, thin harness. Tan’s point is that the harness should fit in your head, and capability should grow in the skills, not in the runner. I’ve kept the chat loop to roughly that size. When I ship something new like briefings, financial tools, or the upcoming knowledge base, it arrives as new tools and prompt fragments, not as new branches in dispatch code.

Deterministic floor for anything that matters. Tan says don’t ask the model to be reliable when the application can be reliable instead. This is the way I’ve been building from the start. Tool inputs get validated. Tool outputs are typed. Anything durable runs on Temporal, so a flaky LLM call can’t vaporize someone’s morning briefing. Cheddy’s recent releases of durable jobs and approvals is the same idea on a different surface.

Resolvers for retrieval. Tan’s resolver pattern is a 200-line file that decides which skill is relevant for a given turn and it maps straight onto Cheddy’s selective memory retrieval. A small model picks which memories belong in this turn, they are injected into the last user message, and the static system prompt stays cache-warm. It’s basically Tan’s architecture with different skin.

No pgvector. This is where Tan’s latest post was the most gratifying external validation. GBrain hit 97.6% on LongMemEval using structured retrieval, zero embeddings. I’d made the same call months earlier for different reasons (article caps, BM25 first, treat lint as correctness), so it was nice to see a published benchmark land in the same place. If you’re designing a memory or knowledge layer right now and your reflex is to reach for pgvector, read both pieces before you commit.

The part that broke

In Tan’s world, a skill is a Markdown file in a repo. The user opens it, reads it, edits it, forks it, shares it. Skills are inspectable artifacts under version control. That design works because the user is a developer, and their workflow already revolves around files.

That assumption collapses for knowledge workers. Cheddy’s users don’t have a repo and don’t want one. The skills layer can’t be files because there’s nowhere to put the files. Skills have to be the product.

Here’s what that looks like:

A briefing is a skill. But you configure it through a UI, not Markdown front-matter. The user picks sources, picks a delivery time, picks a tone. Server-side, the system assembles the instructions and feeds them to the model exactly like Tan’s Markdown file would.
Memory is a skill ingredient. Nobody edits a MEMORY.md. They tell the chat “remember this,” see an action card, and confirm. The structured row in Postgres is the durable artifact, and the chat thread is the editor.
Integrations are skills. Authorizing Gmail registers a bundle of tools plus a small instruction fragment. From the model’s perspective, that bundle is identical to a Markdown skill saying “here’s how to use Gmail.”

Tan’s conviction that skills are fat and the harness stays thin holds up fine. It’s just the surface that changes.

Files for developers, UI for everyone else.

But this creates a follow-on problem that’s worth naming. In a developer tool, the user is also the skill author. They debug by reading their own files. In a non-developer product, the skill author is the product team, and the user only sees the output. Which means users are owed something that dev tooling doesn’t provide: a visible trace of what the agent did, which skills fired, and why.

I’m not done with that yet. It’s obviously the next thing to build, and Ashe’s notes from 54 builder office hours hit the same point. Right now, memory and observability are the biggest bottleneck for putting AI in production.

A worked example: briefings

Briefings are the cleanest case study.

The skill itself is roughly a page of instructions: how to read the user’s Google Workspace products (Gmail, Calendar, Drive, Tasks, Forms, etc), Notion data, and memory; how to weight a same-day meeting against a ticket due tomorrow; how to phrase action items; how to keep it under a two-minute read. None of that lives in the chat loop. The chat loop literally doesn’t know briefings exist.

The harness is unchanged. The exact same loop that handles a normal conversation handles a scheduled briefing turn. The only difference is which prompt fragments and tools are attached at runtime.

The deterministic floor is Temporal. The schedule fires at 6am, workflows run sources in parallel, partial failures retry with proper backoff, and the assembled briefing lands in storage. SSE wakes the device, a dumb sync pulls it into Dexie, and the user opens their phone at 8am to a result that survived a flaky Notion API call at 6:03.

That’s the whole shape, and it fits on one page. Fat skill on top, thin harness in the middle, deterministic infrastructure anchoring the bottom. When something goes wrong in production, it goes wrong below the skill line, where it can be fixed without retraining a model or rewriting a prompt.

The forward bet

Most architecture posts close on “this survives model churn.” True, but cheap. Every layered design survives model churn. The bet worth making is more specific than that.

Skills-as-product-surface is a moat in two directions at once.

Regarding the model layer, when a new model drops skills get sharper without code changes. A provider deprecates, the deterministic floor routes around it. You compound investment in the layer you own, not the layer a provider rents you.

There is also a moat on the user layer. This is the part Tan’s framing implies but doesn’t say out loud. When skills are Markdown files in a repo, your TAM is people who edit Markdown files in repos. When skills are UI surfaces like Cheddy’s briefing configurator, or a memory action card, or an integration toggle your TAM is everyone with an office job. The architecture is very similar underneath; only the authoring surface changes. That translation is the actual product work, and it’s the work that decides whether the underlying thesis reaches a hundred developers or a hundred million knowledge workers.

Tan’s framing is the cleanest articulation of the bottom half of that bet I’ve seen. Cheddy is my wager on the top half: that the same architecture, with the skill surface translated into product, holds for people who will never see a terminal. If you’re building anything chat-shaped for non-developers, the framework holds. Most of your effort will go into turning file-based ergonomics into UI affordances that do the same job. The architecture underneath doesn’t need to move.