·14 min read

Treat codebase as a context window

How architecture and tooling that work for humans work just as well for machines.

Treat codebase as a context windowTreat codebase as a context window++++

Language models have a context window, a fixed amount of information they can hold and reason about at once. When the conversation outgrows it, the system compresses: older messages get summarized, detail gets lost, and the model starts making decisions based on an approximation of what came before rather than the thing itself. The output degrades. Not catastrophically, but enough that you notice the reasoning getting looser, the suggestions getting more generic, the references getting wrong.

A codebase works the same way. Every engineer on a project has a context window: the amount of the system they can hold in their head and reason about coherently. When the codebase outgrows it, the same thing happens. People start making decisions based on approximations. They copy a pattern from the nearest file without knowing if it's the right one. They introduce a second way of doing something because they didn't know the first one existed. The output degrades. Slowly at first. Then it becomes the default.

##Compression is understanding

The difference is what compression does. In a language model, compression is lossy. It's damage control, a necessary evil. In a codebase, compression is the opposite. It's the work of understanding. When you take five ad-hoc solutions to the same problem and replace them with one deliberate abstraction, you haven't lost information. You've distilled it. The abstraction is the understanding you didn't have when you wrote the first version.

But compression is not the same thing as abstraction. This is where most teams go wrong. They see duplication (two handlers that look similar, three services with overlapping logic) and they reach for an abstraction immediately. Before they've seen enough instances to know what's actually the same and what just looks the same right now. The result is a premature generalization built on a guess. It adds indirection without adding understanding.

Now instead of two concrete things you can read, you have one abstract thing you have to trace through. When the third case arrives and doesn't quite fit, you're bending the abstraction to accommodate it instead of questioning whether it was right in the first place. Real compression can only happen after you've lived with the mess long enough to see the actual pattern. You earn the right to abstract by building the ugly version first.

Domain knowledge starts with nouns: the things your system deals with. Agent, Lead, Property. The interactions between those nouns create verbs: an agent contacts a lead, a lead views a property. That's where every model begins, and it's clean enough to fit on a whiteboard. Then you build through it. You discover that "contacts" isn't one thing. It's an initial outreach, a follow-up three days later, a re-engagement after silence, a birthday text six months out. The verb outgrows its sentence. It picks up rules, timing, branching logic. At some point it stops being something an agent does and becomes something the system has. The verb becomes a noun: a Campaign. Now Campaign has its own verbs: it pauses, resumes, branches on response. You didn't plan for this entity. It emerged from the pressure of the domain pushing against your original model. That's what "learning the shape of the thing" actually looks like.

##Three people in a wardrobe

We built Bonzo, a real estate communication platform, with three developers for the first three years of operation. Three people can't afford ambiguity. There's no room for tribal knowledge when three people fit in a wardrobe. Every pattern that lives in someone's head instead of in the code is a liability, because that someone is also the person handling deploys, debugging production, and building the next feature. The constraint forced a discipline: if a problem gets solved, it gets categorized. The solution becomes the solution. You don't revisit the decision every time a new instance of the problem appears. You revisit it at retros, the single most valuable ceremony in the entire engineering process, and possibly the only one worth keeping from the yes-we-do-agile-we-do-dailies school of project management.

This only works if you accept something uncomfortable: you don't understand the problem you're solving until you've built through it. The first version is always messy. That's not failure. It's how the domain reveals itself. You learn the shape of the thing by colliding with it.

And not just in code. Developers see how the system is dissected. They know what the building blocks are, where the seams fall, which nouns carry weight and which ones are duct tape. That makes them domain experts in their own right. They should act like it. A developer who understands the decomposition of the product is not a translator for someone else's vision. They're a contributor to the business itself. The nouns you start with are a guess. The nouns you end with are the understanding. You're not really building the product. You're discovering what your model of the problem actually looks like when it has to be precise enough to execute.

##What compression looks like in practice

The discipline is in what happens next. Most teams leave the scaffolding up. The prototype becomes the product. The quick solution becomes the pattern, not because anyone decided it should be, but because nobody went back to compress it. We couldn't afford that. With three people, every piece of uncompressed complexity is weight you carry on every future decision.

So we compressed aggressively. We built strong typing on the backend that gets generated into TypeScript types and query composables: typed API calls with Zod validation that any developer can use without thinking about serialization, validation, or the shape of the response. We eliminated an entire class of bugs not by testing for them but by making them structurally impossible.

We used Tailwind as a design language and built a component library backed by Storybook, then enforced it with lint rules. No raw class attributes in pages, everything goes through components. As components grew more refined, they became bigger partials that minimized UI drift across the product. The design system wasn't documentation. It was the code itself.

We always started from the smallest possible unit. The design process began at the atomic level: what is a component, what is a controller, what is a request. On the backend, that meant defining the contract first: a Data class is the single source of truth for what goes in and what comes out.

PHP
final class CreateExampleData extends BaseData
{
    public function __construct(
        #[Max(255)]
        public readonly string $name,
        #[Min(10)]
        public readonly ?string $description = null,
        public readonly ?EmailAddress $contactEmail = null,
    ) {}
}

That's it. A plain class that carries types and validation rules. From this, the system generates everything the frontend needs: TypeScript types, Zod validation schemas, and typed API clients:

TS
export const createExampleSchema = z.object({
  name: z.string().max(255),
  description: z.string().min(10).nullable().optional(),
  contact_email: z.string().email().nullable().optional(),
})

export const exampleClient = {
  create: (data: z.infer<typeof createExampleSchema>): Promise<ExampleDto> =>
    useApi('/api/examples', { method: 'POST', body: data }),
  // ...
} as const

The generated Zod schema feeds directly into an <AutoForm> component, not for production UIs, but for prototyping. Define a backend route, write the input Data class and the output DTO, and you get a working form in seconds. Enough to sit down with the person who wanted the feature and ask: is this even what you had in mind? That feedback loop used to require a developer's time. Now, with the right tooling, anyone on the team can spin up a prototype and validate their own idea. This democratizes thinking about the product, and that matters, because everyone sees it differently. A product is a three-dimensional problem space and every observer sits in a different place, noticing and caring about different things. The more people who can cheaply test their perspective, the better the product gets.

The controller becomes a thin routing layer because the types do the enforcement. One decision at the atomic level cascades into every layer above it. That's what compression looks like in practice: a single class replacing hundreds of lines of glue code across the stack.

When we needed a new project in the monorepo, a CLI bootstrapped it according to the project rules: same structure, same infrastructure patterns, same technology choices. No decisions to make. The template embodies the decisions that were already made. The more seamless the standardization, the better.

When we discovered that developers were dropping database columns that fed our CDC pipeline, we didn't write a wiki page about it. We built a lint rule in a shared package that prevents it from happening. The shared package became the contract: what does a controller look like? How do we call third parties? How do we handle rate limiting? These are problems every project runs into. A key-value pair of problem-to-solution, enforced by tooling, is what keeps six applications behaving like one.

##The same tools serve both

This is where it gets interesting. The same compression that makes a codebase navigable for a new engineer makes it navigable for a language model. Clean domain separation means the AI only needs to parse context from one folder. We have an Actions module that describes general data operations you can perform on an object in the CRM. The module itself only cares that the apply logic works, that exceptions are handled properly, that the interface is defined, that persistence is covered. It's a reusable domain concept with clear boundaries. A human reading it knows exactly what it does and where it ends. A transformer reading it knows exactly the same thing, for exactly the same reason: the compression did the work of making the intent legible.

The tooling story compounds this. Domain-specific CLIs that help developers create projects, manage changes, and find relevant context don't just improve human DX. They become tools that AI can use directly. A precise, well-scoped CLI is a better interface for a language model than a thousand lines of documentation or a beefy MCP server that you will never fully use. The investment in developer experience is now, simultaneously, an investment in AI capability. The same tools serve both. This wasn't the plan when we built them. It's a consequence of the architecture being right.

But "the same tools" understates it. The real unlock is reproducibility. When your test suite can spin up a browser, take a screenshot, and diff it against a baseline, that's a tool a developer uses to catch visual regressions. It's also a tool an agent uses to verify its own output. When your design system is enforced by lint rules and backed by visual regression tests, an agent generating a new component gets the same feedback loop a human does: write it, run it, see if it passes. The agent doesn't need taste. It needs a test that encodes taste.

It goes further than linting and type checks. Responsive breakpoint validation: an agent changes a layout, a headless browser checks it at 320px, 768px, 1024px, 1440px, and reports what broke. Accessibility audits that run in CI, not as a suggestion but as a gate. Screenshot comparisons that catch UI drift before it reaches review. Each of these works the same regardless of who triggered it: a developer running npm test or an agent executing the same command through a tool interface.

The pattern is always the same: the team agrees on how a problem gets solved, encodes that agreement into tooling that enforces it, and both humans and agents operate within the same constraints. Not conventions written in a wiki. Tooling that fails the build when you deviate. That's the only contract that scales, and it scales to agents for free, because agents respect a failing test the same way a developer does. These patterns (the reproducible feedback loops, the enforced conventions, the shared tooling contract) aren't specific to any one codebase. They're infrastructure. The kind of thing that should be a package, not a set of decisions every team reinvents from scratch.

##The narrow path

Contrary to popular intuition, the constraints didn't slow us down. They accelerated us. When every solved problem has exactly one solution and every new project starts from the same template, there's nothing to deliberate. You spend your time on the actual problem instead of on the decisions surrounding it. The team moved faster and with more confidence precisely because the path was narrow. With this workflow we could iterate on features in hours that would have taken days in a codebase where every developer was free to reinvent the wheel.

What worked for three people didn't automatically work for twenty-five. As many developers, that many tastes. The things that got through code review when we lacked automated enforcement (the inconsistencies, the second and third ways of solving the same problem) taught us that conventions explained are conventions forgotten. Conventions enforced by tooling are the only kind that survive contact with a growing team. The freedom to choose your own approach to a solved problem is a freedom you can't afford when maintaining a product long-term. Once a problem is solved, it's solved. Use the same tool for the same problem. Every time.

##The volume is not a spike

Garry Tan, YC's CEO, tweeted that he writes 37,000 lines of code a day. Do the math on that: it's 77 lines per minute, over a line per second, for eight hours straight. You cannot review that much code in a day. You cannot even read it. The cost of producing a line of code is approaching zero, but the ability to verify what got produced hasn't scaled at all. And it's going to get worse. More models, faster models, more people pointing them at codebases with no guardrails. The volume is not a spike. It's a new baseline, one that needs to be controlled. Not by slowing the progress down, but by creating proper guardrails.

The only way to absorb that kind of output is if the codebase has enough structure to constrain it. Patterns to extend, conventions to follow, types that enforce correctness before a human ever looks at the diff. A compressed codebase makes AI assistance precise. The model generates code that fits because there's only one way it can fit. A fragmented one turns every AI-generated PR into a review burden that no team can sustain.

The line between order and chaos is where you learn the most. Too much order and nothing grows. Too much chaos and nothing survives. That's where every codebase sits now. The order is what you've compressed so far: your types, your conventions, your tooling. The chaos is the volume flooding in. You don't get to pick one side. You keep up by adopting, as a codebase and as an organization, and you stay alive by making sure the guardrails are yours. Not best practices borrowed from a conference talk. Tooling that encodes what your team learned by building through the mess. That's the only kind that holds.

// consulting

Working on something that needs a second pair of eyes? I do project audits, architecture reviews, and help teams that have outgrown their codebase.

get in touch →
2026 · Przemysław Przylucki

This page respects your privacy by not using cookies, and not collecting any data.