The infrastructure that lasts is rarely the most impressive. We have inherited systems built on exotic stacks that nobody could operate, and we have inherited boring ones — a couple of droplets, a CDN, a managed database, a deploy script — that had run for years with no drama. The difference is almost never the technology. It is whether the operational shape of the system is legible: whether a tired engineer at 2am can understand it, deploy it, and roll it back. That legibility is the whole game.
Predictable hosting over clever hosting
We default to DigitalOcean for a lot of work, and the reason is unglamorous: predictable pricing and a small surface area you can actually hold in your head. A droplet, App Platform, a managed Postgres, a managed Redis — that is enough to run most production systems for years, and the bill does not surprise anyone at the end of the month. The hyperscalers are powerful, but they are also a hundred services deep, and a small team running on AWS often spends as much effort understanding the platform as building on it. We ship on AWS, Cloudflare Workers, Fly.io, and bare metal when a workload genuinely calls for it, but the bias is toward infrastructure a small team can fully understand rather than infrastructure that needs a dedicated specialist just to keep the lights on. Complexity you do not need is a cost you pay every day, not just on launch day.
Cloudflare as the front line
A CDN doing real work is one of the highest-leverage pieces of a durable system, and "doing real work" is the operative phrase — a CDN that only proxies requests is leaving most of its value on the table. Cloudflare in front of the application handles TLS, caches static assets and cacheable responses at the edge, absorbs traffic spikes before they reach the origin, and gives you a layer of DDoS and bot protection for free. Tuning cache headers so the edge serves repeat traffic — rather than passing every request straight through to the origin — is often the single change that turns a system that buckles under a launch into one that absorbs it without anyone noticing. Origin compute should be spending its cycles on work the edge genuinely cannot do, not serving the same logo and the same homepage a million times. When a launch goes sideways, the first question we ask is what fraction of traffic the edge is actually answering.
CI/CD you actually trust
A deployment pipeline only earns its keep if the team trusts it enough to deploy on a Friday afternoon. That trust comes from a few non-negotiables: type-checked and linted builds on every branch so broken code never reaches main, preview environments per pull request so changes are seen and clicked through before they merge, and a zero-downtime production deploy with a rollback that has been rehearsed rather than theorized. The teams that ship safely and often are not braver than everyone else — they have a pipeline that makes the safe path the easy path, so doing the right thing takes less effort than doing the risky thing. When the safe path is also the slow, manual, error-prone one, people route around it, and that is where outages come from.
Monitoring that points at the pain
Most monitoring shows you that CPU is at 40 percent, which tells you nothing useful when a customer cannot check out. The monitoring that matters is keyed to the business: is the checkout endpoint returning errors, is the job queue backing up, is the p95 latency on the path that makes money climbing toward the timeout. Resource graphs are necessary but they are not where you start an investigation — by the time CPU is the symptom, the real problem already has a name on the business side. Structured logs you can actually query, uptime checks on the flows that matter, error tracking that groups by cause, and dashboards built around outcomes rather than raw resources are what let you find the problem before the support tickets do. The goal is to be the first to know, not the last.
- Stateless application tiers so you can add or replace instances without ceremony.
- Read-through caching and a CDN absorbing the repeat traffic before it hits origin.
- Queue-driven writes so a traffic spike defers work instead of dropping it.
- Indexed databases and a managed provider handling backups and failover.
- A rollback path rehearsed in staging, not discovered during the incident.
Scale is an architecture choice, not a server upgrade. The bigger box buys you a few months. The right shape buys you years.
When we take on scalable infrastructure work, it almost always starts with an audit and incremental hardening — CDN, caching, indexes, deployment hygiene — before anyone proposes a rewrite. Most systems do not need to be rebuilt. They need the load curve moved and the operations made legible, and then they tend to last.