Build Like You'll Get It Wrong
Field Guide
Build Like You'll Get It Wrong
The best engineering teams don't plan for success. They plan for failure and design recovery into every system. Resilience beats perfection in production, in careers, and in life.
Resilience Spectrum
From fragile to antifragile. Most teams operate at Robust. Click to explore each level.
Robust
Survives stress unchanged
Characteristics
- • Redundancy and failover
- • Monitoring and alerting
- • Standardized processes
- • Defense in depth
Examples
- • Load balancers and backups
- • Automated tests and deployments
- • Access controls enforced
- • Regular security audits
Key takeaway
NASA doesn't write procedures for when things go right. Every procedure assumes something has already gone wrong.
Key takeaway
Growth-mindset individuals process errors differently at a neurological level. Their brains light up during mistakes.
Key takeaway
Your first rollback drill will teach you more about your system than six months of monitoring dashboards.
Mission Control, April 13, 1970. An oxygen tank explodes. Jack Swigert says those four words: “Houston, we’ve had a problem.” The room doesn’t panic. The astronauts don’t improvise genius solutions. They execute procedures written for disasters that hadn’t happened yet.
Summary: NASA survives catastrophe because it builds systems assuming failure is inevitable, not exceptional. Your engineering culture should work the same way. Most teams optimize for the happy path. Production optimizes you.
Apollo 13 didn’t survive because of genius. It survived because someone spent weeks writing checklists for when things break. Every procedure NASA writes starts with “something has already gone wrong.” You flip to the broken-stuff chapter and follow steps. Nothing new. No improvisation. Discipline.
Here’s the thing nobody tells you about growth: Carol Dweck’s research on mindset shows that people who improve fastest actually process errors differently. Neurologically. When someone with a growth mindset makes a mistake, their brain lights up in the regions associated with attention and error detection. They see the failure and their brain immediately starts analyzing what happened. Fixed mindset brains? They recognize the error but don’t engage the same recovery circuits.
The skill isn’t making fewer mistakes. The skill is getting comfortable with mistakes and having a system for what happens next.
Most teams build for the happy path. The feature works. The API responds. The deployment succeeds. That path is boring. Production is the “three things fail simultaneously at 3 AM on Saturday while you’re offline” path. That’s the one that matters.
Netflix figured this out years ago with Chaos Monkey. They didn’t wait for servers to fail. They killed servers on purpose, during business hours, to see what happened. The teams that survived Chaos Monkey weren’t the ones with the fanciest architectures. They were the ones that had already asked “what happens when this specific thing dies?” and written the answer down.
That’s the mindset shift. You stop asking “will this work?” and start asking “what happens when it doesn’t?” One question builds for demos. The other builds for Tuesday at 3 AM.
This is a Human-AI parallel moment. Think about how you decide things in your own life. Should you try that new restaurant? Gut feeling. Probabilistic. Might fail. Learn something. Should you cross the street with a truck coming? Hard rule. Zero tolerance. The decision mechanism matches the consequence.
Jazz musicians figured this out decades ago. Wrong note in a solo doesn’t stop the band. It becomes the start of something interesting. Miles Davis once said something like “don’t play what’s there, play what’s not there.” You can’t play what’s not there if you’re frozen by the wrong note. You move through it. The music keeps happening. That’s resilience.
Your first rollback drill teaches you more about your system than six months of staring at dashboards. Here’s why: when you actually execute a rollback under time pressure, you discover which parts of your documentation are useless. Which team member actually knows how the database failover works. Which monitoring alerts matter and which are noise. A drill is cheaper than the disaster that teaches the same lesson.
Building for failure means:
- Every critical path has a documented rollback (test it quarterly)
- Your alerts wake you up for things that matter, not everything
- You version everything reversible
- You build with “what if this specific thing fails” as your design question
The teams that move fastest aren’t reckless. They’re the ones that recovered fastest from their last failure. They’ve built the muscle memory. They’ve written the procedures.
Einstein didn’t publish the theory of relativity fully formed. He published iterations. He got things wrong. He corrected them. He published again. The genius wasn’t in getting it right the first time. It was in the willingness to be wrong publicly and keep working. Newton had the same pattern. His first version of calculus had gaps. He kept going. The math got better because he treated every error as a refinement, not a verdict.
That’s the operating system. Build. Break. Learn what broke. Build again with that knowledge baked in. The teams that fear breaking things move slowly because they’re trying to avoid a thing that’s inevitable. The teams that plan for breaking things move fast because they’ve already decided what happens next.
Take Interest helps teams build this way. Structured decision-making that preserves reasoning, catches assumptions, and carries lessons forward. Learn more →
This connects to the next post: Security Is a Primitive, Not a Feature. Because security operates the same way. You don’t add it after you’ve built the house. You design for the break-in attempt from day one.
References:
- Dweck, C. S. (2006). Mindset: The New Psychology of Success. Random House. [Research on how growth mindset affects neural activation during errors]
- NASA. (1970). “Apollo 13 Mission Report.” NASA Technical Reports Server. [Mission procedures documentation that saved the mission]
- Dekker, S. (2006). The Field Guide to Understanding Human Error. Ashgate Publishing. [How teams build resilience through procedure design]