Core Idea

Calibration is the cross-manager process that ensures performance designations mean the same thing across teams. Without it, “Senior Engineer” in Team A drifts away from “Senior Engineer” in Team B, and the entire performance system loses credibility. Larson provides four rules that prevent the most common calibration failures.

Calibration System for Performance

A career ladder and review cycle alone cannot produce fair outcomes. Without calibration, each manager applies the ladder through their own lens — biased by advocacy for their own reports, persuasion skills, and local team context. Calibration corrects this by creating a shared, structured comparison process.

What Calibration Is

  • A structured meeting where managers review performance assessments across teams
  • Purpose: ensure designations are applied consistently against the ladder, not against each other
  • Participants: managers at the same level (e.g., all engineering managers in a department)
  • Output: designation confirmations, adjustments, and documented rationale

Larson’s Four Rules

1. Shared quest, not a competition

  • All participants seek to apply the ladder consistently — not to advocate for their own reports
  • Failure mode prevented: advocacy bias — managers arguing to “win” promotions for their people rather than evaluating fairly
  • Practical signal: if managers feel defensive about their assessments, calibration has become adversarial

2. Read, don’t present

  • Written assessments are shared before the meeting; everyone reads in advance
  • The meeting is for discussion and comparison — not for managers to pitch their reports
  • Failure mode prevented: presentation bias — charismatic or senior managers get better outcomes simply through persuasion

3. Compare to the ladder, not to peers

  • The question is always: “Does this person’s impact match the level description?”
  • Never: “Is this person better or worse than that person?”
  • Failure mode prevented: relative ranking — punishes engineers on strong teams and rewards those on weaker ones
  • Academic grounding: criterion-referenced appraisal consistently outperforms norm-referenced appraisal for fairness and development outcomes

4. Study the distribution

  • After calibration, review the population distribution of designations
  • Warning signs: 80%+ at “Exceeds Expectations”; senior levels far above industry benchmarks
  • Healthy distribution: roughly bell-curved around the middle designation
  • Failure mode prevented: grade inflation — all managers grade high to avoid difficult conversations, eroding the system’s meaning

What a Calibration Meeting Looks Like

  1. Assessments distributed to all participants 2-3 days before the meeting
  2. Participants read and note questions or comparisons
  3. Meeting opens with borderline or contested cases — not straightforward ones
  4. Discussion anchors on specific ladder criteria: “What evidence do we have for this impact at this scope?”
  5. Facilitator (usually the senior manager) tracks emerging distribution in real time
  6. Decisions documented with rationale; manager communicates outcome to engineer

Connection to Designation Momentum

Past ratings create institutional memory. An engineer calibrated as “Exceeds” builds momentum that makes future downward adjustments socially difficult — even when warranted. This is why applying the rules rigorously from the start matters more than correcting drift later.

Sources

  • Larson, Will (2019). An Elegant Puzzle: Systems of Engineering Management. Stripe Press. ISBN: 978-1-7322651-8-9.

    • Chapter 6.5 — primary source for the four rules of calibration and the calibration meeting structure
  • DeNisi, Angelo S. and Michael K. Murphy (2017). “Performance Appraisal and Performance Management: 100 Years of Progress?” Journal of Applied Psychology, Vol. 102(3), pp. 421-433. DOI: 10.1037/apl0000085.

    • Comprehensive review of a century of appraisal research; documents leniency bias, halo effects, and distributional errors as persistent problems; supports criterion-referenced over norm-referenced approaches as more accurate and fair
  • Colquitt, Jason A. (2001). “On the Dimensionality of Organizational Justice: A Construct Validation of a Measure.” Journal of Applied Psychology, Vol. 86(3), pp. 386-400. DOI: 10.1037/0021-9010.86.3.386

    • Foundational study on procedural fairness (N=776); demonstrates that consistent, bias-suppressed, representative procedures drive perceived fairness independent of outcome; theoretical underpinning for why calibration rules matter
  • Scullen, Steven E., Michael K. Mount, and Maynard Goff (2000). “Understanding the Latent Structure of Job Performance Ratings.” Journal of Applied Psychology, Vol. 85(6), pp. 956-970. DOI: 10.1037/0021-9010.85.6.956

    • Analysis of variance in performance ratings; idiosyncratic rater effects account for 62% of variance — more than true performance — making cross-rater calibration essential for any fair system
  • Orosz, Gergely (2022). “Performance Reviews for Software Developers – How I Do Them In a (Hopefully) Fair Way.” The Pragmatic Engineer. Available: https://blog.pragmaticengineer.com/performance-reviews-for-software-engineers/

    • Practitioner account of calibration practice at scale in tech; covers the “read before meeting” norm and distribution monitoring; corroborates Larson’s rules from industry experience

Note

This content was drafted with assistance from AI tools for research, organization, and initial content generation. All final content has been reviewed, fact-checked, and edited by the author to ensure accuracy and alignment with the author’s intentions and perspective.