A rubric that TAs cannot apply consistently isn't a rubric — it's a wish. The unit packets define what the binary decisions are. This protocol defines how a team of TAs makes those decisions reliably across sections, sessions, and the term.
Start of every term. TAs grade the same 20–30 sample answers; outliers are coached before they ever grade a real student.
At every grading station. Pass / Not-yet / Edge-case examples for the most common items in the unit. The card is consulted when there's hesitation.
Weekly. Coordinator regrades a sample of each TA's items; inconsistencies surface immediately, not at term's end when nothing can be done about them.
For edge cases. TAs do not adjudicate ambiguous answers alone. Items go to a queue the coordinator reviews end-of-day.
Specifications grading lives or dies on this protocol. A rubric system without TA calibration produces the same variance as traditional point-allocation grading, just with a different vocabulary. The protocol is the part that makes "binary, atomic judgments" actually binary and actually atomic when applied by a team.
Held within the first two weeks of the term, before any TA grades a real practical. 90–120 minutes. Scheduled as paid TA time. Repeated for each unit if TA personnel change between units.
| Time | Activity | What happens |
|---|---|---|
| 0:00–0:15 | Walkthrough | Coordinator walks through the rubric structure, the anchor cards, and the bundle threshold downstream consequences. Why the discipline matters. |
| 0:15–0:35 | Sample R1 (identification) | TAs independently grade ~10 sample R1 answers. Decisions written, not discussed. Coordinator collects. |
| 0:35–0:55 | Sample R1 review | Coordinator tallies decisions on whiteboard. Items where TAs disagreed are discussed first; items where everyone agreed are confirmed second. Outliers are not embarrassed — the goal is the standard, not the person. |
| 0:55–1:15 | Sample R2 + R3 | Same exercise for ID + Function and Histology samples. R2 and R3 carry the most subjective judgments and benefit most from calibration. |
| 1:15–1:30 | Anchor card walkthrough & Q&A | Coordinator points to the anchor card sections that resolve the disagreements that just surfaced. TAs leave with the cards in hand. |
The point of the calibration session is to surface disagreement and resolve it before it costs students. The standard can shift — coordinator may decide an edge case differently after hearing TA reasoning — but it shifts once, on the record, before grading begins. After the session, the standard is the standard.
If one TA's pattern is clearly different from the rest (consistently more lenient or stricter), the coordinator follows up one-on-one within the week. The conversation is about the rubric, not the person. Most outliers are misunderstanding one specific item or one specific rule; clarification fixes it. If a TA cannot calibrate to the cohort standard after coaching, they grade with the coordinator's check on every item until they can.
Anchor cards are the in-the-moment reference TAs consult when there is hesitation. They live at every grading station throughout the practical. They exist because memory is unreliable under load — no TA grading 30 students in 90 minutes will remember every synonym rule.
Each unit packet (cardiovascular, nervous system, musculoskeletal) includes anchor card pages designed to be torn out, laminated, and placed at grading stations. A complete anchor card set covers, for each rubric type:
Anchor cards are versioned. When the escalation queue resolves a new edge case (Pillar ④), the coordinator decides whether to:
This is how the system improves. Every term should produce a small number of new anchor card entries; the absence of new entries means nobody is encountering edge cases — or, more likely, nobody is escalating them. Either way, it's a signal worth investigating.
Anchor cards work best laminated, hole-punched, and on a single ring at the grading station. TAs flip through them quickly. Loose-leaf cards get lost; spiral-bound cards don't lay flat. Whatever format your program adopts, prioritize can the TA find the right card in 5 seconds — that's the design constraint.
Calibration at the start of the term is necessary but not sufficient. Standards drift. Sympathy accumulates. Fatigue erodes. The audit is the mechanism that catches drift before it becomes a term-wide problem.
Each week, the coordinator regrades approximately 5–10% of each TA's items from that week's lab sessions. Selection is stratified:
| Pattern | What it means |
|---|---|
| All decisions match coordinator's regrade | Calibration is holding. Brief acknowledgment to the TA; no change. |
| One or two disagreements, scattered | Normal variance. Note the items; revisit at the next coaching touch-point. |
| Pattern of leniency on one rubric type | TA may need a refresher on that rubric's discipline. Schedule a 10-minute one-on-one within the week. |
| Pattern of strictness on one rubric type | Same response. Strictness is no more virtuous than leniency — both are deviations from the standard. |
| Pattern of under-escalating | TA is adjudicating ambiguity instead of escalating. Reinforce: when in doubt, circle and escalate. |
| Pattern of over-escalating | TA is escalating items the anchor cards already resolve. Walk through the relevant cards. |
Audit results are returned to TAs by the start of the following week's lab session. Feedback is brief, specific, and rubric-focused (not personality-focused). Example:
"Thanks for the work this week. One pattern I want to flag: on the tricuspid valve item, you marked four students as pass when they said 'lets blood through.' That's a not-yet per the R2 rubric — the valve's function is preventing backflow, not permitting flow. Have a look at anchor card E2 and let me know if you have questions before next session."
It is not a performance review. It is not a basis for TA discipline (unless persistent and unaddressed after coaching). It is the same thing the calibration session is — a mechanism for keeping the rubric the rubric, applied by a team that gets tired.
Some answers are genuinely ambiguous and will not appear on any anchor card the first time they show up. The escalation queue ensures these are decided once, by the coordinator, with consistency across the cohort — not 12 different ways by 12 different TAs.
A simple spreadsheet (or notebook) maintained by the coordinator. Each row contains: date, unit, rubric type, the student answer (or summary), the anchor-card item it relates to (if any), the coordinator's decision, and the rationale (one sentence). The log serves three purposes:
| Belongs in queue | Does NOT belong (handle in the moment) |
|---|---|
| Genuinely novel ambiguous answer not anticipated by any anchor card | Answer that matches an anchor card example — just apply the card |
| Spelling case where the rule is silent | Spelling case the rule covers explicitly |
| Function statement that's partially correct in a way no anchor case addresses | Function statement that obviously misses the rubric requirement |
| Disputed dissection technique observation (e.g. unusual approach that worked) | Standard technique observation covered by the 4-point or 5-point checklist |
| Anything where two TAs at adjacent stations disagree | Anything where the rubric is clear and the TA's hesitation is just speed-related |
A healthy escalation rate is roughly 2–5% of items in any practical. Below 1% suggests TAs are adjudicating things they shouldn't. Above 8% suggests the anchor cards need to resolve more cases (or the rubric itself has a gap that needs addressing). The coordinator should track this rate per TA and per unit.
One page. The full operational rhythm of the calibration system, designed for the wall above the coordinator's desk.
| Week | Calibration activity | Notes |
|---|---|---|
| Pre-term (Week 0) | Sample answers prepared for first unit | 20–30 per rubric type, 40/40/20 mix |
| Week 1 | Calibration session ① with TAs | 90–120 minutes, paid time, before any grading |
| Weeks 2–N | Spot-check audit ③ each week | 5–10% of each TA's items, returned by next session |
| Mid-term | Mini-calibration on emerging escalation log items | 30–45 min, only if log volume warrants it |
| New unit start | Re-run calibration session ① for new unit's anchor cards | Shorter (60 min) if same TA team; full session if any new TAs |
| Post-term | Review escalation log; promote items to anchor cards for next term | 2–3 hours; produces the v0.x → v0.(x+1) update |
| Escalation rate per TA per unit | Target 2–5%; investigate outliers in either direction |
| Spot-check disagreement rate | Target <5% of regraded items; rising rate signals drift |
| Same-answer consistency across TAs | Sampled occasionally by giving two TAs the same item; should be near 100% |
| Student appeal rate | Should fall over time as the system stabilizes; rising appeals signal a rubric problem worth investigating |
Three early warning signs: appeals rising; TAs grading the same session very differently in spot-checks; the escalation log either empty (TAs adjudicating alone) or overflowing (rubric or anchor cards have gaps). Any one of these is a cue to step back and redesign rather than push harder on the existing system.