A 0.91 confidence score told me the plate was right. It wasn't.
Summary
A vision model for license plate OCR exhibited systematic overconfidence, misreading "QJG659" as "OJG659" with "0.91" confidence due to visual ambiguity. This article addresses two core issues: a model's confidence score is often not a true probability, especially on ambiguous inputs like "O"/"Q" pairs, and even a corrected score doesn't dictate a decision. Solutions involve calibrating scores using methods like temperature scaling or Platt scaling for global adjustments, or a confusability matrix for per-glyph corrections, which can be derived from fonts or error logs. The author stresses that calibration is an ongoing process, requiring continuous logging and refitting due to data drift. Crucially, a calibrated score must inform a distinct decision policy with options to accept, reject, or abstain, considering the asymmetric costs of different error types, leading to per-class thresholds and an explicit abstention band.
Key takeaway
For MLOps Engineers deploying vision models where confident misreads are costly, you must treat model confidence as a raw signal, not a true probability. Calibrate your model's scores continuously using techniques like temperature scaling or a confusability matrix. Crucially, separate "how sure" from "what to do," designing a decision policy with explicit accept, reject, and abstain outcomes. Set per-class thresholds based on the asymmetric costs of being wrong to prevent silent, expensive errors.
Key insights
Model confidence scores are not probabilities; calibrate them and separate "how sure" from "what to do" based on error costs.
Principles
- Model confidence is a raw signal, not a true probability.
- Calibration requires continuous maintenance against drift.
- Decision policies must account for asymmetric error costs.
Method
Calibrate model confidence using temperature/Platt scaling or a confusability matrix (font-derived or error-log learned). Define a decision policy with accept, reject, and abstain outcomes, setting per-class thresholds based on asymmetric error costs.
In practice
- Implement temperature or Platt scaling for global calibration.
- Define explicit abstain bands for human review.
Topics
- Model Calibration
- Confidence Scores
- Decision Policies
- Computer Vision
- License Plate Recognition
- Error Cost Analysis
Best for: AI Engineer, Machine Learning Engineer, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by AI Advances - Medium.