AI Evaluation Rubric
Defining experience quality in large language models
The Vision–From Guesswork to Confidence
In early 2024, Acrobat was scaling its AI capabilities, but without a shared definition of quality, teams were flying blind. We needed more than testing—we needed a shared language to define what “good AI” meant for our users.
The Challange
Evaluations were inconsistent. Even when all the evaluation scores looked high in the lab, users often got unhelpful results and dead ends, which led to frustration and hindered trust.
The missing piece? A framework that goes beyond output accuracy to define what quality truly means for users.
My Role
I led the creation of the AI Evaluation Framework, a human-centered tool for assessing generative AI features.
Working closely with Research & Strategy, I transformed fragmented efforts into a repeatable, cross-functional system that informs release decisions and roadmap priorities.
Impact
The framework became the standard for Acrobat AI, enabling faster, more confident launches.
Unified how PMs, researchers, and designers assess AI
Accelerated feature go-to-market by surfacing quality gaps early
Influenced Acrobat’s strategic roadmap through data-backed prioritization
Became part of Acrobat’s standard design toolkit for GenAI
The Solution:
Go After What Matters to People
I began with what people care about when interacting with AI: clarity, usefulness, and trust.
I interviewed stakeholders, reviewed real user feedback, and traced where expectations broke down. From there, I intentionally shaped the framework around human experience, not technical convenience. For example, while it was easy to report “accuracy,” I pushed us to evaluate relevance, coherence, and even how well the AI acknowledged its limits—because those are the moments that shape trust.
I built the framework around what mattered most to users:
Useful – Does it help them accomplish their goal?
Usable – Is it intuitive and easy to understand?
Responsible – Is it safe, transparent, and respectful?
Of course, there are some limitations, such as showing uncertainty in the UI wasn’t technically possible yet. But I designed the rubric to reflect user needs now, and to scale with the system over time.
The Problem
Teams moved quickly to ship new AI features, but misalignment was growing beneath the momentum. Engineering pointed to lab-based tests, proudly citing “85% accuracy,” but those numbers came from canned questions and didn’t reflect real-world use. Product teams had no clear benchmarks. Designers and researchers worked in silos, each measuring success differently. What looked good on paper often failed in practice—users encountered unhelpful responses, grew frustrated, and lost trust. The real problem wasn’t accuracy—it was the absence of a shared, experience-driven standard for evaluating AI quality.
The Dimensions of Quality
Putting the Framework into Practice
Once established, the rubric became more than just a scorecard—it became a decision-making tool.
Teams used it to benchmark the quality of Acrobat’s AI features through real-world testing, end-to-end audits, and competitive analysis. It guided release readiness, surfaced experience issues early, and helped prioritize design bugs that truly mattered to users.
By aligning design, product, and engineering around shared quality goals, the framework didn’t just evaluate what was shipped—it shaped what got built next.
Example Report Card (partial)
Impact
Used in 3 launches and 6 design reviews within 2 quarters
Accelerated go/no-go decisions with cross-functional clarity
Informed roadmap and guided redesign of PDF AI assistant