eCommerce · Conversational AI
Benchmarking an AI Shopping Assistant Across 5 Critical Capability Dimensions
A global eCommerce technology company had built an AI shopping assistant that guides customers from discovery through checkout. As usage scaled, quality gaps emerged — and the team needed a rigorous way to measure what “good” actually looked like before fine-tuning.
Engagement
The BriefClient
A global eCommerce technology company
Nexus Platform
Evaluation pipelines, benchmark automation, model routing
aion Research
Custom evaluation-framework design, fine-tuning strategy, base-model assessment
Forward-Deployed Engineers
Embedded with client engineering and product teams throughout
The Challenge
ContextFive quality gaps at scale. The assistant handled everything from catalog search and inventory checks to cart management and order tracking, all through natural dialogue. As usage scaled, the gaps started to show.
Unnecessary clarifying questions
The assistant would sometimes ask unnecessary clarifying questions, introducing friction instead of guiding the customer forward.
Missed sales opportunities
It would miss opportunities to close a sale, failing to guide multi-turn conversations toward purchase completion.
Wrong tool calls
It would sometimes call the wrong backend tool with incorrect parameters, undermining the reliability of catalog search, inventory, cart, and order-tracking actions.
Scripted empathy
Empathy in problem-resolution scenarios felt scripted — responses behaved more like a decision tree than a helpful human.
Language quality as a baseline
As the company looked to expand into new markets, English-first language quality needed to hold up as a baseline before any multilingual rollout. The team knew they needed to fine-tune — but first, they needed a rigorous way to measure what “good” looked like.
The Approach
ApproachFive capability dimensions. aion's research team embedded alongside the client's engineering and product teams to build an end-to-end evaluation and data strategy across five capability dimensions.
Clarification discipline
Is the assistant asking the right questions at the right time, or introducing unnecessary friction?
Sales closure
How effectively does the assistant guide multi-turn conversations toward purchase completion?
Empathy and problem handling
When things go wrong, does the assistant respond like a helpful human or a decision tree?
Tool-calling reliability
Is the correct backend tool being selected with the right parameters on every invocation?
Language quality
Is the conversational English natural, correct, and consistent enough to serve as a foundation for future multilingual expansion?
A repeatable benchmark framework
aion designed a structured benchmark spanning task taxonomy, automated scoring rubrics supplemented by human spot-checks, baselines, and acceptance thresholds — a repeatable framework the client could use for every subsequent model iteration.
The Outcome
OutcomeA repeatable measurement system. Within the first engagement phase, aion delivered end-to-end model evaluation, data strategy, and a fine-tuning approach, plus a clear roadmap to reach production-grade performance quickly.
Benchmark Framework
Automated and human evaluation pipelines giving the client a quantitative view of model performance across all five dimensions for the first time.
Comprehensive Data Strategy
A labeling and data-collection plan — what to capture, how to score it, and which examples to prioritize — to drive targeted fine-tuning against the dimensions that move the needle.
Base Model Evaluation
Assessment of candidate models against the client's license, infrastructure, and size constraints, with recommendations on fine-tuning approach (SFT, DPO/ORPO, RL-style methods).
Prioritized Optimization Roadmap
The fastest path from current performance to production-grade quality across each capability dimension.
The takeaway
Within the first engagement phase, aion delivered end-to-end model evaluation, a comprehensive data strategy, a base-model and fine-tuning approach, and a prioritized roadmap to production-grade performance — a measurement system the team can run on every model iteration that follows.
