eCommerce · Conversational AI

Benchmarking an AI Shopping Assistant Across 5 Critical Capability Dimensions

A global eCommerce technology company had built an AI shopping assistant that guides customers from discovery through checkout. As usage scaled, quality gaps emerged — and the team needed a rigorous way to measure what “good” actually looked like before fine-tuning.

Engagement

The Brief

Client

A global eCommerce technology company

Nexus Platform

Evaluation pipelines, benchmark automation, model routing

aion Research

Custom evaluation-framework design, fine-tuning strategy, base-model assessment

Forward-Deployed Engineers

Embedded with client engineering and product teams throughout

5Capability dimensions benchmarked
Auto + HumanEvaluation pipelines built

The Challenge

Context

Five quality gaps at scale. The assistant handled everything from catalog search and inventory checks to cart management and order tracking, all through natural dialogue. As usage scaled, the gaps started to show.

Unnecessary clarifying questions

The assistant would sometimes ask unnecessary clarifying questions, introducing friction instead of guiding the customer forward.

Missed sales opportunities

It would miss opportunities to close a sale, failing to guide multi-turn conversations toward purchase completion.

Wrong tool calls

It would sometimes call the wrong backend tool with incorrect parameters, undermining the reliability of catalog search, inventory, cart, and order-tracking actions.

Scripted empathy

Empathy in problem-resolution scenarios felt scripted — responses behaved more like a decision tree than a helpful human.

Language quality as a baseline

As the company looked to expand into new markets, English-first language quality needed to hold up as a baseline before any multilingual rollout. The team knew they needed to fine-tune — but first, they needed a rigorous way to measure what “good” looked like.

The Approach

Approach

Five capability dimensions. aion's research team embedded alongside the client's engineering and product teams to build an end-to-end evaluation and data strategy across five capability dimensions.

Clarification discipline

Is the assistant asking the right questions at the right time, or introducing unnecessary friction?

Sales closure

How effectively does the assistant guide multi-turn conversations toward purchase completion?

Empathy and problem handling

When things go wrong, does the assistant respond like a helpful human or a decision tree?

Tool-calling reliability

Is the correct backend tool being selected with the right parameters on every invocation?

Language quality

Is the conversational English natural, correct, and consistent enough to serve as a foundation for future multilingual expansion?

A repeatable benchmark framework

aion designed a structured benchmark spanning task taxonomy, automated scoring rubrics supplemented by human spot-checks, baselines, and acceptance thresholds — a repeatable framework the client could use for every subsequent model iteration.

The Outcome

Outcome

A repeatable measurement system. Within the first engagement phase, aion delivered end-to-end model evaluation, data strategy, and a fine-tuning approach, plus a clear roadmap to reach production-grade performance quickly.

Benchmark Framework

Automated and human evaluation pipelines giving the client a quantitative view of model performance across all five dimensions for the first time.

Comprehensive Data Strategy

A labeling and data-collection plan — what to capture, how to score it, and which examples to prioritize — to drive targeted fine-tuning against the dimensions that move the needle.

Base Model Evaluation

Assessment of candidate models against the client's license, infrastructure, and size constraints, with recommendations on fine-tuning approach (SFT, DPO/ORPO, RL-style methods).

Prioritized Optimization Roadmap

The fastest path from current performance to production-grade quality across each capability dimension.

The takeaway

Within the first engagement phase, aion delivered end-to-end model evaluation, a comprehensive data strategy, a base-model and fine-tuning approach, and a prioritized roadmap to production-grade performance — a measurement system the team can run on every model iteration that follows.

Have a problem like this?Let’s deploy the answer.