Building a Multi-Vendor Marketplace That Grew AOV by 18%

How a conversational shopping assistant on a Mercur multi-vendor marketplace lifted average order value — real-time product recommendations, vendor routing, and the A/B test that proved it.

A D2C brand running a multi-vendor marketplace wanted to increase average order value without the usual playbook of "add a minimum for free shipping" or "show a cross-sell widget in the cart." They'd tried both. Diminishing returns.

The hypothesis was that a conversational shopping assistant — something that could understand what the customer was looking for and recommend products across vendors — would surface items customers wouldn't find through browsing alone. After a 6-week A/B test, the treatment group showed an 18% lift in AOV. Here's the build.

The Stack

The marketplace ran on Mercur — an open-source multi-vendor layer on top of Medusa.js v2. Mercur handles vendor onboarding, commission splits, and per-vendor inventory. The storefront was Next.js with Tailwind.

The conversational assistant was a standalone service: FastAPI backend, OpenAI GPT-4o for generation, and a pgvector index over the product catalog for semantic retrieval.

Product Catalog Indexing

Every product in the marketplace was embedded and stored in a pgvector column alongside its metadata — title, vendor, category, price range, attributes, and average rating.

The indexing pipeline ran on a schedule and on Medusa webhook events. When a vendor added or updated a product, the webhook triggered re-embedding of that product. The full catalog reindex ran nightly to catch any missed updates and refresh stale embeddings.

We embedded a composite text field: {title} | {category} | {description} | {attributes}. Just the title wasn't enough — "Classic Oxford" means nothing without "Men's Dress Shoe, Leather, Brown, Size 42." Just the description was too noisy — vendor descriptions are marketing copy, not structured data.

The Conversation Flow

The assistant wasn't a general chatbot. It had a narrow job: help the customer find and buy products. The system prompt constrained it to:

Ask clarifying questions about what the customer wants
Search the catalog using semantic retrieval
Present 2-3 options with reasoning
Handle follow-ups ("something cheaper," "in blue," "from a different vendor")

Each turn generated a structured tool call to the search API. The search combined semantic similarity with hard filters — if the customer said "under $50," that was a price filter, not a vague preference. The LLM extracted structured filters from natural language, and the retrieval layer applied them as SQL WHERE clauses before the vector similarity ranking.

@tool
def search_products(
    query: str,
    max_price: float | None = None,
    category: str | None = None,
    vendor_id: str | None = None,
    limit: int = 5,
) -> list[ProductResult]:
    filters = build_filters(max_price, category, vendor_id)
    return vector_search(query, filters=filters, limit=limit)

Why It Worked

The assistant was good at two things that browsing UIs are bad at:

Cross-vendor discovery. On a multi-vendor marketplace, each vendor's products live in their own little silo. A customer browsing Vendor A's store doesn't see Vendor B's complementary product. The assistant searched across all vendors and could say "this bag from Vendor A pairs well with these shoes from Vendor C" — something the browse UI could never do organically.

Intent refinement. Customers often start with vague intent: "I need something for a dinner party." A filter-based UI forces them to translate that into categories, price ranges, and attributes. The assistant just asked: "What kind of dinner party? Casual or formal? How many guests? What's your budget?" Three questions in, it had enough context to make a targeted recommendation.

The A/B Test

We ran the test for 6 weeks with a 50/50 split. The control group saw the standard storefront. The treatment group got a floating chat widget in the bottom-right corner.

Key metrics after 6 weeks:

AOV: +18% in treatment group (statistically significant, p < 0.01)
Conversion rate: +4% (significant, but smaller effect)
Items per order: +0.6 average (the main driver of AOV lift)
Chat engagement: 31% of treatment group users opened the chat at least once
Chat-to-purchase: 22% of chat sessions resulted in an add-to-cart within the same session

The AOV lift came almost entirely from the assistant surfacing cross-vendor bundles. Customers who engaged with the chat bought from an average of 1.8 vendors per order, compared to 1.2 for non-chat customers.

What Didn't Work

Proactive suggestions. We initially had the assistant pop up with "Looking for something?" after 30 seconds of browsing. Engagement was high but satisfaction was low — people found it annoying. We switched to a passive widget that only spoke when opened. Engagement dropped but quality of interactions went up.

Long conversations. Sessions over 8 turns had declining recommendation quality. The context window filled with earlier preferences that were no longer relevant, and the assistant started contradicting itself. We added a "start fresh" button and limited conversation history to the last 6 turns with a summary of earlier intent.

Vendor fairness. Early versions inadvertently favoured vendors with better-written product descriptions (richer embeddings = higher similarity scores). We normalised embedding quality by running all vendor descriptions through a standardisation step before embedding — extracting structured attributes and generating a canonical description regardless of the vendor's copywriting quality.

The client kept the assistant running after the test. Last I heard, they're expanding it to handle returns and order tracking — which is a different problem entirely, but the infrastructure was built to support it.

Command Palette