Emcy — Ship an AI agent that actually does things

Emcy generates MCP servers from OpenAPI specs. Those specs can have dozens—sometimes hundreds—of endpoints. Each endpoint becomes a tool. When a user sends a message, the agent needs to figure out which tools are relevant.

This is the tool selection problem, and at scale, the standard solution breaks.

The Problem

The typical approach: send every tool's name and description to an LLM, ask it to pick the relevant ones.

For 10 tools, this works. For 300, it doesn't.

Token math for 300 tools:

Average tool definition: 50 char name + 200 char description = ~63 tokens
300 tools × 63 tokens = 18,900 tokens just for summaries
Add the system prompt + conversation history = easily 25–30K tokens

That's per request. Before the model does any actual work.

The two-pass optimization—use a smaller model like Haiku for selection, then the main model for execution—reduces cost per token but not token volume. You're still sending 18,900 tokens of tool summaries every time. At Anthropic's rate limits, a burst of concurrent users will 429 you.

The Insight

Tool selection is not a language task. It's a retrieval task.

When a user says "get me the Q4 sales report," the system doesn't need to reason about which tool to use. It needs to find the tool whose description is semantically closest to the query. That's what embedding models are for.

Pre-compute an embedding for each tool at import time. At query time, embed the user's message and run a nearest-neighbor search. The top K results are your tool selection. No LLM call required.

Why Azure SQL

We're already on Azure SQL for the rest of the platform. As of SQL Server 2025, it has native vector support:

VECTOR(1536) data type — first-class column type, optimized binary storage
vector_distance('cosine', a, b) — hardware-accelerated similarity search
Standard indexing — filtered indexes on the embedding column
Composable with existing queries — filter by agent, server, enabled status in the same query

No pgvector extension. No separate Pinecone instance. No additional infrastructure to provision, secure, or monitor. The database we already have does the work.

ALTER TABLE Tools ADD Embedding VECTOR(1536);

SELECT TOP 15 Name,
       vector_distance('cosine', @queryEmbedding, Embedding) AS Distance
FROM Tools
WHERE McpServerId = @serverId
  AND Embedding IS NOT NULL
  AND IsEnabled = 1
ORDER BY Distance ASC;

The Pipeline

Import time: When an MCP server syncs, we iterate through its tools and generate an embedding for each one. The input text is simple: "{tool_name}: {tool_description}". The embedding is stored directly in the Tools table as a VECTOR(1536) column.

Query time: The user's message is embedded with the same model (~20 tokens, ~50ms via Azure AI Foundry). A single SQL query using vector_distance('cosine', ...) returns the top 15 tools ordered by similarity (~10ms). Those 15 tools go to the main LLM. Total selection overhead: 20 tokens and 60ms.

Embedding Lifecycle

Embeddings aren't static. Tool descriptions change when an API evolves. We track this:

Event	Embedding Behavior
Tool created	Generate new embedding
Name or description changed	Regenerate embedding
Tool disabled	Keep embedding — filtered in query via `IsEnabled = 1`
Tool re-enabled	Existing embedding still valid
Tool deleted	Embedding deleted via cascade
MCP server re-synced	Only regenerate for tools with changed content

The key decision: don't regenerate embeddings on enable/disable. The query already filters by IsEnabled, so toggling a tool on or off is instant. Regeneration only fires when the semantic content—name or description—actually changes.

Token Budget: Before and After

Component	LLM Selection	Semantic Selection
Tool selection	~20,000 tokens (Haiku)	~20 tokens (embedding API)
Main LLM call	~3,000 tokens (15 tools)	~3,000 tokens (15 tools)
Total per request	~23,000 tokens	~3,020 tokens

7.6× reduction in token usage per request.

Cost: Before and After

Component	LLM Selection	Semantic Selection
Selection cost per request	~$0.005 (Haiku)	~$0.00002 (ada-002)
Main LLM per request	~$0.01	~$0.01
Per 1,000 requests	~$15	~$10.02

33% cost reduction. Plus: no rate limit pressure on the selection step, because embedding API rate limits are orders of magnitude more permissive than chat completion limits.

Latency

LLM-based selection adds 800–1,200ms of latency before the main call even starts. Semantic selection adds 60ms. The user perceives a meaningfully faster response.

What We Considered and Rejected

Dedicated vector database (Pinecone, Qdrant, Weaviate): Adds infrastructure. Our tool count per agent (tens to low thousands) doesn't justify a separate service. Azure SQL handles this scale trivially.

pgvector on PostgreSQL: Would require migrating off Azure SQL. The rest of our data model—agents, MCP servers, OAuth configurations—is deeply integrated with EF Core and Azure SQL. Not worth the migration cost.

Client-side filtering: Let the LLM client (Claude, Cursor) handle tool selection. This only works when the client has a small enough tool list. At 300 tools, no client handles this well—context windows fill up, and you lose control over selection quality.

Hybrid search (BM25 + vector with RRF): Combining fulltext keyword ranking with vector similarity using Reciprocal Rank Fusion. We designed this but haven't needed it—pure cosine similarity handles our workloads. We'll revisit if we see cases where exact tool name matches are being missed by semantic search alone.

The Stack

Three components. No new infrastructure beyond what the platform already uses. The embedding model is deployed in Azure AI Foundry (same instance we use for the main LLM). The vectors live in Azure SQL (same database as everything else). The selector is a service class in the API.

This system is running in production at Emcy. If you're building MCP infrastructure and hitting the same scaling walls, we'd like to help.

Token-Efficient Tool Selection: How We Scaled From 10 Tools to 300