Small models + smart routing: the workflow pattern behind “mini/nano” AI (and how to use it)
Published 2026-03-18 • Tags: AI trends, operations, cost, governance, workflows
A quiet trend is becoming the loudest business lesson in AI:
most of your work is not “frontier reasoning” work.
It’s classification, rewriting, extraction, summarisation, policy checks, and glue.
Thesis: “mini/nano” models are a forcing function for a better architecture.
Build tiered routing: send 80–95% of tasks to small, fast models; escalate only when risk/ambiguity is high.
Your wins are latency, cost, and reliability — without sacrificing safety.
Fresh signals (why this matters this week)
-
OpenAI is explicitly positioning smaller, faster models (e.g. “mini” and “nano”) for coding, tool use, and high‑volume workloads.
That’s a hint: the “default” model in production will often be the smaller one.
(source)
-
Google Research continues to publish applied ML work focused on workflow integration (e.g. screening workflows, turning unstructured reports into usable data).
The pattern: AI value comes from operationalising, not just modelling.
(example)
The practical pattern: tiered routing
Tiered routing is simple:
use the cheapest model that can reliably meet your task’s quality and safety constraints.
If the job is uncertain, high-impact, or policy-sensitive, route it up.
Think in three tiers
- T0 — “Commodity” work: extract fields, rewrite emails, classify tickets, summarise calls, draft internal notes.
- T1 — “Judgement” work: multi-step decisions, conflicting inputs, new edge cases, tool orchestration.
- T2 — “Risk” work: external comms, financial impacts, legal/policy, changes to production systems.
Rule: T0 defaults to small/fast models.
T1 escalates based on confidence.
T2 is always gated (human approval, tight tool scopes, audit logs), regardless of model.
How routing decisions should actually be made (not vibes)
Don’t route based on “this feels hard”.
Route based on observables your workflow can measure.
Routing signals you can implement this week
- Confidence checks: ask the model for structured uncertainty (e.g.
low/med/high) + a reason; escalate on low.
- Policy lane: if the payload contains customer data, payroll, credentials, or legal content → route to the “sensitive lane” with tighter tools.
- Disagreement: run two cheap passes (or cheap+cheap) and escalate when outputs diverge meaningfully.
- Tool risk: if the task requires a “write” tool (send email, update CRM, merge code) → escalate + require approval.
- Blast radius: number of recipients / records touched / dollars impacted. More radius → higher tier.
Workflow example: “Inbound leads → next-step email” (SMB-friendly)
- T0: small model extracts fields from a form/email (company, need, urgency) + drafts a reply.
- Guardrail: run a policy check pass (also small): “does this include pricing promises, legal advice, or sensitive info?”
- T1 escalation: if the lead is ambiguous or asks for commitments, escalate to a stronger model to propose options.
- T2 gate: human approves before send. (Draft-first, always.)
The win: the expensive model is now exception handling, not the default.
Your AI spend becomes predictable — and your response time gets faster.
The missing piece: evaluation gates for routing
Routing is only safe if you can test it.
Create a tiny eval set (20–50 real examples) and track:
task success, policy violations, hallucinations, and time-to-complete.
Minimum viable “routing eval”
- 10 T0 cases: extraction + rewrite (should be fast and accurate).
- 10 T1 cases: ambiguous inputs (should escalate and produce options).
- 10 T2 cases: sensitive scenarios (should refuse/route to approval every time).
A rollout checklist (1–2 weeks)
- Day 1: choose one workflow with real volume (support triage, lead follow-up, invoice matching).
- Days 2–3: define the three tiers + the routing signals (confidence, lane, tool risk).
- Days 4–5: implement draft-first tools + approval gate for T2.
- Week 2: add eval gates + dashboards (cost, latency, escalation rate, policy catches).
Sources used for freshness via RSS:
OpenAI News RSS (e.g. “Introducing GPT-5.4 mini and nano”)
and Google Research RSS (e.g. “Introducing Groundsource…”).
Where Workflow ADL fits
Workflow ADL is built around the idea that AI is a workflow system:
queues, tools, approvals, and auditability.
Tiered routing is how you make that system economical.
If you want to adopt this pattern quickly, start with one workflow and measure escalation rate.
Your first goal is not “perfect AI” — it’s a predictable, safe default.