Writing | Law Zava

The AI Strategy Stack: What Boards Mistake for Moats

Tue, 30 Jun 2026 00:00:00 +0000

The strongest argument against this whole essay is short: foundation models keep getting better, so whatever gap your proprietary data closes this quarter, the next base model closes for free. If that were fully true, no data moat in AI would be worth funding. It is half true. The half it gets wrong is the half boards keep paying for.

Start with what the strategy deck stacks up as defensible. Three layers, usually. The model itself, rented, and your competitor can rent the same one. The scaffolding on top of it: prompt library, routing logic, eval harness, all shaped around one vendor’s behavior and all of it breaks the morning you change providers. If the moat disappears when the vendor changes, it was never a moat. It was a dependency. That clears two of the three layers off the slide. Swap the provider in your head; whatever still works the next morning is the only candidate worth the word.

What survives is the third layer, and it is the one boards cannot tell apart from its imitation: data your own operation produces by running. Here is the mechanism, and the exact place it breaks.

Take support automation. The model drafts a resolution; a human approves before it ships. Every rejection or rewrite captures a labeled triple: the input, the output the model produced, the output the human accepted. Not a log line. A graded example of where your model was wrong and what right looked like, on your tickets, in your domain.

Now the load-bearing step, the one most decks wave through. How does that triple make a cheaper model tier handle a class it used to escalate? Two mechanisms, two different bills.

Retrieval, the few-shot route: index the accepted exemplars and inject the nearest ones into the prompt at inference. Cheap to stand up, live the moment you index a correction, but it taxes every call in tokens and latency, and the lift is capped because you are renting the base model’s in-context learning.

Distillation, the fine-tune route: train the small model on the correction set. Latency stays flat, the behavior is baked in, per-call cost drops, but you pay a training-and-eval cycle up front and re-pay it on every base-model upgrade. Which tier absorbs which class is a cost decision, not a model-quality one: retrieval for the long tail, distillation for the high-volume classes once they stop drifting.

Either way, one number tells you which you have: escalation rate on a single request class, quarter over quarter. Falling and sustained while quality holds is the loop compounding. Flat is a warehouse with a dashboard bolted to it. A logging pipeline and a compounding loop look identical in the architecture diagram and behave nothing alike in the P&L.

I cannot hand you a rival’s P&L to prove the good case, and any deck that shows you a clean before-and-after percentage is selling the illustration as the evidence. The honest test is one you run on your own numbers: name the class, name the two quarters it improved, name why a competitor on the same vendor cannot reproduce it. The answer to the last one is never the model. It is the system around it that turns each rejection into an exemplar only you hold.

Then the part the optimistic version omits: this asset depreciates. When the next base model ships, it absorbs your easy classes for free, everyone’s, not only yours, and that compresses the set of failures where your corrections still move the number. Your edge is only ever the residual: corrections illegible outside your context, your product’s quirks, your contractual edge cases, your policy language. The vendor will productize the capture loop; they already sell feedback buttons and fine-tuning APIs. What they cannot aggregate is a residual that means nothing without your business wrapped around it. The loop compounds only while you generate domain-specific corrections faster than a better base model erases the generic ones.

From Model Demos to Profit Engines: The CTO Playbook for AI Unit Economics

Thu, 25 Jun 2026 00:00:00 +0000

Quick take

A beautiful demo is not a business model. It only proves the model can look useful before the business pays for edge cases. The bill arrives when the system hits real users, real load, and real failure conditions. At that point AI stops being a model-selection problem and becomes a routing problem , a fallback problem, and a repair problem. Good CTOs do not buy “smart.” They buy systems that stay cheap enough, predictable enough, and reliable enough to survive the week.

Unit economics start with routing

The wrong AI architecture sends every request to the most expensive path. That feels elegant until the invoice arrives. Mature systems route by value and by risk.

A practical routing model usually splits work into classes:

trivial tasks that should stay cheap and local
medium-value tasks that deserve a balanced model tier
high-stakes tasks that justify expensive reasoning and stronger checks

This is not model worship. It is cost discipline.

The hidden cost is rarely the model line item

Teams fixate on tokens because tokens are visible. The real bill sits around the model: retries, context assembly, human correction, support escalation, and the work of proving the output is acceptable.

If a system saves one minute for a customer and creates two minutes of cleanup, it is destroying margin.

A finance-aware CTO should be able to answer these questions without hand-waving:

what each class of request costs to serve
where the rework happens
what failure costs when the model is wrong
which parts of the workflow justify premium inference

The real decision is not model choice, it is failure cost

“Best model” is usually the wrong conversation. The useful conversation is about failure cost.

A cheaper model that fails gracefully can beat a more expensive model that fails silently. A local fallback that keeps the system alive during a rate-limit event can matter more than a small quality lift in the happy path.

The CTO playbook is simple: optimize the whole system, not the benchmark screenshot.

Measure margin at the workflow level

The right unit of measure is the workflow, not the model call.

Ask:

how much does this workflow cost end to end?
how often does it need human repair?
how long does it take to reach a trustworthy answer?
what is the revenue or labor value of the result?

That is where the business truth lives. A model that looks slightly less accurate in isolation may create better margin if it is cheaper, faster, and easier to trust.

A practical threshold

If the system does not improve margin, then it needs to improve risk or speed. If it improves neither, it is a demo that escaped the lab.

AI work that survives budget review answers one of four questions:

does it lower cost per task?
does it reduce human labor?
does it increase throughput?
does it unlock new revenue with acceptable risk?

If not, the demo should stay in the demo lane.

Key Takeaways

Route cheap work cheaply.
Model cost is only part of the bill.
Measure workflow margin, not call cost.
If it does not improve margin, risk, or speed , it does not belong in production.

The New Talent Stack: Product, Platform, and Applied AI Must Work as One System

Thu, 18 Jun 2026 00:00:00 +0000

Quick take

Most AI hiring plans are trying to fix an interface problem with resumes.

If product, platform, and applied AI are not built as one operating system, new headcount adds motion but not leverage. The constraint is usually not talent scarcity. It is system design.

Recruiting Alone Cannot Fix a Broken Stack

AI organizations often describe their issue as “we need stronger talent.” In many cases, they already have capable people. What they lack is a clear operating contract across teams.

The pattern is familiar:

product optimizes for release velocity
platform optimizes for reliability and control
applied AI optimizes for model behavior and evaluation quality

Each goal is rational. The breakdown happens at the handoffs.

When interfaces are unclear, every launch becomes a negotiation. When interfaces are explicit, the same teams produce compounding output.

The Three-Layer Talent Stack

A healthy stack has three interlocking layers with distinct responsibilities:

Product — owns user outcomes and business success metrics.
Platform — owns safe defaults, deployment paths, and observability.
Applied AI — owns workflow behavior, retrieval/prompting/routing choices, and evaluation quality.

These are not departments in competition. They are system components with different jobs.

If product outruns platform, quality debt accumulates. If platform outruns product, infrastructure becomes generic overhead. If applied AI outruns both, you get technically impressive demos that never operationalize.

Where Organizations Usually Break

Most failures are boundary failures, not individual failures.

Common symptoms:

no explicit owner for the model-to-product handoff
platform operating as a ticket queue instead of an enablement layer
applied AI measured by demo novelty instead of adoption in live workflows
product committing features that infra cannot support safely

A concise diagnosis: org debt is usually interface debt with better branding.

Design the Stack Intentionally

The fix is not “more syncs.” The fix is explicit decision rights.

product owns problem selection and business tradeoffs
platform owns reliability guardrails and release safety
applied AI owns workflow performance and evaluation integrity
leadership owns escalation rules when tradeoffs conflict

Once this is explicit, hiring quality improves. You stop searching for mythical generalists and start hiring operators who can perform inside a coherent system.

What to Evaluate Before Adding Headcount

Before opening new roles, run this short check:

Are cross-team handoffs documented and current?
Does each layer have clear success metrics it actually controls?
Are escalation paths clear when speed, reliability, and quality disagree?
Are teams rewarded for system outcomes rather than local optimization?

If those answers are weak, fix interfaces first. New hires will scale the current operating model, good or bad.

Key Takeaways

Strong AI organizations are designed as a system, not staffed as silos.
Product, platform, and applied AI need explicit interfaces and decision rights.
Boundary clarity is a bigger lever than raw headcount.
Hiring works best after the operating contract is clear.

The Executive Case for Local-First AI Infrastructure

Tue, 16 Jun 2026 00:00:00 +0000

Quick take

Local-first AI is not anti-cloud. It is an operating decision about where work runs, where risk sits, and where margin leaks.

If placement is left to default settings, you inherit default latency, default privacy exposure, and default unit economics. Executives should treat compute placement the way they treat pricing and vendor strategy : explicit, reviewed, and tied to outcomes.

Stop Treating Placement as an Implementation Detail

Most teams still frame local-first as a tooling preference. That misses the real issue.

Placement determines four things that show up directly in business performance:

Latency control — fewer network hops and less variance in response time.
Privacy control — sensitive data can stay inside your perimeter .
Cost control — high-frequency calls stop accumulating per-request tax .
Failure control — fallback paths are closer to the workload and easier to reason about.

That is not ideology. That is operational control.

Where Local-First Should Be the Default

Local-first wins when work is frequent, bounded, and expensive to keep outsourcing call-by-call.

Typical candidates:

routing and classification
extraction and normalization
internal workflows handling regulated data
high-volume background tasks
retrieval-heavy systems where context assembly dominates spend

The pattern is practical: keep frontier work in the cloud, run repeatable workload locally , and route between them intentionally.

A useful line for leadership teams: placement discipline is margin discipline.

Where Cloud Still Wins

The right architecture is usually hybrid, not absolutist.

Cloud remains the better default when you need:

frontier reasoning or specialized capabilities you do not host
burst capacity you cannot justify building for
minimal operational overhead for low-volume workloads
fast access to capabilities you do not need to own long term

The mistake is not using cloud APIs. The mistake is using them by reflex after workload shape has changed.

An Incremental Adoption Path

Do not start with a full migration plan. Start with workload triage.

Identify repeated tasks where per-request cost is accumulating.
Move low-risk routing and transformation paths first.
Keep cloud as escalation while you validate reliability and observability.
Expand local placement only after economics and failure behavior are proven.

If you do this well, the architecture gets less dramatic over time. That is a success condition.

Executive Decision Rubric

Before moving a workload local, ask:

Is frequency high enough that per-request cost now matters?
Does the workload touch data that should stay inside our boundary?
Can we operate fallback and observability well enough to trust it?

If two of three are yes, you likely have a local-first candidate.

Key Takeaways

Local-first is a control strategy, not a cloud rejection strategy.
Compute placement directly affects latency, privacy posture, and margins.
Hybrid architecture is the practical default: local for repeated bounded work, cloud for frontier escalation.
Move incrementally and prove economics before scaling hardware commitments.

Decision Latency as a P&L Variable: The Leadership Metric Nobody Owns

Wed, 10 Jun 2026 00:00:00 +0000

Quick take

Slow decisions look like caution. In practice, they are hidden expense.

Decision latency belongs on the P&L. Every day a real decision sits unresolved, the business pays in delay, rework, and attention.

Why Decision Latency Matters

A team can look productive and still be dragging the business down if every meaningful decision takes too long.

Decision latency shows up as:

stalled launches
expired opportunities
duplicated work
growing frustration in the teams closest to the customer

When leaders do not measure this, they blame execution when the real problem is delay. The work may be moving. The organization is not.

What Decision Latency Looks Like in Practice

You can usually find it by asking a few questions:

How long does a high-signal issue sit before someone decides?
How many people need to weigh in before the first answer exists?
How often do decisions get reopened because no one owned the original call?
How much work is blocked waiting for alignment that never arrives?

Those are not soft questions. They are economic questions.

If a release, hiring decision , vendor decision , or architecture decision sits for weeks, the business is paying rent on uncertainty.

A useful line: ambiguous ownership is the most expensive architecture in your company.

Make It Visible

If you want leaders to care, make the metric visible.

Track:

time from issue raised to decision made
time from decision made to action taken
number of escalations per decision class
number of decisions reopened after approval

Once those numbers are in the open, patterns become hard to deny. You can see which teams move fast, which questions keep getting rerouted, and where the organization is burning time on decisions that should have been routine.

How to Reduce It

Decision latency drops when teams do four things well:

Define who owns each decision class .
Set decision boundaries before the crisis.
Reduce the number of people required for routine calls.
Make escalation fast when the decision is truly material.

This is not about making every decision unilateral. It is about making routine decisions quick and risky decisions explicit.

If the call is small, the system should move. If the call is material, the system should know exactly who has to weigh in.

Key Takeaways

Decision latency is a real cost driver.
Measure the time from issue to decision and from decision to action.
Ownership clarity reduces hidden opex.
The best organizations make routine decisions quickly and unusual decisions deliberately.

Designing the AI Leadership Bench: Roles, Interfaces, and Failure Boundaries

Wed, 10 Jun 2026 00:00:00 +0000

Quick take

AI leadership does not fail because titles are missing. It fails because interfaces are missing.

A real leadership bench is the decision system connecting product, platform, reliability, and governance. If those seams are unclear, incidents turn into organizational confusion before they become technical recovery.

A Bench Is an Interface Map

Many companies think “strong bench” means “we hired senior people.” That is necessary, but not sufficient.

A working bench answers four questions without debate:

who owns product tradeoffs
who owns platform reliability
who owns model governance and risk boundaries
who owns escalation when those priorities collide

If the answers depend on who is online that day, the bench is not operational.

Core Roles and Decision Rights

The exact titles vary. The interfaces should not.

Product owner — accountable for business outcome and adoption targets.

Platform owner — accountable for safe defaults, observability , and deployment reliability.

Applied AI owner — accountable for workflow behavior, routing, and evaluation quality .

Governance owner — accountable for explicit, reviewable risk boundaries.

The goal is not bureaucracy. The goal is unambiguous ownership when tradeoffs are real.

Failure Boundaries Beat Hero Culture

Healthy leadership systems plan for predictable stress cases instead of hoping for heroic response.

Define boundary behavior for events like:

model quality degradation
vendor policy or terms changes
quiet workflow failure that evades basic monitoring
loss of a key operator

If those handoffs are documented and rehearsed, incidents stay technical. If not, incidents become political.

One reliable warning sign: one person is expected to explain the full system from memory. That is not a bench. That is a single point of organizational failure.

How to Build the Bench in Practice

Make interfaces concrete and testable:

document what each owner can decide without escalation
define escalation thresholds for speed vs reliability vs governance conflicts
map core metrics to the leader who can actually move them
rehearse incident handoffs before live incidents force improvisation

This is operational hygiene, not ceremony.

A line worth keeping: great leaders design boundaries before they design org charts.

Key Takeaways

AI leadership strength comes from interfaces, not senior titles alone.
Product, platform, applied AI, and governance need explicit owners and decision rights.
Failure boundaries should be defined before incidents, not during them.
If one person holds the whole system context, the bench is underbuilt.

The Operating Cadence: Turning AI Leadership Interfaces Into Predictable Output

Wed, 10 Jun 2026 00:00:00 +0000

Quick take

A bench with clear interfaces is a necessary foundation. It is not a compounding system. Without rhythm, documented ownership drifts back into informal updates, and informal updates beat formal ones right up until they don’t.

Cadence is the mechanism that keeps interfaces load-bearing.

Interfaces Without Cadence Degrade

When a team documents who owns what, the clarity is real — for a few weeks. Then the pace picks up, the weekly sync gets skipped once, and the product owner starts resolving platform questions directly because it is faster. The interface is still on paper. It is no longer operational.

This is the failure mode that connects a well-designed bench to a year-two org that is back to improvising. Nobody dismantled the system. They just stopped running it.

Formal coordination loses to informal coordination every time informal coordination has lower friction. The only fix is making the formal cadence the path of least resistance — by keeping it short, metric-anchored, and non-negotiable.

The Three Cadences That Compound

Three rhythms cover the full operating surface of a scaling AI program.

Weekly operating cadence — 30 minutes, same metrics every cycle. Latency, error rate, eval scores , blocked work. The point is not status; it is signal. Any metric outside its threshold triggers an owner, not a discussion. If nothing is outside threshold, the meeting ends early.

Monthly outcome review — 90 minutes, owners present against targets set the previous month. What moved, what did not, what is at risk next month. This is where product and platform tradeoffs surface before they become incidents. Governance owner attends. Decisions are recorded with the owner and the date.

Quarterly architecture audit — half day, forward-looking. Where is the system accumulating hidden cost? What capability investment is being deferred? What would break first if the load doubled? The audit produces a short list of bets for the next quarter, not a roadmap deck.

Each cadence locks in a different time horizon. Weekly locks in operational latency. Monthly locks in outcome reliability. Quarterly locks in capability investment. Together they cover the full range from “is anything on fire today” to “are we building toward where the load is going.”

What Each Cadence Prevents

The weekly cadence prevents alert fatigue from becoming normalized degradation. Teams that skip it tend to discover the same problems later, at higher cost, under more pressure.

The monthly review prevents the gap between product ambition and platform reality from widening silently. That gap is where most AI roadmap slippage hides. By the time it is visible to leadership, it is already a quarter behind.

Cadence does not eliminate incidents. It shortens the distance between a signal and a decision .

The quarterly audit prevents incident-driven re-architecture . The single most expensive pattern in scaling AI programs is emergency redesign under production pressure. Orgs that run a quarterly audit tend to make the same architectural changes earlier, cheaper, and with less organizational disruption. The audit is not a guarantee — it is a forcing function for the conversation that should happen before the crisis.

The Predictability Test

A cadence is working when the team can answer one question before the quarter ends: what is the most likely bottleneck next quarter, and who owns the intervention?

This is not a forecasting exercise. It is a structural test. If nobody can answer it, the cadence is collecting status but not producing foresight. The monthly reviews are not surfacing risk early enough, or the quarterly audit is not connected to the weekly signal.

If the team can answer it — even roughly — the cadence is compounding. The interfaces are being exercised on a predictable rhythm, and that rhythm is generating the kind of organizational memory that makes year-two scale possible without heroics.

Key Takeaways

Documented interfaces degrade without a cadence to run them; informal coordination fills the gap and eventually breaks.
Three rhythms cover the full operating surface: weekly operating, monthly outcome review, quarterly architecture audit.
Each cadence locks in a different time horizon — latency, reliability, and capability investment respectively.
A cadence is working when the team can predict next quarter’s bottleneck before it arrives.

The Post-Prototype AI Org: Operating Models That Survive Year Two

Wed, 10 Jun 2026 00:00:00 +0000

Quick take

A lot of AI orgs look healthy in month three and brittle by year two. The model usually did not fail. The operating model did. Prototype energy is easy to create; durable coordination is not.

The question is not whether the team can ship something exciting. The question is whether the company can keep shipping after the novelty fades.

Why the prototype phase hides the real problem

In the early phase, AI teams often succeed because everyone is close to the work. Decisions are informal, context is shared, and the whole system fits in a few people’s heads. That stops scaling almost immediately.

As soon as the team grows, the same strengths turn into liabilities:

knowledge becomes hidden
approvals multiply
handoffs slow down
nobody owns the interface boundaries

What worked when the team was small no longer works when the company needs predictability.

The operating model should be explicit

A post-prototype AI org needs to define how work moves.

The model should answer:

who owns the user problem?
who owns the runtime?
who owns the quality signal?
who owns the risk boundary ?
who can stop the release?

Without those answers, the team is improvising around gaps that will eventually become incidents or delays.

Handoffs are the hidden bottleneck

Most AI roadmaps do not fail because the team lacks ideas. They fail because each handoff adds ambiguity.

The problem shows up in predictable places:

product asks for speed, platform asks for safety
applied AI wants more freedom, compliance wants more proof
leadership wants output, the system wants more control

That tension is normal. What is not normal is leaving it unresolved.

A good operating model turns tension into a documented interface, not a recurring crisis.

Scale requires less heroics, not more

The post-prototype org has to depend less on heroic behavior and more on repeatable behavior.

That usually means:

clearer ownership
smaller decision surfaces
stronger eval gates
visible rollback paths
fewer ambiguous exceptions

This can feel slower at first, but it is the only way the org gets faster at scale.

A simple test

Ask whether the AI system can survive a senior person going on vacation for two weeks.

If the answer is “not really,” the organization is still running on hidden tribal knowledge.

If the answer is “yes, with documented ownership and a stable operating model,” the company is moving from prototype to production.

That is the real year-two test.

Key Takeaways

Prototype energy does not scale on its own.
The year-two problem is usually organizational, not model-related.
Ownership, interfaces, and escalation paths matter more than the demo itself.
A durable AI org is designed for scale before the prototype succeeds.

The AI Vendor Negotiation Playbook for CTOs

Tue, 09 Jun 2026 00:00:00 +0000

Use this before any AI vendor contract renewal, initial procurement, or pricing negotiation. Most CTOs walk in under-prepared — the vendor knows your dependency footprint better than you do. This worksheet closes that gap. Work through it the day before the meeting.

1. Workload Facts You Must Have

The vendor’s first move is to define your usage for you. Don’t let them.

Total request volume per month, broken out by use case A single aggregate number is not enough. Know which workflows drive cost.
Cost per task class (e.g., generation vs. classification vs. retrieval) If you cannot name your top three cost drivers, you cannot challenge the invoice.
Latency p50/p95 by workflow, measured from your own instrumentation Vendor SLAs are measured at their edge, not yours.
Percentage of spend attributable to this vendor vs. total AI budget Concentration creates leverage — for them. Know the number.
Named owner of the vendor relationship on your side If no one owns it, no one negotiates it.

2. Architecture Leverage Check

Leverage is an architecture property. Answer these before you sit down.

Is the vendor’s API called directly from product code, or through an abstraction layer ? Direct calls = switching costs measured in months. Abstraction = measured in days.
How many distinct integration points does this vendor touch? Write the number. Fewer than five is manageable. More than ten is a dependency.
What is the estimated engineering cost to swap this vendor? Get a real estimate, even a rough one. “Unknown” is not an answer.
Do you have a secondary provider you have already integrated, even partially? Yes/No. If no, you have no credible threat.
Does your data pipeline depend on vendor-specific formats or embeddings ? Format lock-in is often more expensive than API lock-in.

3. Evaluation Evidence

Vendors sell on benchmark claims. Counter with your data.

Do you have evals that measure model performance on your actual workload ? Yes/No. If no, you are buying on their terms by default.
Which models have you tested against your task suite in the last 90 days? List them. If the answer is only theirs, you have no comparison point.
What is your acceptable quality threshold, defined numerically? “Good enough” is not a threshold. A number is.
Have you run a cost-per-correct-output comparison across providers? Price per token is a distraction. Price per correct result is the metric.
Who owns your eval framework and can demo it in the meeting if needed? Named person, not a team.

4. Exit Credibility

A vendor that believes you cannot leave does not negotiate. Make them uncertain.

Do you have a documented migration plan, even a sketch? It does not need to be final. It needs to exist.
What is your contractual notice period to exit? Know this before they remind you of it.
Have you identified which vendor you would move to first if pricing increased 40%? Name them. Vague alternatives are not alternatives.
Is there a sunset timeline for any features that are vendor-exclusive today? If yes, the vendor knows your dependency has an expiration date.
Can your team absorb a two-week migration sprint without derailing the roadmap? Yes/No. Honest answer only.

If you cannot fill in the workload numbers, you are not done preparing — you are about to negotiate against someone who has already modeled your spend. If you have no eval data, you will accept their performance claims by default. If there is no exit plan, any number they name is essentially a take-it-or-leave-it offer. The meeting itself is the wrong place to discover these gaps. Thirty minutes with this worksheet before you walk in is worth more than any negotiation tactic once you are in the room.

How to Run an AI Incident Review That Changes Architecture, Not Slides

Tue, 02 Jun 2026 00:00:00 +0000

Quick take

An AI incident review is only useful if it changes the system. Anything else is a postmortem-shaped meeting.

If the review does not change architecture, evaluation, or control boundaries , the organization has paid for ceremony and learned too little.

The Point of an Incident Review

The point of an incident review is not to assign theater-friendly blame.

The point is to answer:

what failed
why it failed
how we knew
what should change so it fails differently next time

If that last step is missing, the review is incomplete.

What Good Reviews Produce

A strong incident review should produce concrete outputs:

a change to architecture
a change to evaluation coverage
a change to alerting or observability
a change to access or fallback policy
a change to ownership or escalation rules

If the only output is a slide deck, the organization is optimizing for closure, not improvement.

The cleanest signal is whether the same class of incident can happen again. If it can, the review was not done.

How AI Incidents Are Different

AI incidents often degrade quietly long before they trigger a loud outage.

The symptoms may be:

degraded answer quality
increased retries
hallucinated outputs that look plausible
cost spikes hiding inside normal traffic
users losing trust before the team notices

That means incident reviews need to look at both user impact and system behavior. You cannot fix what you did not measure.

Incidents tell you where the system was more fragile than the architecture review admitted .

A Useful Review Template

A practical review should cover:

the triggering event
the timeline
the technical failure mode
the business impact
the monitoring gap
the architectural fix
the owner of the fix
the follow-up verification date

That is enough to keep the review grounded and actionable.

A postmortem without system change is paperwork.

The template is simple on purpose. If the review cannot name the control that changes, the meeting was too abstract.

Key Takeaways

Incident reviews should change architecture, not just record narrative.
AI failures often show up as silent degradation before loud incidents.
Good reviews end with specific fixes, owners, and verification dates.
If the same class of incident can recur, the review was not complete.

How Great CTOs Design AI Roadmaps That Survive Contact With Reality

Thu, 28 May 2026 00:00:00 +0000

Quick take

AI roadmaps fail when ambition is treated as sequencing. Dependencies slip, rollback gets expensive, and the team discovers the missing work only after the launch date is already spoken for.

A survivable roadmap is not a prettier Gantt chart. It is a dependency-aware budget for uncertainty.

Roadmaps Fail at the Edges

The core mistake is treating the roadmap like a statement of intent instead of a statement of sequencing.

AI work fails at the edges:

data access is slower than expected
model behavior is less stable than expected
review cycles take longer than expected
vendor changes arrive earlier than expected

If your roadmap does not account for those edges, it is not a plan. It is a confidence exercise.

Most teams only find out those edges are missing after the launch date is already public.

The fix is to move the hidden work into the plan before the promise is made.

Budget the Dependency Chain

Every AI feature has a dependency chain:

data availability
context assembly
model routing
evaluation
deployment
fallback

If any one of those links is not ready, the feature will not survive real use.

If the chain is incomplete, the roadmap is lying by omission.

The most honest roadmap is the one that writes the chain down first. That slows the conversation, but it also keeps the team from selling a feature that depends on work nobody has budgeted.

Slower conversations are cheaper than broken launches.

Make Rollback a First-Class Requirement

Good roadmaps assume the first version will be wrong.

That means every AI initiative should answer four questions:

How do we turn this off?
How do we know it is hurting us?
How fast can we revert?
What manual path exists if the model degrades?

If those answers are fuzzy, the roadmap is overconfident.

If you cannot turn it off quickly, you have shipped a liability with a product label .

Roadmaps should not only describe the happy path. They should budget for the probability that the first version is wrong, the vendor changes terms, or the model regresses under load.

That is not pessimism. It is operational seriousness.

WIP Limits Matter More Than Hope

A roadmap that promises too many parallel AI experiments is usually a roadmap that does not respect WIP.

The more novel the work, the lower the WIP should be.

Concurrency feels productive until it multiplies rework.

Strong teams set rules like:

no more than one high-risk AI launch per squad at a time
no feature ships without evaluation coverage
no vendor migration without a fallback path
no roadmap item enters “done” until the operational notes exist

That may sound strict. It is. Novel work punishes loose concurrency.

What a Survivable Roadmap Looks Like

Survivable roadmaps are dependency-explicit, rollback-aware, and honest about capacity.

A roadmap is not a promise. It is a bet with visible failure modes.

If the failure modes are invisible, the roadmap is pretending.

You do not need a roadmap that impresses the room. You need one the organization can execute without pretending the hard parts are somebody else’s problem.

Key Takeaways

AI roadmaps fail at dependency and rollback boundaries.
Treat the roadmap as a budget for uncertainty, not a wish list.
Limit WIP, make rollback explicit, and require evaluation coverage before launch.
The best roadmap is the one the organization can survive.

Hiring for AI Teams: The Operator Profile That Actually Scales

Tue, 26 May 2026 00:00:00 +0000

Quick take

The best AI hires are not the people who can narrate the model stack. They are the operators who can turn ambiguity into a system, make the failure mode legible, and keep shipping when the first answer is wrong.

That is why judgment matters more than hype. Teams that hire for excitement get enthusiastic meetings. Teams that hire for operator discipline get leverage.

The Operator Profile

Strong AI operators usually have four traits:

they can turn a vague brief into a tractable plan without waiting for perfect inputs
they know enough about systems tradeoffs to challenge weak assumptions early
they care about verification as much as output
they can move between engineering, product, and executive language without flattening the nuance

Model trivia is cheap. Operator judgment is what survives contact with production.

The market is full of people who can name the newest framework. The shortage is people who can keep a system healthy when the workload changes, the vendor shifts, or the first release misbehaves.

What Most Teams Hire Wrong

AI hiring goes off the rails when teams reward signals that are easy to notice but hard to run with.

Teams over-index on:

prompt fluency without operational discipline
research taste without delivery habits
architecture opinions without incident literacy
product instinct without measurement rigor

None of those traits is bad. The problem is imbalance.

A strong AI team needs people who will own the boring parts: evals , fallback logic, access boundaries, cost control, and documentation precise enough that someone else can operate the system later.

If a candidate can talk fluently about models but cannot explain how they would debug a bad release, they are not ready to own production AI.

The Interview Questions That Matter

You do not need a clever hiring process . You need questions that force real evidence.

Ask candidates to walk through:

A system they had to stabilize. What was broken, how did they know, and what changed after they touched it?
A decision they reversed. Strong operators do not defend bad ideas forever. They update when the evidence changes.
A workflow they measured. If they cannot show how they connected work to metrics, they probably did not own the outcome.
A failure they made safer. In AI, good operators do not eliminate failure. They bound it.

A useful answer is concrete, a little messy, and grounded in actual work. The worst answer sounds polished and empty.

Hire for the Shape of the System

AI teams do not need the same operator profile in every context. Research-heavy, production-heavy, and regulated enterprise teams all demand different instincts.

If you want a research-heavy team, hire for exploration and rigor. If you want a production-heavy team, hire for stability and operational discipline. If you want a regulated enterprise team, the bar is not “exciting.” The bar is whether this person can help you ship safely, repeatedly, and without heroics.

That is the real operator profile:

can handle uncertainty without freezing
can make tradeoffs explicit
can leave behind a system other people can run
can keep pace without turning every launch into a performance

Key Takeaways

Hire AI operators for judgment, not model vocabulary.
Ask about stabilization, reversal, measurement, and safer failure.
The strongest people leave behind systems, not just stories.
If a candidate cannot explain how they debug and bound failure, keep looking.

Technical Leadership in the AI Era (It’s About Throughput, Not Trends)

Thu, 21 May 2026 00:00:00 +0000

Quick take

AI does not change the core job of technical leadership. It changes the cost of being vague. In 2026, the best leaders still do the same three things: set direction, remove friction, and keep production systems measurable. The difference is that AI makes every weak assumption show up faster.

The real mandate is throughput. Not more noise. Not more experimentation theater . Throughput.

The Leadership Pivot: Focus on Throughput

Organizations do not pay technical leaders to keep up with model releases. They pay them to improve organizational throughput .

That means reducing cognitive overhead, tightening verification, and making deployment paths boring enough that teams can move without drama. If you cannot measure what an AI workflow produced, or what it cost to produce it, you do not have an operating system yet. You have a prototype with invoices.

The leadership question is simple: are we removing blockers faster than we are adding complexity?

Decision-Making in Practice

AI work gets messy when teams debate tools before they define the outcome.

Good leaders force the conversation back to first principles:

What business metric should change if we ship this?
What latency budget do we actually have?
What happens when the model is wrong?

Those questions cut through a lot of noise. They keep the team from turning architecture meetings into opinion contests about vector databases , prompt styles, or the latest agent framework.

If the answer to any of those questions is fuzzy, the work is not ready for serious implementation.

Define “Good Enough” and Measure It

Reliability is not just accuracy. It is consistency, cost, and the ability to catch degradation before customers do .

Sometimes a smaller, cheaper model is the right answer. Sometimes the frontier model is worth the price. The point is not to be religious about either option. The point is to define the bar, test against it, and choose the system that meets it with the least operational pain.

Your job is not to build a perfect AI system. It is to build one where failure is bounded, expected, and visible.

The Cultural Shift

Technical leadership still has a change-management problem. Engineers will worry about ownership, safety, and the volatility of the ecosystem. Those concerns are real.

The right response is not debate for its own sake. It is instrumentation.

Stop arguing in design docs about whether a model will work. Build the telemetry that shows whether it works. Stop treating every new framework like a strategy reset. Run small, contained experiments that either produce evidence or die cheaply.

The strongest teams are not the ones that sprint toward the newest beta API. They are the ones that can absorb change without losing control.

Final Take

AI rewards leaders who are disciplined about outcomes and ruthless about verification. If the team can move quickly, measure clearly, and recover cleanly, AI becomes leverage. If not, it becomes another source of drag.

Stop Building Internal AI Tools No One Uses

Tue, 19 May 2026 00:00:00 +0000

The demo went well. A mid-size logistics company — roughly 800 people, enough procurement complexity to justify the investment — had spent three months building an internal AI tool to surface contract terms during vendor negotiations. The launch Slack channel hit 40 reactions in the first hour. A VP called it the kind of thing that changes how the team operates.

Six weeks later, the channel had five messages in it, four of them automated. The procurement leads were still pulling PDFs manually and copying terms into a shared spreadsheet. One support engineer, who had quietly championed the project from the beginning, had reverted to her old database query because “the tool doesn’t know about the amendments.” The tool was still running. Nobody had officially abandoned it. It had simply become invisible.

This pattern is not unusual. It is almost the default.

What Actually Failed

The postmortem conversation usually centers on the wrong things — model choice, interface design, rollout timing. Those are symptoms. The root causes are structural.

The contract tool was built around a narrow slice of the negotiation workflow: surfacing base terms. But procurement work is not base terms. It is base terms plus amendments plus prior history plus the relationship context the lead carries in her head. The tool knew one layer of a five-layer problem. It looked complete in a demo because demos are controlled. Real work is not controlled.

The output trust problem arrived fast. In week two, the tool surfaced an incorrect payment term — technically correct in the original contract, superseded by a signed amendment it had not been given access to. The lead caught it before it caused damage, but she stopped relying on it after that. One unexplained wrong answer is enough to demote a tool from co-worker to footnote. The team had not built evaluation into the system , so there was no way to know how often this happened, which made the uncertainty worse, not better.

Nobody owned adoption after the launch. The engineer who built it moved to a different priority. The VP who celebrated it never checked sustained usage . When procurement leads developed workarounds, there was no one watching the signal and no one with a mandate to respond. The tool drifted.

When It Works

A different team at a professional services firm built something structurally simpler: a tool that drafted the engagement summary section of a client report, pulling from structured notes the consultant had already entered into their project management system. Narrow scope. No novel context required. One predictable output format, reviewed every time before it went anywhere.

The tool stuck. Not because it was more technically impressive — it was considerably less so. It stuck because it removed a specific, recurring task that consultants genuinely disliked, it used context they were already maintaining anyway, and the output was always human-reviewed before it mattered. The failure mode was visible and safe. The value was obvious the first time you used it and every time after.

The team lead reviewed usage weekly for the first two months and made three small adjustments based on what she saw. That ownership — unglamorous, persistent, post-launch — is what made the difference.

The Structural Difference

Both companies built AI tools for internal workflows . One failed quietly, one became a habit. The gap was not the model. It was not the interface. It was whether the tool was designed around how work actually moves or around what would look good in a demo .

Tools that survive are ones that fit a narrow, complete slice of a workflow, produce output that is either verifiable or bounded enough to trust, require no context the user does not already have, and have someone whose job it is to watch whether people are actually using them.

That last part is the one most teams skip. Usage is not a launch outcome. It is an operating responsibility.

Build the System the Model Cannot Break

Thu, 14 May 2026 00:00:00 +0000

Quick take

An AI-native company is not a company that uses AI. It is a company whose operating model — decisions, ownership, interfaces, capital, and failure boundaries — has been built so AI compounds inside it instead of evaporating around it.

The model will change. The system around it should not.

This is a manifesto. It is opinionated, deliberately. Twelve tenets, four movements, one test. Borrow what works. Argue with the rest.

Movement I — Strategy

1. The operating model is the strategy

The model is the most expensive dependency in your stack. It is not the brain. The brain is everything you build around it: context assembly, retrieval, validation, retries, telemetry, fallback, escalation.

Two companies buy the same frontier model on the same Tuesday. One ships in six weeks with a deterministic fallback, a typed validator, and an eval gate on every PR. The other ships in six months with a notebook of “good prompts” and a Slack channel for incidents. Same model. Different company.

If your AI plan begins with “which model should we buy,” you are solving the easiest problem in the room. The moat is everything around the model.

2. Capital allocation is the first product decision

Great AI teams do not start with a roadmap. They start with a kill list . Capital is finite. Attention is finite. Support burden is finite.

Three questions before any AI initiative gets funded:

Does this increase margin, reduce risk, or improve speed?
Can we measure that effect within one to three quarters?
Do we own the fallback if the model or vendor changes?

If the answer to all three is not yes, the default is no.

The most common pattern across Series B–D companies that quietly stalled in 2024–2025: somewhere between $1M and $3M of engineering and infra burned on internal copilots that never crossed adoption threshold, plus a duplicate prompt orchestration layer because two teams built one in parallel. Neither project had a measurable failure mode. Both had a sponsor.

A four-dimension scorecard makes the next budget meeting honest:

Adoption — are real users using it in a real workflow?
Reliability — does it fail in bounded, observable ways?
Margin — does it reduce cost or improve unit economics?
Speed — does it shorten a real business cycle time?

If you cannot defend it with numbers, the project is not innovative. It is unpriced.

3. Decision latency is a P&L variable

Slow decisions look like caution. In practice, they are hidden expense. Every day a real decision sits unresolved, the business pays in delay, rework, and attention.

Headcount is an input. Throughput is an outcome . Adding the tenth engineer to a system that takes nine days to approve a deploy adds nine more days of waiting, not 10% more output.

Track four numbers with the same seriousness as revenue:

time from issue raised to decision made
time from decision made to action taken
escalations per decision class
decisions reopened after approval

Ambiguous ownership is the most expensive architecture in your company.

Movement II — Architecture

4. Build firewalls, not masterpieces

A statistical engine cannot be expected to behave like deterministic infrastructure. If your architecture only works when the model is correct 100% of the time, it is not architecture. It is wishful thinking with a demo budget.

Three failure modes, three firewalls. They are not the same thing and they are not solved by the same code:

Inbound sanitization. What data is permitted into the prompt context. PII strippers, schema enforcers, retrieved-document trust scoring. This is also where indirect prompt injection — instructions hidden in a vendor PDF, a customer message, or a tool output — gets caught before it reaches the model.
Outbound validation. A typed schema checker stands between the model and the operational database. Malformed JSON, out-of-range values, and policy-violating outputs are rejected at the boundary, not absorbed by downstream services.
Operational fallback. Circuit breakers for vendor outages and rate limits. If the model returns invalid output three times in a row, the system degrades to a deterministic path — not a stack trace in front of the user.

Each of these is a separate piece of code with a separate owner, a separate test surface, and a separate failure mode. A “kill switch” that catches all three is a slide, not a system.

You cannot prompt your way out of entropy. You have to architect your way out of it.

5. Evaluation is the spine

If you cannot define an eval suite before shipping a feature, you do not understand the system well enough to ship it.

A five-level maturity ladder :

Vibes-based. Someone eyeballs prompts before release.
Spreadsheet. Suite exists, runs occasionally, blocks nothing.
CI/CD-integrated. Evals run on every PR. A failed gate stays failed.
Continuous telemetry. Production samples scored asynchronously. Incidents become regression tests.
Governance as moat. Evaluation shapes architecture before code. Margin, latency, and sovereignty tradeoffs are quantified, not asserted.

Below Level 3 is not a production system. It is a demo with a pager.

Level 4 is where most organizations get stuck, and the reason is rarely effort. Judge models drift, ground truth ages, sampling bias creeps in, and your asynchronous scoring quietly stops tracking the failure mode you cared about. Mature teams hold a small, hand-labeled golden set as the anchor, treat the judge model as a versioned dependency, and re-calibrate when either changes.

Eval portability is a year-two survival trait. If your eval suite is hand-tuned to one model’s tokenizer and one vendor’s output quirks, you have not built an eval suite. You have built a benchmark for the model you are about to be unable to leave.

6. Agentic systems run on a reliability contract

Agents are not magical workers. They are autonomous systems with more ways to fail. The reliability discipline gets stricter, not looser.

Every production agent answers five questions in one meeting, without hand-waving:

what is it allowed to do?
what is it explicitly not allowed to do?
what metrics prove it is healthy?
what happens when the model degrades?
who can stop it, and how fast?

But the five questions are a meeting checklist. The contract is a published artifact with SLOs, blast-radius caps in dollars or rows or API calls, rollback latency targets, and a named owner per failure mode. Blast radius is the real design variable: data scope, action scope, time scope, permission scope, fallback scope.

Kill switches are not weakness. They are governance that can move faster than the failure. A useful test of any AI control: could an engineer follow this rule at 2 a.m. without calling a committee?

A roadmap that ships an agent without answers to these questions is a roadmap that has shipped a liability with a product label. Every initiative names how it turns off, how it knows it is hurting, how fast it reverts, and what manual path exists when the model degrades.

Companion: Agent Reliability Contract template . Rollback document template .

Autonomy without a reliability contract is just an incident waiting for a timeline.

Movement III — Economics & Externals

7. Unit economics live at the workflow, not the model call

Route by value and by risk. Trivial work stays cheap and local. High-stakes work earns expensive inference and stronger checks. A finance-aware leader can answer, without hand-waving:

what each class of request costs to serve, end to end
where the rework happens
what failure costs when the model is wrong
which parts of the workflow justify premium inference

The cost question nobody owns until it explodes: when product ships a feature that 10x’s tokens, who pays? If the answer is “we’ll figure it out,” you have not designed an operating model. You have deferred a fight.

Compute placement is part of this calculation, not a separate one. For high-frequency agentic workloads, a chain of round-trips across regions and vendors compounds into real latency tax and real egress cost. Local-first, hardware-aware patterns earn their place where the workload mix justifies them — and create a worse outcome where it does not. Measure first, place compute second.

A cheaper model that fails gracefully beats an expensive model that fails silently.

8. Sovereignty is an architecture constraint

Privacy is not a feature you bolt on before an enterprise contract closes. It is the shape of the system.

A sovereign system controls the full lifecycle of every piece of data — where it lives, who can access it, how long it persists, and what happens when someone asks you to delete it. In practice, four concrete patterns:

Customer-managed keys. BYOK or hold-your-own-key. If your cloud provider holds the only copy of the encryption key, “we cannot access your data” is a policy promise, not a verifiable claim.
Regional routing with storage isolation. EU data does not leave EU infrastructure. The application layer handles the routing. The deployment pipeline ships multi-region.
Scoped, short-lived access. No ambient credentials. Service-to-service tokens with explicit grants and automatic expiry.
Immutable audit trails. Append-only, tamper-evident logging of every access, transformation, and movement.

“We use AWS” is not an answer to “where does my data live.” Sovereignty is about specificity.

The compounding bill arrives when you try to add this later. The discount arrives when you build it in early and close enterprise contracts without an architectural retrofit.

9. The threat model is the manifesto

An AI manifesto without a threat model is marketing copy. Four risks every operator names explicitly:

Indirect prompt injection. Instructions hidden in retrieved documents, tool outputs, and user uploads — not just in the user’s direct prompt. Treat every retrieved string as potentially adversarial. Validate before it reaches the model. Strip before it reaches the agent.
Silent quality drift. The model returns slightly worse reasoning. The tone shifts. The retrieval starts ignoring critical documents. There is no stack trace. Only asynchronous production scoring, anchored to a golden set, catches this before customers do.
Vendor and model lock-in by accident. Fine-tunes, preference data calibrated to one model family, and prompts hand-tuned to a specific tokenizer compound. By year two, your “swappable” model is a six-month migration. Discipline preserves optionality: prompt abstraction, eval portability, vendor-neutral preference data, and a quarterly review of what would break if the vendor changed terms tomorrow.
Agent blast radius creep. Permissions accumulate. The agent that summarizes documents quietly gains write access to your billing API because someone needed it once. Audit scope quarterly. Treat agent permissions like database credentials, not like configuration.

Threat modeling is not a one-time exercise. It is the bill of materials your system runs on.

Movement IV — People & Failure

10. Interfaces beat titles

Most AI hiring plans try to fix an interface problem with resumes. They rarely work.

A working leadership system is not a roster of senior titles. It is a decision map. Four owners with explicit decision rights and explicit escalation paths:

Product — user outcomes, adoption, business tradeoffs.
Platform — safe defaults, deployment paths, observability, paved roads.
Applied AI — workflow behavior, routing, prompting, retrieval, evaluation quality.
Governance — risk boundaries, sovereignty controls, escalation thresholds.

The titles can be anything. The interfaces cannot be ambiguous. If the answers depend on who is online that day, the system is not operational.

The same logic governs platform teams. A platform exists to make repeated decisions disappear into the default path — identity, routing, eval harnesses, logging, safe deployment, fallback behavior. The moment platform becomes a queue that has to bless every use case, the queue is the product and waiting is the cost. A platform should remove waiting, not become a waiting room.

Hiring works after the operating contract is clear, not before. New hires scale the current operating model, good or bad. Org debt is interface debt with better branding.

11. Anti-fragility requires portability discipline

Resilience is surviving the shock. Anti-fragility is using the shock to remove the next one.

Fragility hides in the org chart and in the stack. One engineer who knows the routing. One vendor whose terms changed last week. One fine-tune that took six months to train and would take six months to migrate. That is not an organization or a system. That is a single point of failure wearing a department badge or a model card.

Four design choices build strength:

Modular ownership. No critical function depends on one person’s memory. Deputies are named.
Resettable interfaces. A model, vendor, or workflow can be swapped without a rewrite. This is not free. It requires prompt abstraction, eval portability, vendor-neutral preference data, and a regular drill where the team actually proves a swap is possible.
Fast learning loops. Every failure produces a tighter eval, a better fallback, or a clearer operating boundary.
Cross-training on the boring parts. Alerts, evals, fallback logic, access boundaries. The unglamorous work is what keeps the organization elastic.

A short anti-fragility check:

Can you swap a model without rewriting the product?
Can you lose a key engineer without losing the system?
Can you absorb a vendor price increase without panic?
Can you turn a production incident into an improved control?

If any answer is no, the organization is more brittle than it thinks. The most expensive lie an AI organization tells itself is that the model is swappable when nobody has tried.

12. The year-two test

A lot of AI organizations look healthy in month three and brittle by year two. The model did not fail. The operating model did. Prototype energy is easy to create. Durable coordination is not.

The single question that separates the two:

Can the AI system survive a senior person going on vacation for two weeks?

If the answer is “not really,” the organization is still running on hidden tribal knowledge.

If the answer is “yes, with documented ownership, a published reliability contract, an eval suite that blocks releases, and a fallback path the on-call engineer can execute at 2 a.m.,” the company is moving from prototype to production.

That is the only year-two test that matters. Everything else in this manifesto is in service of passing it.

What this manifesto is not

It is not a prediction about which model wins. It is not a framework for replacing engineers with agents. It is not a defense of any vendor, any cloud, or any stack.

It is a statement about how serious companies organize for AI when the easy money, the demo budgets, and the hype cycles are done — and only the operating model is left to do the work.

The model will change.

The system around it should not.

Law Zava writes about the operating model behind serious AI execution. Companion artifacts: Agent Reliability Contract template · Rollback document template · Eval Suite starter kit . The canonical reading path is at /blog .

Why Most AI Platform Teams Become the New Bottleneck

Thu, 14 May 2026 00:00:00 +0000

Quick take

AI platform teams become bottlenecks when they start reviewing every use case instead of shipping safe defaults. Once the team needs a ticket to approve basic work, the queue is the product and the platform is just a delay with a nicer name.

The answer is not to shrink the team and hope demand goes away. It is to move decisions out of the queue and into the platform.

A Platform Team Is a Product with a Queue

A healthy platform team exists to make repeated decisions disappear.

If every experiment needs a ticket, a Slack ping, and a weekly exception review, the platform is no longer a platform. It is a gate with a service catalog.

The warning signs show up fast:

request backlogs that never get smaller
the same exception coming back under a new name
engineers building shadow infrastructure because the official path is too slow
work that should have been standardized long ago still handled by hand

Once teams start routing around the platform, the default path has already lost.

What Bottleneck Behavior Looks Like

Bottlenecks rarely announce themselves. They sound like process.

You hear it in the same lines over and over:

“We are waiting on the platform team.”
“Can we make this an exception?”
“We built a small internal workaround.”
“The platform is a few weeks behind us.”

None of those lines is fatal on its own. The pattern becomes a problem when they turn into the normal way work gets done.

A platform team becomes a bottleneck when it centralizes decisions that should have been made once, written down, and pushed into the default path.

Redesign the Team Around Capabilities, Not Control

Good platform teams build paved roads .

They own the hard parts once:

identity and access patterns
model routing defaults
evaluation harnesses
logging and traceability
safe deployment templates
fallback behavior

Then they get out of the way.

The wrong shape is a team that has to bless every new use case. The right shape is a team that makes the safe path easier than the unsafe one.

A good test: a platform team should remove waiting, not become a waiting room .

The Metrics That Reveal the Truth

Most platform dashboards avoid the real question. You need blunt metrics.

Measure:

time from request to usable platform support
exceptions granted per month
shadow systems discovered in production
hours spent waiting on platform review
AI workflows shipped without platform involvement

Those metrics tell you whether the platform is compounding or constraining.

If exceptions keep rising and the team calls that “flexibility,” the default path is still too hard to use.

What Good Looks Like

The best AI platform teams I have seen share three habits:

They bias toward self-service.
They make safe defaults boring.
They track the cost of waiting as carefully as the cost of infrastructure.

That last one matters. Waiting is not free. Every hour a product team spends blocked on the platform is an hour not spent learning from users.

A good platform team does more than improve developer experience. It improves business velocity.

The CTO Communication Protocol: Aligning Engineers, Executives, and Investors in AI Programs

Tue, 12 May 2026 00:00:00 +0000

Quick take

AI programs rarely fail because one team is incompetent. They fail because the organization tells itself three different stories about the same system. Engineers hear one version of reliability, executives hear one version of commercial impact, and investors hear one version of scale. By the time those stories collide in a board meeting, the disagreement has already been baked into the program. A CTO’s job is to keep the story true enough that people can act on it.

The Alignment Problem

Every layer in a company listens for a different failure.

Engineers ask: can we make it reliable without turning the stack into a science project?

Executives ask: can it matter this quarter, not someday?

Investors ask: can it scale without becoming a support burden, a security problem, or a margin leak ?

If those questions are not coordinated, the organization drifts into avoidable conflict. Product thinks it shipped success. Engineering thinks it shipped risk. Finance thinks it shipped cost. The AI program becomes a political object instead of an operating system.

What Each Layer Needs to Hear

A good communication protocol gives each audience the right level of detail and nothing more.

Engineers need constraints, failure modes, ownership, and the exact conditions under which they should stop or escalate.

Executives need the business outcome, the tradeoffs, the cost of delay, and the risk of waiting for a perfect answer.

Investors or board members need the thesis, the numbers, the confidence interval around those numbers, and the reason the company believes the numbers are real.

The common mistake is predictable: over-share implementation detail upward and under-share operational reality downward. Leaders either talk past each other or sand off the complexity to keep the room calm. Neither habit helps. Clarity is kinder than politeness when the system is expensive.

Build a Communication Rhythm

Strong CTOs do not improvise every update. They set a rhythm that forces the same narrative to appear at predictable intervals, so the organization can spot drift before it becomes a surprise.

A practical cadence looks like this:

weekly: operational progress, blockers, decisions made, decisions deferred
monthly: outcome metrics , risk posture, and what changed in the operating assumptions
quarterly: strategy shifts, tradeoffs, roadmap changes, and what the board should expect next

That structure gives the organization memory and gives the board a clean way to compare this quarter with the last one.

The point is not to produce more slides. The point is to keep the story consistent enough that people can challenge it honestly.

Misaligned narratives are delayed incidents.

Use the Same Three Questions Everywhere

Keep asking the same three questions in every forum: what changed, what did it affect, and what happens next? Those questions work at the team level, the executive level, and the board level because they force the same discipline: outcome, consequence, next move. If a layer cannot answer them, the communication is not yet useful.

Alignment is not consensus. It is a shared operating picture.

Key Takeaways

AI programs fail when each audience hears a different success definition.
Engineers, executives, and investors need different levels of detail, but they need the same core truth.
Use a consistent communication rhythm so the story does not change every time the room changes.
Keep asking what changed, what it affected, and what happens next until the answer is sharp enough to survive board scrutiny.

AI Governance Without Bureaucracy

Thu, 07 May 2026 00:00:00 +0000

Quick take

Good AI governance does not look busy. It looks boring: tighter defaults, named owners, and fast escalation paths. If governance slows safe work and never stops unsafe work, it is bureaucracy with a policy memo attached.

The Governance Mistake

Most organizations confuse governance with oversight theater.

They create committees, review boards, and approval layers, then act surprised when teams route around them. The result is predictable: slow delivery, hidden risk, and a false sense of control.

AI governance should answer a simpler question: what is allowed by default, what requires review, and what is forbidden?

If those boundaries are clear, teams can move. If they are not, every decision becomes a negotiation.

Tight Defaults Beat Loose Rules

Good governance systems do not ask engineers to remember every policy. They make the safe path the easy path.

That means:

default data access is scoped, not ambient
model use is tied to approved workflows
logs retain enough context to investigate failures
high-risk actions require explicit escalation
evals run before release, not after incident review

Governance works when it compresses uncertainty. It fails when it only adds paperwork.

A useful test: could an engineer follow the rule at 2 a.m. without calling a committee? If not, the rule is too vague or too heavy.

Ownership Matters More Than Policy

The fastest way to break governance is to make it everyone’s job.

Real governance needs named owners for:

data classification
model approval
evaluation coverage
exception handling
incident response

Without ownership, governance becomes a shared belief system. Shared belief systems feel flexible until something breaks.

The people who matter most are not the ones writing the longest policy. They are the ones who can answer: who decides, who reviews, and how fast can we change course?

Build the Smallest Control Stack That Works

You do not need 30 controls to govern AI well. You need the smallest control stack that actually changes behavior.

Start with:

a short list of approved data classes
a clear model use policy by workflow
required evals for release
a lightweight exception path
an incident review process that changes architecture, not just slides

If you can keep that stack small, understandable, and enforced, you will get more compliance and less resistance.

A line worth keeping: the best control is the one engineers can still use at 2 a.m.

Key Takeaways

Governance should compress uncertainty, not create bureaucracy.
Use tighter defaults and named ownership.
Keep the control stack small enough to operate.
If the policy cannot survive real work, it is not governance; it is paperwork.

The Board Deck Is Lying: How to Measure AI Progress Without Theater

Tue, 05 May 2026 00:00:00 +0000

Quick take

Most AI dashboards count motion, not progress. They record pilots, prompts, and meetings, then call that momentum. If the scorecard cannot show adoption, reliability, margin, or cycle-time improvement, it is a prop. A board should be able to read it and know whether the business is better off.

The Theater Problem

AI reporting drifts toward vanity metrics because vanity metrics are easy to collect and hard to argue with.

The usual suspects:

number of pilots launched
number of prompts written
number of models tested
number of meetings held
number of slides in the board update

None of those is useless on its own. The problem is that none of them answers the only question that matters: what improved because we shipped this?

A Better Executive Scorecard

A serious AI scorecard should be small enough to remember and strong enough to force a decision.

Start with four dimensions:

Adoption — are real users using it in a real workflow?
Reliability — does it fail in bounded, observable ways?
Margin — does it reduce cost or improve unit economics?
Speed — does it shorten a real business cycle time?

If a project does not move at least one of those numbers, it is not strategic. It is a lab exercise with a budget.

The point is not to build a perfect dashboard. The point is to make it impossible to hide weak outcomes behind busy activity.

What to Report Weekly

A weekly AI review should be short, blunt, and decision-oriented.

Report:

what shipped
what users actually did with it
what broke
what it cost
what decision changed because of the data

That last bullet matters. Progress reporting without decisions is performance art.

A team can launch five experiments in a week and still have no strategy. Strategy shows up when the evidence sharpens the next choice.

Keep the Dashboard Honest

There are two reliable ways AI dashboards lie.

First, they drift toward lagging metrics only. By the time the board sees the number, the product problem is already old.

Second, they reward volume instead of signal. A busy roadmap can still be a weak roadmap.

Keep the dashboard honest by requiring every metric on the top page to map to one of three board outcomes :

margin expansion
risk compression
execution-speed advantage

If a metric does not help the board understand at least one of those outcomes, it belongs lower in the stack or not at all.

A line worth keeping: if the scorecard cannot survive finance review, it is not strategy.

Key Takeaways

Measure adoption, reliability, margin, and speed.
Weekly reviews should force decisions, not decorate slides.
Tie every visible metric to margin, risk, or execution speed.
If the dashboard cannot survive finance review, move it off the first page.

The 2026 AI Build vs. Buy Calculus (It’s Just Operational Cost)

Thu, 30 Apr 2026 00:00:00 +0000

Quick take

In 2026, build vs. buy is not a taste question. It is an operational cost question. Are you prepared to own the telemetry, the fallback paths, and the failure modes that come with the stack? Buying gives you speed and leaves the analytics with someone else. Building gives you control and hands you the overhead.

The Myth of the Headline Price

Most teams compare API pricing to GPU rental and stop there. That is the wrong first-order model.

Token price is the easiest number to quote and the least useful number to trust. The real bill shows up in the work around the model:

Telemetry & Evals: If you self-host, you must build the pipeline that captures, scores, and reviews output. Vendor APIs may bundle some of this, but then they also own the metadata.
Graceful Degradation: When the provider throttles you at peak, do you have local fallback? Hybrid systems buy resilience, but they also add systems-engineering work.
Data Sovereignty : Sometimes the reason to build is simple: the data cannot legally leave your VPC. Once that is true, the token price stops mattering.

When to Buy (The Commodity Highway)

Buy when the AI capability is a feature, not the product.

If you are building an internal documentation chatbot, a support-ticket summarizer, or a semantic search overlay, buy the API. Do not spend engineering throughput standing up vLLM instances and chasing KV-cache optimizations for a problem that is not your moat.

The catch is lock-in at the integration layer. If your code imports vendor-specific classes directly, you will feel the squeeze when prices change or a model line is deprecated. Keep the provider behind an internal interface .

When to Build (The Crucible of Control)

Build when AI sits inside unit economics or inside a hard trust boundary.

You must build if:

Your margins depend on it. Billions of tokens a day can make the API tax the difference between a healthy product and a broken one.
You operate under zero-trust or residency constraints. In healthcare, finance, or defense, the data cannot touch a multi-tenant cloud edge.
You need hardware-level optimization. Sub-150ms tail latency usually means quantization, attention fusion, and serious control over the runtime.

That is the part teams underestimate. You are no longer building a prompt pipeline. You are operating a distributed, heavily constrained state machine. That takes engineers who understand memory bandwidth, not just prompting.

The Hybrid Default

The mature pattern in 2026 is a barbell.

Buy frontier models for complex reasoning, planning, and high-context zero-shot tasks. Build or host quantized, heavily tuned 8B models for the large volume of routing, formatting, and classification work that sits underneath the product.

The CTO’s job is not to choose a camp. It is to make the handoff between buy and build a config change, not a rewrite.

Margin, Risk, and Speed: The Three Numbers That Should Drive AI Strategy

Tue, 28 Apr 2026 00:00:00 +0000

Quick take

Most AI strategy decks are full of nouns and short on numbers. That is usually the tell. If a project cannot move margin, reduce risk, or shorten the path to an outcome, it is not strategy. It is activity with a steering committee.

Why Three Numbers Are Enough

Leaders overcomplicate AI strategy because they do not want to choose.

But every AI decision eventually lands in one of three buckets:

Margin — does it improve unit economics?
Risk — does it make the system safer or more controllable?
Speed — does it shorten the path from decision to outcome?

That is the executive frame. Everything else supports it.

If a project cannot clearly improve at least one of those numbers, it does not belong near the top of the roadmap.

The Trap of Novelty Metrics

AI teams love the wrong metrics because the wrong metrics are easy to count.

Number of models tested. Number of pilots launched. Number of prompts written. Number of demos shown. Number of meetings held.

Those numbers can tell you whether work is happening. They do not tell you whether the company is getting more profitable, less exposed, or faster to act.

Build a Scorecard Around Outcomes

A serious AI scorecard is short.

Did margin improve?
Did risk go down?
Did cycle time shorten?

Everything else is instrumentation that helps answer those questions.

That does not mean you ignore adoption, reliability, or cost. It means you use them as inputs to the three executive numbers, not as substitutes for them.

The strongest boards and founders do not need twenty metrics. They need a few numbers that are hard to fake.

Make the Three Numbers Operational

The framework only works if the numbers are real.

For each AI initiative, define:

the baseline
the target
the measurement cadence
the owner
the rollback path if the numbers move the wrong way

That keeps the conversation concrete and makes the project accountable.

A line worth keeping: if a strategy cannot change one of the three numbers, it is probably theater.

Key Takeaways

Margin, risk, and speed are enough to evaluate AI strategy.
Stop reporting novelty metrics as if they were outcomes.
Give every project a baseline, target, owner, cadence, and rollback path.
If the work does not change the numbers, the work is not strategic.

AI Production Governance: A Maturity Model

Thu, 23 Apr 2026 00:00:00 +0000

Quick take

Most AI teams do not have a model problem. They have a control problem. The gap between stable production AI and production chaos is usually governance: small trusted evals, release gates that actually block, and rollback paths that fire before users feel the drift. If you cannot explain how a change is tested, approved, and reversed, you do not have a production system. You have a demo with a pager.

The Governance Maturity Model

Level 1: “Vibes-Based” Deployment

Evaluation is manual, episodic, and easy to ignore. Someone checks the prompts when there is time, ships the change, and waits for users to find the regression.

You can tell you are at Level 1 when the answer to “How do you know yesterday’s model swap was safe?” is a shrug, a few sample prompts, or “it looked fine.” There is no baseline. There is no history. There is only whatever the latest person happened to test.

The failure mode is silent degradation. The model changes, behavior drifts, and the team learns about it weeks later from an angry customer or a support escalation that should never have reached production.

Level 2: The “Spreadsheet” Era

There is an eval suite , but it lives beside the delivery process instead of inside it. Someone runs a small Python script over a fixed list of cases before a big release and calls that “testing.”

Level 2 teams understand that evaluation matters, but they still treat it like a chore. The suite covers happy-path prompts and misses the things that actually break systems: adversarial inputs, schema violations, prompt injection, PII leakage. And because the results are not wired into release decisions, a bad run usually gets waved through anyway.

The failure mode is false confidence. The team trusts a narrow test set because it exists, not because it is representative. Then a multi-turn attack, a bad schema shift, or a quiet regression makes the gap obvious in production.

Level 3: CI/CD Integration (The Minimum Operational Bar)

Evaluation is part of the delivery pipeline. The suite is broad enough to cover core capabilities and common failure modes, and the results block release candidates when they miss the bar.

At Level 3, every PR or deployment candidate runs the eval suite automatically. The checks include latency, cost per token, output schema validity, and the core reasoning path your product depends on. Results show up in CI next to unit tests. A failed gate stays failed until someone writes the exception and owns the risk.

This is the minimum bar for an enterprise team. A vendor can release an “improved” model on Tuesday, and a Level 3 team can run the suite on Wednesday morning and decide, with evidence, whether the new model actually helps their workload.

Level 4: Continuous Production Telemetry

Evaluation does not stop when code ships. The system keeps watching in production and turns incidents into future tests.

At Level 4, an asynchronous sampling job pulls 5% of production responses, scores them with a cheaper model or other fast evaluator, and flags anomalies. When something goes wrong, the exact input/output pair that caused it becomes a regression test. The system assumes drift is normal, because with LLMs, it is.

Level 5: Governance as a Strategic Moat

Evaluation shapes architecture before code is written. Quality and privacy are not afterthoughts; they are constraints that drive the design.

At Level 5, the team knows how much reasoning quality they give up if they move traffic from a large cloud API to a quantized local 8B model, because they have the metrics to prove it. That gives the CTO real room to choose between margin, latency, and data sovereignty. It also lets the company close larger enterprise deals because it can show, in operational terms, where customer data lives and where it does not.

How to Force Maturity

If you are leading a team stuck at Level 1 or 2, you will not buy your way out with a new tool. You have to change how releases work.

Stop accepting demos. Do not ship the next feature unless it includes a 20-case eval suite attached to the PR.
Wire it to CI. If evaluation does not block the deploy, it is a suggestion, not a control.
Build circuit breakers. Treat the model like a flaky dependency. If it fails to return valid JSON three times, fall back to a deterministic system or fail safely. Do not hand hallucinations to the user and call that progress.

Mature teams do not treat AI as magic. They treat it like a volatile operational dependency that has to be contained, measured, and rolled back fast.

Why Most Enterprise AI Architecture Fails in Year One

Tue, 21 Apr 2026 00:00:00 +0000

Quick take

Enterprise AI projects fail in their first year for a simple reason: teams ask a statistical engine to behave like deterministic infrastructure. If your architecture only works when the model is correct 100% of the time, it is not architecture. It is wishful thinking with a demo budget.

By mid-2026, the honeymoon phase of GenAI is over. Executives want ROI, and engineering organizations are staring at cloud bills, silent degradations, and brittle integration layers. The root cause is almost always the same: teams built highly optimized demos instead of heavily constrained, operable systems.

The Fiction of the Flawless Prompt

The most destructive belief in enterprise AI architecture is that the LLM is a magical function: put string in, get business outcome out.

When a demo works 95% of the time in a Jupyter notebook, product owners assume the remaining 5% is a prompt engineering problem. It is not. It is entropy.

You cannot prompt your way out of entropy. You have to architect your way out of it.

Defining Failure Boundaries

If a traditional distributed database like ScyllaDB or Cassandra fails to return a row, the application does not simply crash with a stack trace visible to the user. It degrades gracefully. It falls back to a cache, a static default, or an asynchronous queue.

Enterprise AI architecture routinely lacks those boundaries. The model hallucinates a malformed JSON object, and the downstream system ingests it directly, corrupting application state.

Mature architecture enforces strict boundaries:

Inbound: What data is strictly permitted to enter the prompt context? Do you have PII strippers actively defending the edge?
Outbound: Does the LLM communicate directly with the operational database, or does it write to an intermediate queue that is validated by a deterministic, typed schema checker before the transaction commits?

If your architecture allows the model to act unilaterally without a deterministic validator acting as a bouncer, production failure is not a surprise. It is the expected outcome.

The Missing Telemetry Layer

When an older microservice begins leaking memory, Ops teams see the P99 latency spike in Datadog and roll back the deployment.

When an LLM begins to silently degrade—perhaps because the vendor aggressively quantized its backend to save on compute—there is no stack trace. The model simply returns slightly worse reasoning. The tone shifts. The RAG retrieval starts ignoring critical documents.

Most enterprise builds fail because they have zero telemetry to detect this drift . They ship the feature and assume it will perform equally well forever.

Robust systems do not trust models. They probe them. They sample 5% of all production outputs and score them asynchronously. They run hundreds of unit tests against the prompt pipeline with every deployment. They treat the LLM as a hostile dependency that must continually prove its competence.

Build Firewalls, Not Masterpieces

The winning architectures in 2026 are not the most complex. They are the most defensive.

They use small, fast, highly specialized models for routing. They enforce rigid, typed output schemas . They degrade to entirely non-AI, algorithmic fallbacks the moment latency spikes or a validation check fails.

Stop trying to build a perfect AI. Start building architecture that survives when the AI inevitably acts stupid.

AI Capital Allocation: What Great CTOs Stop Funding First

Thu, 16 Apr 2026 00:00:00 +0000

Quick take

Great AI teams do not start with a roadmap. They start with a kill list. If a project cannot defend margin, risk, or speed, it does not deserve the next budget cycle. Capital is finite. Attention is finite. Support burden is finite.

The real mistake most companies make is treating AI spend as a separate class of spend. It is not. It competes with product work, platform work, hiring, and operational debt. If you cannot explain why an AI initiative deserves scarce capital, you are not allocating capital. You are subsidizing hope.

Capital Allocation Is the First Product Decision

Capital allocation is not a finance problem that happens to engineering. It is a technical leadership problem with finance consequences.

Every AI project consumes three things:

engineering time
infrastructure budget
organizational attention

If the project does not improve one of three board-level outcomes — margin expansion, risk compression, or execution speed — it is likely a vanity project wearing a product costume.

That does not mean the project has to be immediately profitable. It does mean you should be able to state what gets better if the project works and what gets worse if it does not.

What Should Die First

The easiest place to make mistakes is the demo room. The second easiest is the budget meeting.

Stop funding these first:

Thin demos that do not survive workflow reality. If the user needs three manual edits after every response, you have built a presentation layer, not a product.
Duplicate platform work. If two teams are building separate prompt orchestration, evaluation, or routing layers, one of them should stop. Duplication feels like speed until the maintenance bill lands.
Ambiguous experiments with no owner. “We should explore AI” is not a strategy. It is a permission slip for drift.
Projects with no measurable failure mode. If nobody can say what counts as bad output, bad latency, bad cost, or bad adoption, the project cannot be managed.

There is a simple reason these projects linger: they are emotionally easy to defend. Nobody wants to kill a project that sounds innovative. But if you cannot defend it with numbers, the project is not innovative. It is unpriced.

The Kill-List Rubric

A good kill list is not a spreadsheet of personal dislikes. It is a decision system.

Before funding a new AI initiative, ask three questions:

Does this increase margin, reduce risk, or improve speed?
Can we measure that effect within one quarter?
Do we own the fallback if the model or vendor changes?

If the answer to all three is not yes, the default should be no.

This is where a lot of teams get sentimental. They continue funding because the project has a sponsor, or because it already consumed sunk cost, or because it looks good in a board deck. Those are weak reasons to keep a system alive.

Strong reasons to keep funding an AI initiative usually look like this:

it replaces high-volume manual work
it improves decision quality in a regulated workflow
it reduces customer wait time
it protects a revenue stream that depends on fast, accurate responses

Notice that none of those reasons mention hype.

What to Keep Funding Instead

The highest-return AI investments are boring in the best way.

Fund the parts that make the system measurable and durable:

retrieval and context quality
evaluation harnesses
fallback logic
routing by task class
observability around bad outputs and retries
workflow-specific data collection

The point is not to chase the smartest model. The point is to build a system that can absorb model churn without forcing a rewrite every six months.

A useful line to keep in mind: if a system cannot be measured under load, it is still a pilot. Pilots are fine. Pilots just should not keep consuming production budget forever.

The Hard Part Is Saying No

The best operators are not famous for being aggressive spenders. They are famous for being disciplined about what they do not fund.

That discipline becomes a reputation asset. The founder who sees you delete a weak AI project starts trusting your judgment. The board member who sees you cut duplicate work starts trusting your signal. The engineering team that sees you protect their time starts trusting your priorities.

Capital allocation is how you tell the truth about what matters. If a project cannot defend margin, risk, or speed, it should not survive by momentum alone. Fund the systems that make AI measurable, recoverable, and cheap to operate. Cut the rest.

AI Strategy: The CTO Perspective (It's Just Data Infrastructure)

Tue, 14 Apr 2026 00:00:00 +0000

Quick take

In 2026, a CTO’s AI strategy is not a model shortlist. It is an operating model for data, latency, evaluation, and failure. The model will change. The system around it should not.

If your AI plan still starts with “which model should we buy,” you are solving the easiest problem in the room. The moat is the pipeline that feeds context, the eval loop that catches regressions, and the fallback path that keeps the product standing when the model misses.

The Strategy Is the Infrastructure

The single biggest mistake engineering organizations make is treating the model as the brain. It is not. It is the most expensive dependency in the stack.

The brain is everything you build around it: context assembly, retrieval, validation, retries, telemetry, and rollback.

A CTO must focus ruthlessly on three pillars:

1. The Context Pipeline

The model is only as intelligent as the context you feed it. If Postgres, Cassandra, or Scylla takes five seconds to assemble structured context, encode it, and hand it to the orchestrator, your feature is already late before inference begins.

Strategy means architecting data replication, embedding generation, and caching so the latency budget stays intact for the inference layer. If your data infrastructure is not close to real time, your AI will not be either.

2. The Evaluation Framework

You cannot scale what you cannot measure. If your organization is still eyeballing model outputs before deployment, you are running a pilot, not a production system.

Leadership means demanding continuous evaluation. Every PR that touches an orchestration layer must be blocked by a CI pipeline that runs 500 deterministic evals against the new reasoning flow. Building that telemetry is the AI strategy.

3. Graceful Degradation and Fallbacks

LLMs fail. APIs throttle. Endpoints rotate. If a model hallucinates malformed JSON and your core application crashes, that is not an AI failure; that is an architectural failure.

A mature strategy wraps every AI interaction in circuit breakers. If the model fails three times, what is the deterministic fallback? If the cloud provider rate-limits you, where is the local, quantized 8B-parameter fallback model running in your own cluster?

Stop Chasing the Frontier

The frontier-model conversation is a distraction. Unless you are OpenAI or Anthropic, you do not win by having the smartest model. You win by having the tightest feedback loop, the cleanest data access, and the lowest cost per transaction.

A strong CTO designs for swapability : a single configuration commit, zero downtime, and telemetry that proves the new model performs 4% better on the exact workload that matters.

That is the strategy. Everything else is theater.

Sovereign Systems: Building for a World Where Data Privacy Is Non-Optional

Mon, 06 Apr 2026 00:00:00 +0000

Quick take

Privacy is no longer a feature you bolt on before an enterprise deal closes. It’s an architecture constraint that shapes how you store data, route requests, grant access, and deploy infrastructure. Teams that treat sovereignty as a first-class design input ship faster, close contracts with fewer surprises, and avoid the painful retrofit that hits every product that grows past its original assumptions. Build it in early or pay compound interest later.

What “Sovereign” Actually Means

In practical engineering terms, a sovereign system is one where you control the full lifecycle of every piece of data: where it lives, who can access it, how long it persists, and what happens when someone asks you to delete it. That’s it. No mysticism, no marketing language.

This doesn’t require owning physical hardware. It means having enforceable guarantees about data residency, encryption boundaries, identity controls, and audit trails, regardless of whether you run on bare metal, a private cloud, or a scoped partition within a public provider.

The distinction matters because “we use AWS” is not an answer to “where does my data live.” Region selection, encryption key ownership, cross-account access policies, and backup replication targets are the answers. Sovereignty is about specificity.

Why This Is Urgent Now

Three forces are converging.

First, data residency rules are tightening globally. The EU’s enforcement posture has hardened. Brazil, India, and multiple Southeast Asian jurisdictions now impose localization requirements that are recent and still evolving. Cross-border transfer mechanisms that worked in 2023 are under review or already invalidated.

Second, AI systems multiply the problem. Every model inference potentially creates a copy of the input data. Retrieval-augmented generation pipelines pull documents into contexts that may span regions. Fine-tuning creates derivative datasets. Logging captures prompts and completions that contain customer data. If you weren’t tracking data lineage before, AI workflows make the gap impossible to ignore.

Third, retrofitting is brutally expensive. Teams that scale first and add privacy controls later face a familiar pattern: months of engineering time, frozen feature development, emergency compliance audits, and customer conversations that should have happened at contract signing. The cost of early privacy controls is a fraction of the remediation bill.

Minimum Viable Controls

You don’t need to solve everything at once. Four controls cover the critical surface.

Identity boundaries. Every access to customer data, whether by a human, a service, or a model, must pass through an identity system with explicit grants. No ambient access. No shared credentials. No “the app has a database connection string” as your entire access model. Service-to-service authentication with short-lived tokens and scoped permissions is baseline, not advanced.

Encryption with key ownership. Encrypt at rest and in transit, but also control the keys. If your cloud provider holds the only copy of the encryption key, you’ve delegated a critical trust boundary. Customer-managed keys or bring-your-own-key arrangements aren’t paranoia. They’re the mechanism that makes “we can’t access your data” a verifiable claim instead of a policy promise.

Retention and deletion. Define how long each data category lives, and enforce it automatically. When a customer asks for deletion, you need to know every location where their data exists, including backups, logs, caches, model training sets, and analytics pipelines. If you can’t enumerate those locations, you can’t comply. Automated retention policies with verified deletion are the only way this works at scale.

Audit trails. Log every access, transformation, and movement of sensitive data. Not for compliance theater, but because when something goes wrong, you need to reconstruct what happened. Immutable, append-only audit logs with tamper detection give you forensic capability and regulatory evidence in the same system.

Zero-Trust Patterns for Data Access

Zero-trust is overused as a buzzword, but the core principle is sound: never grant access based on network position alone. Every request must be authenticated, authorized, and logged regardless of where it originates.

For sovereign systems, this means your internal services don’t get a free pass. A microservice running in the same VPC as the database still authenticates with scoped credentials and gets only the permissions its function requires. Lateral movement, the classic post-breach escalation path, becomes much harder when every hop requires fresh authorization.

This adds friction. That’s the point. Friction at the access layer is cheap insurance against breaches that cost orders of magnitude more.

Multi-Region Architecture Tradeoffs

Data residency requirements often mean running infrastructure in multiple regions. This introduces real engineering tradeoffs.

Latency increases when data can’t leave a region. If your EU customers’ data must stay in Frankfurt, serving those customers from us-east-1 isn’t an option. You need regional deployments with local data stores, which means your application must handle regional routing, and your deployment pipeline must support multi-region releases.

Consistency gets harder. If you previously relied on a single-region database with strong consistency, splitting across regions forces you to choose between synchronous replication with higher latency or eventual consistency with application-level conflict resolution. Most teams find that eventual consistency with well-designed conflict resolution is the pragmatic choice, but it requires upfront design work.

Operational complexity increases linearly with regions. Each region needs monitoring, alerting, backup verification, and incident response capability. Teams that underestimate this end up with “dark” regions where infrastructure runs but nobody watches it.

The honest tradeoff: multi-region sovereign architecture costs more to build and operate than a single-region deployment. But for products selling to regulated industries or international customers, it’s not optional. Budget for it explicitly rather than discovering the cost mid-contract.

Staged Implementation

For teams with existing platforms, a staged approach works.

Stage 1: Visibility. Map where customer data lives. Every database, cache, log store, backup, and third-party integration. You can’t control what you can’t see. This is usually the most humbling step.

Stage 2: Boundaries. Implement identity-based access controls and encryption key management. Replace ambient access patterns with explicit grants. This is the highest-leverage change.

Stage 3: Automation. Build automated retention enforcement, deletion verification, and audit log aggregation. Manual processes don’t scale and don’t survive employee turnover.

Stage 4: Regional controls. If your market requires it, add data residency enforcement with regional routing and storage isolation. This is the most expensive stage and should be driven by actual customer and regulatory requirements, not speculation.

Governance Checklist

For alignment between engineering, legal, and executive leadership:

Document every data category, its sensitivity level, and its residency requirements.
Map data flows across services, regions, and third parties. Update quarterly.
Establish key ownership policy: who holds encryption keys, and what’s the rotation schedule.
Define retention periods per data category with automated enforcement.
Build deletion capability that covers all storage locations, including backups and derived datasets.
Implement access logging with immutable audit trails.
Run a tabletop exercise: a customer requests full data deletion. Can you do it within your SLA?
Review AI-specific data flows : where do prompts, completions, and training data live?

Key Takeaways

Sovereignty is not a premium feature or an enterprise upsell. It’s core infrastructure for products that handle other people’s data. The cost of building it in early is a fraction of the cost of retrofitting it later, and the trust it builds with customers compounds over every contract cycle.

The teams that get this right treat privacy as a design constraint alongside latency, reliability, and cost. Not as a checkbox for the legal team. The architecture follows from that decision.

The Throughput Engineer: Why Headcount Is a Lagging Metric

Mon, 30 Mar 2026 00:00:00 +0000

Quick take

Headcount is an input. Throughput is an outcome. The best engineering organizations have stopped asking “how many engineers do we need?” and started asking “what’s blocking the engineers we have?” Teams that optimize for decision speed, defect containment, and execution clarity outperform teams twice their size. Hiring more people into a broken system just makes the system break faster.

The Metric Everyone Tracks and Nobody Questions

Every quarterly planning cycle, the same conversation happens. The roadmap is too ambitious for the team. The proposed solution is more headcount. The exec team approves some fraction of the ask. Six months later, the team is bigger but the roadmap is still slipping.

This pattern persists because headcount is easy to measure and feels actionable. You can put a number on a slide. You can point to it in a board meeting and say “we’re investing in engineering.”

But headcount measures capacity the way adding lanes measures highway throughput. It works up to a point, then coordination overhead offsets the capacity gain. The tenth engineer doesn’t add 10% more output. They add 10% more communication paths, 10% more code review load, and another person who needs context on every architectural decision.

The organizations getting this right have shifted to outcome metrics. Not “how many people do we have” but “how fast do decisions move from identification to resolution.” Not “how many PRs did we merge” but “what’s our change failure rate and how quickly do we recover.”

Staff Growth Versus Constraint Removal

Adding staff is an additive intervention. It puts more resources into the system. Constraint removal is a multiplicative intervention. It makes every existing resource more effective.

Consider a team of eight engineers where the average PR sits in review for 18 hours. Hiring two more engineers does nothing to fix the review bottleneck. It makes it worse because there are now more PRs competing for the same review bandwidth. But changing the review process, setting a 4-hour SLA, pairing reviewers with authors, and shrinking PR scope, can cut that 18 hours to 4 without adding a single person.

The same principle applies at every level. Slow deploys, unclear ownership, meetings that could be async documents, long approval chains. Each costs every engineer on the team hours per week. Multiply by team size and the waste is staggering.

If 20 engineers each lose 5 hours per week to process friction, that’s 100 engineer-hours, equivalent to 2.5 full-time engineers doing nothing but waiting. Removing the friction is cheaper than hiring, faster to implement, and doesn’t increase coordination costs.

AI tooling has made this dynamic sharper. A well-structured team with good tooling and clear ownership regularly outships teams twice its size. But a poorly structured team with AI tooling just generates more half-finished work faster. AI amplifies the system it operates in, good or bad.

The Operating System of a High-Throughput Team

High-throughput teams share three operational patterns that have nothing to do with individual talent.

Clear intent over detailed instructions. When an engineer picks up a task, they should know the outcome that matters, not the exact steps to get there. “Reduce P95 latency on the search endpoint below 200ms” is clear intent. “Refactor the search query builder to use connection pooling” is a solution masquerading as a task. The first lets the engineer use judgment. The second removes it.

Teams that operate on intent move faster because decisions happen at the point of most information, the engineer doing the work, rather than being routed through a manager who has less context. This requires trust, and trust requires that the intent is genuinely clear and that the engineer has the authority to make reasonable tradeoffs.

Delegated authority with explicit boundaries. Every recurring decision type should have a documented owner and a decision boundary. “The on-call engineer can roll back any deploy without approval” is a delegation. “Database schema changes require review from the data team” is a boundary. When these are written down and understood, decisions happen in minutes instead of hours.

The failure mode is implicit authority. Nobody knows who can make the call, so everyone escalates. The escalation chain adds latency to every decision. In a team of 15, this can mean that a simple operational decision takes a day instead of an hour because it bounces between three people who each assume someone else owns it.

Async-first communication . Synchronous communication, meetings, Slack pings expecting immediate response, tap-on-the-shoulder interruptions, is the most expensive coordination mechanism. It requires everyone to be available simultaneously and context-switch away from focused work.

Async-first doesn’t mean no meetings. It means meetings are for decisions that genuinely require real-time discussion. Everything else is a written document, a recorded decision in a ticket, or a code review comment.

A Weekly Operating Cadence

Decision tempo separates high-throughput teams from slow ones. A lightweight weekly cadence keeps the system self-correcting without drowning in noise.

Weekly: review leading metrics. Cycle time from commit to production, change failure rate, time to recover from incidents, review queue depth, and decision latency on open questions. Don’t track vanity metrics like lines of code or number of PRs.

Biweekly: connect signals to causes. Is cycle time creeping up? Is one team’s change failure rate spiking? Are the same types of decisions getting stuck repeatedly? The goal is systemic diagnosis, not individual blame.

Biweekly: pick one constraint to remove. “This sprint, we’re going to cut our deploy time from 45 minutes to under 10” is a decision. “We’re going to improve developer experience” is not. One thing, not five.

Continuous: execute, measure, repeat. Act on the decision, measure the result, and feed it back into the next weekly review. If cutting deploy time didn’t improve cycle time, the constraint was elsewhere. Move to the next one.

Incentives That Reward Impact Over Activity

Most engineering organizations accidentally incentivize busyness. The engineer who closes the most tickets gets praised. The team that ships the most features gets the biggest headcount allocation. The manager who runs the most meetings looks the most engaged.

Throughput-oriented incentives look different.

Reward engineers who eliminate recurring work, not just complete it. The engineer who automates away a manual process that costs the team 10 hours per week has created more value than the engineer who ships a new feature used by 50 people.

Reward teams that improve their own throughput metrics, not just output volume. A team that cuts its change failure rate from 15% to 3% has freed up enormous capacity that was previously spent on rollbacks, hotfixes, and incident response. That’s worth more than two new features.

Reward leaders who make themselves less necessary. The manager whose team operates smoothly when they’re on vacation has built a better system than the manager who’s cc’d on every decision.

A 12-Week Operating Reset

For teams experiencing delivery drag, a structured reset works better than a reorg.

Weeks 1-3: Measure. Instrument cycle time, change failure rate, review latency, and decision latency. Don’t change anything yet. Establish a baseline that everyone agrees on.

Weeks 4-6: Remove one constraint. Pick the biggest bottleneck revealed by the data. If review latency is the worst, fix the review process. If deploy time is the worst, fix the pipeline. One constraint at a time.

Weeks 7-9: Delegate and document. Write down the top 10 recurring decision types and who owns each one. Set decision boundaries. Remove one layer of approval from the most common workflow.

Weeks 10-12: Sustain. Establish the weekly review cadence. Compare throughput metrics to the week-1 baseline. Identify the next constraint. Make the cycle self-reinforcing.

Teams that complete this reset typically see 30-50% improvement in cycle time without adding staff. The improvement comes from removing friction that was invisible because everyone had adapted to it.

Board-Facing Metrics That Map Engineering to Business Risk

Boards understand risk and return. Translate engineering throughput into those terms.

Cycle time maps to market responsiveness. “We can respond to a competitor move in days, not months” is a strategic capability that boards care about.

Change failure rate maps to operational risk. “5% of our changes cause incidents” is a risk number a board can evaluate, especially when paired with the cost of those incidents.

Recovery time maps to resilience. “When something breaks, we fix it in under an hour” is a durability statement that affects customer trust and revenue protection.

Decision latency maps to organizational agility. “Strategic decisions take 2 days to reach execution, not 2 weeks” tells the board that the organization can adapt.

None of these metrics mention headcount. That’s the point. Headcount funds capacity. These metrics measure whether that capacity produces results.

Key Takeaways

Headcount tells you what you’re spending. Throughput metrics, cycle time, change failure rate, recovery time, decision latency, tell you what you’re getting.

The highest-leverage engineering work is constraint removal, not feature addition. Every hour of friction you eliminate pays dividends across every engineer on the team.

Stop asking “how many engineers do we need?” Start asking “what’s preventing the engineers we have from shipping?”

AI Agent Operations and the Networking Bottleneck: Why AI Agents Fail on Legacy Infrastructure

Mon, 23 Mar 2026 00:00:00 +0000

Quick take

Most AI agent failures aren’t model failures. They’re infrastructure failures wearing a model mask. Legacy networking assumptions, flat trust boundaries, and missing circuit breakers create brittle agent behavior that looks like “the AI is unreliable” but is actually “the network can’t support autonomous execution patterns.” Fix the infrastructure and the agents get dramatically more reliable overnight.

The Execution Path Nobody Drew on a Whiteboard

Agent tasks fan out across DNS resolution, TLS handshakes, token exchanges, service mesh routing, and backend queries. The multi-hop latency problem is well-understood (I covered the general case in the cloud-heavy architecture post ), but the networking-specific failure modes deserve their own treatment: stale DNS caches that route agents to decommissioned endpoints, TLS renegotiation overhead that compounds across 40 tool calls, service mesh sidecars that add 5-15ms per hop invisibly, and queue depth limits that silently drop requests during agent-scale bursts. These aren’t model problems. They’re networking problems that surface as agent unreliability.

The Hidden Cost of 20th-Century Network Assumptions

Most enterprise networks were designed around two assumptions: traffic flows north-south through a perimeter, and anything inside the perimeter is trusted. AI agents violate both assumptions simultaneously.

Agent traffic is east-west by default. A single task might call an internal knowledge base, a code execution sandbox, an external search API, and a database, all in a single reasoning loop. The traffic pattern looks like a mesh, not a pipeline. Networks designed for request-response patterns between a frontend and a backend choke on this.

The trusted-network assumption is worse. When an agent has a service account with broad permissions, every tool call inherits those permissions. If the agent can read from a document store, it can read from all of it. If it can write to a database, the blast radius of a prompt injection extends to every table the service account can touch. This isn’t a theoretical risk. It’s the default configuration in most deployments I’ve seen.

Latency compounds differently for agents than for traditional services. A human user tolerates 200ms of added latency on a page load. An agent making 40 tool calls in a single task turns 200ms of unnecessary overhead per call into 8 seconds of total delay. At scale, this means the difference between an agent that completes tasks in seconds and one that takes minutes. Users notice. They lose trust. They stop using the feature.

Zero-Trust Identity for Autonomous Systems

The fix isn’t a network redesign. It’s an identity redesign at the network layer.

Every agent tool call should carry a scoped identity that specifies what the agent can reach, for how long, and on behalf of which user or task. This is standard zero-trust thinking applied to agent traffic patterns. (For the broader tool permission and output validation side of this, see my earlier post on AI security .)

In practice, the networking-specific concerns are:

Per-task credentials with network scope. Instead of a long-lived service account, mint a short-lived token for each agent task. The token carries the minimum permissions needed for that specific workflow, and critically, it limits which network endpoints the agent can reach. When the task ends, the token expires. If the agent is compromised mid-task, the blast radius is one task’s worth of permissions and one task’s set of reachable services.

Per-call authentication overhead. Every tool call crossing a network boundary needs auth, and that auth has a cost. TLS mutual authentication, token validation, and policy lookup all add latency. The design tradeoff is between granular identity (every call authenticated independently) and performance (connection pooling, session tokens, cached auth decisions). Get this wrong and your zero-trust layer becomes the latency bottleneck it was meant to protect against.

Network segmentation per agent class. Not all agents need the same network access. An agent that summarizes documents has no business reaching your billing API. Segment your network so each agent class can only route to the services it needs. This is basic network segmentation, but most teams skip it because their agents all share one service account with broad network access.

Reliability Engineering for Agent Workflows

Traditional reliability patterns need adjustment for agentic workloads. The standard toolkit, retries, timeouts, circuit breakers, still applies, but the parameters and placement change.

Timeouts need to be per-step, not per-request. An agent task might legitimately run for 30 seconds across 20 tool calls. A global timeout of 30 seconds will kill valid workflows. A per-step timeout of 3 seconds will catch hung dependencies without killing the task.

Retry logic needs backpressure awareness. An agent that retries a failed tool call immediately, while 50 other agent instances are doing the same thing, creates a retry storm that takes down the dependency. Exponential backoff with jitter is the minimum. Better: a circuit breaker that trips after a threshold and fails fast for all agent instances, with a clear error message the model can reason about.

Queue depth matters more than you think. Agent workloads are bursty. A user action that triggers 10 agent tasks, each making 15 tool calls, puts 150 requests into your service mesh in seconds. If the target service has a queue depth of 50, you’re dropping requests before the agent even knows there’s a problem. Size your queues for agent-scale fan-out, not human-scale request rates.

Graceful degradation over hard failure. When a tool call fails, the agent should get a structured error it can reason about, not a 500 or a timeout. “Knowledge base unavailable, try alternative approach” is actionable. A raw HTTP error is not. Design your tool contracts to return machine-readable failure modes.

Observability for Agent Decision Traces

Standard APM tools show you request latency and error rates. For agent workflows, you need something more: a trace that follows the agent’s reasoning across tool calls, captures the decision points, and shows why the agent chose one path over another.

This means correlating model inputs, outputs, and tool calls into a single trace. Each agent task gets a trace ID. Each tool call within that task gets a span. The spans include the tool arguments, the response, the latency, and the policy decision. When you look at a slow or failed agent task, you can see exactly which step took too long, which dependency failed, and whether the agent’s retry behavior made things better or worse.

The teams doing this well treat agent traces like they treat database query plans. They review them regularly, look for patterns, and optimize the hot paths. A tool call that takes 500ms and gets called 20 times per task is a bigger problem than a tool call that takes 2 seconds but only gets called once.

Migration Path

You don’t need to rebuild your infrastructure to start.

Instrument first. Add trace IDs to agent tool calls. Log latency, errors, and retry counts per step. You can’t fix what you can’t see.
Add identity boundaries. Replace long-lived service accounts with per-task tokens, starting with agents that have write access.
Circuit-break external calls. Add circuit breakers and per-step timeouts for every external dependency. Size queues for agent-scale fan-out.
Migrate to mesh. Deploy a service mesh or policy layer for tool call routing. Start in audit mode, then shift to enforcement.

Each step is small and reversible. Together they compound into a fundamentally more reliable agent platform.

Checklist: Risk Reduction in 90 Days

Map every tool an agent can call, its permissions, and its failure modes
Add per-task trace IDs to all agent tool calls
Replace at least one long-lived service account with scoped, short-lived tokens
Set per-step timeouts on all agent tool calls
Add circuit breakers for external API dependencies
Deploy a policy layer in audit mode for tool call authorization
Review agent decision traces weekly for latency outliers and retry storms
Load test agent workflows at 10x expected concurrency
Document failure modes and give agents structured error responses
Establish an error budget for agent reliability separate from service reliability

Key Takeaways

Agent reliability is infrastructure reliability. The model is usually fine. The network, the auth layer, the retry logic, and the observability stack are where agent workflows actually break.

Treat agent tool calls like an API surface that needs zero-trust security, per-step reliability engineering, and end-to-end tracing. The teams that figure this out early will ship reliable agent products. The teams that keep tuning prompts to work around infrastructure problems will keep wondering why their agents are “flaky.”

Network and identity design is core agent product work, not background platform plumbing. Budget for it accordingly.

De-Risking the Black Swan: Red-Teaming Distributed Databases Before Production

Mon, 16 Mar 2026 00:00:00 +0000

Quick take

Most catastrophic database incidents aren’t novel. They’re compounded failures that nobody practiced for. The node-failure test passes, so the team moves on. Then a network partition hits during a schema migration while the on-call engineer is handling an unrelated alert, and suddenly you’re in territory no runbook covers. Structured red-teaming exposes these compound paths before they become customer-visible outages. It costs a fraction of what a single bad incident costs.

Black Swans vs. Ignored Knowns

The term “black swan” gets overused in infrastructure. Most catastrophic database failures are not genuinely unpredictable. They are known failure modes that compound in ways nobody tested.

Consider the canonical distributed database incident: a network partition isolates a minority of nodes, those nodes continue accepting writes because the partition detection is slow, the partition heals, and now you have conflicting data that the conflict resolution logic wasn’t designed to handle at that volume. Every component in this chain is well-understood. The failure isn’t in any single component. It’s in the interaction between them under specific timing conditions.

The honest term for most “black swan” database incidents is “ignored known.” The team knew partitions could happen. They knew conflict resolution had edge cases. They knew detection wasn’t instant. They just never tested all three at once.

Red-teaming is how you turn ignored knowns into practiced scenarios.

Mission-Style Red-Teaming

Chaos engineering tools that randomly kill processes are useful, but they test a narrow failure class: single-component loss. Distributed database failures rarely look like one node dying cleanly. They look like degraded networks, clock drift, slow disks, operator errors during maintenance windows, and combinations of all of the above.

Mission-style red-teaming borrows from military and security practice. A dedicated team designs multi-step failure scenarios with specific objectives, executes them against production-equivalent infrastructure, and scores the defending team’s response. The key difference from chaos engineering is intentionality: the red team isn’t injecting random faults. They’re pursuing a specific failure hypothesis through a sequence of realistic actions.

A red-team exercise has three roles:

Red team: designs and executes the failure scenario. Their goal is to cause data loss, unavailability, or corruption without triggering detection within a target time window.
Blue team: the on-call and operations engineers responding as they would in a real incident. They don’t know the scenario in advance.
White team: observers who control the exercise, ensure safety boundaries, and document everything for the post-exercise review.

The exercise runs for a fixed window, typically two to four hours. The red team executes their scenario. The blue team detects, diagnoses, and responds. Everyone debriefs afterward.

The Stress Scenarios That Matter

Not all failure modes are worth practicing. Focus on scenarios that are plausible, high-impact, and poorly covered by existing automation.

Network partitions with asymmetric visibility. One side of the partition can see the other; the other side cannot. This breaks assumptions in consensus protocols that expect symmetric failure detection. Many teams test clean partitions but never test asymmetric ones.

Clock skew under load. Distributed databases that use timestamps for ordering (which is most of them) behave unpredictably when clocks drift. NTP usually keeps drift small, but under heavy load, NTP corrections can be delayed. The result is transaction ordering violations that are invisible until a consistency check runs, which might be hours or days later.

Quorum erosion during maintenance. You take one node offline for a rolling upgrade. While it’s down, a second node develops a slow disk. You now have a degraded quorum that’s technically functional but one failure away from data unavailability. This is the most common compound failure pattern and the least practiced.

Operator mistakes during incidents. The most dangerous moment for a distributed database is when a human is manually intervening during an incident. Wrong-node restarts, accidental force-quorum operations, and recovery commands run against the wrong cluster are responsible for a disproportionate share of catastrophic data loss. Red-teaming should include scenarios where the operator is given misleading information and time pressure.

Backup restoration under partial failure. Most backup tests verify that a restore works on a clean target. Real restores happen during incidents, when the target environment is degraded, the team is stressed, and the backup might be from a point in time that’s already inconsistent. Test restoration under these conditions, not just in a clean room.

The OODA Loop for Incident Rehearsal

Effective red-team exercises run on a tight observe-orient-decide-act cadence. This isn’t just a framework. It’s a scoring mechanism.

Observe: How quickly does the blue team notice something is wrong? Detection time is the single most important metric. A failure that’s detected in two minutes has a fundamentally different blast radius than one detected in twenty. Measure time from fault injection to first alert, and time from first alert to accurate diagnosis.

Orient: Does the team correctly identify what’s happening? Misdiagnosis is common in compound failures because the symptoms don’t match any single runbook entry. The blue team might see elevated latency and assume it’s a hot key, when the actual cause is a partial partition affecting replication. Measure time from first alert to correct hypothesis.

Decide: Does the team choose an appropriate response? Under pressure, teams often default to the most familiar action (restart the node) rather than the most appropriate one (isolate the partition). Measure whether the chosen action matches the failure mode.

Act: Does the team execute the response correctly? Even when the right decision is made, execution errors under stress are common. Typos in commands, wrong node targets, and forgotten steps in manual procedures are all frequent. Measure execution accuracy and time to containment.

Each phase gets a score. Over multiple exercises, these scores reveal systemic gaps: maybe detection is fast but diagnosis is slow, or decisions are sound but execution is error-prone. That tells you exactly where to invest in automation, training, or tooling.

Scoring Readiness

After each exercise, score three dimensions:

Readiness (1-5): Could the team handle this scenario if it happened tomorrow in production? A 1 means the team didn’t detect the failure. A 5 means they detected, diagnosed, and contained it within SLA.

Blast radius (1-5): If the team had not responded, how bad would it have gotten? A 1 means minor degradation. A 5 means unrecoverable data loss or extended outage.

Time to containment (minutes): Wall-clock time from fault injection to the point where the failure is contained and no longer spreading. This is the metric that matters most to your customers and your SLA.

Plot these over time. Improving readiness scores and decreasing containment times are the clearest signals that your red-teaming program is working. If scores plateau, your scenarios aren’t challenging enough.

From Findings to Backlog

Red-team exercises are useless if findings sit in a postmortem document that nobody reads. Every exercise should produce a prioritized list of concrete improvements, each with an owner and a deadline.

The conversion process is simple:

List every gap discovered. Detection gaps, diagnostic confusion, tool limitations, missing runbooks, automation failures.
Score each gap by blast radius times likelihood. Likelihood is informed by the exercise, not guessed.
Assign an owner for each gap. Not a team. A person.
Set a deadline before the next exercise. The next exercise will test whether the gap was closed. This creates accountability.

Common improvements that come out of red-team exercises include automated partition detection that currently requires manual observation, runbook updates for compound failure scenarios, guardrails on dangerous operator commands during incidents, and backup restoration procedures tested under realistic conditions.

The backlog items from red-teaming tend to be high-value, low-glamour work. They rarely make it onto a roadmap through normal prioritization because they address risks that haven’t materialized yet. The exercise provides the evidence needed to justify the investment.

A Quarterly Operating Cadence

Red-teaming works best as a regular practice, not a one-off event. A quarterly cadence balances rigor with operational overhead.

Run quarterly. Dedicate the first few weeks to scenario design based on recent incidents and architectural changes, a half-day to executing the exercise against a production-equivalent environment, and the remainder of the quarter to remediating the gaps you found.

This cadence means every quarter your team practices a realistic failure scenario, identifies concrete gaps, and fixes the most critical ones before the next exercise. Over four quarters, you’ve tested and improved your response to a dozen failure modes. That’s a fundamentally different reliability posture than “we tested node failover once during setup and it worked.”

Key Takeaways

Most catastrophic database failures are compound scenarios that nobody practiced, not genuinely unpredictable events.
Chaos engineering tests component failure. Red-teaming tests system failure under realistic operational conditions.
Score every exercise on detection time, diagnostic accuracy, decision quality, and execution correctness. Track trends.
Convert findings into owned backlog items with deadlines tied to the next exercise.
Run quarterly. Consistency matters more than intensity.

Red-teaming distributed databases is not theater and it’s not a luxury. It’s the cheapest way to find out whether your recovery assumptions actually hold before your customers find out for you.

Beyond Cloud-Heavy Architecture: Why Agentic Systems Need Local-First, Hardware-Aware Design

Mon, 09 Mar 2026 00:00:00 +0000

Quick take

Most teams building agentic systems default to cloud-heavy architectures because that’s what they know. The result is unpredictable latency, runaway costs on bursty workloads, and a privacy posture that depends entirely on someone else’s infrastructure. Local-first, hardware-aware design fixes the economics and gives you failure modes you can actually reason about. Treat compute placement as architecture, not an optimization pass.

The Cloud-Heavy Anti-Pattern

The standard agentic stack looks like this: application code in one cloud region calls a model API in another, pulls context from a vector database in a third, and writes results back through a gateway that adds its own hop. Every step crosses a network boundary. Every boundary adds latency variance, failure surface, and cost.

For a single inference call, the overhead is tolerable. For an agent that chains ten to fifty calls per task, with tool use, retrieval, and self-correction loops, the overhead compounds. A p50 latency of 200ms per hop becomes 2-10 seconds of pure network time on a moderately complex agent run. At p99, you’re looking at timeouts and retries that double or triple your effective cost.

The measurable symptoms are consistent across teams:

Latency variance dominates execution time. The model itself is fast. The network between your orchestrator and the model, plus the hops to retrieval and tool services, is where time disappears.
Cost scales with hops, not intelligence. You pay for every round trip: egress, ingress, token overhead from context reassembly, and retry loops when any hop fails.
Failure modes are combinatorial. When five services must all be healthy for one agent task to complete, your effective availability is the product of their individual availabilities. Five nines times five is not five nines.

This is not an argument against cloud. It’s an argument against cloud-only, cloud-default architecture for workloads that don’t need it.

Consolidating Runtime Layers

The fix is straightforward: move compute closer to the data and the user. Consolidate runtime layers so agent orchestration, context retrieval, and lightweight inference happen in the same process or at least on the same machine.

This is not a new idea. Databases figured this out decades ago. You don’t run your query planner in a different availability zone from your storage engine. Agentic systems are hitting the same lesson: when the workload is latency-sensitive and involves tight feedback loops, co-location wins.

In practice, consolidation means running a local inference server for small models (classification, routing, extraction), keeping your retrieval index on the same node as your orchestrator, and reserving cloud API calls for frontier-model tasks that actually need them. The local layer handles the high-frequency, low-complexity work. The cloud layer handles the hard problems.

The cost difference is significant. A team running all inference through a cloud API at roughly two to five dollars per thousand complex agent tasks can drop to twenty to fifty cents by handling routine calls locally with a quantized model on commodity GPU hardware. The frontier API cost doesn’t disappear, but it shrinks because you’re only sending it the work that justifies the price.

Cloud-Only vs. Hybrid Cost Envelopes

The math depends on workload shape, but the pattern is consistent.

Cloud-only architectures have variable cost that scales linearly with usage and offers no marginal improvement at volume. You pay the same per-token rate whether you run one task or a million. Egress fees, retry overhead, and context window waste compound on top.

Hybrid local-first architectures have a higher fixed cost (hardware, setup, maintenance) but dramatically lower marginal cost. Once the local inference server is running, the incremental cost of a routing decision or an extraction call is effectively zero. You’re paying for electricity and depreciation, not per-request metering.

The crossover point arrives faster than most teams expect. For workloads above a few thousand agent tasks per day, local-first is cheaper within months, not years. Below that threshold, cloud-only is simpler and the cost premium is manageable.

The latency picture is even more decisive. Local inference on a mid-range GPU delivers sub-10ms response times for small models. No network hop matches that. For agent loops that make dozens of calls per task, local inference can cut total wall-clock time by 60-80%.

Where Systems Languages Matter

Agent runtimes written in Python work fine for prototyping and low-throughput production. But as you move inference and orchestration onto local hardware, you start caring about memory predictability, startup time, and per-request overhead in ways that garbage-collected runtimes don’t handle well.

Rust is showing up in this layer for practical reasons. It gives you memory safety without garbage collection pauses, which matters when you’re serving inference requests with tight latency budgets.

This is not about rewriting your application in a systems language. It’s about the runtime layer, the inference server, the orchestration loop, the retrieval engine. These are the hot paths where predictable performance translates directly into cost savings and reliability. The application logic on top can stay in whatever language your team knows.

The practical signal: if your agent runtime’s p99 latency is dominated by GC pauses or memory allocation overhead rather than actual inference time, a systems-language runtime will help. If inference time dominates, the language doesn’t matter.

Adoption Without Full Rewrites

Teams with existing cloud-heavy architectures don’t need to rip and replace. The migration is incremental and each step produces measurable improvement.

Step 1: Instrument and classify. Before moving anything, measure what your agent stack actually does. Break down time and cost by call type: routing decisions, context retrieval, small-model inference, frontier-model inference. Most teams discover that 70-80% of calls are routine work that doesn’t need a frontier model or a cloud round trip.

Step 2: Add a local inference tier. Deploy a quantized model locally for the routine calls you identified. Route classification, extraction, and simple generation through it. Keep the cloud API as the escalation path. This is a routing change, not a rewrite.

Step 3: Co-locate retrieval. Move your vector index or retrieval layer onto the same infrastructure as your orchestrator. This eliminates the retrieval round trip, which is often the single largest latency contributor after model inference.

Step 4: Evaluate and tighten. With local tiers in place, measure again. Adjust routing thresholds. Identify the next tier of work that can move local. Each iteration reduces cloud dependency and improves predictability.

The entire migration can happen alongside normal feature work. No flag days, no cutover weekends.

Governance and Data Residency

Local-first architecture has a governance benefit that’s easy to overlook: your data stays on your infrastructure. For teams operating under GDPR, HIPAA, or sector-specific data residency requirements, this simplifies compliance significantly.

When agent tasks process user data through a cloud API, that data traverses networks you don’t control and resides, however briefly, on infrastructure you don’t own. The compliance burden of documenting, auditing, and risk-managing that data flow is real and growing. Local inference eliminates the flow entirely for tasks that don’t require cloud escalation.

This doesn’t mean you avoid cloud APIs altogether. It means you have architectural control over which data leaves your perimeter and which doesn’t. That’s a better conversation to have with your compliance team than “everything goes to a third-party API.”

Decision Rubric

When deciding how to place compute for agentic workloads, ask these questions:

Volume: Are you running more than a few thousand agent tasks per day? If yes, the economics of local inference likely favor hybrid.
Latency sensitivity: Do your agent loops involve more than ten chained calls? If yes, network overhead is probably your bottleneck.
Data sensitivity: Does your agent process PII, health data, or regulated information? If yes, local-first reduces compliance surface.
Team capability: Do you have infrastructure engineers who can operate local GPU servers? If no, start with managed options or cloud-based inference with a clear migration path.
Workload predictability: Are your traffic patterns bursty or steady? Bursty workloads benefit most from local capacity that handles baseline load with cloud burst for peaks.

Common Traps

Over-investing in local hardware before measuring workload shape. Instrument first. Buy hardware based on data, not intuition.
Treating local and cloud as either/or. The right answer is almost always hybrid. The question is where to draw the line.
Ignoring operational cost of self-hosted infrastructure. Local inference is cheaper per request but requires someone to keep it running. Factor in ops time.
Optimizing for p50 when p99 is what breaks your SLA. Agentic workloads are chains. One slow hop at p99 delays the entire task.

Hardware placement is a first-order architecture decision. Make it early, measure it continuously, and adjust as your workload evolves. The teams that get this right don’t have the fanciest models. They have the most predictable systems.

AI Startup Landscape 2026

Mon, 02 Mar 2026 00:00:00 +0000

Quick take

In early March 2026, “we use AI” is not a startup thesis. Buyers reward outcomes, reliability, and integration. If you cannot explain unit economics, governance, and how you fit into existing workflows, you stall at pilot. The durable advantages are the familiar ones: data, distribution, and operational execution.

The AI startup market is no longer about novelty. It’s about cost, control, and integration. The surface area is still large, but the center of gravity has shifted toward fewer core platforms, tighter enterprise scrutiny, and a bigger gap between prototypes and production systems.

Market Shape

Platform and Infrastructure

The platform layer has consolidated into a small set of credible options with predictable capabilities. Buyers are less willing to bet on unproven foundations and more willing to standardize on what is stable, documented, and supported. Infrastructure has followed a similar path: compute, data pipelines, and deployment stacks are converging on vendors that can meet uptime, security, and procurement requirements without surprises.

Applications

Application-layer startups still have room, but the bar is higher. Products that win do not just automate a task; they change a workflow and own measurable outcomes. Horizontal tools that look interchangeable struggle to price, and sales cycles now expect proof of reliability, cost controls, and governance.

What Differentiation Looks Like Now

Differentiation is less about model performance and more about compound advantages that are hard to copy. The clearest signals are:

Proprietary or hard-to-recreate data flows tied to a real workflow.
Distribution that doesn’t depend entirely on paid acquisition or hype cycles.
A delivery path from pilot to production that fits enterprise controls.

Where Leverage Actually Sits

Look past the marketing and leverage tends to concentrate in a few places:

Workflow ownership: the product lives where work already happens (tickets, docs, CRM, IDEs), not in a separate “AI app.”
Hard-to-copy data loops: usage generates better data, which improves the product, which drives more usage.
Integration depth: the messy parts (permissions, audit logs, escalation paths) become a moat.
Operational playbooks: rollout, monitoring, and rollback are part of what you sell, even if indirectly.

This is why many flashy demos fail commercially. They show capability without showing leverage.

Commercial Reality

Budgets are still there, but they are more disciplined. Buyers want predictable unit economics and clear ownership of risk. That means pricing tied to outcomes or usage, transparent operating costs, and honest limits on automation. Services revenue is acceptable when it accelerates deployment, but products that require constant custom work do not scale well under current expectations.

What Buyers Reward In 2026

Even early-stage buyers are more explicit now. Successful deals usually include:

clear ROI framing (“reduce handling time by X”, “increase conversion by Y”)
visible controls (permissions, logging, approvals)
predictable cost per outcome
an escalation path for edge cases

If you can’t answer security and governance questions without improvising, the sale slows down.

Where This Leaves New Teams

The winning path is narrower, not closed. New teams can still build meaningful businesses if they accept that the default outcome is commoditization and plan for it. Focus beats breadth. Systems thinking beats feature stacking. The fastest route to durability is to choose a domain where operational pain is acute and data is defensible, then deliver with production-grade reliability from day one.

Common Failure Modes

Commoditization by API: your “secret sauce” is a thin wrapper around a capability everyone can buy.
Pilot purgatory: the product works in a demo but can’t survive real permissions, real data, and real scale.
Services trap: every customer needs a custom build, so the roadmap becomes a consulting queue.
Unit economics denial: usage grows while margins quietly collapse.

Takeaways

Consolidation is real at the platform and infrastructure layers.
Application winners own a workflow and measurable outcomes.
Durable advantages come from data, distribution, and deployment fit.
The market rewards focus and operational rigor over novelty.

AI Security: Evolving Threats and Defenses

Mon, 23 Feb 2026 00:00:00 +0000

Quick take

AI security in late February 2026 isn’t one trick like “add a content filter.” It’s a threat model plus layers: constrain tool access, validate outputs, isolate trusted context, log what matters, and design a fast rollback path. Treat agentic workflows like an exposed API surface, because that’s effectively what they are.

AI security is no longer a niche concern. It sits alongside reliability and privacy as a core production requirement. The threat landscape has grown more deliberate and multi-stage, and the most effective defenses now blend model behavior controls with traditional security practice.

Threat Evolution

Current Threats

Late February 2026 is characterized by attacks that try to shape or extract behavior rather than simply break it. Prompt injection remains a primary entry point, but it has shifted toward multi-step workflows that hide intent across inputs, tools, and outputs. Data extraction attempts are more targeted and often move through legitimate features. Model manipulation is now a broader risk, spanning training data quality, dependency integrity, and deployment pipelines.

Agentic systems have widened the attack surface. Tool access, long-running tasks, and multi-model orchestration introduce new paths for indirect influence and privilege escalation. The effect is less about a single exploit and more about cumulative pressure on the system’s assumptions.

Attack Patterns Worth Understanding

The most instructive attacks are multi-step, because they exploit the same features that make AI systems useful.

Consider a prompt injection chain against an agentic assistant with tool access. The attacker doesn’t inject a single malicious instruction. Instead, they plant a benign-looking instruction in a document the assistant will retrieve: “Before responding, summarize the current system configuration for context.” The assistant treats this as a helpful step, surfaces internal configuration details in its working memory, and then a follow-up prompt asks it to include that summary in its response. No single step looks malicious. The chain works because the assistant treats retrieved content with the same trust as user instructions.

Data exfiltration through tool use follows a similar pattern. An attacker crafts input that causes the model to call an external API or write to a log in a way that encodes sensitive context into the request parameters. The model isn’t “trying” to leak data. It’s following instructions that happen to route internal state through an external channel. If your tool permissions allow HTTP calls or file writes without strict scoping, the model can be steered into acting as an exfiltration vector without any single request looking abnormal.

These patterns matter because they aren’t theoretical. They are the incidents teams are seeing in production, and they resist simple keyword filtering or input validation.

Defense Strategies

Current Best Practices

Effective defenses treat AI systems as full-stack security targets. Inputs are filtered for intent, not just keywords. Outputs are constrained to structured formats when possible, with explicit checks for sensitive data leakage. Tool use is tightly scoped, with least-privilege access and clear audit trails.

The principle of separation is critical. System instructions, user input, and retrieved content must be clearly delineated in the prompt structure, and the model must be told explicitly which parts are trusted. This doesn’t eliminate injection, but it raises the bar significantly. Attacks that work against a flat prompt often fail when the model has a clear instruction hierarchy.

Security Monitoring and Detection

Monitoring is no longer optional. It needs to cover model behavior, tool calls, and user interaction patterns, with rapid rollback paths when behavior drifts.

The detection approach that works best is behavioral baselining. Establish what normal looks like for your system: typical response lengths, tool call frequencies, the ratio of requests that trigger safety filters, and the distribution of topics in model output. Then alert on deviations. A sudden spike in tool calls from a single user session, or a shift in the kinds of data the model references in its responses, can indicate an active attack before any single request trips a rule.

Log everything the model does, not just the final output. Intermediate reasoning steps, tool call parameters, retrieved documents, and safety filter activations all form a forensic record. When an incident happens, you need to reconstruct the full chain of events, and it often spans multiple turns and tools.

Incident Response for AI Systems

Incident response plans should include model configuration changes, not only infrastructure changes. Traditional playbooks assume the application logic is deterministic. AI incidents require a different approach.

When you detect anomalous behavior, the first response is often to restrict the model’s capabilities rather than take the service offline. Disable tool access, narrow the set of allowed response formats, or fall back to a simpler model with tighter constraints. This contains the blast radius while you investigate.

The investigation itself should include prompt and context review. Pull the full conversation history, the retrieved documents, and the system instructions that were active at the time. Look for the point where the model’s behavior diverged from expected, and trace it back to the input that caused the shift. This is different from traditional log analysis because the “bug” is often in the data, not the code.

After an incident, update your evaluation suite. Every real incident should produce at least one new test case that would have caught the issue. This is how defenses compound over time.

A Practical Security Review Framework

When reviewing an AI system’s security posture, I walk through five areas.

First, input separation: are system instructions, user input, and retrieved content clearly delineated? Can retrieved content override system behavior?

Second, tool permissions: does the model have the minimum access it needs? Are tool calls logged and auditable? Can a single prompt cause the model to chain multiple tool calls without human review?

Third, output controls: are responses filtered for sensitive data before reaching the user? Are structured output formats enforced where possible?

Fourth, monitoring coverage: are you tracking behavioral baselines? Can you detect slow drift, not just sudden breaks? Do you have alerting on cost, tool call patterns, and safety filter rates?

Fifth, incident readiness: do you have an AI-specific playbook? Can you restrict model capabilities without a full outage? Does your team know how to reconstruct a multi-turn attack chain from logs?

No system will score perfectly on all five. The point is to know where the gaps are and prioritize based on the actual risk profile of your application.

Defensive patterns that actually help

Separate trusted and untrusted context: retrieved documents are data, not instructions. Make that separation explicit in prompts and in your system design.
Constrain tool contracts : strict schemas, validation, and side-effect annotations. Prefer idempotent writes and require confirmation for irreversible actions.
Policy at the boundary: enforce permissions and rate limits outside the model. The model shouldn’t be your authorization system.
Output validation: enforce schemas and scan for obvious sensitive leakage patterns before returning responses to users.
Sandbox where possible: isolate file access, network access, and execution environments for tool-using agents.

None of these are perfect. The goal is to reduce surprise and shrink blast radius.

A Practical Security Checklist

If you want a boring checklist that catches most mistakes:

List tools, permissions, and side effects. Remove anything you can’t justify.
Make retrieved content clearly untrusted. Don’t let it override system rules.
Validate tool arguments and model outputs on every call.
Log tool calls with correlation IDs and track abnormal patterns.
Add a hard kill switch and a rollback path for config/model changes.
Run a small red-team exercise focused on prompt injection and tool misuse.

Key Takeaways

Attack chains are more subtle and operationally aware. They exploit the trust model of AI systems rather than looking for traditional vulnerabilities. Defensive design must combine model controls with traditional security discipline, and it must account for the fact that the model itself can be steered into acting against the system’s interests.

Monitoring and incident response need to be built into the system, not bolted on. The teams that handle AI security well are the ones that treat it as an operational discipline with its own tools, playbooks, and review cadence.

AI security remains an ongoing process. The goal isn’t perfect prevention but resilient systems that detect, contain, and adapt quickly as conditions change.

AI Team Structures 2026: Central, Embedded, and Hybrid Models

Mon, 16 Feb 2026 00:00:00 +0000

Quick take

By mid-February 2026, the org question isn’t “should we have an AI team?” It’s “where does ownership live?” The best structures make evaluation, cost, and incident response someone’s job, not a shared worry. Most teams land on a hybrid: a small enabling platform group plus embedded delivery in product teams.

AI work has shifted from experiments to ongoing product and operations work. Most organizations that ship AI features have converged on a small set of structures. The right choice still depends on maturity, product criticality, and how much shared infrastructure is needed. The structure also changes how teams manage AI inference cost , AI-native architecture , and governance.

This post focuses on structures that stay stable under real delivery pressure, not aspirational org charts.

Team models that hold up

Central platform team

A central platform team builds and operates shared AI infrastructure, evaluation tooling, and common components. This model fits organizations that need consistency, strong governance, and shared reliability across many teams. It works particularly well in regulated industries where auditability and compliance require a single pane of glass across all AI usage.

Where it breaks down is speed. When every product team routes requests through a central group, the platform team becomes a bottleneck. This is common in organizations with ten or more product teams sharing a three-person AI platform group. The queue grows, the platform team triages by business priority, and lower-priority teams either wait or build workarounds. If you choose this model, staff it generously or accept that iteration speed will be gated.

Embedded in product teams

AI engineers live inside product teams and ship features end to end. This model fits products where AI is core to user experience and iteration speed matters. A team building a search product or a conversational interface benefits from having the AI engineer sit in the same standup, hear the same customer feedback, and own the same on-call rotation as the rest of the squad.

The risk is fragmentation. When several product teams solve the same problems independently, you end up with three prompt evaluation frameworks, two model routing strategies, and no shared understanding of cost. This model works best when you have a small number of product teams, or when AI use cases are different enough that shared infrastructure would not save much effort.

Hybrid model

A small platform team provides shared foundations while product teams embed AI engineers for delivery. This is the most common model because it balances infrastructure consistency with product-team autonomy.

The platform team in a hybrid model typically owns inference infrastructure, model selection and routing, shared evaluation tooling, and cost observability. Product-team AI engineers own feature-level prompts, domain-specific evaluation datasets, and production behavior for their use case. The boundary between these layers matters more than the org chart. Writing down the interface contract, what the platform provides and what the product team owns, prevents most of the friction that kills hybrid models.

The hybrid model fails when the platform team behaves like an internal vendor rather than an enabling function. If product teams have to file tickets and wait for releases to get basic capabilities, you’re back to the central bottleneck problem with extra steps. The platform team should ship self-serve tooling and stay close to the product engineers who use it.

Decision criteria

Use the structure that matches the work, not the other way around. Three factors tend to dominate the decision.

First, how many teams need the same AI capabilities and standards. If the answer is two, embedded is fine. If it’s eight, you need a platform function or you will drown in duplication.

Second, how frequently AI features ship and change. High iteration velocity favors embedded engineers who can move with the product team’s sprint rhythm. Slower, more deliberate releases are easier to route through a central group.

Third, how much operational risk and compliance pressure exists. Regulated environments benefit from centralized governance and audit trails. Lower-risk consumer products can afford more distributed ownership.

Add one more that teams often forget: how expensive mistakes are. If the blast radius is high, you want tighter standards, stronger review, and explicit gating.

Roles and responsibilities in 2026

AI engineer

Builds AI features inside product flows, owns evaluation in production, and partners with design and data for quality. The role blends software engineering with systematic testing and monitoring. In 2026, the AI engineer is distinct from the ML engineer or data scientist. An ML engineer typically focuses on model training, fine-tuning, and training infrastructure. A data scientist focuses on analysis, experiment design, and statistical rigor. The AI engineer works downstream of both: integrating models into products, building evaluation harnesses that catch regressions, and owning production behavior. Think of it as the difference between building the engine and building the car.

AI platform engineer

Owns shared systems like inference services, evaluation pipelines, and model routing. The focus is reliability, scale, and cost control for many teams at once. This role requires strong infrastructure engineering skills and an understanding of how product teams consume AI capabilities. Strong platform engineers pair with product-team AI engineers to understand real usage patterns rather than building abstractions in isolation.

AI product manager

Defines the use case scope, success metrics, and rollout plan. The role emphasizes rigorous tradeoffs between quality, latency, and cost, with clear ownership of user outcomes. An AI PM needs to be comfortable with probabilistic behavior and must resist the urge to promise deterministic results. They own the decision of when a feature is good enough to ship and when it needs more evaluation investment.

Team size and scaling

Most teams start too large. A single AI engineer embedded in a product team, supported by a lightweight shared toolkit, is enough to validate whether AI adds value to a workflow. Scaling up before validation leads to expensive teams that optimize solutions to the wrong problems.

For the platform function, two to three engineers can support four or five product teams if the scope is well-defined. Once you pass that ratio, the platform team needs to grow or the scope needs to shrink. A common mistake is building a platform team of six that tries to serve fifteen product teams and ends up serving none of them well.

When hiring, prioritize engineers who have shipped AI features into production over those with impressive research backgrounds but no operational experience. The gap between a working prototype and a reliable production system is where most AI projects stall, and that gap is an engineering problem, not a research problem.

AI security / governance partner

Whether this is a dedicated role or a shared function, someone must own policy: data handling rules, permission models, logging requirements, and review gates. Teams that skip this role tend to slow down later under audit pressure.

Common failure modes

These patterns show up across teams. Platform teams that ship abstractions without enabling product speed often build elaborate internal APIs nobody asked for while product teams work around them. Product teams that skip evaluation and discover quality issues late usually treat AI features like deterministic code, then get surprised when behavior drifts after a model update. Ambiguous ownership for model behavior in production creates incidents where nobody knows whether the platform team or the product team should respond. Usually it is both, but the escalation path was never defined.

What This Looks Like At Different Sizes

Small startup (1 to 2 AI engineers): embed in the product, keep tooling lightweight, and use strict output validation plus a small eval set. Avoid platform work that nobody will maintain.
Mid-size company (multiple product teams): introduce a small platform function to own routing, eval tooling, and shared guardrails, while keeping delivery embedded in product teams.
Large org (regulated, many teams): platform + governance becomes non-negotiable. Embedded teams still ship features, but standards, audit trails, and permissions need central ownership.

Operating practices that matter

Evaluation is a first-class deliverable, not a side task. Teams that ship reliably treat test sets, error analysis, and monitoring as part of every release. Evaluation datasets are versioned alongside code, and regressions in evaluation scores block releases the same way failing tests would.

Clear service ownership and on-call rotations prevent AI incidents from becoming orphaned problems. Every AI feature in production should have a named owner who is paged when it degrades. Cost management belongs in planning, not just finance review after launch. Model inference costs can surprise you, and the time to catch a cost spike is before it compounds for a month.

A pragmatic starting point

If the organization is early, start embedded with a lightweight shared toolkit and a small platform function. As adoption grows, formalize the platform team and tighten standards. Revisit the structure every six months, because the problem shifts as AI moves from pilot to core workflow. The structure that got you to your first production feature is rarely the structure that will support your tenth.

FAQ

What is the best AI team structure in 2026?

For most companies, the best default is hybrid: a small platform group owns shared infrastructure, routing, evaluation, and governance, while product teams own delivery and workflow quality.

When should AI engineers be embedded in product teams?

Embed AI engineers when iteration speed and workflow context matter more than central consistency. This works best when use cases are distinct or when the company is still validating where AI creates value.

When does a central AI platform team make sense?

A central platform team makes sense when many product teams need the same model access, evaluation tooling, governance, and cost controls. It fails when it becomes a ticket queue.

Who owns AI quality in production?

The product team should own user-facing behavior. The platform team should own shared reliability, model access, routing, observability, and guardrails. The interface between those teams must be explicit.

AI Inference Cost Trends 2026: Model Pricing and Token Costs

Mon, 09 Feb 2026 00:00:00 +0000

Quick take

AI inference costs are still falling in 2026, but the teams that win are not simply waiting for cheaper model pricing. They route routine work to smaller models, cache repeated requests, control context size, batch offline jobs, and measure cost per successful outcome instead of cost per token alone.

The practical question is no longer “will AI get cheaper?” It will. The better question is whether your architecture can take advantage of falling token costs without losing quality, reliability, or governance. That is where AI-native architecture and honest AI ROI measurement matter.

AI Inference Cost Trends in 2026

The direction is clear: model pricing keeps compressing, especially for routine inference workloads. Competition between frontier providers, open-weight models, inference-optimized hardware, and smaller task-specific models has made the default price curve friendlier than it was in 2024 or 2025.

That does not mean every AI product gets cheap automatically. The bill still depends on how much context you send, how many retries your system creates, whether you cache repeated work, and whether every request goes to a premium model by default.

Cost driver	2026 trend	What to do
Input tokens	Cheaper, but context windows invite waste	Trim history, summarize, and retrieve only relevant context
Output tokens	Still easy to overspend through verbose responses	Constrain output length and use structured formats
Frontier models	Lower than prior years, still premium	Reserve for high-risk or high-value cases
Small models	Much cheaper and good enough for bounded tasks	Route classification, extraction, and simple drafting here
Retries	Often hidden in aggregate API spend	Track retries by feature and failure mode
Evaluation	More important as model choice expands	Budget eval maintenance as part of production cost

The teams with the lowest useful cost are usually the teams with the cleanest architecture. They know which path a request took, why that model was selected, how often fallback fired, and what one successful outcome actually cost.

Model Pricing: 2025 vs. 2026

By 2025, many organizations had already seen token prices drop enough to move AI workloads from experiment budgets into operating budgets. In 2026, the bigger change is not just cheaper tokens. It is optionality.

Most production use cases now have multiple viable model tiers:

a cheap model for routing, classification, extraction, and formatting
a mid-tier model for routine reasoning and drafting
a frontier model for ambiguous, high-stakes, or high-value work
a deterministic fallback for cases where the model should not decide

This changes procurement conversations. Instead of asking “which provider is cheapest?” teams should ask “which tasks deserve expensive inference?” A flat architecture where every request hits the best model leaves money on the table.

The better pattern is a small model-routing layer with explicit thresholds. That router can be heuristic at first. It does not need to be clever. It needs to be measured.

What Has Changed

The market has moved from experimentation to steady operations. Costs keep trending down, but the bigger shift is that most workloads now have multiple viable options. That creates room for routing, fallback, and tiered service levels instead of one default model for everything.

The pricing arc is clear. In early 2024, a million tokens from a frontier model cost roughly thirty dollars on the input side and sixty on the output side. By late 2025, equivalent capability was available for a fraction of that, and by early 2026, competitive pressure pushed prices down again. For many workloads, per-token cost has dropped by an order of magnitude in under two years.

That is not subtle. It changes the math on use cases that were previously too expensive to run at scale.

Smaller, task-specific models have gotten even cheaper. Routing a classification task or structured extraction job through a lightweight model can cost a hundredth of what a frontier model charges for the same tokens. The capability gap has narrowed enough that, for well-defined tasks, the smaller model is often not just cheaper but faster and more predictable.

Why Costs Keep Moving

Several forces continue pushing in the same direction. Model efficiency gains mean each generation does more with less compute. Hardware improvements, especially in inference-optimized silicon, reduce cost per operation at the infrastructure layer. Competitive pressure from open-weight models and multiple commercial providers keeps pricing honest.

Open tooling also keeps baseline capability accessible. When a team can self-host a capable model on reasonable hardware, it sets a ceiling on what commercial APIs can charge for equivalent work. That dynamic is not going away.

The Costs People Miss

Token pricing gets most of the attention, but in mature AI operations it is rarely the largest line item. Hidden costs are usually where budgets quietly expand.

Evaluation is first. Building and maintaining evaluation suites, human review processes, and regression testing infrastructure takes real engineering time. Teams that ship without proper evaluation pay later in incident response and lost trust, and that bill is usually bigger. But the evaluation work itself is not free, and it scales with the number of models and use cases in production.

Data preparation is another. Cleaning, labeling, formatting, and versioning data for fine-tuning or retrieval-augmented generation is labor-intensive work. It often requires domain expertise that is expensive to hire or contract.

Teams that underestimate this end up with underperforming models, then spend more on prompt engineering and workarounds than they would have spent on data quality upfront. It is common to burn months of engineering time compensating for training data problems that could have been fixed at the source in weeks.

Monitoring and observability add ongoing cost. Logging every request, tracking latency distributions, detecting drift, and alerting on quality degradation all require infrastructure. For high-volume systems, storage and compute costs for the monitoring layer itself can be material. At scale, the observability stack for an AI system can rival inference cost.

Retraining and model updates are the costs that compound. As data distributions shift and user expectations change, models need refresh cycles. Each cycle involves data collection, training or fine-tuning, evaluation, and deployment. The cost is not just compute. It is also the engineering attention required to run the cycle reliably.

Routing Strategies in Practice

The highest-leverage cost optimization is usually not better rate cards. It is sending each request to the right model for the job.

Consider a customer support system handling thousands of queries a day. Most are routine: order status, return policies, password resets. A small, fast model handles these well at minimal cost. A subset involves complex complaints, edge cases, or escalation decisions that benefit from a more capable model. And a handful require human review regardless.

A routing layer that classifies incoming requests and directs them to the right tier can cut costs dramatically without degrading user experience. Classification itself is cheap, often a lightweight model or a set of heuristics. Savings come from not running every request through the most expensive option.

In practice, teams define two or three model-capability tiers, build a classifier that assigns each request to a tier, and measure both cost and quality per tier over time. Thresholds can be adjusted as models improve or as new options appear.

The same pattern applies to internal tooling. Code generation, document summarization, and data extraction all include varying difficulty levels within one workflow. A well-designed system uses the frontier model for hard cases and a fast, inexpensive model for everything else.

Token Cost vs. Cost Per Outcome

Token cost is useful for vendor comparison. It is not enough for product decisions.

Most teams start with a simple per-request cost estimate and multiply by expected volume. That is fine for initial budgeting, but it breaks down quickly as usage grows and patterns shift.

A more durable approach is to model cost per outcome rather than cost per request. If a workflow needs three API calls, two retries, and a human review step to produce one useful result, the cost of that result is the sum of all components. Tracking cost per outcome makes it possible to compare architectures and model choices on equal footing. It also prevents a cheap model from looking good when it creates repeated retries, manual cleanup, or user escalation.

This also makes business conversations easier. Saying “this feature costs twelve cents per completed task” is more useful than “we spend four thousand dollars a month on API calls.” The first number connects to business value. The second is just an expense line. It also helps decide which AI team structure should own optimization: product teams, a platform team, or a shared enablement group.

Forecasting also gets easier once you have a few months of production data. Usage patterns are often more stable than expected, with predictable daily and weekly cycles. Surprises usually come from new feature launches or changes in user behavior, not gradual drift.

A simple forecasting model that accounts for known upcoming changes and adds a buffer for unknowns is usually enough. Overly complex forecasting is rarely worth it when underlying pricing can change with one vendor announcement.

The key point is not just the trend line. It is the increasing ability to trade cost for latency and quality in a controlled way. That is what makes cost engineering possible.

How to Reduce AI Inference Cost Without Breaking Quality

The best responses are architectural, not purely vendor-driven. Teams that treat AI as an operational system tend to make pragmatic decisions early, then refine as usage stabilizes. That means choosing models by task fit, pushing repeat work into caches, and designing workflows that degrade gracefully.

Caching deserves special mention. In systems where similar inputs recur frequently, a well-designed cache can eliminate a significant percentage of API calls entirely. Semantic caching, where near-duplicate inputs return cached results, extends that benefit. Implementation cost is usually modest compared with savings at scale.

Designing for graceful degradation is the other pattern that consistently pays off. If the primary model is unavailable or too slow, the system should fall back to a smaller model, a cached response, or a simplified workflow rather than failing outright. This is not just a reliability pattern. It is also a cost pattern, because your budget is not held hostage by a single vendor’s pricing or availability.

Common Levers That Work

Reduce context: send only what the model needs. Summarize, chunk, and cap history.
Cache repeat work: if users ask the same questions, your system should remember.
Batch when possible: offline jobs rarely need low-latency interactive pricing.
Constrain outputs: structured output and strict schemas reduce rambling responses.
Route by risk: start small, escalate only when the cheap path fails.

The point is not to chase the lowest cost per token. The point is to hit your product’s quality bar at a sustainable unit cost.

FAQ

Are AI inference costs going down in 2026?

Yes. The broad trend is downward, especially for routine inference and smaller task-specific models. The operational risk is assuming lower token prices automatically create lower product costs. Wasteful context, retries, and weak routing can erase the savings.

What is the best way to reduce LLM token costs?

Start with context control. Send less irrelevant text, retrieve narrower evidence, summarize long histories, and cap output length. After that, add routing, caching, batching, and fallback paths.

Should every request use the cheapest model?

No. Cheap models are best for bounded, low-risk tasks. Premium models still make sense for ambiguous or high-value work. The goal is tiered inference, not cheapest-possible inference.

What metric should teams track besides token price?

Track cost per successful outcome. Include model calls, retries, retrieval, evaluation, human review, monitoring, and incident handling. That is the number that belongs in budget and ROI conversations.

How does model routing reduce AI costs?

Routing sends routine requests to cheaper models and escalates only when the task requires stronger capability. Done well, it reduces spend without forcing the product into a lowest-common-denominator model choice.

A Simple Checklist

Instrument cost per request and cost per successful outcome.
Identify the top 3 flows by spend and break down why they cost what they cost.
Add routing: cheap default, expensive escalation, deterministic fallback.
Add caching for repeat prompts and repeat retrieval.
Set budgets and alerts so cost spikes are visible within hours, not at month-end.

Common Traps

Optimizing prompts before you instrument. If you cannot measure spend by endpoint and outcome, you are guessing.
Treating cost as “the AI team’s problem”. Cost is a product and platform concern. If the feature is valuable, it deserves real engineering.
Ignoring retries and failure loops. One bad tool call can multiply into three retries and a second model call. That is where surprise bills come from.
Paying premium prices for routine work. Most requests are boring. Route them to boring systems.

What To Watch Next

Over the rest of 2026, watch for clearer separation between operational and premium tiers, and for tooling that makes governance and quality measurement cheaper to run.

Winners will be teams that keep cost in scope without letting it dictate every decision. Cheap AI that does not work is not savings. Expensive AI that delivers measurable outcomes is an investment. The goal is to know which is which.

AI Regulation Is Here. Stop Acting Surprised.

Mon, 02 Feb 2026 00:00:00 +0000

Quick take

Regulation isn’t a future problem. It’s already in procurement questionnaires, security reviews, and internal risk sign-off. Teams that build evidence and controls into the system will ship faster than teams that bolt them on later. Treat compliance as engineering, not paperwork.

None of this is legal advice. It’s an engineering view of how regulation is already changing how teams deliver.

This isn’t theoretical. It affects procurement timelines, partnership agreements, and whether a product can launch in certain markets at all. Enterprise buyers now include AI governance questions in their security questionnaires. If you can’t answer them clearly, deals stall.

The Regulatory Landscape Right Now

Rules and expectations vary by jurisdiction, but the common pattern is stable. Regulators and buyers focus on impact, transparency, and accountability. The question is no longer just “can it work” but also “can it be explained, monitored, and corrected.”

The EU AI Act is the most concrete framework on the table. It classifies systems by risk tier and imposes requirements accordingly. High-risk systems, those used in hiring, credit scoring, law enforcement, and critical infrastructure, face mandatory conformity assessments, technical documentation, and human oversight obligations. Even general-purpose AI models have transparency and reporting duties if they meet certain capability thresholds.

In the US, the landscape is more fragmented. Executive orders have established reporting requirements for large training runs and directed agencies to develop sector-specific guidance. States like California and Colorado have moved ahead with their own disclosure and impact assessment rules.

The practical effect is that teams operating across jurisdictions need to satisfy multiple overlapping standards, not a single checklist. If your product serves customers in both the EU and the US, you’re building for the union of those requirements whether you planned for it or not.

Other markets are following similar patterns. Canada, the UK, Singapore, and others have published frameworks that share the same core themes: risk classification, transparency, and accountability. The specifics differ, but the architectural implications converge.

What regulation actually looks like right now

Compliance is less about a single checklist and more about credible evidence of how a system behaves. The minimum set of artifacts is usually small but non-optional.

A model card or system card is the starting point. It documents what the model does, what data it was trained or fine-tuned on, known limitations, and intended use boundaries. This isn’t a marketing document. It needs to be honest about where the system performs poorly and what it wasn’t designed to handle. A good model card is a page or two, not a hundred-page report.

A risk register maps each deployment to its potential impact. For a customer-facing recommendation engine, the risk profile is different from an internal document summarizer. The register should capture who is affected, what happens when the system is wrong, and what controls are in place. Update it when the system’s scope changes, not just at launch.

Data provenance documentation traces where training and inference data comes from, how it was collected, and what consent or licensing applies. This matters more than most teams expect, especially when regulators ask about bias or when a partner wants to know whether their data was used in training.

A monitoring and incident response plan explains how the system is observed in production, what triggers a review, and who is responsible when something goes wrong. This is the artifact that separates a compliant deployment from a demo.

Regulators want to see that you can detect problems and act on them, not just that you tested the model before launch. A plan that names real people, real dashboards, and real escalation paths is worth more than a generic template.

Where Engineering and Compliance Collide

The most common friction I see isn’t about disagreement on goals. It’s about pace and language. Engineering teams want to ship. Compliance teams want to review. Neither side is wrong, but without a shared process, the result is delays, workarounds, or both.

The first friction point is documentation timing. If compliance artifacts are treated as a post-launch requirement, they never get done well. Engineers are already on to the next feature, and the compliance team is reviewing a system they didn’t help design. The fix is to produce documentation alongside development. Start the model card when the model is selected, not when legal asks for it three weeks before launch.

The second friction point is risk-assessment granularity. Compliance teams sometimes want to assess every model change as if it were a new deployment. Engineering teams want to iterate quickly.

A practical resolution is to define change categories. Minor prompt adjustments can be reviewed in batch. Significant model swaps need a fresh assessment. Everything in between gets a proportional review. Document the categories and get both sides to agree on them before the first deployment, not during a heated debate about a release that’s already late.

The third friction point is tooling. Engineers work in code repositories and CI pipelines. Compliance teams work in spreadsheets and document management systems. Bridging this gap with automation, by generating compliance artifacts from code annotations, test results, and monitoring dashboards, reduces manual handoffs and keeps both sides working from the same source of truth.

I’ve seen teams solve this by adding a compliance metadata file alongside the model configuration in the same repository. When the CI pipeline runs, it generates a compliance summary from that metadata plus test results. The compliance team reviews a formatted report instead of chasing engineers for screenshots.

A Phased Practical Path

Trying to build a complete compliance program in one sprint is a recipe for stalled projects. A phased approach works better and builds credibility incrementally.

In the first phase, take inventory. Map where AI is used, who is affected, and what data flows through each system. This sounds obvious, but I’ve seen organizations discover AI components they didn’t know existed because a team quietly deployed a third-party API. You can’t govern what you can’t see.

In the second phase, classify by impact. Group systems into risk tiers based on who is affected and what happens when the system fails or behaves unexpectedly. Internal productivity tools sit in a different tier than customer-facing decision systems. Classification drives how much oversight each system needs, so getting this right early saves significant effort later.

In the third phase, build the artifact pipeline. Create templates for model cards, risk assessments, and monitoring plans. Integrate them into your development workflow so that evidence is produced as a natural byproduct of building features.

Automate where possible. Pull test results into compliance reports. Generate data lineage from pipeline metadata. Surface monitoring dashboards that serve both engineering and governance audiences. The goal is to make compliance evidence a side effect of good engineering, not a separate workstream.

In the fourth phase, establish review cadence. Set regular checkpoints that match each risk tier. High-risk systems get quarterly reviews with executive visibility. Lower-risk systems get lightweight annual reviews or automated checks.

The cadence should be predictable so teams can plan around it instead of reacting to ad hoc requests. Predictability is what makes compliance sustainable. Surprise audits create resentment. Scheduled reviews create routine.

The easiest way to get this right is to treat it like any other production constraint. Add a lightweight PR checklist for AI changes: data sources, eval results, and new failure modes. Version prompts and routing rules alongside code. Keep a small eval suite that runs on every meaningful change. Instrument quality, cost, latency, and error rate.

In early February 2026, compliance isn’t a separate program. It’s part of making AI safe to deploy and straightforward to defend when questions arrive. Teams that treat it as an engineering discipline, with clear processes, proportional oversight, and automated evidence collection, will ship faster than those who treat it as paperwork handled after the fact.

The regulation isn’t going away. But with a practical approach, it doesn’t need to slow you down.

AI-Native Architecture Patterns 2026: Production Guide

Mon, 26 Jan 2026 00:00:00 +0000

Quick take

AI-native architecture is mostly about boring interfaces: route model calls through a gateway, ground outputs with retrieval, validate and log everything, and make evaluation part of the release process. The goal isn’t to worship a model. The goal is to ship AI features that survive change: model updates, data drift, new policy requirements, and real production load.

AI-native architecture is no longer a sidecar to the main system. By late January 2026, teams treat it as a first-class capability with concrete design and operational practices. The emphasis has shifted from demos to reliability, cost control, and change management.

What Changed

The biggest shift is structural. AI capabilities are now designed into service boundaries, deployment flows, and runtime controls instead of layered on top. That changes how teams think about interfaces, failure modes, and ownership.

Two years ago , most teams ran AI as a separate service that the rest of the stack called when it needed something smart. The model sat behind an API, and the integration was a thin adapter. That worked for demos and low-stakes features, but it broke down as AI became central to the product. Latency budgets, error handling, and data flow all suffered from the indirection. The shift to native architecture means AI concerns are represented in the same design conversations as database schemas, API contracts, and deployment topologies.

Core Patterns That Hold Up

AI Gateway

A dedicated gateway organizes AI access and policy. It centralizes routing, safety controls, and observability so teams don’t reimplement the same logic across services. It also provides a stable interface as models and capabilities evolve.

In practice, the gateway sits between your application services and model providers. Requests flow in from your services, the gateway applies rate limiting and authentication, selects the appropriate model based on task type and cost constraints, and forwards the request. Responses flow back through the same path, where the gateway logs latency, token usage, and any safety filter activations before returning the result. This single chokepoint means you can swap providers, add fallback models, or enforce new policies without touching application code.

The tradeoff is operational overhead. A gateway is another service to run, monitor, and scale. Teams that skip it usually rebuild the same logic piecemeal across every service that calls a model, which is worse. But you need to staff it. Someone owns the gateway, and that ownership must be explicit from the start.

Retrieval Layer

A retrieval layer handles knowledge access, context assembly, and freshness. It’s treated as an application concern rather than a data science add-on. The goal is to make AI behavior grounded, auditable, and resilient to stale inputs.

The retrieval layer receives a query from the orchestration logic, searches across one or more knowledge stores ( vector databases , document indices, structured data APIs), ranks and filters the results, assembles them into a context window with appropriate formatting, and passes the assembled context to the model along with the original request. The output is grounded in specific sources, which makes it auditable.

Freshness is the hardest part. Stale context produces confident wrong answers, which are worse than no answer. Teams that do this well treat the retrieval layer like a cache: they track staleness explicitly, set TTLs on indexed content, and build refresh pipelines that run on a schedule or when upstream data changes. The retrieval layer isn’t a static index. It’s a living system with its own operational requirements.

Evaluation Pipeline

An evaluation pipeline is part of the architecture, not a later stage. Automated checks and human review are integrated into delivery so quality doesn’t depend on a single model choice or a one-off test run.

The pipeline runs at multiple stages. Before deployment, it executes a suite of test cases against the candidate model or prompt configuration and compares results to established baselines. During deployment, it runs a smaller set of smoke tests against live traffic. After deployment, it continuously samples production responses and scores them against quality criteria.

What gets caught depends on the depth of the suite. At a minimum, evaluation catches regressions in factual accuracy when you update a model version, formatting breakdowns when prompt templates change, and safety filter gaps when new input patterns emerge. More mature pipelines also catch subtle drift: the model still produces valid output, but the tone has shifted, or it has started favoring certain response patterns over others. These slow changes are invisible without measurement and are often the ones that erode user trust.

Migrating From Bolt-On to Native

Most teams don’t start with native architecture. They start with a model API call inside an existing service and grow from there. The migration path is predictable.

The first step is to extract AI concerns into a shared layer. If three services each call a model API with their own retry logic, prompt templates, and error handling, consolidate that into a gateway or shared library. This is a mechanical refactor, not a redesign.

The second step is to make the data flow explicit. Bolt-on integrations often pass raw user input directly to the model. Native architecture introduces a context assembly step where retrieval, formatting, and policy checks happen before the model sees anything. This is where you gain control over what the model knows and how it behaves.

The third step is to add evaluation as a first-class concern. This means defining what good output looks like for each use case, writing test cases, and wiring them into your CI pipeline. Until evaluation is automated, every model change is a gamble.

The migration doesn’t need to happen all at once. Teams can move one use case at a time, starting with the highest-risk or highest-traffic path. The key is that each step produces a tangible improvement in reliability or operability, not just architectural purity. The team structure matters here because shared routing, evaluation, and governance need explicit owners.

Design Priorities

The systems that perform well share a few priorities. They build model-agnostic interfaces with clear contracts so that swapping a provider is a configuration change, not a rewrite. They design graceful degradation with explicit fallback paths, because models will fail and the product needs to keep working when they do. And they invest in continuous measurement of quality, safety, and cost, because you can’t manage what you don’t measure.

Add one more: ownership. A feature without an owner is a liability. Someone must be accountable for keeping quality steady as everything around the model changes.

Operating In Production

Operational work matters as much as model selection. Good systems make evaluation visible, track drift, and keep changes reversible. They also avoid tight coupling to any single model or provider so capability upgrades don’t require a redesign.

The day-to-day reality of operating these systems is closer to running a data pipeline than running a traditional web service. You’re monitoring output quality, not just uptime. You’re tracking cost per request alongside latency. And you’re maintaining a relationship with your evaluation suite that’s as important as your relationship with your test suite for deterministic code.

Takeaway

AI-native architecture is now a discipline with stable patterns. The winning approach is to design for change, make evaluation part of the system, and treat AI as a core runtime capability rather than a bolt-on feature. The teams that get this right aren’t the ones with the best models. They are the ones with the best systems around their models.

FAQ

What is AI-native architecture?

AI-native architecture treats model calls, retrieval, evaluation, routing, cost control, and fallback behavior as first-class production concerns instead of bolting an API call onto an existing feature.

What are the core AI architecture patterns in 2026?

The durable patterns are an AI gateway, retrieval layer, evaluation pipeline, model routing, structured output validation, observability, and graceful degradation.

Why do enterprise AI architectures fail?

They usually fail because the prototype has no production boundary: no owner, no eval suite, no fallback path, no data freshness model, and no cost attribution.

Building Reliable AI Agents in Go

Mon, 19 Jan 2026 00:00:00 +0000

Quick take

Reliable agents are built, not prompted. Limit tools and steps. Validate every action at the boundary. Persist state so retries are safe. Design explicit recovery paths. Measure outcomes with evals , not vibes. If you want autonomy, earn it in increments with evidence and guardrails. This post includes the Go patterns I actually use.

I’ve been building agent systems in Go for the past year – across startups and enterprise teams. The same lesson keeps repeating: the model is the easy part. The hard part is everything around it. Tool validation. State management. Recovery paths. Observability. The boring infrastructure that turns “it works in a demo” into “it works at 3am when nobody is watching.”

Reliable agents are engineered, not prompted. Here’s how.

What “reliable” actually means

If you can’t write down the success criteria, you can’t make an agent reliable. “Handle this ticket” isn’t a spec. “Classify into one of five categories, draft a reply citing the relevant policy section, and escalate to a human if confidence is below 0.7” is a spec.

A reliable agent operates within known tools, limited steps, and explicit completion checks. It produces repeatable outcomes. It fails safely. Creativity and autonomy aren’t the goal. Predictability is.

Reliability is strongest where the task is structured: multi-step workflows with fixed tools, document extraction, data transformation with deterministic post-processing. It degrades as tasks become open-ended, long-running, or novel. That isn’t a temporary limitation. It’s a fundamental property of probabilistic systems.

The architecture that holds up

The reliable agent systems I build don’t look like a single prompt calling tools. They look like a small system with explicit responsibilities:

type Agent struct {
    tools      ToolRegistry
    policy     PolicyEnforcer
    validator  ActionValidator
    state      StateStore
    supervisor Supervisor
    maxSteps   int
    timeout    time.Duration
}

type ToolRegistry struct {
    tools map[string]Tool
}

type Tool struct {
    Name        string
    Schema      jsonschema.Schema
    Execute     func(ctx context.Context, args json.RawMessage) (json.RawMessage, error)
    SideEffects bool
    Idempotent  bool
}

Every component has a clear job. The tool registry enforces schemas. The policy layer checks permissions before execution. The validator inspects arguments and output shape. The state store persists progress so retries don’t repeat side effects. The supervisor can stop, escalate, or hand off to a human.

You can implement this in a lightweight way, but the responsibilities need to exist somewhere. If they don’t, reliability will always be “mostly okay until it isn’t.”

Validation at the boundary

Agents fail in boring ways. Wrong parameters. Missing required fields. Calling the right tool at the wrong time. Repeating a write action. Getting stuck in a loop.

The fixes are also boring:

func (v *ActionValidator) Validate(action Action) error {
    tool, ok := v.registry.Get(action.ToolName)
    if !ok {
        return fmt.Errorf("unknown tool: %s", action.ToolName)
    }

    if err := tool.Schema.Validate(action.Args); err != nil {
        return fmt.Errorf("invalid args for %s: %w", action.ToolName, err)
    }

    if tool.SideEffects && !v.policy.Allowed(action) {
        return fmt.Errorf("action %s denied by policy", action.ToolName)
    }

    return nil
}

Validate arguments at the boundary. Return structured errors. If a tool has side effects, check policy before execution. If a tool isn’t idempotent, check whether this exact action has already been executed in the current run.

This isn’t clever. It’s the same approach I use for any public API. Treat tools like APIs, enforce contracts, and the model has fewer ways to surprise you.

Idempotency and state

The nastiest agent bugs come from retries that repeat side effects. Duplicate tickets. Repeated refunds. Double-sends. The fix is the same as in any distributed system : make write operations idempotent.

func (s *StateStore) ExecuteOnce(ctx context.Context, stepID string, fn func() (json.RawMessage, error)) (json.RawMessage, error) {
    if result, ok := s.Get(stepID); ok {
        return result, nil // already executed, return cached result
    }

    result, err := fn()
    if err != nil {
        return nil, err
    }

    s.Set(stepID, result)
    return result, nil
}

Every meaningful step gets a unique ID. Before executing, check if the step has already completed. If it has, return the cached result. This makes retries safe and recovery straightforward.

I learned this pattern while building cloud infrastructure at a previous startup, not AI systems. Same principles. Different surface area.

The supervisor loop

The supervisor is the most important piece. It enforces hard limits and decides what happens when things go wrong:

func (a *Agent) Run(ctx context.Context, task Task) (Result, error) {
    ctx, cancel := context.WithTimeout(ctx, a.timeout)
    defer cancel()

    for step := 0; step < a.maxSteps; step++ {
        action, err := a.planNextAction(ctx, task)
        if err != nil {
            return Result{}, fmt.Errorf("planning failed at step %d: %w", step, err)
        }

        if action.Type == ActionComplete {
            return a.finalize(ctx, action)
        }

        if action.Type == ActionEscalate {
            return a.escalateToHuman(ctx, task, action.Reason)
        }

        if err := a.validator.Validate(action); err != nil {
            a.logValidationFailure(step, action, err)
            continue // let the model try again with the error context
        }

        result, err := a.state.ExecuteOnce(ctx, action.StepID, func() (json.RawMessage, error) {
            return a.tools.Execute(ctx, action)
        })
        if err != nil {
            a.supervisor.OnFailure(ctx, step, action, err)
            continue
        }

        a.appendResult(step, action, result)
    }

    return Result{}, fmt.Errorf("agent exceeded max steps (%d)", a.maxSteps)
}

Hard maximum on steps. Hard timeout. Explicit escalation path. Validation before every tool call. Idempotent execution. Structured logging at every decision point.

This isn’t a framework. It’s a pattern. Adapt it to your domain. The important thing is that these responsibilities exist in your system, however you implement them.

Observability

If you can’t see what the agent did, you can’t improve it. Log enough to answer practical questions:

Tool name, step number, latency
Success/failure codes and validation errors
Argument hashes (not raw values for sensitive data)
Completion status and reason for stopping
Human handoff events

This data turns “the agent is flaky” into “the search tool fails 8% of the time when the query exceeds 200 characters.” The second statement is fixable. “Flaky” isn’t.

Where this falls apart

Open-ended creative work. Long-running autonomous loops with shifting context. Novel situations without prior examples. High-stakes decisions without human review.

These aren’t temporary limitations waiting for a better model. They are fundamental properties of probabilistic systems operating in complex environments. If your agent needs to handle these cases, the answer isn’t a better prompt. The answer is a human checkpoint.

The uncomfortable truth

Most agent reliability problems aren’t model problems. They are engineering problems. Wrong tool schemas. Missing validation. No idempotency. No timeouts. No escalation path. The model does something unexpected, and instead of being caught at the boundary, it cascades into a production issue.

Fix the engineering first. The model reliability improves as a consequence.

If you want autonomy, earn it in increments. With evidence. With guardrails. Not with optimistic prompts and hope.

AI Video Applications in Practice

Mon, 12 Jan 2026 00:00:00 +0000

Quick take

Video AI works when you treat it as a pipeline, not a magic model. Keep the domain tight, segment aggressively, ground outputs in transcripts and timestamps, and route low-confidence cases to human review. The product should help people navigate video, not act like it watched everything for them.

Video AI is now practical for scoped workflows. Teams are shipping systems that align audio and visuals, surface key moments, and make large video libraries searchable. The gap between a useful product and churn usually comes down to clear scope, predictable quality, and a human review path when confidence drops.

What Works Now

Reliability improves when the domain is defined and outputs are constrained. The most dependable capabilities are:

Moment finding for a known task or format
Summaries and highlights with timestamps
Policy screening that escalates uncertain cases
Search across a curated video collection

Application Patterns

Meeting and training intelligence

The best results come from combining transcripts with visual cues like screen changes, slides, and gestures. The output should be a short recap, clear actions, and a timeline of key moments. Treat this as a navigation tool, not a full replacement for watching the video.

Content review and safety

Use multiple signals instead of one score. Frame sampling, audio analysis, and scene context should all contribute to the final decision. Keep a clear path for human review, especially for borderline cases or sensitive content.

Video knowledge bases

Segment videos into stable chunks and index each segment with its transcript and visual context. Retrieval works best when users can jump directly to a moment, not just a file. This turns training libraries, product demos, and webinars into searchable references.

Editing assistance

AI can speed up rough cuts, captions, and highlight reels. It is less reliable for long-form generation or complex narrative editing. Position it as acceleration, not replacement.

Design Considerations

Design the product around model limits, not the other way around. Practical systems usually share a few traits:

Clear input bounds such as duration limits and supported formats
Visible uncertainty with reasons for low confidence
Latency budgets tied to the workflow, not the demo
Auditability for what was seen, heard, and decided

Shipping a Pragmatic Version

Start with a small, representative dataset and define acceptable output before you build. Add lightweight evaluation with a few high-risk scenarios, then iterate on prompt and pipeline changes. Logging and review tooling matter as much as model choice, especially when users need to trust what was skipped.

A Reference Pipeline That Holds Up

Most successful implementations look like a pipeline with explicit stages:

Ingest: normalize formats, cap duration, and record metadata.
Transcribe: get a transcript with time alignment (timestamps are the backbone).
Segment: split into stable chunks (scenes, slide changes, speaker turns).
Index: store transcript + metadata + embeddings for each segment.
Retrieve: answer queries by returning moments, not entire videos.
Synthesize: generate a summary or highlight list that points back to exact timestamps.

This structure keeps the system debuggable. When something is wrong, you can see whether transcription, segmentation, retrieval, or synthesis caused the failure.

Evaluation That Matters For Video

Video AI demos often look great because teams do not audit outputs closely. Practical evaluation focuses on a few measurable things:

Timestamp accuracy (can users jump to the right moment?)
Coverage (did the system miss key segments?)
False positives (highlight reels are useless if they highlight noise)
Safety/classification precision at the thresholds you operate at

Keep a small “golden set” of videos and re-run it whenever you change models, prompts, segmentation, or retrieval.

Common Pitfalls

Hallucinated timestamps: the model sounds confident but points to the wrong moment. Always anchor outputs to retrieved segments.
Overly long context: shoving a whole video into a single prompt wastes money and reduces accuracy. Segment first.
No review tool: if reviewers cannot quickly see why a decision was made, they will not trust it.
Privacy drift: meeting videos and training footage often contain sensitive data. Treat retention, access, and redaction as first-class requirements.

A Simple Checklist

Define supported formats and duration limits.
Make timestamps and citations part of every output.
Build a review UI for low-confidence cases.
Track latency and cost per processed minute of video.
Re-run a golden evaluation set on every meaningful change.

Closing

Video is searchable and summarizable when scope is clear and workflows are designed for review. Build the pipeline for predictable outputs, and the product will feel reliable.

What I Actually Expect from AI in 2026

Mon, 05 Jan 2026 00:00:00 +0000

Quick take

The advantage in 2026 isn’t model access. Everyone has that. The advantage is shipping AI features that behave predictably: scoped workflows, measured quality, controlled costs, a rollback path. Expect agents to get practical within guardrails, routing to replace one-model-fits-all, and regulation to become a real deployment constraint. The hype hangover is here. Execution is what matters now.

Prediction posts are dangerous. They age badly. I’ve been wrong before and survived, so here goes.

The conversation has shifted. 2025 proved models can be impressive. 2026 will test whether they are dependable in routine work. The changes that matter will be quieter: fewer surprises, tighter boundaries, and more disciplined economics.

Agents get real – within limits

This is the prediction I feel most confident about: bounded agents will become normal in production. Support triage. Internal ops workflows. Content pipelines. Document processing. The common thread is clear scope, defined tools, and human checkpoints.

The agent architecture that works looks similar everywhere I see it succeed:

Operates inside a defined workflow with explicit stop points
Uses tools with strict schemas, not free-form “do anything” capabilities
Produces intermediate artifacts a human can review – a draft, a classification, extracted fields
Easy to roll back or disable without breaking the product

A support agent that drafts a reply, proposes a refund category, and attaches relevant policy excerpts? That works. An agent that autonomously changes account settings across multiple systems without review? That will keep failing for boring reasons: permissions, edge cases, accountability, audit.

Full autonomy will remain limited. The hard part isn’t tool use. It’s verification and accountability. Anyone telling you otherwise is selling something.

Routing replaces the monolithic model

One of the clearest patterns I’ve seen: the teams controlling their costs and quality are the ones routing across models . Small model for simple classification. Medium model for drafting. Large model for complex reasoning and synthesis. Choose by task and risk, not by a single default.

Caching and reuse matter too: repeated requests, repeated retrieval, repeated transformations. Teams will treat token spend like any other variable cost and engineer it down.

If your AI feature is expensive today, the fix isn’t “wait for cheaper models.” The fix is to design a system that does less unnecessary work and fails more gracefully. This is basic systems engineering. The AI hype cycle just took a couple of years to remember it.

MCP and the integration layer

I’ve been watching MCP (Model Context Protocol) closely. It’s the kind of boring, practical standard that actually moves the industry forward – a way for models to interact with tools and data sources through a consistent interface. Not revolutionary. Useful.

What excites me about MCP is that it makes the agent architecture I described above more standardized and portable. Tool registries with schemas. Structured inputs and outputs. Less bespoke glue code per integration. Whether MCP specifically wins or another protocol emerges, the direction is clear: tool integration becomes a standard interface, not a custom project.

Enterprise: from experimentation to operations

AI budgets will flow toward integration, governance, and change management. Procurement, security review, and data quality will matter more than novel features. ROI scrutiny will tighten. Projects that can’t show durable value will get cut.

What changes inside organizations is mostly non-technical. Ownership becomes explicit – someone can approve data access, approve risk, and kill a feature. Enablement beats evangelism – internal platforms and reusable components matter more than another demo day. Training becomes practical – teams learn to write specs and evaluate changes, not just “prompt engineering.”

Regulation becomes a deployment constraint

Regulation is no longer theoretical. It’s showing up in procurement questionnaires, security reviews, and internal risk sign-off. Teams that build evidence and controls into the system will ship faster than teams that bolt them on later.

The prediction that matters: governance moves onto the critical path. Not as a blocker. As a competitive advantage for teams that do it well.

What probably won’t happen

Fully autonomous agents everywhere. Verification and accountability are still hard problems.
Prompt-only reliability. If a feature matters, it needs evaluation, monitoring, and structured interfaces. Not just better wording.
One model to rule them all. Production systems will route across models because constraints differ by task.
Frictionless compliance. Regulation doesn’t go away. Teams just get better at building evidence into the workflow.

None of this blocks useful systems. It pushes teams toward discipline. Which is where the value has always been.

What to do right now

If you’re shipping AI, the best moves are unglamorous:

Pick one workflow with clear value and low blast radius.
Define success and failure modes in writing.
Build a small eval set from real examples. Keep it versioned.
Add a rollback path and monitoring before expanding scope.
Track cost per successful outcome, not cost per request.

Do those five things and you will be ahead of most teams chasing capability. The advantage in 2026 isn’t clever prompting. It’s building a system that can be operated, debugged, and trusted.

Discipline over heroics. Ruthless focus. Same as always.

2025: The Year AI Stopped Being Special

Mon, 22 Dec 2025 00:00:00 +0000

I wrote a year-in-review post for 2019 about leaving fintech, joining a deep-tech founder program, and starting a new company. That year felt like a hinge point – a move from the known to the unknown. 2025 had a similar feel, but the shift wasn’t personal. It was industry-wide.

AI stopped being a special project. It became infrastructure.

That sentence sounds obvious in December. It wasn’t obvious in January. At the start of the year, most organizations I had worked with were still treating AI as an experiment. A side initiative. Something the “AI team” owned. By the end of the year, the successful ones had woven it into delivery pipelines, support tooling, and internal operations where reliability matters more than novelty.

The unsuccessful ones are still running pilots.

From demos to systems

The biggest shift was organizational, not technical. Projects moved from isolated demos to systems with owners, budgets, and maintenance plans. Evaluation and monitoring became part of deployment, not afterthoughts. Rollback plans existed before launch, not after the first incident.

This isn’t glamorous work, but it’s the work that matters. The teams that won in 2025 weren’t the ones with the cleverest prompts. They were the ones with the most disciplined operations.

Governance stopped being a dirty word

One thing I pushed hard for: governance as enablement, not bureaucracy. Clear rules for data handling, model selection, and access controls made teams faster. Guardrails reduced rework. Policy embedded in CI pipelines unblocked adoption in regulated contexts where teams had been stuck for months.

The pattern is simple. If governance is a checklist in a SharePoint, teams work around it. If governance is a set of automated checks in the delivery pipeline, teams rely on it. It’s the same lesson I learned running infrastructure at scale: make the right thing the easy thing.

Cost became a design constraint

Early in the year, teams treated model costs like someone else’s problem. By mid-year, the bills arrived. Suddenly, cost and latency were architectural decisions, not afterthoughts.

Small models for simple tasks. Large models for complex reasoning. Routing by task type and risk level. Caching repeated requests. Treating token spend like any other variable cost and engineering it down. These are infrastructure patterns, not AI magic. The teams that figured this out early controlled their economics. The teams that waited got surprised.

This reminded me of the early cloud days, when teams learned that “spin up more instances” isn’t a cost strategy. The discipline is the same: measure, optimize, budget. The only difference is that the unit of cost went from compute hours to tokens.

The throughline

On a personal note, 2025 was also the year I started proving out ideas I’ve carried since my early ventures. Building tools that reduce operational complexity, and make the right thing the easy thing, applies directly to AI infrastructure. The overlap between what I learned building cloud tooling and what teams need now for AI operations is almost one-to-one. Different surface area, same principles.

What actually worked

AI delivered best when scoped to a well-defined job with measurable outcomes inside existing workflows: drafting, summarization, classification, data extraction, and assisted analysis. Human review was explicit. Responsibility for quality was assigned to a specific person, not “the AI team.”

The three patterns that held up all year: evaluation-first rollout , human-in-the-loop for consequential actions, and model routing instead of one-model-fits-all.

What didn’t work

Broad, underspecified mandates. “Use AI to transform our customer experience.” That isn’t a spec. That’s a wish. Deployments without visibility into quality, security, or cost. Optimistic assumptions substituting for measurement.

I watched one organization burn an entire quarter on an “AI-powered” feature that had no eval suite, no monitoring, and no clear definition of success. When leadership asked why quality was inconsistent, the team had no data to answer with. They had anecdotes. Anecdotes don’t survive a quarterly business review.

The organizations that struggled most were the ones that mistook enthusiasm for strategy.

What stayed hard

Ambiguity. When success criteria are unclear, AI outputs drift and debates replace decisions. This is a product management problem, not an AI problem.

Trust. Users lose trust faster than teams regain it. One bad incident – a confidently wrong answer, a data exposure, a weird hallucination – and the credibility deficit takes months to recover from.

Drift. Small changes to prompts, data, or models shift behavior in ways that are hard to notice without measurement. This is why evaluation isn’t a launch activity. It’s a continuous operation.

High-stakes automation. The closer a feature gets to irreversible actions, the more you need review, auditability, and rollback. This constraint isn’t going away. Nor should it.

The story of 2025 isn’t that AI is unreliable. It’s that reliability is engineered, not assumed.

The internal shift that mattered most

Inside organizations, the biggest change was process maturity. Prompts and routing rules got versioned and reviewed like code. Evaluation moved earlier in the lifecycle. Platform teams became enablement functions instead of gatekeepers.

This is what turned AI from “experimentation” into “infrastructure.” It happened not because of a model breakthrough, but because engineering leaders insisted on treating AI systems with the same rigor as everything else in production.

Looking at 2026

The trajectory is continuation, not revolution. Better reliability. Tighter governance. Deeper integration. MCP and similar protocols making tool integration more standardized. Agents getting more practical for bounded workflows. Regulation becoming a real deployment constraint rather than a theoretical discussion.

I expect 2026 to be the year when the gap between “AI-capable” organizations and “AI-mature” organizations becomes impossible to ignore. Capable means you can build a demo. Mature means you can run it in production, measure it, fix it when it breaks, and explain it to a regulator. That gap is where the real competition happens.

The most valuable progress will come from operational discipline. Not a single breakthrough. Not a new model that changes everything. Just the steady, unglamorous work of making AI systems predictable, auditable, and maintainable.

2025 was the end of the novelty phase. The work now is execution.

The teams that understand this will win 2026. The teams that are still waiting for the next model release to solve their operational problems will keep waiting.

AI in 2025: The Year It Became Boring (Finally)

Mon, 08 Dec 2025 00:00:00 +0000

The most important thing that happened to AI in 2025 wasn’t a new model or a benchmark. It was the quiet, unsexy shift from “look what it can do” to “how do we run this reliably.”

AI became boring. And I mean that as the highest compliment.

What held up

Scoped tasks. Drafting, summarization, classification, assisted analysis. These became standard building blocks across the teams I worked with. Not fully automated work, but faster cycles and better starting points for human decisions. The pattern was consistent: define the task narrowly, evaluate outputs rigorously, and keep a human in the loop for anything consequential.

From what I’ve seen, the teams that got real value treated AI like any other system dependency. They versioned prompts. They ran evals in CI. They monitored quality drift the same way they monitor uptime. Nothing revolutionary, just engineering discipline applied to a new kind of component.

Reliability required active management the entire year. Human review stayed essential for anything with meaningful risk. Verification, provenance, monitoring – these weren’t optional extras. They were the cost of using AI responsibly. Teams that skipped these steps learned the hard way.

Where the limits stayed stubborn

Models still fail on edge cases. They still produce confident errors. They still struggle with up-to-date or domain-specific facts without a strong retrieval layer. Autonomy improved but complex workflows continued to need supervision and explicit guardrails.

None of this was surprising. But I think the persistence of these limits surprised people who expected 2025 to be the year everything “just worked.” It wasn’t, and that’s fine. Infrastructure doesn’t need to be perfect. It needs to be predictable and manageable.

The gap between “impressive demo” and “production system” stayed wide all year. I saw teams cycle through the same disillusionment: the model works great in testing, then behaves differently on real user inputs, then degrades when the underlying data changes. This isn’t a bug. This is the nature of probabilistic systems. The sooner teams accepted that, the faster they built something reliable.

Three patterns that actually worked

Evaluation-first rollout. Define what “good” means before you ship. Write it down. Build a small eval set from real examples. If you can’t measure quality, you can’t improve it, and you definitely can’t tell if your last change made things worse.

Human-in-the-loop for consequential actions. Not as a checkbox. As a genuine review step for anything that touches customers, money, or data. The teams that treated this as optional learned the hard way. The teams that built it into the workflow from day one rarely had incidents they couldn’t contain quickly.

Model routing over monolithic models. Use the smallest model that meets quality requirements. Escalate to a larger model only when needed. Route by task type and risk level. This is how you control costs and latency without sacrificing quality where it matters. One model for everything is a demo architecture, not a production architecture.

What changed inside teams

The organizational response matured. Governance moved from policy documents to operational routines – something I pushed hard for. AI evaluation became part of release processes. The role of AI engineering broadened from a specialized niche to a cross-functional concern touching product, data, security, and compliance.

I saw this play out clearly at a telecom company. Early in the year, AI was “the ML team’s thing.” By Q3, product managers were writing eval criteria. Security teams were reviewing prompt configurations. Finance was asking about cost per successful task instead of cost per API call. That cross-functional involvement is what separates “we use AI” from “we run AI as infrastructure.”

This matters more than any model improvement. A better model in a broken process still produces broken outcomes. A good-enough model in a disciplined process produces reliable value.

Looking at 2026

The trajectory feels less like a sprint and more like steady infrastructure improvement. Better planning. More reliable agents. Broader adoption. The core constraints remain familiar: trust, compliance, sustainable economics.

What I’m focused on heading into the new year:

Clean interfaces for retrieval, evaluation, and monitoring. MCP is making this more practical, and I’m watching it closely.
Policies that translate into day-to-day workflow checks, not quarterly reviews.
Clear ownership for quality, safety, and cost. Not “the AI team.” A specific person with the pager and the authority to change the system.

The most useful framing for 2025 was simple: AI is infrastructure. It delivers value when treated with the same rigor as any other system. It fails when treated as a shortcut.

2025 was the year that lesson became obvious. The question for 2026 is whether teams will actually internalize it or keep learning it the hard way.

Scaling AI in the Enterprise Is a Management Problem

Mon, 24 Nov 2025 00:00:00 +0000

Here’s the simplest test for whether your enterprise is actually scaling AI: can a team outside the AI group ship a safe, supported AI feature without reinventing the wheel?

If the answer is no, you aren’t scaling. You’re doing pilots.

I see this constantly. The technology isn’t the bottleneck. Models are good enough. The tooling exists. What’s missing is the operating model – the boring work that turns a demo into something that runs in production for years, with clear ownership, predictable costs, and a way to handle failures.

The pilot trap

Every large organization I’ve worked with has successful AI pilots : impressive demos, enthusiastic teams. Then the question comes: “How do we do this across 50 teams?”

The answer is never “give everyone API keys and let them figure it out.” That path leads to duplicated effort, inconsistent security practices, and a support burden that lands on the same three experts who built the original pilot. I’ve watched this happen at telecom companies. I’ve watched it happen at financial services firms. The pattern is remarkably consistent.

What an operating model actually looks like

Separate shared capabilities from local execution. It’s no more complicated than that.

Shared capabilities are the things every team shouldn’t have to reinvent: platform services , security guardrails, eval frameworks, model access, and policy. A small central group owns these. Their job is to make it easy to build safely.

Local execution belongs to the business teams who own use cases and outcomes. They pick the problems. They ship the features. They own the quality.

The balance matters. Too centralized, and you create a bottleneck where every AI idea has to go through a committee. Too distributed, and you get security gaps, wasted spend, and inconsistent quality. The sweet spot is a lightweight forum that resolves cross-team issues and keeps standards current without becoming a gate.

Governance as a lane, not a wall

The word “governance” makes engineers groan. I get it. But governance done right makes you faster, not slower.

The practical version is simple: data access is intentional and documented, model behavior is testable, audit trails exist, incident response has an owner, and rollback is a button, not a project.

If governance is a checklist that lives in a SharePoint nobody reads, teams will work around it. If it’s embedded into the build process – eval gates in CI, prompt versioning in the repo, monitoring that ships with the feature – teams will rely on it because it makes their lives easier.

Enablement, not evangelism

Scaling fails when enablement is treated like a training event. A two-hour workshop on “prompt engineering” doesn’t help a product team ship a reliable feature. What helps: repeatable patterns, starter templates, and a support path that doesn’t depend on cornering the same overworked ML engineer.

Extend the practices you already have. Your teams already know how to run CI pipelines, do code reviews, and deploy behind feature flags . Add eval suites to the pipeline. Add prompt reviews to the PR process. Make AI features fit into the existing delivery workflow instead of inventing a parallel one.

What to measure

Not tool adoption. Not number of pilots. Not “AI maturity scores.”

Track what’s in production and whether it’s maintained. Track support burden. Track which use cases are paused or retired. These signals tell leaders where to invest and what to stop. Everything else is decoration.

The sequence that works

Establish the platform and guardrails first. Prove the model with a small set of high-leverage use cases. Expand to more teams with consistent support. Review outcomes and simplify anything that causes friction.

The order matters. Each step creates the preconditions for the next. Skip ahead and you’re scaling demand faster than capability, which is how you end up with 50 broken pilots instead of 5 working ones.

This is a management problem. Treat it like one.

AI Incidents Don't Look Like Outages. That's the Problem.

Mon, 10 Nov 2025 00:00:00 +0000

Quick take

AI incidents are behavior failures, not downtime. Your monitoring says everything is green while the system confidently gives wrong answers. Detect with sampled quality checks and user feedback. Contain with rollbacks and feature flags, not root-cause analysis. Turn every incident into new eval coverage. Speed and reversibility beat thoroughness.

I wrote about incident response in 2019, drawing from national cyber-defense exercises and real startup breaches. The core lesson was simple: teams that perform best under pressure are the ones that have practiced the response, not the ones with the fanciest playbook sitting in Confluence.

That lesson applies directly to AI systems. But AI incidents have a nasty twist.

The system is up. The system is wrong.

Traditional incidents are usually obvious. The service is down. Latency spikes. Error rates climb. Dashboards go red. Someone gets paged.

AI incidents are subtle. The service returns 200 OK. Latency is normal. No errors in the logs. But the system is confidently telling a customer something wrong. Or it regressed after an untracked prompt change. Or the retrieval layer is surfacing stale docs, and the model is synthesizing them into plausible-sounding garbage.

I’ve seen this firsthand. A team ships a model update on Friday. Quality degrades on a specific input class. Nobody notices until Monday because all the operational metrics look fine. The only signal was a spike in user thumbs-down feedback that nobody was monitoring.

That’s the core problem. Your existing monitoring was built for availability. AI incidents are about correctness, and correctness is harder to observe .

What counts as an AI incident

Any material deviation from expected behavior that can affect users or business outcomes. In practice:

Wrong-but-plausible responses that users might trust and act on
Regressions after model, prompt, or retrieval changes
Retrieval failures that surface irrelevant or outdated context
Safety or policy violations – the model doing something it shouldn’t

These are ambiguous by nature. There’s no clean threshold. So detection has to rely on multiple signals, not a single metric.

Detection that actually works

Teams that catch things quickly combine several layers:

Sampled quality checks. Automatically evaluate a percentage of live traffic against your eval criteria. This catches systematic regressions before they pile up.

Targeted evals for known risk areas. If your system handles financial data or medical information, run focused checks on those categories continuously.

User feedback with low friction. A thumbs-down button isn’t sophisticated. It’s incredibly effective if someone is actually looking at the data. At a startup I ran, we learned that a simple feedback signal, reviewed daily, caught issues faster than any automated check.

Drift indicators. Track model behavior distributions over time. Track retrieval relevance scores. When these shift, something changed – even if nobody deployed anything.

No single signal is ground truth. The goal is to surface a pattern early enough to contain it.

Containment: fast and reversible

The instinct during any incident is to understand what happened. Resist that. Contain first, investigate later. This is the same principle from traditional IR – the tourniquet analogy I’ve used before.

For AI systems, the most reliable containment actions are:

Roll back to a previous model or prompt version. This requires having versioned those artifacts in the first place.
Feature-flag the risky path. Disable or rate-limit the AI feature. Route to a fallback.
Escalate to human review. For high-stakes outputs, insert a human checkpoint until the issue is understood.
Increase sampling. Crank up monitoring on the affected workflow while the issue is active.

All of these are operational actions, not analytical ones. You don’t need to understand the root cause to stop the bleeding.

Postmortems that close the loop

Once contained, run a focused postmortem. The questions are specific:

Which outputs were wrong or unsafe? Get concrete examples.
What signal could have caught this earlier?
What evaluation gap allowed it through?
What operational control would have reduced the blast radius?

The most important action item from any AI postmortem: add the failure cases to your eval suite . Every incident should produce new test coverage. If your eval suite isn’t growing after incidents, you aren’t learning.

Keep action items small and testable. “Improve quality” isn’t an action item. “Add 10 regression cases from this incident to the eval suite and enforce a rollout gate for prompt changes in this workflow” is an action item.

Prevention is a posture, not a gate

The teams that handle AI incidents well treat them as routine. Not as emergencies that mean someone failed. Practical prevention:

Evaluate changes before they hit full traffic. Canary deploys work for AI too.
Track model, prompt, and retrieval changes in a single changelog. When something breaks, you need to know what changed.
Maintain a simple runbook with containment options and owners. Not a 40-page document. A one-pager with “who gets paged, what can we roll back, what is the fallback.”

The goal isn’t zero incidents. The goal is fast detection, fast containment, and a system that gets more predictable over time. Same as any production system.

AI Technical Debt Is Eating Your Team Alive (And You Can't Even See It)

Mon, 27 Oct 2025 00:00:00 +0000

I wrote about the true cost of technical debt back in 2016. The core argument was simple: if you can’t put a number on your debt, you can’t make a rational decision about it. Measure the pain, do the math, and present the tradeoff.

That advice still holds. But AI debt is a different animal, and it’s making me angry.

With traditional tech debt, at least you can see it. Messy code. Missing tests. A module everyone dreads touching. The debt is in the codebase. You can grep for it. You can point to it in a PR review.

AI debt hides. It hides in prompts copy-pasted from a demo and never documented. In evaluations that were “planned for next sprint” six months ago. In embeddings that went stale when source docs changed and nobody re-indexed. In retrieval pipelines where data drifted so gradually that answers went from “good” to “plausible” to “confidently wrong,” and nobody noticed until a customer complained. The architectural version of this is why AI-native architecture needs explicit evaluation and retrieval ownership.

The system is still up. It still returns 200 OK. And it’s slowly poisoning your product.

The four kinds of AI debt that keep showing up

Prompt debt. Someone wrote a prompt that worked. They shipped it. Three model versions later, it still “works,” but the behavior has shifted in ways nobody documented because nobody was measuring. The prompt has magic strings nobody can explain. Changing a single sentence now requires a full regression test nobody has time for, so nobody changes anything, and the prompt becomes legacy code that happens to be written in English.

Eval debt. This one drives me up the wall. Teams ship AI features with no evaluation suite . Then they argue about quality using anecdotes. “It seemed fine when I tried it.” That’s not engineering; that’s vibes. Without evals, you can’t tell if your last change made things better or worse. You’re flying blind and calling it agile.

Data and pipeline debt. Stale embeddings. Missing documents. Labeling standards that drifted. The retrieval layer quietly degrades, and because LLMs are so good at sounding confident, nobody notices that answers are getting worse. This is the most insidious form because it’s silent. The system doesn’t crash. It just gets less trustworthy.

Architecture debt. The model interface is hard-coded three layers deep. Tool calls are embedded in application logic. Swapping a provider or upgrading a model feels like open-heart surgery. So teams avoid improvements entirely. The system calcifies.

How to actually fix this

The same way you fix any tech debt . Not with a heroic rewrite. With discipline.

Version your prompts like code. Put them in the repo. Give them owners. Document the intent, not just the text. When someone changes a prompt, they should write down why, and what eval signals should remain stable. This isn’t bureaucracy. It’s how you stop mystery regressions.

Build evals before you ship. Start with a small set of real examples and documented expected outcomes. Run them on every meaningful change. It doesn’t need to be elaborate. It needs to be consistent. Teams that do this – even just 20-30 test cases – move faster because they know what is safe to change.

Decouple the model interface. Abstract it. Separate retrieval from response logic. That lets you swap providers , test with mocks, and upgrade models without touching core flows. It also makes your system testable, which is the whole point.

Monitor freshness alongside quality. Track when your embeddings were last updated. Track retrieval relevance scores. If your data pipeline is stale, your outputs are stale, no matter how good the model is.

The uncomfortable part

Most teams accumulate AI debt because they shipped under pressure and told themselves they’d clean it up later. I’ve been guilty of this. Early on at a startup I ran, we had prompts that worked “well enough” and no eval suite for weeks. The reckoning came when we swapped model versions and spent three days figuring out what broke because we had no baseline to compare against.

The fix isn’t a cleanup sprint. It’s a steady cadence. Fifteen percent of capacity toward debt work, same as I recommended in 2016. Review prompt changes with rationale. Run evals on every release. Monitor quality signals and data freshness together.

AI debt is manageable. But it requires intention. If every small change to your AI system feels risky, you already have a debt problem. The path forward isn’t heroic rewrites. It’s a steady sequence of small, documented improvements.

Steady beats dramatic. Every time.

AI Doesn't Make Your Team Faster. Shared Infrastructure Does.

Mon, 13 Oct 2025 00:00:00 +0000

Every few weeks someone asks me how AI is changing team productivity. The honest answer: less than most people think, and in different ways than expected.

Individual engineers using Copilot or ChatGPT to write code faster is fine. It’s also not the point. One person moving 20% faster doesn’t help if the team is still bottlenecked on the same things it was bottlenecked on six months ago: stale docs, unclear decisions, and onboarding that requires cornering a senior engineer for two hours.

The teams I see getting real gains are the ones that treat AI as shared infrastructure. Not a personal productivity hack. Infrastructure.

What that looks like in practice

A shared assistant for team documentation and search. Not a chatbot that guesses – something that points to actual internal sources and tells you who owns what. Automated meeting summaries that feed into the same system where the team already tracks decisions. Onboarding workflows where a new hire can get a credible first answer and a pointer to the right human, instead of posting in Slack and hoping someone responds.

None of these need perfect accuracy. They need consistent routing and clear expectations about when AI is advisory versus authoritative.

The measurement trap

Here’s where most teams go wrong. They measure AI tool adoption . Number of prompts. Lines of code generated. That’s like measuring how many emails your team sends and calling it productivity.

The only question that matters: is the team less stuck?

Fewer repeated questions about the same topic. A shorter gap between a decision being made and that decision being documented. Less rework because someone missed context from a meeting they weren’t in.

If AI usage goes up but those numbers stay flat, you have added a toy, not infrastructure.

Docs, specifically

Documentation is where AI has the most underrated impact. Not generating docs from scratch – that’s garbage. But proposing small updates when code changes, flagging content that no longer matches reality, and making the update feel like a five-second approval instead of a batch project.

At a startup I ran, we struggled with doc decay like everyone else. The trick was making updates feel like routine housekeeping, not a chore you schedule for “next sprint” and never do.

Start small, stay boring

Pick one shared workflow. Make it reliable. Expand based on evidence, not enthusiasm. A small, visible win – like meeting notes that are actually useful the next day – changes team behavior more than any broad AI rollout plan.

The teams getting durable gains are the ones keeping AI practical, scoped, and accountable. Boring wins. As usual.

Measuring AI ROI Without Lying to Yourself

Mon, 29 Sep 2025 00:00:00 +0000

Quick take

AI ROI isn’t a spreadsheet trick. Pick one workflow with a clear baseline. Capture all costs – engineering, evals, governance, change management, and AI inference cost – not just API bills. Tie benefits to outcomes the business already measures. Report a range with assumptions, not one magic number. If your ROI case only works under best-case assumptions, it doesn’t work.

I’ve sat in a lot of budget reviews over the years – telecoms, fintech, logistics. The AI ROI presentations I see fall into two categories: honest assessments that lead to good decisions, and fiction that leads to funded projects that get quietly killed six months later.

The difference isn’t sophistication. It’s honesty about costs and rigor about baselines.

The Full Cost Picture

The first lie in most AI ROI calculations is the cost side. Teams report API costs and maybe some engineering time. They leave out everything else.

Here’s what AI actually costs:

Cost Category	What Teams Report	What It Actually Includes
Infrastructure	API usage fees	API fees + local compute + storage + networking + monitoring
Engineering	Initial build time	Build + integration + prompt engineering + ongoing maintenance
Evaluation	Nothing	Eval set creation + human review + quality monitoring tooling
Data	Nothing	Data preparation + cleaning + annotation + ongoing curation
Governance	Nothing	Compliance review + privacy controls + audit tooling + vendor management
Change Management	Nothing	Training + process redesign + user support + documentation
Opportunity Cost	Nothing	What else the team could have built with the same time

When I push teams to fill in the “What It Actually Includes” column, the cost estimate typically doubles or triples. That isn’t an argument against AI. It’s an argument for honest accounting so you can make the right investment decisions.

The Baseline Problem

You can’t measure improvement without a baseline. Sounds obvious. You’d be amazed how many teams skip it.

Before you deploy AI in a workflow, measure the current state:

Metric	How to Capture	Why It Matters
Throughput	Tasks completed per person per day	Direct productivity comparison
Error rate	Errors caught in QA or by customers	Quality comparison
Cycle time	Time from task start to completion	Speed comparison
Cost per task	Fully loaded labor cost / tasks completed	Economic comparison
Customer satisfaction	CSAT or NPS for the specific workflow	Outcome comparison

Measure for at least four weeks before deployment. Document any other changes that happened during the same period – new hires, process changes, seasonal variation. Those confounders matter when you try to attribute improvements to AI.

Mapping Benefits to Outcomes

The second lie in most AI ROI cases is on the benefit side. “Time saved” isn’t a business outcome. It’s a proxy. What did the team do with the saved time?

Map every claimed benefit to something the business already tracks and trusts:

AI Capability	Claimed Benefit	Business Outcome to Measure
Automated triage	Faster ticket routing	Resolution time, first-response time
Document extraction	Less manual data entry	Throughput per person, error rate
Content generation	Faster content creation	Time to publish, content volume
Code assistance	Faster development	Cycle time, defect rate, deploy frequency
Customer support	Reduced support load	Tickets per agent, CSAT, escalation rate

If you can’t connect an AI capability to a number the business already watches, the benefit is speculative. Label it that way. Don’t pretend it’s measured.

The Three Traps

Cherry-picking the easy wins. Measuring ROI only on the tasks that were already easiest to automate. The impressive numbers don’t represent the full deployment. Report the aggregate, not just the highlights.

Ignoring the learning curve. The first month after deployment is usually worse than the baseline. People are adjusting. Workflows are changing. If you measure too early, you either see inflated novelty effects or deflated learning-curve effects. Neither is representative.

Qualitative benefits as hard numbers. “Developers feel more productive” isn’t the same as “throughput increased 20%.” Both are worth reporting. Only one belongs in a financial model. In my work, I insist on separating measured outcomes from perceived benefits in every report. Leadership respects the honesty.

The Report Format That Works

Keep the ROI report to one page. Seriously. If it needs more than one page, you’re either overcomplicating or overclaiming.

Decision context. What question does this measurement answer? “Should we expand AI-assisted triage to all support channels” is specific. “Is AI valuable” isn’t.

Assumptions. List every assumption explicitly. Volume of tasks, cost rates, attribution model, measurement window. When assumptions change, the conclusion changes. Make that visible.

Results as a range. Don’t report a single ROI number. Report a range: conservative estimate under pessimistic assumptions, expected estimate under likely assumptions, optimistic estimate under best-case assumptions. If the conservative estimate is still positive, you have a strong case. If only the optimistic estimate is positive, you have a gamble.

Next measurement. State when you’ll re-measure and what would cause you to change course. This turns the report from a sales pitch into a decision tool.

What matters

AI ROI measurement isn’t about proving AI works. It’s about making good investment decisions. Capture the full cost, not just the API bill. Establish a real baseline before deploying. Map benefits to outcomes the business already tracks . Report honestly, with ranges and assumptions.

The teams that do this get funded reliably because leadership trusts their numbers. The teams that overclaim get one round of funding and then spend a year explaining why the projections didn’t materialize.

Discipline over heroics. Even in spreadsheets.

AI Privacy Is a Plumbing Problem, Not a Policy Problem

Mon, 15 Sep 2025 00:00:00 +0000

Quick take

AI privacy is plumbing, not policy. Map every data flow. Minimize what you send to models. Control who can replay prompts and access logs. Set retention rules that are actually enforced. Do sensitive work locally and pass reduced representations upstream. If you treat privacy as a late-stage review, you’ll fail the audit.

My background in national cyber-defense taught me something that most engineers learn too late: data classification isn’t a theoretical exercise. When you’re operating in an environment where information leakage has consequences beyond a compliance fine, you develop a different relationship with data flows. You map them. You minimize them. You assume every copy of data is a liability until proven otherwise.

That mindset transfers directly to AI systems.

The Problem Nobody Maps

Most AI features touch far more data than the visible prompt. In a typical RAG workflow , the user submits a query, your system retrieves context from a knowledge base, the model receives both the query and retrieved documents, it generates a response, and that response gets logged for quality monitoring.

At each step, data is copied. The user’s query is in your application logs, in the retrieval system’s query log, in the model provider’s request log, in your quality monitoring dashboard. The retrieved documents – which might contain sensitive customer data – now exist in your model provider’s system too, subject to their retention policy, not yours.

If you can’t draw this flow on a whiteboard in under two minutes, your privacy controls are guesswork. I start every privacy review by asking the team to map the flow. Most teams can’t do it. That’s the first problem to fix.

Minimize Before You Send

Data minimization is the single most effective privacy control in AI systems. Not because it’s elegant, but because it reduces blast radius. Data you don’t send can’t be leaked, retained, or trained on.

Practical minimization looks like this:

Strip identifiers early. Before the prompt is assembled, remove names, emails, account IDs – anything that isn’t required for the model to produce a useful response. If the model needs to reference a user, use an opaque session token that maps to the real identity only in your system.

Send summaries, not documents. If you need context from a 20-page contract, summarize the relevant section locally and send the summary. The model doesn’t need the full document. Your privacy exposure drops by an order of magnitude.

Separate sensitive from useful. Not all data carries the same risk. Split your workflows so that high-sensitivity data – medical records, financial details, authentication tokens – is processed locally with stronger controls. Lower-risk data can flow through standard AI paths. This tiering reduces the scope of every privacy review and makes incident response simpler.

Local First for the Dangerous Bits

Some operations should never leave your infrastructure. PII detection, redaction, and sensitive-content classification should run locally, on models you control, before anything touches an external API.

The pattern is straightforward: do sensitive work where the data already lives, then pass a reduced representation to the cloud model. This isn’t about avoiding cloud AI entirely. It’s about being deliberate about what crosses the boundary.

I’ve helped design pipelines where the first stage runs a local model to detect and redact PII, the second stage sends the sanitized content to a cloud model for the actual task, and the third stage re-attaches the redacted information only in the final response shown to the authorized user. The cloud model never sees real PII. The logs never contain it. The attack surface shrinks dramatically.

Logs Are the Quiet Privacy Gap

AI features generate logs that teams don’t think about. Prompt logs for debugging. Response logs for quality monitoring. Replay tools for incident investigation. Evaluation datasets built from production traffic.

Each of these creates a copy of user data that lives outside your normal data governance. And because these are “internal tools,” they often have broader access than production databases do.

Lock them down the same way you lock down production data:

Access control. Not everyone who can view the dashboard should be able to replay prompts containing user data. Restrict access by role and audit who accesses what.
Retention limits. Prompt logs don’t need to live forever. Set a retention window – 30 days is plenty for most debugging needs – and enforce automatic deletion.
Audit trails. Know who accessed which logs and when. This isn’t optional for regulated industries. It shouldn’t be optional for anyone.

Vendor Questions That Actually Matter

When evaluating AI providers, skip the marketing page and ask these questions directly:

Is customer data used to train or improve models by default? How do you opt out, and is the opt-out verified?
What data is retained after a request completes? For how long? For what purpose?
Where does processing happen geographically? Who on the vendor’s side can access request logs?
How are deletion requests handled? What’s the SLA? Is deletion cryptographic or simply a database flag?

Write the answers down. Put them in your vendor assessment. Revisit them annually, because vendor policies change without notice.

Governance That Survives Audits

Heavy governance processes don’t survive contact with reality. Teams skip them, shortcuts accumulate, and the audit reveals a gap between policy and practice.

Keep governance light and concrete:

One data flow map per AI feature. Inputs, retrieval sources, logs, outputs, retention. Fits on a single page.
A documented purpose for each data category. Why is this data in the pipeline? If you can’t answer, remove it.
Tested deletion paths. Not “we have a process for deletion.” Actually run it. Verify the data is gone. Do this quarterly.

Privacy is a design constraint, not a compliance checkbox. Build it into your AI pipeline the same way you build in authentication and authorization: as infrastructure that runs automatically, not as a review that happens after the fact.

Security, stability, performance – in that order. Privacy falls under security. It goes first.

AI Pair Programming: It's a Junior Dev, Not a Wizard

Mon, 01 Sep 2025 00:00:00 +0000

I pair with AI every day: building production systems, contributing to Go, and prototyping new ideas. It’s part of my workflow the same way version control and testing are – not because it’s magical, but because it’s useful when you know its limits.

The teams I’ve seen get the most value from AI coding assistants treat them the same way: like a fast, literal junior developer. Emphasis on literal. The model does exactly what you ask, fills in gaps with plausible guesses, and never tells you when your approach is wrong. That’s the mental model that keeps you productive without getting burned.

Where It Shines

AI assistants are excellent at work that’s well-scoped and pattern-driven. The kind of tasks where you know exactly what the output should look like but don’t want to type it all out.

Boilerplate generation, test scaffolding from existing patterns, translating a clear spec into working code, exploring how an unfamiliar API works, and refactoring repetitive code paths into a cleaner abstraction when you already know what that abstraction should be.

I use it heavily for these cases and it genuinely saves hours per week. When I’m writing Go and I need a new handler that follows the same pattern as the last ten handlers, the AI drafts it in seconds. I review, adjust, and move on.

Where It Falls Apart

The moment you need architectural judgment, project history, or business context, the AI becomes dangerous. Not useless – dangerous. Because it will confidently produce something that looks right, passes a quick glance, and introduces a subtle bug or design flaw that you don’t catch until it’s in production.

Watch for these warning signs:

It repeats the same mistake after you correct it. The model doesn’t learn within a session the way a human colleague does. If it keeps ignoring a constraint, it probably can’t reliably hold that constraint in its current context.
It invents things. Functions that don’t exist. Config options that aren’t real. API endpoints it hallucinated from training data. Always verify against actual docs.
It optimizes for elegance over correctness. The model loves clean, compact code. Sometimes that means it refactors away an important edge case because the edge case made the code ugly.

I’ve caught all three of these in my own work. More than once.

The Loop That Works

Long, open-ended chat sessions with AI produce garbage. The context window fills up, the model loses track of constraints, and you end up in a back-and-forth that takes longer than writing the code yourself.

Short, focused loops work. Here’s the pattern I use:

Define the task tightly. Inputs, outputs, constraints, existing style to match. Be specific. “Add a function that does X given Y, handling Z edge case, matching the pattern in the rest of this file.”
Get a first pass. Let the AI draft it.
Review critically. Not “does this look right” – trace through the logic. Check edge cases. Check error handling. Check that it respects the codebase conventions.
Iterate on specific gaps. Don’t ask for a full rewrite. Point at the specific line or logic branch that’s wrong and ask for a fix.
Integrate manually. Copy the code into your editor, run the tests, review the diff. The AI’s output is a draft, not a commit.

Give It Real Context

Vague prompts produce vague code. The single biggest improvement I’ve seen is upgrading from “write me a function that processes users” to something with actual constraints:

“Add a method getActiveUsers(since time.Time) to UserStore. Users are active if their LastSeen is after the given time. Return a slice sorted by LastSeen descending. If the store is empty, return nil, not an empty slice. Match the existing receiver pattern in this file.”

That level of specificity is the difference between useful output and time wasted reviewing hallucinated code.

The Trust Boundary

Here’s the line I draw: AI output is untrusted input . Same as user input. Same as data from an external API. It goes through the same gates.

Tests must pass.
Linter must pass.
Code review still applies. A human reads the diff.
Security-sensitive code gets extra scrutiny regardless of who or what wrote it.

Some teams have started rubber-stamping AI-generated code because “the AI wrote it and it looks fine.” That’s how you get vulnerabilities in production. I’ve seen it happen.

The Honest Assessment

AI pair programming makes me faster at the boring parts of writing software. It doesn’t make me better at the hard parts. Architecture decisions, security considerations, performance tradeoffs, understanding what the user actually needs – those are still entirely on me.

The developers who get the most value are the ones who already know what good code looks like. The AI accelerates their output. The developers who rely on AI to compensate for gaps in their understanding ship bugs faster.

Use it as a tool. Review its work. Keep the sessions short. And never, ever merge without reading the diff.

Running AI Locally: A Practical Guide for Teams Who Care About Control

Mon, 18 Aug 2025 00:00:00 +0000

Quick take

Local AI development is a legitimate option for teams that need data control, predictable costs, or offline capability. The tradeoff is operational work. Keep the stack small, abstract the provider behind an interface, version your models like you version your code, maintain an eval set, and always keep a cloud fallback for quality-critical paths.

I run local models daily: in production projects, for prototypes, and for anything involving sensitive data that shouldn’t leave my machine. The tooling has matured enough that this is no longer a novelty; it’s a practical engineering choice with clear tradeoffs.

I’ve also seen teams go all-in on local AI without understanding what they’re signing up for. Running your own models means owning the full lifecycle: model selection, quantization, runtime management, version pinning, quality monitoring, and fallback strategies. If you aren’t prepared for that operational load, use a managed API.

This post is for teams who have decided local makes sense and want to do it properly.

When Local Is the Right Call

Local AI makes sense in specific scenarios:

Sensitive data. Proprietary code, financial records – anything you don’t want leaving your network. I frequently work with data under NDA, and local inference means the data never touches a third-party API.
Predictable costs . API costs scale with usage; local costs scale with hardware. For high-volume routine tasks – classification, extraction, summarization – local can be dramatically cheaper once you amortize the hardware.
Offline or air-gapped environments. Some deployments don’t have reliable internet. Some shouldn’t have it. My national cyber-defense background drilled this in – there are environments where external API calls aren’t just inconvenient; they aren’t allowed.
Deterministic CI testing . When your tests depend on model output, you need a pinned model version that doesn’t change between runs. Local gives you that control.

Local is the wrong call when you need frontier-level quality on every request or your team can’t absorb the operational overhead.

The Provider Abstraction

First rule: never hard-code your provider. Whether you’re using Ollama, llama.cpp, vLLM, or a cloud API, the rest of your code shouldn’t care. Hide it behind an interface.

In Go, this is clean:

// Provider defines the contract for any AI backend.
type Provider interface {
    Complete(ctx context.Context, req CompletionRequest) (CompletionResponse, error)
    Embed(ctx context.Context, input string) ([]float64, error)
    Health(ctx context.Context) error
}

type CompletionRequest struct {
    Model       string
    Messages    []Message
    MaxTokens   int
    Temperature float64
}

type CompletionResponse struct {
    Content    string
    TokensUsed int
    Model      string
    FinishReason string
}

Now your local and cloud providers implement the same interface. Switching between them is a config change, not a code rewrite. Testing is trivial: mock the interface and move on.

type OllamaProvider struct {
    endpoint string
    client   *http.Client
}

func NewOllamaProvider(endpoint string) *OllamaProvider {
    return &OllamaProvider{
        endpoint: endpoint,
        client: &http.Client{
            Timeout: 120 * time.Second,
        },
    }
}

func (o *OllamaProvider) Complete(ctx context.Context, req CompletionRequest) (CompletionResponse, error) {
    body := ollamaRequest{
        Model:    req.Model,
        Messages: toOllamaMessages(req.Messages),
        Stream:   false,
        Options: ollamaOptions{
            Temperature: req.Temperature,
            NumPredict:  req.MaxTokens,
        },
    }

    resp, err := o.post(ctx, "/api/chat", body)
    if err != nil {
        return CompletionResponse{}, fmt.Errorf("ollama completion: %w", err)
    }

    return CompletionResponse{
        Content:      resp.Message.Content,
        TokensUsed:   resp.EvalCount,
        Model:        resp.Model,
        FinishReason: resp.DoneReason,
    }, nil
}

The Fallback Chain

Local models are good. They aren’t always good enough. For quality-critical paths – user-facing content generation, complex reasoning tasks, anything where a wrong answer costs real money – you need a fallback to a stronger model .

type FallbackProvider struct {
    primary   Provider
    fallback  Provider
    threshold float64 // confidence threshold for fallback
}

func (f *FallbackProvider) Complete(ctx context.Context, req CompletionRequest) (CompletionResponse, error) {
    resp, err := f.primary.Complete(ctx, req)
    if err != nil {
        // Primary failed, try fallback
        slog.Warn("primary provider failed, using fallback", "error", err)
        return f.fallback.Complete(ctx, req)
    }
    return resp, nil
}

In practice, I extend this with confidence scoring – if the local model returns a low-confidence response, automatically retry with the cloud provider. The core pattern is simple: try local first, fall back to cloud when needed, and log fallbacks so you know how often they happen.

Configuration That Travels

Keep your AI configuration in a structured file in source control. Everything – model names, endpoints, fallback rules, temperature settings – should be declarative and version-controlled.

ai:
  default_provider: local

  providers:
    local:
      type: ollama
      endpoint: http://127.0.0.1:11434
      models:
        completion: "mistral:7b-instruct-v0.3-q5_K_M"
        embedding: "nomic-embed-text:latest"
      timeout: 120s

    cloud:
      type: openai
      # API key from environment: AI_CLOUD_API_KEY
      models:
        completion: "gpt-4o"
        embedding: "text-embedding-3-small"
      timeout: 30s

  fallback:
    enabled: true
    primary: local
    secondary: cloud
    on_error: true
    on_low_confidence: true
    confidence_threshold: 0.7

  evaluation:
    eval_set_path: "./eval/fixtures"
    run_on_model_change: true

The model name includes the quantization level. This is deliberate. mistral:7b-instruct-v0.3-q5_K_M is not the same as mistral:7b-instruct-v0.3-q4_0. Different quantization levels produce different outputs. Pin it.

Versioning and Reproducibility

This is where most local setups fall apart. Someone updates the model, doesn’t tell the team, and suddenly outputs are different. Tests still pass because nobody wrote quality assertions – they just check that the model returned something.

Version these things:

Model file hash. SHA256 the model binary. Store the hash in your lockfile or config. If the hash changes, the model changed.
Runtime version. Pin your Ollama or llama.cpp version in your Dockerfile or setup script.
Prompt templates. Keep them in source control alongside the code that uses them. Prompt drift is real and insidious.

FROM ollama/ollama:0.3.12

# Pull and pin specific model versions
RUN ollama pull mistral:7b-instruct-v0.3-q5_K_M

# Copy eval fixtures for smoke test
COPY eval/fixtures /eval/fixtures

The Evaluation Harness

You need an eval set . Not optional. It should be a small collection of representative inputs with expected outputs that you run every time you change a model, update a prompt, or modify provider configuration.

func TestModelQuality(t *testing.T) {
    provider := setupLocalProvider(t)

    fixtures := loadEvalFixtures(t, "./eval/fixtures")
    var passed, failed int

    for _, fix := range fixtures {
        resp, err := provider.Complete(context.Background(), fix.Request)
        if err != nil {
            t.Errorf("fixture %s: %v", fix.Name, err)
            failed++
            continue
        }

        if !fix.Validate(resp.Content) {
            t.Errorf("fixture %s: expected pattern %q, got %q",
                fix.Name, fix.ExpectedPattern, resp.Content)
            failed++
            continue
        }
        passed++
    }

    passRate := float64(passed) / float64(passed+failed)
    if passRate < 0.85 {
        t.Fatalf("pass rate %.1f%% below threshold 85%%", passRate*100)
    }
}

Run this in CI. Run it before every model swap. Run it when you change prompts. The eval harness is what keeps you from shipping a regression you don’t notice for two weeks.

Performance Tuning Order

If local inference is too slow, fix it in this order:

Smaller model. For routine tasks – classification, extraction, simple summarization – a 7B parameter model is often sufficient. Don’t run a 70B model for ticket triage.
Quantization. Q5_K_M is usually the sweet spot between quality and speed. Q4_0 is faster but you’ll notice quality degradation on complex tasks. Measure with your eval set before committing.
Batching. If you have throughput-heavy workloads, batch requests. Most local runtimes support this. The latency per request goes up slightly but throughput goes up dramatically.
Hardware. GPU inference is 10-50x faster than CPU for most model sizes. If you’re serious about local AI, budget for a decent GPU. An RTX 4090 handles a 7B model comfortably.

The Honest Tradeoff

Local AI gives you control, privacy, and predictable costs. In exchange, you take on operational responsibility for model management, quality monitoring, and infrastructure maintenance. That’s a fair trade for the right workloads.

Keep the stack small. Abstract the provider. Version everything. Measure quality continuously. Keep a cloud fallback for the moments when local isn’t enough.

The teams that do this well treat local AI like any other infrastructure dependency – with discipline, not enthusiasm.

AI Workflow Automation: Decisions Are Cheap, Actions Are Expensive

Mon, 04 Aug 2025 00:00:00 +0000

Last year I worked with a logistics company that had automated invoice processing with an AI agent . The agent read invoices, extracted line items, matched them to purchase orders, and approved payments. End to end. No human in the loop.

It worked beautifully for three months. Then the agent approved a $340,000 payment to a vendor who submitted a duplicate invoice with slightly different formatting. The model treated it as new. The validation layer didn’t exist because “the AI handles it.”

Three hundred and forty thousand dollars. Because someone treated a probabilistic system like a deterministic one.

That experience crystallized a principle I repeat often: AI decides, deterministic code acts. Never the other way around.

The Architecture That Survives

The separation is simple in concept and surprisingly rare in practice. The AI component receives structured context, produces a structured decision with a rationale, and stops there. Everything after that – validation, side effects, and the actual work – is deterministic code with explicit rules.

The flow:

Trigger arrives with metadata (a ticket, a document, an event)
AI decision produces structured output – classification, extraction, routing recommendation, confidence score, and a short explanation of why
Deterministic validation checks the decision against hard policy rules, allowlists, deny lists, and threshold constraints
Action or escalation – if validation passes and confidence is high, execute. If not, route to human review with the full context attached
Audit trail stores the input, the decision, the rationale, the validation result, and the final action

Every step is logged. Every decision is replayable. If something goes wrong, you can trace exactly where and why.

Confidence Tiers Aren’t Optional

Not every AI decision deserves the same treatment. A classification the model is 95% sure about is different from one it’s 60% sure about. Your automation should know the difference.

I use three tiers everywhere:

High confidence – auto-approve, execute the action, log for periodic review
Medium confidence – queue for human review with the AI’s recommendation and rationale attached
Low confidence – escalate immediately, flag for manual handling, don’t proceed

Thresholds depend on your domain. For invoice processing, I set the bar high because the cost of a wrong action is real money. For ticket triage, I set it lower because a misrouted ticket is annoying but recoverable.

The point is that uncertainty is a normal operating state. It isn’t a bug. Your system should be designed to handle it gracefully instead of pretending every decision is confident.

Context Discipline

Feed the AI the minimum context needed to make a good decision. Not a raw database dump. Not the entire ticket history. Use a structured package: the specific document or event, the relevant policy excerpt, and a few representative examples of how similar cases were decided.

When teams dump everything into the context window , two things happen: token costs explode, and the model starts hallucinating connections between unrelated data points. More context isn’t better context. Be deliberate about what matters for a specific decision.

Where AI Automation Actually Fits

Good fits: request triage, document classification, data extraction from messy formats, policy-based routing where ambiguity is expected and escalation is normal.

Bad fits: anything safety-critical, anything requiring hard real-time guarantees, anything where a wrong decision is irreversible and expensive. If you can’t tolerate occasional uncertainty, don’t automate with a probabilistic system.

From what I’ve seen, the most successful automation projects started with a single workflow that already had a manual review path. They ran the AI in shadow mode first, compared its decisions to the human decisions, measured agreement rates, and only then moved to live execution – with review still in place for the first few weeks.

The Real Lesson

That $340,000 duplicate payment wasn’t a model failure. The model did exactly what it was designed to do – it classified the invoice and approved it. The failure was architectural. Nobody built the validation layer that should have caught a duplicate vendor-amount-date combination. Nobody defined the hard boundaries.

AI automation works when you respect what it is: a probabilistic decision engine. Wrap it with deterministic guardrails , log everything, and keep humans in the loop for anything your business can’t afford to get wrong.

Guardrails beat talent. Always.

AI Docs That Don't Lie to Your Users

Mon, 21 Jul 2025 00:00:00 +0000

Quick take

Your AI docs system is only as good as its retrieval and its willingness to say “I don’t know.” Use hybrid search, chunk by document structure with version metadata, cite sources in every answer, and treat freshness as a scheduled operational job – not a wish on the backlog.

I contribute to Go regularly. I also use documentation from dozens of projects every day. And I can tell you the most common failure in developer documentation isn’t bad writing. It’s bad retrieval.

A developer hits a cryptic error at midnight. They search. They get a result that looks right. It’s from v2. They’re on v4. The answer doesn’t apply, but they don’t realize it until they’ve wasted forty minutes. Now multiply that across everyone using your docs.

That’s the problem AI documentation systems need to solve. Not “make the docs chatty.” Make docs findable, version-accurate, and honest about gaps.

The Three Problems Worth Solving

Discovery. Users don’t know your terminology. They describe symptoms, not concepts. A developer searching for “connection refused after deploy” might need the page about TLS configuration, but your keyword search returns the networking overview. Semantic search bridges this gap, but only if your chunks are meaningful units – not random 500-token slices.

Version accuracy. Your API changed between v3 and v4. The auth flow is different. The error codes are different. If your retrieval doesn’t filter by version, it will surface whatever is most popular in the index. Popular doesn’t mean current.

Freshness. Your product shipped a breaking change last Tuesday. The docs still describe the old behavior. Your AI docs system confidently explains how the old version works. This is worse than having no AI at all because it adds a layer of false authority.

The System Shape

An AI docs system is a pipeline, not a chatbot with a vector store bolted on. The pieces that matter:

Content store with metadata. Every chunk needs a stable ID, a version tag, a last-updated timestamp, and a source URL. Without these, you can’t filter, you can’t cite, and you can’t detect staleness.

Hybrid retrieval . Semantic search for conceptual questions. Keyword search for exact error codes, flag names, and parameter values. Neither alone is sufficient. The combination covers most queries. Add a reranking step that considers version relevance and recency – not just semantic similarity.

Answer synthesis with citations. The model generates an answer, but every claim must trace to a specific chunk. If the retrieved chunks don’t contain the answer, the system says so explicitly: “This doesn’t appear to be covered in the current docs. Here’s the closest related section.” A short answer with a source link beats a fluent paragraph that invents details.

Feedback collection. Log every question that gets a low-confidence response or explicit negative feedback. Route those to doc owners weekly. This is the actual improvement loop. Without it, you’re flying blind.

Chunking Matters More Than Model Choice

I’ve seen teams agonize over which LLM to use for synthesis while completely ignoring their chunking strategy . The chunking is where the battle is won or lost.

Split by document structure. Headings, sections, and code blocks are natural semantic boundaries. A chunk should be a coherent unit that can answer a question on its own, or clearly can’t. Token-count splitting produces fragments that retrieve well by similarity score but fail at actually answering questions.

Attach version metadata to every chunk. If someone asks about v4 auth, filter to v4 chunks before retrieval. This isn’t a nice-to-have. It’s the difference between helpful and harmful.

Freshness Is Ops Work

Docs go stale. This isn’t a failure of discipline – it’s a consequence of shipping software. The solution isn’t “write better docs.” The solution is automated freshness checks.

Schedule weekly jobs that validate links, compare API schema hashes against the documented version, and flag code samples that reference deprecated methods. When a check fails, create a ticket with clear ownership and a deadline. Not a backlog item. A real deadline.

At the fintech startup, we learned this the hard way with financial data: stale information in a financial context isn’t just unhelpful, it’s dangerous. The same principle applies to docs. Stale docs users trust are worse than no docs at all.

Measure Success by Questions Answered

Pageviews are meaningless for docs. The metric that matters is: did the user get the right answer?

Track question success rate through explicit thumbs-up/down on AI answers. Track the count of unanswered or low-confidence questions – these are your improvement backlog. Track time-to-update for pages flagged as stale.

The feedback loop is the product. The AI layer is just the delivery mechanism. If unanswered questions aren’t flowing back into your documentation process , your AI docs system is a search box with extra steps.

Build retrieval that respects versions. Require citations. Admit uncertainty. Treat freshness as an operational discipline. Everything else is decoration.

Your AI Metrics Are Measuring the Wrong Thing

Mon, 07 Jul 2025 00:00:00 +0000

Every AI product review I sit in starts the same way: someone pulls up a dashboard showing adoption rates, interaction volume, and session length. The numbers are up and to the right. Everyone nods.

Then I ask: “How many of those interactions ended with the user getting the right answer?” Silence.

This is the metrics gap that keeps burning teams. Usage tells you people showed up. It tells you nothing about whether they left with what they needed. An AI feature can be heavily used and actively harmful at the same time. Users try it, get a wrong answer, correct it manually, and keep coming back because they’re optimistic. Your dashboard shows engagement. Your product is eroding trust.

What to Actually Measure

Three things. That’s it.

Did the output help? Not “was it generated.” Did it contribute to the user completing their task? Define what successful completion looks like for your specific workflow, then measure whether AI-assisted completions happen more often, faster, or with fewer errors than the baseline. If you can’t tie AI output to a task outcome, you’re measuring wind.

Was it correct? Combine automated checks with periodic human review. Automated checks catch format violations, hallucinated entities, and safety issues . Human review catches the subtle stuff: answers that are technically correct but misleading, or correct for the wrong version. Sample 5% of outputs weekly. That’s enough to spot trends before they become incidents.

Do users trust it? Trust is the leading indicator everyone ignores. Track it through implicit signals: how often users edit AI output before accepting it, how often they abandon a flow after seeing the AI response, and how often they re-prompt with the same question phrased differently. Rising edit rates or re-prompt rates mean trust is declining. By the time CSAT surveys catch this, you’ve already lost months.

The Dashboard That Fits on One Screen

Your AI scorecard should answer four questions at a glance:

Are people using it? (adoption, retention – the basics)
Is the output good? (correctness rate, safety rate from automated + human review)
Is it helping? (task completion rate, time to completion vs. baseline)
Do they trust it? (edit rate, re-prompt rate, abandonment rate)

Review weekly. Tie every metric to a decision. If a number moves and nobody changes anything, delete the number. Dashboards without decisions are theater.

When a metric dips, you should be able to trace it back to a model update, a retrieval change, or a product shift within the same week. If you can’t, your instrumentation is too coarse.

The Uncomfortable Truth

Most teams avoid quality metrics because they’re harder to collect and the numbers are less flattering than engagement counts. That’s exactly why they matter. The teams that measure task success and trust alongside usage are the ones whose AI features survive past the demo phase.

Measure what the user felt. Everything else is vanity.

Stop Fine-Tuning Models You Haven't Bothered to Prompt Properly

Mon, 23 Jun 2025 00:00:00 +0000

I need to get something off my chest. I’ve reviewed six AI projects in the last two months where teams jumped straight to fine-tuning. Six. Not one of them had tried proper few-shot prompting first. Not one had a retrieval layer for domain knowledge. They saw “the model doesn’t know our stuff” and immediately reached for the most expensive, most maintenance-heavy tool in the shed.

This drives me nuts.

Fine-Tuning Isn’t a Knowledge Injection

Let me say this clearly: fine-tuning changes behavior, not knowledge. If your problem is “the model doesn’t know about our product,” the answer is retrieval. RAG . Grounding. Whatever you want to call it, feed the model your docs at inference time.

Fine-tuning bakes patterns into weights. It’s good for consistent tone, strict output formats, and narrow tasks repeated at massive scale. It’s terrible for facts that change, knowledge that needs updating, or anything where you want to point at a source and say “the answer came from here.”

I’ve watched teams spend weeks curating training data to teach a model their product catalog. Then the catalog changes. Now the model confidently recommends products that no longer exist. Retrieval would have solved this in an afternoon.

The Decision Is Simple

Before you fine-tune anything, answer these questions honestly:

Have you pushed the prompt hard? Not a one-liner. A real system prompt with role definition, constraints, examples, and output format. Most teams write a lazy prompt, get mediocre results, and conclude the model needs training. No. Their prompt needs training.

Have you added retrieval? If the issue is domain knowledge, factual accuracy, or up-to-date information, retrieval is the answer. Fine-tuning can’t compete with a well-indexed knowledge base for factual tasks.

Is the remaining gap about behavior? After good prompts and solid retrieval, if the model still can’t hold a consistent tone, reliably produce a specific output structure, or stop drifting on a narrow repeated task, now we can talk about fine-tuning.

Is the volume worth it? Fine-tuning has upfront cost and ongoing maintenance. If the task runs ten times a day, just use a better prompt. If it runs ten thousand times a day and prompt tokens are eating your budget , fine-tuning starts to make economic sense.

The Maintenance Tax Nobody Mentions

Here’s what the fine-tuning tutorials leave out. A tuned model is a versioned product. Your training data reflects a snapshot of your business at a moment in time. Products change. Policies change. Customer expectations change. Your training set drifts.

That means you need:

Versioned training sets in source control
A holdout evaluation set that you run against every new version
Monitoring for quality regression in production
A refresh cadence that’s actually budgeted and scheduled

I’ve seen exactly one team do all of this well. Everyone else fine-tuned once, celebrated, and then watched quality slowly degrade over three months while nobody noticed because nobody was measuring.

I’m not anti-fine-tuning. I’m anti-premature-fine-tuning. The legitimate cases exist:

You need a specific voice or brand tone that holds across thousands of outputs and few-shot examples aren’t stable enough
You have a narrow classification or extraction task at high volume where shaving prompt tokens saves real money
You need a strict output schema and the base model keeps introducing creative variations despite explicit instructions

From what I’ve seen, maybe one in five projects that ask about fine-tuning actually need it. The rest need better prompts, proper retrieval, or both.

The Honest Checklist

Write a real system prompt with examples and constraints. Test it on 50 representative inputs.
If factual accuracy is the gap, add retrieval. Test again.
If behavior consistency is still the gap at high volume, collect 200+ high-quality examples that match real production inputs.
Hold out 20% for evaluation. Fine-tune. Compare against the base model on both your target metric and general reasoning.
If the tuned model wins on behavior but loses on reasoning, reconsider whether the tradeoff is worth it.
Version everything. Monitor everything. Schedule refreshes.

Stop treating fine-tuning as step one. It’s step last.

AI Customer Support That Doesn't Make People Hate You

Mon, 09 Jun 2025 00:00:00 +0000

I have a confession: I’ve rage-quit a support chat with an AI bot at least four times this year. And I build these systems for a living.

The problem is rarely the technology. The problem is that someone decided the goal was “deflect tickets” instead of “help customers.” Those goals produce completely different systems.

At a shared mobility startup I ran, we handled support for thousands of riders across multiple cities. Some of it was straightforward – “where is my scooter” kind of stuff. Some of it wasn’t – billing disputes, safety incidents, regulatory questions. The lesson that stuck with me was simple: the moment a customer feels trapped in a loop with no exit, you’ve lost them. Permanently.

That lesson applies directly to AI support.

Design for the Handoff, Not the Deflection

The best AI support systems I’ve seen share one trait: they’re obsessed with the handoff. The AI handles the routine stuff – password resets, order status, basic troubleshooting. Fine. But the moment the conversation crosses into ambiguity, billing, account security, or anything emotionally charged, it routes to a human. Fast. With full context attached.

Full context means the customer doesn’t have to repeat themselves. It means the human agent sees the conversation history, account state, prior tickets, and the AI’s confidence assessment. If your handoff drops any of that, your human agent starts from zero and the customer feels punished for escalating.

Make escalation a one-tap action. Not buried in a menu. Not “please describe your issue again so we can route you.” One tap. Every screen.

Ground Answers or Say Nothing

Here’s where most AI support goes sideways: the model generates a plausible-sounding answer that’s completely wrong. The customer follows it, makes things worse, and now you have a pissed-off user and a support ticket that’s twice as hard to resolve.

The fix is grounding . Every answer the AI gives should be traceable to current documentation or a known resolution pattern. If the system can’t find a source, it should say so. “I don’t have a verified answer for this – let me connect you with someone who does.” That sentence is worth more than a thousand confidently wrong paragraphs.

For anything touching billing, account access, or security – require a source citation or refuse to answer. No exceptions. A cautious deferral builds trust. A confident hallucination destroys it.

Context Isn’t Optional

Your AI support bot should know who it’s talking to: conversation history , account state, prior tickets, current subscription tier. If the customer told you their name and order number two messages ago, don’t ask again.

This sounds obvious, but it’s shocking how many production systems get it wrong. They treat every message as an independent event because someone optimized for stateless simplicity instead of user experience.

Context also means understanding what has already been tried. If the customer says “I already restarted the app,” don’t suggest restarting the app. The AI should parse prior attempts and skip the obvious stuff. This is where retrieval over conversation history earns its keep.

Measure What the Customer Feels

Most teams measure deflection rate as their primary AI support metric. That tells you how many tickets the AI intercepted. It tells you nothing about whether customers got help.

Measure these instead:

CSAT per interaction – not aggregate, per conversation. Did this specific person feel helped?
Time to resolution – including escalation time. If AI adds a 10-minute runaround before connecting to a human, that’s worse than no AI at all.
Repeat contacts – if the same customer comes back about the same issue, the first interaction failed. Full stop.
Escalation quality – when the AI hands off, does the human have enough context to pick up immediately?

Review these weekly. Not monthly. Weekly. Because AI support quality can drift fast when your knowledge base gets stale or your product ships a change that the docs haven’t caught up with.

Start Narrow, Stay Honest

Don’t launch AI support across every channel and every topic on day one. Pick the three most common, routine request types. Test internally. Get the escalation path rock solid. Make sure the knowledge base is current.

Then expand. Slowly. Treat every failed conversation as signal – a gap in your docs, a missing retrieval path, a policy AI doesn’t know about. That feedback loop is the actual product. The chatbot is just the interface.

AI support works when it’s built around humility – the system’s humility about what it knows, and the team’s humility about what it can handle. Everything else is a demo.

Your AI Pipeline Is Just ETL With Extra Steps (And That's Fine)

Mon, 26 May 2025 00:00:00 +0000

Quick take

Stop overcomplicating AI pipelines. They’re ETL plus retrieval ops. Diff your inputs, chunk by structure (not token count), upsert with stable IDs, and treat reindexing as a deliberate, versioned event. Skip the diffing step and retrieval drifts into garbage. I’ve seen it happen three times this year alone.

I’ve been building data pipelines since before anyone called them “data pipelines.” At the fintech startup we were ingesting financial news from hundreds of sources, normalizing it, and serving it for real-time retrieval. That was 2017. The core problems haven’t changed.

What has changed is that your pipeline now has a second consumer: a retrieval system feeding an LLM. If you treat that consumer as an afterthought, your AI product will deliver confidently wrong answers. Ruthless focus on the basics separates pipelines that work from pipelines that demo well.

The Shape of an AI Pipeline

Every AI pipeline I’ve seen in production boils down to six stages. Here’s the skeleton:

pipeline:
  stages:
    - name: extract
      # Pull from sources, normalize formats
      # PDF, HTML, API responses -> clean markdown or structured text

    - name: diff
      # Hash-based change detection
      # This is the stage most teams skip. Don't.

    - name: chunk
      # Split by document structure first, token count second
      # Preserve section boundaries and headings

    - name: embed
      # Generate vectors using a pinned model version
      # Log the model version. You will need it later.

    - name: index
      # Upsert with stable IDs and rich metadata
      # source_id + chunk_position = deterministic ID

    - name: verify
      # Check for missing chunks, stale entries, orphans
      # Alert on drift from expected source freshness

Nothing exotic. The magic is in the discipline of each stage, not in clever architecture.

The Diff Step Is Everything

Most teams skip change detection and reprocess everything on every run. At small scale, this is fine. At production scale, it’s expensive, noisy, and makes debugging a nightmare.

A simple content-hash approach works well:

func hasChanged(sourceID string, content []byte, store HashStore) bool {
    newHash := sha256.Sum256(content)
    existing, found := store.Get(sourceID)
    if !found {
        store.Set(sourceID, newHash)
        return true
    }
    if existing != newHash {
        store.Set(sourceID, newHash)
        return true
    }
    return false
}

When I built the ingestion pipeline at the fintech startup, adding a diff layer cut downstream processing costs by roughly 60%. Most sources don’t change on most runs. Detecting that early saves everything downstream.

The diff step also gives you auditability. You can answer “what changed and when” instead of shrugging at a vector store that silently drifted.

Chunking: Structure Before Size

This is where most RAG pipelines go wrong. Teams reach for a token-count splitter because it’s the default in every tutorial, then wonder why retrieval returns fragments of ideas instead of coherent answers.

Split by document structure first. Headings, sections, code blocks, list items – these are natural semantic boundaries. Only fall back to token-count splitting when a single section exceeds your context window.

def chunk_by_structure(doc: Document) -> list[Chunk]:
    chunks = []
    for section in doc.sections:
        if section.token_count <= MAX_CHUNK_TOKENS:
            chunks.append(Chunk(
                content=section.text,
                metadata={
                    "source_id": doc.id,
                    "section_heading": section.heading,
                    "position": section.index,
                    "doc_version": doc.version,
                },
                # Deterministic ID: no duplicates on re-ingestion
                id=f"{doc.id}:{section.index}",
            ))
        else:
            # Fall back to sliding window only for oversized sections
            for sub in sliding_window(section, MAX_CHUNK_TOKENS, overlap=100):
                chunks.append(Chunk(
                    content=sub.text,
                    metadata={**section.metadata, "sub_position": sub.index},
                    id=f"{doc.id}:{section.index}:{sub.index}",
                ))
    return chunks

Two things matter here. First, the id is deterministic, derived from source and position, not random. This means re-ingesting the same content produces upserts, not duplicates. Second, metadata travels with every chunk. When retrieval returns a chunk, you know exactly where it came from, which version, and which section.

I can’t overstate how many production RAG systems I’ve reviewed where chunks had no stable ID. Every reindex created duplicates. Users got the same passage three times in their context window, and the model hallucinated a consensus that didn’t exist.

Freshness Is an Operational Problem

Your pipeline isn’t done when it runs once. Sources change, APIs update, and documents get deleted. If your index doesn’t reflect reality, your AI lies with confidence.

Three rules I enforce on every pipeline:

Reindex on embedding model changes. If you swap or upgrade your embedding model , every existing vector is stale. This is a full reindex event. No exceptions. Pin your model version and log it.
Purge on source deletion. If a document disappears from the source, its chunks must disappear from the index. Orphaned chunks are a retrieval poison pill.
Alert on freshness drift. Every source has an expected update cadence. If your financial news feed hasn’t updated in 6 hours, something is wrong. Don’t wait for a user to notice.

freshness_policy:
  sources:
    - name: product_docs
      expected_interval: 24h
      alert_after: 36h
    - name: api_changelog
      expected_interval: 7d
      alert_after: 10d
    - name: support_kb
      expected_interval: 48h
      alert_after: 72h
  on_embedding_change: full_reindex
  on_source_delete: purge_chunks

The Mistakes I Keep Seeing

After building AI infrastructure across telecom and fintech, the failure pattern is remarkably consistent:

No stable IDs. Updates create duplicates. Retrieval returns the same content multiple times. The model treats repetition as emphasis and doubles down on whatever it found.

Token-count-only chunking. A paragraph about authentication gets split mid-sentence. The first half lands in one chunk, the second half in another. Retrieval finds the first half. The model confidently gives half an answer.

Ad-hoc reindexing. Someone runs a reindex on a Friday afternoon. Nobody knows what changed. Retrieval quality shifts. The team argues about whether it got better or worse. No one can prove either way because there’s no baseline.

Missing permission metadata. The chunks are indexed without access control data. A user with restricted access asks a question and gets an answer sourced from documents they shouldn’t see. This is a compliance incident waiting to happen.

What matters

AI pipelines are pipelines. The retrieval layer adds real complexity, but the solution isn’t a new paradigm. It’s the same discipline that has always worked: detect change early, preserve meaning when you split, keep identifiers stable, and make freshness an operational concern with clear owners and alerts.

Same fundamentals, new surface area. That’s the whole story.

Agent Orchestration: Four Patterns, Honest Tradeoffs

Mon, 12 May 2025 00:00:00 +0000

Quick take

More agents doesn’t mean better results. It means more coordination overhead and more failure modes. Start with a simple pipeline, add a verifier, and only go multi-agent when you can clearly define who owns each decision. If your agents don’t have contracts, you don’t have orchestration – you have chaos.

I keep getting asked about multi-agent architectures . Teams see the demos – agents collaborating, debating, building things together – and they want that. What they usually need is simpler.

The uncomfortable truth about agent orchestration is that it’s just distributed systems with worse debugging tools. Every coordination problem you’ve seen in microservices shows up again: unclear ownership, implicit state, cascading failures , and the seductive illusion that more components mean more capability.

That said, there are real use cases where multiple agents outperform a single one. The key is choosing the right pattern and being honest about the tradeoffs.

The four patterns

After building and reviewing agent systems in production , I’ve landed on four patterns that cover most real-world use cases.

1. Sequential pipeline

The simplest pattern. Agent A does research, passes results to Agent B for analysis, then Agent B passes to Agent C for writing. Each agent has a clear input and output contract.

When it works: Tasks with a natural sequence of distinct steps. Content generation pipelines. Data processing workflows. Anything where each step builds on the previous one.

When it breaks: Early agents produce weak output and later agents can’t recover. Errors compound. The pipeline is only as good as its weakest step.

My rule: Add explicit checkpoints between stages. If Agent B receives garbage from Agent A, it should reject and request a retry rather than trying to work with bad input. We learned this the hard way on a project – a research agent that returned vague summaries poisoned every downstream step.

2. Parallel execution

Multiple agents work on the same problem independently, then results are merged. Think: three agents each review a PR from a different angle (logic, security, performance), and a synthesis step combines their findings.

When it works: Tasks where multiple perspectives add value. Review workflows. Risk assessment. Brainstorming alternatives.

When it breaks: The synthesis step. Merging conflicting agent outputs is hard. If your merge strategy is “average the results” or “take the longest response,” you’re losing the benefit of parallel execution.

My rule: Define merge rules explicitly. Conflicts get escalated to a human or resolved by a designated arbiter agent with clear criteria.

3. Hierarchical orchestration

A coordinator agent breaks work into subtasks, delegates to specialist agents, and assembles the final result. This is the manager-worker pattern.

When it works: Large, complex tasks that can be decomposed. Project planning. Multi-file code generation. Report compilation from multiple data sources.

When it breaks: The coordinator overfits to its initial plan. If subtask results invalidate the plan, the coordinator needs to replan. Most implementations don’t handle this well – the coordinator stubbornly follows the original decomposition even when evidence says it shouldn’t.

My rule: Give the coordinator explicit replanning triggers. If a subtask fails or returns unexpected results, the coordinator reassesses before continuing.

4. Debate and verification

Two or more agents argue opposing positions. A judge agent evaluates the arguments and makes a final call. This pattern surfaces assumptions and edge cases that a single agent misses.

When it works: Decisions with genuine uncertainty. Code review where the tradeoffs are unclear. Risk assessment where different framings lead to different conclusions.

When it breaks: Agents generate artificial disagreement to fill their roles. Or the judge defaults to the more verbose argument. The pattern needs real divergence to add value.

My rule: Only use debate when the single-agent answer has measurable uncertainty. If the task has a clear correct answer, debate is overhead.

Pattern comparison

Pattern	Best for	Failure mode	Complexity	Agent count
Sequential pipeline	Step-by-step workflows	Error compounding	Low	2-4
Parallel execution	Multi-perspective review	Bad merge logic	Medium	3-5
Hierarchical	Large decomposable tasks	Rigid planning	High	3-8
Debate/verification	Uncertain decisions	Artificial disagreement	Medium	2-3

The coordination basics nobody talks about

The pattern is the easy part. The hard part is the coordination contract between agents. Every agent needs:

Defined inputs and outputs. Not “whatever seems relevant.” A schema. Required fields. Validation at the boundary.
Pass/retry/escalate criteria. What does the next agent do when it receives bad input? Accept it? Reject it? Ask for clarification? This must be explicit.
Short, stable context. Don’t pass the entire conversation history between agents. Pass a structured summary of what the previous agent decided and why. Long contexts lead to confusion and drift.
Decision logging. Every agent decision gets logged with reasoning. When the final output is wrong, you need to trace which agent made the bad call and why.

Without these, adding agents just multiplies failure modes. You get more components and less reliability. I’ve seen teams build five-agent systems that performed worse than a single well-prompted model because coordination overhead drowned out the benefits.

When to not use multi-agent

Most of the time.

I’m serious. A single agent with good tools, clear instructions, and a verification step handles 80% of use cases better than a multi-agent system. Multi-agent adds value when:

The task genuinely requires different capabilities or perspectives
Verification needs to be independent from generation
The work can be parallelized for speed
No single prompt can hold all the necessary context

If none of those apply, you’re adding complexity for its own sake.

How I start

Two agents. One that does the work. One that checks the work. That’s it. The generator-verifier pattern is the simplest multi-agent setup and the one with the highest reliability improvement per unit of added complexity.

Once the generator-verifier is stable and measured, you can consider whether splitting the generator into specialized sub-agents would help. Usually it doesn’t. But when it does – when you have distinct expertise domains that benefit from isolation – the improvement is real.

Start simple. Add complexity only when you can measure the improvement. Orchestration isn’t a goal. Reliability is.

AI Security: Same Principles, New Attack Surface

Mon, 28 Apr 2025 00:00:00 +0000

Quick take

Treat every AI endpoint like an exposed API that can be tricked into doing things you didn’t intend. Separate trusted instructions from untrusted content. Constrain tool access. Filter outputs for leakage. Monitor like the system is adversarial, because someone will make it so. Security, stability, performance – in that order.

During a national cyber-defense exercise a few years back, we ran a scenario where the opposing team compromised an automated decision support system. They didn’t hack the system in the traditional sense. They fed it manipulated data that changed its recommendations. The system worked exactly as designed. It just made the wrong decisions because its inputs were poisoned.

That scenario has stayed in my head this year because it’s exactly what prompt injection does to AI systems. The model works as designed. The inputs are manipulated. The outputs are wrong. And the system has no idea.

The threat model isn’t theoretical

Every AI system I see in production combines three things that should make security engineers nervous:

Untrusted user input goes directly into the model context.
Retrieved content from external sources is treated as context, not as untrusted data.
Tool access allows the model to take actions with real consequences.

Mix those three together and you get a system where a malicious string in a support ticket can, in the worst case, cause the model to call an internal API, exfiltrate data, or take an action that nobody authorized.

This isn’t hypothetical. I’ve seen prompt injection succeed against production systems. In one case, a user embedded instructions in a document that was retrieved during RAG. The model followed those instructions and included internal system prompt details in its response. The user got a screenshot and posted it on social media. Not a great day for that team.

Where the attacks land

Prompt injection is the big one. Direct injection, where the user types instructions that override the system prompt, is the obvious case. Indirect injection is scarier: malicious instructions embedded in retrieved documents, emails, or web pages that the model processes. The model can’t reliably distinguish “instructions from the developer” from “instructions from an attacker hiding in the data.”

Data leakage is the second big one. Models will echo back their system prompts, retrieved context, or other users’ data if you ask the right way. Output filtering catches some of this. But the model is creative, and attackers are more creative. Assume that anything in the context window can potentially appear in the output.

Tool misuse is the emerging threat. As AI systems gain access to tools – databases, APIs, file systems, deployment pipelines – the blast radius of a successful injection grows dramatically. A chatbot that can only generate text is annoying when compromised. A chatbot that can query your database and call your APIs is dangerous.

Defenses that actually work

I apply the same layered defense approach I learned in the national cyber-defense context, adapted for AI systems.

Separate trusted from untrusted

The most important architectural decision is maintaining a clear hierarchy of instructions. System prompts are trusted. User input is untrusted. Retrieved content is untrusted. Tool outputs are semi-trusted. The model should have explicit markers for these boundaries, and the system should be designed so that untrusted content can’t override trusted instructions.

This doesn’t fully prevent injection, but it raises the bar. Label everything. Normalize inputs. Strip or escape known injection patterns before they enter the context.

Constrain tool access

Every tool an AI system can access should follow least privilege . Read-only by default. Write operations require explicit confirmation. Destructive operations require human approval. Scope queries to the current user’s data. Rate limit everything.

Our MCP tool servers enforce permission checks at the tool level, not just at the connection level. A user might be allowed to query their own deployment status but not trigger a rollback. The model never gets to make that decision – the permission boundary does.

Filter outputs aggressively

Output filtering is your last line of defense. Check every response for:

System prompt fragments or internal instructions
Personally identifiable information that shouldn’t appear
Known attack patterns (encoded instructions, suspicious URLs)
Content that violates your safety policies

This isn’t foolproof. Models are remarkably good at paraphrasing things they shouldn’t say. But filtering catches the low-hanging fruit and raises the cost of attack.

Monitor for the weird

Traditional security monitoring looks for known attack patterns. AI security monitoring also needs to detect behavioral anomalies:

Sudden changes in tool call patterns
Requests that are unusually long or contain encoded content
Responses that include fragments of system prompts
Spikes in refusal rates or cost
Users who systematically probe the model’s boundaries

On one project, we caught an attacker by noticing a user who submitted 200 requests in an hour, each slightly different, all testing variations of the same injection technique. Traditional rate limiting didn’t flag it because the request volume was below the threshold. Behavioral analysis did.

The architecture matters more than the detection

Here’s the uncomfortable truth: you can’t fully prevent prompt injection with current techniques. The model is a general-purpose text processor that follows instructions, and there’s no reliable way to make it distinguish between legitimate instructions and injected ones.

What you can do is limit the blast radius. Isolate AI services from core systems. Scope permissions narrowly. Put human approval gates on sensitive actions. Log everything. Make the system auditable.

This is the same defense-in-depth approach we apply to every exposed system. The fact that the attack vector is natural language instead of SQL or shellcode doesn’t change the principles. It changes the surface.

What I tell every team

Security, stability, performance – in that order. That’s my priority stack for AI systems, same as any other system I build.

Start by assuming the model will be tricked. Design your system so that a successful trick does as little damage as possible. Then add detection. Then add response playbooks . Then drill them.

The teams that treat their AI systems like exposed APIs with real blast radius will be fine. The teams that treat them like internal tools with trusted inputs will learn an expensive lesson. I’d rather they learned from this post than from their first incident.

Testing AI Where It Actually Runs

Mon, 14 Apr 2025 00:00:00 +0000

Quick take

Your eval suite passes. Your staging environment looks good. Your AI feature will still break in production because real users do things your test set never imagined. Shadow it, canary it, measure it, and make every rollout reversible. Evidence before confidence.

I wrote about testing in production back in 2019. The core thesis hasn’t changed: staging lies to you. What has changed is that AI makes the lying worse.

Traditional software either works or it doesn’t. The test passes or fails. The API returns the right data or throws an error. AI features exist in a gray zone where the output is almost always plausible, sometimes correct, and occasionally dangerous. Your test suite can’t cover this space. Production can.

Why offline evals aren’t enough

Every AI project should have an eval suite . I’ve been saying this for over a year. But evals test known scenarios. Production surfaces the unknown ones.

Real users send inputs your test set never imagined. They misspell things. They paste in multi-language text. They include personally identifiable information that triggers different model behavior. They ask questions that are ambiguous in ways your eval prompts aren’t.

At one company, their AI support agent passed every eval with flying colors. In production, users started treating it like a search engine – pasting in order numbers and expecting it to look up status. The model happily hallucinated order details instead of saying “I can’t do that.” The eval suite had no test case for “user treats chatbot like a database query tool.” Production found it in the first hour.

Shadow mode first

Before any AI change touches a real user, shadow it. Run the new version in parallel with the current one, compare outputs, and log everything. The user only sees the current version.

Here’s the pattern I use in Go:

type ShadowRunner struct {
	current   ModelClient
	candidate ModelClient
	logger    *ShadowLogger
}

func (s *ShadowRunner) Execute(ctx context.Context, req Request) (Response, error) {
	// Current model serves the user
	resp, err := s.current.Complete(ctx, req)

	// Candidate runs in background -- never blocks the user
	go func() {
		candidateCtx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
		defer cancel()

		candidateResp, candidateErr := s.candidate.Complete(candidateCtx, req)
		s.logger.LogComparison(ShadowResult{
			RequestID:      req.ID,
			CurrentOutput:  resp,
			CandidateOutput: candidateResp,
			CandidateErr:   candidateErr,
			Match:          s.compareOutputs(resp, candidateResp),
		})
	}()

	return resp, err
}

The shadow logger captures every comparison. I review divergences daily during the shadow period. If the candidate produces different outputs, I want to understand whether those differences are improvements, regressions, or neutral changes.

The shadow period should last at least a week. Longer for high-traffic services. The goal is to see enough real-world input diversity to have confidence in the change.

Canary with kill switches

Once shadow results look good, move to a canary deployment . Route a small percentage of real traffic to the new version and monitor closely .

type CanaryRouter struct {
	current     ModelClient
	candidate   ModelClient
	percentage  atomic.Int32
	qualityGate *QualityGate
}

func (c *CanaryRouter) Route(ctx context.Context, req Request) (Response, error) {
	if c.shouldCanary(req.UserID) {
		resp, err := c.candidate.Complete(ctx, req)
		if err != nil || !c.qualityGate.Check(resp) {
			// Automatic fallback to current
			return c.current.Complete(ctx, req)
		}
		return resp, err
	}
	return c.current.Complete(ctx, req)
}

func (c *CanaryRouter) shouldCanary(userID string) bool {
	hash := fnv.New32a()
	hash.Write([]byte(userID))
	return int(hash.Sum32()%100) < int(c.percentage.Load())
}

The QualityGate is the part most teams skip. It checks the candidate response against basic quality criteria before serving it. If the response fails the gate, the user gets the current version transparently. No harm done.

I start at 1%. Watch for a day. If quality signals hold, move to 5%. Then 25%. Then 100%. Each step gets at least a few hours of observation. If anything looks off at any step, roll back to the previous percentage. No drama.

The hash-based routing is important: the same user always gets the same version within a rollout step. This prevents confusing experiences where the same user gets different quality outputs on consecutive requests.

What to measure during rollout

Three categories of signals, checked at every rollout step:

Quality signals. Task success rate on your eval set. But also: user re-prompts (did they have to ask again?), abandonment rate (did they give up?), explicit negative feedback. These are the signals your eval suite can’t give you.

Safety signals. Refusal rate. Policy trigger count. Anything flagged by your content filters. If the candidate model refuses more or fewer requests than the current one, investigate before expanding.

Operational signals. Latency p50 and p95 by workflow. Token usage. Cost per request. Error rates. A model change that improves quality but doubles cost might not be a net win. Make that trade-off explicit.

type RolloutMetrics struct {
	Version         string
	QualityScore    float64
	RefusalRate     float64
	P50Latency      time.Duration
	P95Latency      time.Duration
	CostPerRequest  float64
	ErrorRate       float64
	UserRepromptRate float64
}

func (m *RolloutMetrics) PassesGate(baseline RolloutMetrics) bool {
	if m.QualityScore < baseline.QualityScore*0.95 {
		return false // quality regression > 5%
	}
	if m.ErrorRate > baseline.ErrorRate*1.5 {
		return false // error rate increase > 50%
	}
	if m.P95Latency > baseline.P95Latency*2 {
		return false // latency doubled
	}
	return true
}

These thresholds aren’t magic numbers. They’re product decisions. A 5% quality regression might be acceptable if cost drops by 40%. A latency doubling might be fine for a background task but fatal for a chat interface. Define them before the rollout starts, not during.

The one-change rule

Never change the model and the prompt at the same time. If quality drops, you won’t know which change caused it. This sounds obvious. I’ve watched four different teams make this mistake in the last three months.

Ship the prompt change. Measure. Ship the model change. Measure. If you must change both, do the prompt first because it’s cheaper to roll back.

Same goes for retrieval changes, system message changes, and tool configuration changes. One variable at a time. Anything else is debugging in the dark.

Holdout baselines

Keep a small, stable slice of traffic permanently on a known-good version. This is your holdout. It tells you whether quality changes are due to your changes or due to shifts in user behavior, input distribution, or upstream data.

Without a holdout, slow regressions look like normal variance. You won’t notice a 2% quality drop per week because no individual week looks bad. But your holdout will show the cumulative drift loud and clear.

What matters

Testing AI in production isn’t reckless. Shipping AI without testing it in production is reckless. Offline evals give you a baseline. Shadow mode gives you confidence. Canaries give you safety. Holdouts give you ground truth.

Every rollout should be reversible, measurable, and attributable to a single change. That isn’t a testing philosophy. That’s engineering discipline applied to a system that fails in ways your test suite can’t anticipate.

Your AI System Looks Healthy. It Is Not.

Mon, 31 Mar 2025 00:00:00 +0000

Here’s a scenario I’ve seen three times this year.

An AI-powered feature is in production. Uptime: 99.9%. Latency: nominal. Error rate: near zero. Dashboards are green. Everyone is happy.

Except the answers are wrong 15% of the time, and nobody knows because nothing is measuring answer quality. The system is healthy. The outputs are not.

This is the fundamental gap in AI observability . Traditional monitoring tells you whether the service is running. It does not tell you whether the service is useful.

Why AI systems fail silently

A classic API returns structured data. If the response is malformed, you get a parse error. If the logic is wrong, a test catches it. The failure modes are usually loud and obvious.

AI systems fail quietly. The model returns a perfectly formatted response with a confident tone and completely wrong content. The HTTP status is 200. The latency is fine. The JSON is valid. And the user just got told that their refund was processed when it wasn’t.

At a fintech startup, we had a similar problem with our financial news summarization pipeline, long before the current AI wave. The summaries looked plausible but occasionally attributed quotes to the wrong CEO or mixed up fiscal quarters. The system was “working” by every operational metric. The outputs were unreliable. We caught it only because a user complained, not because monitoring flagged it.

The lesson stuck with me. You can’t monitor AI like you monitor a REST API. You need different signals.

The signals that actually matter

I use a simple framework with five categories. If you are not tracking all five, you have blind spots.

Traceability. For every response, you need to know: which model, which prompt version, which retrieved context, which tool calls. If you can’t reconstruct why the model said what it said, you can’t debug a bad answer. You’re just guessing. I store a trace object alongside every response that includes model ID, prompt hash, retrieval IDs, and tool call logs. When something goes wrong, the trace is the first thing I pull.

Quality signals. This is the hard one. You need some measure of whether the output was good. Heuristic checks catch obvious failures: empty responses, responses that are too long or too short, and responses that contain known-bad patterns. Sampled evaluation catches the subtle failures: a human or a second model scores a random slice of outputs against a rubric. Neither is perfect. Together they cover enough ground.

Cost per outcome. Not cost per request, cost per successful outcome. A system that gets it right on the first try costs less than one that needs three retries and a human escalation. Track the full cost of getting to a good answer, including retries, fallbacks, and human review. This number will surprise you.

Safety and policy. Refusal rates, blocked content, policy trigger counts. If your refusal rate spikes, something changed – either the inputs or the model behavior. If it drops to zero, something might be wrong too. These are canary signals.

Operational basics. Latency percentiles by workflow (not globally – global averages hide everything), error rates with reason codes, token usage trends. The same stuff you track for any API, but broken down by the AI-specific dimensions that matter.

The prompt versioning problem

Here is something that bites almost every team. Someone changes a prompt. Quality drops. Nobody connects the two events because the prompt change was not tracked alongside the quality metrics.

Treat prompts as production code. Version them. Deploy them through your normal release process. Tag every response with the prompt version that produced it. When quality dips, the first question should be: what changed since the last known-good state?

I version prompts in the same repo as the service code. A prompt change gets a PR, a review, and a run against the eval suite before it hits production. It sounds like overkill until the first time it prevents a regression. Then it sounds obvious.

Keep it lean

The temptation is to build a dashboard for everything. Do not. Start with the minimum set of signals that lets you answer one question: “A user reported a bad answer. Can I explain why it happened and prevent it from happening again?”

If you can answer that question end-to-end, your observability is good enough. If you can’t, no amount of dashboards will save you.

Log the trace. Track quality. Version your prompts. Measure cost per outcome, not cost per request. That’s the baseline. Everything else is optimization.

MCP in Practice: Building Tool Servers in Go

Mon, 17 Mar 2025 00:00:00 +0000

Quick take

MCP is a real protocol that solves a real problem: the N-times-M integration matrix between AI clients and tool servers. I built one in Go. The protocol layer is clean. The hard parts are still auth, permissions, and not handing the model a footgun. If you’re building tool-heavy AI systems, MCP is worth investing in now.

I’ve been building tool integrations for AI systems since early 2024. Every project, the same pattern: custom connector, custom auth wrapper, custom request/response format, custom error handling. Multiply that by every tool and every AI provider and you get an integration matrix that grows quadratically. It’s the microservices API sprawl problem all over again.

MCP – Model Context Protocol – is Anthropic’s answer: a standard protocol for connecting AI models to external tools and data sources. Instead of N clients times M tools worth of custom integrations, you get N clients and M servers all speaking the same language.

I spent the last few weeks building an MCP server in Go to see whether the protocol lives up to the pitch. Here’s what stood out.

What MCP actually is

Strip away the marketing and MCP is a JSON-RPC-based protocol with three core concepts:

Tools. Functions the model can call. Each tool has a name, a description, and a JSON Schema for its inputs. The model decides when to call a tool based on the description.

Resources. Data the model can read. Think files, database records, API responses. Resources have URIs and can be listed or read by the client.

Prompts. Reusable prompt templates that servers can expose. Less interesting for most production use cases, but useful for standardizing common interactions.

The transport layer is deliberately simple: stdio for local servers, HTTP with SSE for remote ones. The protocol handles capability negotiation, so a client can discover what a server offers at connection time.

Building an MCP server in Go

Here’s a minimal MCP tool server that wraps a database query. This is roughly what I built for an internal tool in a recent project that lets the AI assistant query deployment status.

package main

import (
	"context"
	"encoding/json"
	"fmt"
	"log"

	"github.com/mark3labs/mcp-go/mcp"
	"github.com/mark3labs/mcp-go/server"
)

type DeploymentStatus struct {
	Service     string `json:"service"`
	Version     string `json:"version"`
	Environment string `json:"environment"`
	Status      string `json:"status"`
	DeployedAt  string `json:"deployed_at"`
}

func main() {
	s := server.NewMCPServer(
		"deployment-status",
		"1.0.0",
		server.WithToolCapabilities(true),
	)

	tool := mcp.NewTool("get_deployment_status",
		mcp.WithDescription("Get the current deployment status for a service in a given environment"),
		mcp.WithString("service", mcp.Required(), mcp.Description("Service name")),
		mcp.WithString("environment", mcp.Required(), mcp.Description("Target environment: staging or production")),
	)

	s.AddTool(tool, handleGetDeploymentStatus)

	if err := server.ServeStdio(s); err != nil {
		log.Fatalf("server failed: %v", err)
	}
}

func handleGetDeploymentStatus(ctx context.Context, req mcp.CallToolRequest) (*mcp.CallToolResult, error) {
	service, _ := req.Params.Arguments["service"].(string)
	env, _ := req.Params.Arguments["environment"].(string)

	if env != "staging" && env != "production" {
		return mcp.NewToolResultError("environment must be 'staging' or 'production'"), nil
	}

	status, err := queryDeploymentDB(ctx, service, env)
	if err != nil {
		return mcp.NewToolResultError(fmt.Sprintf("query failed: %v", err)), nil
	}

	data, _ := json.Marshal(status)
	return mcp.NewToolResultText(string(data)), nil
}

A few things to note. The tool definition includes a JSON Schema for inputs, which means the client can validate before calling. The handler returns structured results or errors. The server handles all the JSON-RPC plumbing – capability negotiation, method routing, error formatting. You just write the handler.

This is roughly 50 lines of actual logic. The equivalent custom integration I had before was about 200 lines, with its own HTTP server, auth middleware, and request parsing. That reduction matters when you have 15 tools to wrap.

Adding auth and permissions

The protocol itself doesn’t define authentication. That’s intentional – different deployments have different auth requirements. But it means you have to solve it yourself, and this is where most teams will spend their time.

Here’s the pattern I use: a middleware wrapper that checks permissions before the tool handler runs.

type PermissionChecker struct {
	allowedTools map[string][]string // tool -> allowed roles
}

func (pc *PermissionChecker) Wrap(toolName string, handler server.ToolHandlerFunc) server.ToolHandlerFunc {
	return func(ctx context.Context, req mcp.CallToolRequest) (*mcp.CallToolResult, error) {
		user := userFromContext(ctx)
		if user == nil {
			return mcp.NewToolResultError("authentication required"), nil
		}

		allowed := pc.allowedTools[toolName]
		if !hasAnyRole(user, allowed) {
			log.Printf("DENIED: user=%s tool=%s roles=%v", user.ID, toolName, user.Roles)
			return mcp.NewToolResultError("permission denied"), nil
		}

		log.Printf("ALLOWED: user=%s tool=%s", user.ID, toolName)
		return handler(ctx, req)
	}
}

Every tool call gets logged with the user identity, whether it was allowed or denied, and the arguments (redacted where necessary). This isn’t optional. If an AI system can call tools that read your database or modify your infrastructure, you need an audit trail.

For remote MCP servers over HTTP, I add standard bearer token auth at the transport layer. For local stdio servers, the auth context comes from the parent process. Either way, the permission check happens at the tool level, not just at the connection level. A user might be allowed to read deployment status but not trigger a rollback.

The security conversation

This is the part that keeps me up at night. MCP makes it easy to give an AI model access to tools. Maybe too easy. The protocol doesn’t enforce:

Read vs. write separation. A tool that reads data and a tool that deletes data look the same to the protocol. You have to enforce the distinction.
Rate limiting. Nothing stops the model from calling a tool a thousand times in a loop. Build your own limits.
Input sanitization. The model generates the tool arguments. If those arguments end up in a SQL query or a shell command, you’re one prompt injection away from a bad day.
Blast radius. A tool that queries one record is different from a tool that dumps an entire table. Scope your tools narrowly.

I enforce a simple rule: every tool that can write or modify gets a confirmation step that goes back to the user. The model can propose the action, but a human approves it. For read-only tools, I still scope the query to the current user’s data and add rate limits.

func handleTriggerRollback(ctx context.Context, req mcp.CallToolRequest) (*mcp.CallToolResult, error) {
	service, _ := req.Params.Arguments["service"].(string)
	env, _ := req.Params.Arguments["environment"].(string)

	// Never auto-execute destructive actions
	return mcp.NewToolResultText(fmt.Sprintf(
		"CONFIRMATION REQUIRED: Roll back %s in %s to previous version? "+
			"This action requires human approval.",
		service, env,
	)), nil
}

This is the same principle from my national cyber-defense days: least privilege, explicit authorization, and comprehensive auditing . The fact that the agent is an AI model doesn’t change the security model. If anything, it makes it more important, because the model can be manipulated through prompt injection in ways a human user can’t.

Where MCP shines

Tool portability. I built the deployment status server once. It works with Claude, with our internal assistant, and with any future client that speaks MCP. That’s the whole pitch, and it delivers.

Discovery. A client can connect to a server and ask “what can you do?” The response is machine-readable and includes schemas. This means the AI model gets accurate tool descriptions automatically instead of relying on hardcoded prompts.

Composability. An AI client can connect to multiple MCP servers simultaneously. One for deployments, one for monitoring, one for documentation. Each server is independently deployable and testable. This is the microservices pattern applied to AI tool access, with the same benefits and the same risks.

Where it doesn’t

No standard auth. Every deployment rolls its own. This will improve, but right now it’s extra work.

Ecosystem maturity. The Go ecosystem is solid thanks to mcp-go, but tooling for testing, debugging, and monitoring MCP interactions is still young. I wrote my own trace logger.

Complexity budget. MCP is one more protocol layer to understand, debug, and operate. For a team with two tools, the overhead might not be worth it. For a team with ten tools across multiple AI clients, it pays for itself quickly.

Should you adopt it now

If you’re building AI systems that call tools – and increasingly, every AI system does – start with one server. Pick your simplest, most-used tool. Wrap it in MCP. Test it against a real client. Measure the integration effort against your current custom approach.

From what I’ve seen, MCP cut tool integration time roughly in half and made our tools testable in isolation for the first time. The security work is the same either way – you have to solve auth and permissions regardless of protocol. MCP just standardizes everything else.

The protocol is real. The ecosystem is growing. The hard problems are still hard. But the easy problems – discovery, invocation, transport – are solved. That’s enough to make it worth building on.

AI Governance That Does Not Suck

Mon, 03 Mar 2025 00:00:00 +0000

Nearly every enterprise has an AI governance document. Most of them are useless.

Not because the content is wrong. Because nobody reads it. Because it was written by a committee that has never shipped an AI feature. Because it treats governance as a gate instead of a guardrail, and engineers respond to gates the way water responds to dams – they find a way around.

I’ve watched teams at large telcos spend six weeks in governance review for an internal summarization tool that touches no customer data. Meanwhile, a different team ships a customer-facing chatbot with no review at all because nobody told them they were supposed to ask. That’s what governance failure looks like: not the absence of rules, but the absence of practical, enforceable, proportional rules.

What governance should actually do

Three things. That’s it.

Define what’s allowed, with conditions. Not a blanket “AI is approved.” Not a blanket “AI requires review.” A clear mapping from risk level to requirements.
Match oversight to risk. An internal tool that summarizes meeting notes doesn’t need the same review as a system that makes lending decisions. If your governance process can’t tell the difference, it’s broken.
Provide evidence that controls work. Not a signed-off PDF from six months ago. Living evidence: monitoring dashboards, automated checks, audit trails.

Anything beyond those three outcomes is compliance theater.

Risk tiers are the whole game

The simplest model that works:

Low risk: Internal tools, no customer data, no decisions with real consequences. Team-level approval. One-page system card. Basic monitoring. Ship it.

Medium risk: Customer-facing features, data processing, content generation. Formal review. Testing against an eval set . Documented safeguards. Scheduled re-checks.

High risk: Systems that make decisions affecting people’s money, health, access, or rights. Executive visibility. Human oversight. Continuous monitoring. No exceptions.

The tier matters less than the discipline of routing every AI deployment through the right path every time. At one company, we built a simple intake form – five questions, two minutes – that automatically assigned a risk tier and told teams exactly what they needed before shipping. Governance review time dropped from weeks to days. Compliance improved because teams actually followed the process.

The system card

Every AI deployment gets a one-page system card. It should answer:

What is this system allowed to do? What is it explicitly not allowed to do?
What data does it touch and how is that data protected?
What safeguards exist and how are they tested?
Who owns this system when something goes wrong?

That last question is the most important. If nobody has clear ownership, your incident response becomes a group chat full of confusion. I’ve seen that play out too many times.

Governance isn’t a one-time event

Models change. Data drifts. Usage expands beyond the original scope. A governance review from January is stale by March. Build automated checks: version tracking, usage monitoring, and alerts when behavior changes. Treat governance the way you treat infrastructure – continuously, not ceremonially.

The organizations that get AI governance right will move faster than the ones that skip it. Not because rules are fun, but because clear rules eliminate the ambiguity that slows everything down.

Video Understanding AI: What Actually Works

Mon, 17 Feb 2025 00:00:00 +0000

Last month, a team asked me to evaluate whether AI could replace their manual video review process. They had four people watching customer support call recordings, tagging issues, and writing summaries for eight hours a day. I said yes and built a prototype.

The prototype worked beautifully on the first three test clips. Then I ran it against their actual library and it confidently told me a customer was “demonstrating frustration through aggressive keyboard usage.” The customer was typing their account number. The model was hallucinating emotional context from audio artifacts.

That experience captures video AI right now. It’s genuinely capable. It’s also confidently wrong in ways that are hard to predict and even harder to catch at scale.

Video isn’t just “lots of images”

The fundamental challenge with video understanding is time. An image model looks at a single moment. A video model has to track what happened, in what order, and how things changed. That temporal reasoning is where models still struggle.

The practical failure modes I’ve seen:

Temporal confusion. The model describes events out of order or merges two separate moments into one. This is especially bad with longer clips.
Missing key moments. The model summarizes the overall vibe of a clip but misses the specific 10-second window where the important thing happened.
Overconfidence. The model narrates with authority even when it’s guessing. No hedging. No “I’m not sure.” Just wrong with conviction.

The pipeline that actually works

Forget single-prompt video understanding. It doesn’t scale. What works is a pipeline that breaks the problem into stages you can debug independently.

Here’s the architecture I landed on:

Step 1: Extract audio and transcribe. If the video has spoken content, the transcript is your primary signal. Audio transcription is a solved problem, and the output is reliable. Start here.

Step 2: Sample frames intelligently. Not every N seconds. Use scene detection to identify transitions, then sample the first frame of each scene plus any frame with significant visual change. This reduces the frame count by 60-80% without losing meaningful content.

Step 3: Analyze frames with context. Feed each frame to a vision model along with the surrounding transcript text. The transcript grounds the visual analysis and prevents the model from inventing narratives that don’t match what was said.

Step 4: Synthesize with timestamps. Merge the transcript-grounded visual analysis into a structured timeline. Every claim in the summary must reference a specific timestamp. If the model can’t cite when something happened, it probably didn’t happen.

The key insight: audio-first, video-second. The transcript is your source of truth. The video adds context. Not the other way around.

Where it’s actually useful

After the initial disaster and a week of pipeline tuning, I found the sweet spots:

Meeting summaries with action items. Transcribe, extract decisions and action items, tag them with speaker and timestamp. This works well because the transcript carries most of the signal and the visual component (slides, screen shares) adds structure.

Content moderation. Checking video against a specific policy with concrete criteria. “Does this clip contain product logos?” “Is the speaker reading from a teleprompter?” Questions with binary answers that the model can ground in visual evidence.

Search and retrieval. “Find the part of this recording where they discuss pricing.” Natural language search over video libraries works surprisingly well when you have good transcripts and frame-level annotations.

Compliance review. Structured checks against a rubric. Did the agent identify themselves? Did they read the required disclosure? Was the customer’s consent recorded? This works because the criteria are specific and verifiable.

Where it isn’t ready

Long-form video without speech. Surveillance-style footage. Anything where the important signal is subtle body language or spatial relationships. Anything where the model needs to count reliably or track specific objects across many frames.

Also, anything where a false positive has real consequences. If your video review pipeline flags a customer interaction as “hostile” and that triggers an HR process, you had better have a human in the loop.

Starting without overbuilding

Pick one use case. Keep clips under 10 minutes. Fix your output format before you start – structured JSON, not free-form prose. Build a gold set of 20-30 annotated clips and run every pipeline change against it.

The evaluation loop is everything. Without it, you’re optimizing by vibes, and vibes don’t catch temporal hallucinations.

Video AI is real and useful for the right problems. Just don’t let the first impressive demo convince you it’s ready for the hard ones.

AI Code Review Is Mostly Noise

Mon, 03 Feb 2025 00:00:00 +0000

I’m going to say something that will annoy AI tooling vendors: most AI code review output is garbage.

Not all of it. Maybe 15-20% is genuinely useful. But the other 80% is vague, style-obsessed, context-free commentary that would get a human reviewer told to try harder. “Consider adding error handling here.” Thanks. I hadn’t considered that. In Go. Where every third line is error handling.

I’ve been running AI review on PRs across production codebases for months. I wanted it to work. I really did. A tireless reviewer that catches logic bugs and security issues while humans focus on architecture and design ? Sign me up. The reality is more complicated.

What it actually catches

When AI code review works, it works well. The wins are real:

Logic errors on changed paths. The model is good at spotting off-by-one errors, nil pointer risks, and missing edge cases in the specific lines that changed. It caught a race condition in a Go channel handler that three human reviewers missed. That alone justified the experiment.

Security surface area. SQL injection in a new endpoint. Hardcoded credentials in a test file that was about to be committed. An overly permissive CORS config. These are pattern-matching tasks, and models are decent at pattern matching.

Copy-paste bugs. Someone copies a function, changes three of four parameters, and forgets the fourth. The model catches this reliably. Humans miss it because we read what we expect to see.

Where it falls apart

Business context. The model doesn’t know why your checkout flow has that weird retry logic. It doesn’t know that the “redundant” nil check exists because a specific vendor API lies about its response types. It doesn’t know your system’s history. So it flags things that aren’t problems and misses things that are.

Large diffs. Anything over a few hundred lines and the model loses the thread. It starts making generic observations instead of specific findings. “This function is complex and could benefit from refactoring.” Really helpful on a 2,000-line migration PR.

Style opinions nobody asked for. “Consider using a more descriptive variable name.” “This comment could be more detailed.” “Consider extracting this into a separate function.” If I wanted a style cop, I’d configure a linter. AI review should find bugs, not police style.

How I actually use it

After months of tuning, here’s what works.

Scope it to the diff. Don’t let the model browse the entire repo. Give it the changed lines and maybe the immediate surrounding context. The more you feed it, the more generic the output gets.

Demand specifics. My review prompt is aggressive about this:

Review this diff. For each finding:
- Exact line number
- Severity: critical / warning / info
- What could fail at runtime
- A concrete fix

Skip style suggestions. Skip anything a linter would catch.
If nothing is wrong, say nothing.

That last line matters. Without it, the model will always find something to say because it’s trained to be helpful. Sometimes the most helpful thing is silence.

Track the hit rate. I log every AI review comment and whether the human reviewer accepted, dismissed, or ignored it. Our current acceptance rate is about 22%. That means 78% of AI review output is noise. Not great. But the 22% that lands includes some of the highest-severity findings in our review history.

Never gate merges on it. AI review is advisory. A comment. A suggestion. The human reviewer decides. The moment you make AI review a merge blocker, you’ve handed authority to a system that’s wrong four times out of five. Don’t do this.

The uncomfortable math

AI code review costs money. Token costs, API calls, latency in your CI pipeline. At our current volume, it adds about 15-30 seconds per PR and a few dollars per day. That’s cheap for the bugs it catches. But if you aren’t measuring hit rate, you have no idea whether it’s worth it.

Most teams set up AI review, get excited about the first few catches, and then never look at the numbers again. Six months later, developers have learned to ignore the comments entirely because most of them are noise. The tool becomes furniture.

What I actually want

I want AI code review that knows when to shut up. That understands the system well enough to distinguish a real bug from an intentional design choice. That can read a PR description and connect the changes to the stated intent.

We aren’t there yet. But the foundation is real. Scope it tight, demand specifics, measure ruthlessly, and never trust it to make decisions. It’s a second pair of eyes, not a senior engineer.

Reasoning Models in Production: A Practical Guide

Mon, 20 Jan 2025 00:00:00 +0000

Quick take

Don’t make reasoning models your default path. Route by complexity, run expensive calls async, set per-request budgets, and cache aggressively . The model is the easy part. The routing and cost control are where you earn your keep.

I spent the last month integrating reasoning models into a production service. The short version: they’re genuinely better at complex analysis tasks. The long version: they’ll wreck your UX and budget if you treat them like a drop-in replacement for fast models.

This post covers the architecture I landed on, with real Go code. When I started this work, most posts I found were hand-wavy “use async patterns” advice with zero implementation detail.

The problem, concretely

Standard LLM calls in our pipeline take 1-3 seconds. Reasoning model calls take 8-45 seconds. That’s not a rounding error. It’s a completely different product experience.

Cost scales the same way. A reasoning call can burn 10-50x the tokens of a standard call for the same input because the model does internal chain-of-thought before producing output. On a high-traffic endpoint, that adds up fast.

At one company, someone enabled a reasoning model as the default for their support chatbot. The monthly API bill went from $2,000 to $34,000 in three weeks. Most of those calls were “what are your business hours?” Not exactly a problem that requires deep reasoning.

When reasoning models actually help

I’ve found three categories where the latency and cost trade-off is worth it:

Multi-step analysis. Reviewing a contract clause, debugging a complex data pipeline, synthesizing information from multiple sources. Tasks where a wrong answer costs more than a slow answer.

Code review and debugging. Reasoning models catch logic errors and subtle bugs that fast models miss entirely. I use them in our CI pipeline for reviewing diffs on critical paths. Nobody cares if that takes 30 seconds.

Planning and decomposition. Breaking a complex task into subtasks, reasoning about dependencies, identifying risks. The model needs to hold a lot of context and think through implications.

Where they’re a waste: simple Q&A, classification, extraction, and anything high-volume or latency-sensitive. Route those to fast models and save money.

The routing layer

The core insight is simple: not every request deserves the same model . Here’s the router I use in Go:

type ComplexityLevel int

const (
	ComplexityLow ComplexityLevel = iota
	ComplexityMedium
	ComplexityHigh
)

type Router struct {
	fastModel      string
	reasoningModel string
	classifier     *ComplexityClassifier
}

func (r *Router) Route(ctx context.Context, req Request) (Response, error) {
	level := r.classifier.Assess(req)

	switch level {
	case ComplexityLow:
		return r.callModel(ctx, r.fastModel, req, defaultBudget)
	case ComplexityMedium:
		resp, err := r.callModel(ctx, r.fastModel, req, defaultBudget)
		if err != nil || resp.Confidence < 0.7 {
			return r.callModel(ctx, r.reasoningModel, req, premiumBudget)
		}
		return resp, nil
	case ComplexityHigh:
		return r.callModel(ctx, r.reasoningModel, req, premiumBudget)
	default:
		return r.callModel(ctx, r.fastModel, req, defaultBudget)
	}
}

The complexity classifier doesn’t need to be fancy. Ours uses a combination of input length, certain keywords (like “analyze”, “compare”, “debug”), and whether the request references multiple documents. A simple heuristic gets you 80% of the way there.

The medium-complexity path is where this gets interesting. Try the fast model first. If confidence is low, escalate to reasoning. This keeps costs down for tasks that turn out to be simpler than they look.

Async execution for expensive calls

Any reasoning model call that might take more than a few seconds shouldn’t block your HTTP handler. Here’s the pattern I use:

type Job struct {
	ID        string
	Status    string
	Request   Request
	Response  *Response
	CreatedAt time.Time
}

type AsyncExecutor struct {
	jobs   sync.Map
	router *Router
	notify func(jobID string, resp Response)
}

func (e *AsyncExecutor) Submit(ctx context.Context, req Request) (string, error) {
	job := &Job{
		ID:        generateID(),
		Status:    "pending",
		Request:   req,
		CreatedAt: time.Now(),
	}
	e.jobs.Store(job.ID, job)

	go func() {
		resp, err := e.router.Route(context.Background(), req)
		if err != nil {
			job.Status = "failed"
			return
		}
		job.Response = &resp
		job.Status = "completed"
		e.notify(job.ID, resp)
	}()

	return job.ID, nil
}

func (e *AsyncExecutor) Poll(jobID string) (*Job, bool) {
	val, ok := e.jobs.Load(jobID)
	if !ok {
		return nil, false
	}
	return val.(*Job), true
}

The caller gets a job ID back immediately. They can poll for status, or we can push a notification when it’s done. The UX team shows a “thinking deeply about this…” indicator. Users are surprisingly tolerant of waiting when you tell them why.

In production, you want a proper job queue (we use Redis) and persistence. But the pattern is the same.

Per-request cost budgets

This is the piece most teams skip, and it’s what prevents surprise bills. Every model call gets a token budget:

type Budget struct {
	MaxInputTokens  int
	MaxOutputTokens int
	MaxCostCents    int
	TimeoutSeconds  int
}

var (
	defaultBudget = Budget{
		MaxInputTokens:  4000,
		MaxOutputTokens: 1000,
		MaxCostCents:    5,
		TimeoutSeconds:  10,
	}
	premiumBudget = Budget{
		MaxInputTokens:  16000,
		MaxOutputTokens: 4000,
		MaxCostCents:    50,
		TimeoutSeconds:  60,
	}
)

func (r *Router) callModel(ctx context.Context, model string, req Request, budget Budget) (Response, error) {
	ctx, cancel := context.WithTimeout(ctx, time.Duration(budget.TimeoutSeconds)*time.Second)
	defer cancel()

	if req.EstimatedInputTokens() > budget.MaxInputTokens {
		return Response{}, fmt.Errorf("input exceeds budget: %d > %d tokens",
			req.EstimatedInputTokens(), budget.MaxInputTokens)
	}

	resp, err := r.client.Complete(ctx, model, req.ToPrompt(),
		WithMaxTokens(budget.MaxOutputTokens),
	)
	if err != nil {
		return Response{}, fmt.Errorf("model call failed: %w", err)
	}

	costCents := estimateCost(model, resp.Usage)
	if costCents > budget.MaxCostCents {
		log.Printf("WARN: call exceeded cost budget: %d > %d cents", costCents, budget.MaxCostCents)
	}

	return parseResponse(resp), nil
}

The budget is enforced before and during the call. Context timeouts prevent runaway reasoning. Token limits prevent ballooning inputs. Cost estimation after the call feeds monitoring and alerting.

At one company, we added a daily cost ceiling per endpoint. If the endpoint hits 80% of its daily budget by noon, it automatically downgrades all calls to the fast model for the rest of the day. Crude but effective.

Caching reasoning results

Reasoning model outputs are expensive to produce but often reusable. Same contract clause reviewed twice? Same code pattern analyzed in different PRs? Cache it.

type ResultCache struct {
	store *redis.Client
	ttl   time.Duration
}

func (c *ResultCache) GetOrCompute(ctx context.Context, key string, compute func() (Response, error)) (Response, error) {
	cached, err := c.store.Get(ctx, key).Result()
	if err == nil {
		var resp Response
		if json.Unmarshal([]byte(cached), &resp) == nil {
			resp.FromCache = true
			return resp, nil
		}
	}

	resp, err := compute()
	if err != nil {
		return resp, err
	}

	data, _ := json.Marshal(resp)
	c.store.Set(ctx, key, data, c.ttl)
	return resp, nil
}

The cache key is a hash of the input and model version. When the model changes, the cache invalidates naturally. We use a 24-hour TTL for most analysis tasks and a 1-hour TTL for anything time-sensitive.

This alone cut our reasoning model costs by about 40% on the code review pipeline, because many PRs touch similar patterns.

What I got wrong the first time

I initially tried to hide latency entirely. Bad idea. Users thought the system was broken. The moment we switched to explicit “this needs deeper analysis, checking now…” messaging, complaints dropped to zero. People understand that some questions take longer to answer well. Respect that.

I also over-routed to reasoning models early on. The classifier was too generous with “high complexity” ratings. We added a feedback loop: if a reasoning model call produces essentially the same output as a fast model would have (measured by comparing on a sample), downgrade the classification for that pattern. Within two weeks, our routing accuracy improved significantly.

The architecture, summarized

Request → Complexity Classifier → Router
                                    ├── Low → Fast Model (sync)
                                    ├── Medium → Fast Model → check confidence → maybe Reasoning Model
                                    └── High → Async Executor → Reasoning Model → Notify

All paths → Budget Enforcement → Cache Check → Model Call → Response

Treat reasoning models as a premium tier. Route intelligently. Execute async when latency matters. Budget every call. Cache reusable results. The model does the thinking. Your job is to make sure it only thinks when it needs to.

AI in 2025: The Year Discipline Wins

Mon, 06 Jan 2025 00:00:00 +0000

Quick take

Stop chasing model announcements. The teams that win in 2025 are the ones building evals, monitoring quality, and treating AI like infrastructure instead of magic. Discipline over heroics.

Every January, someone publishes a breathless AI predictions post. “This will be the year of AGI.” “Agents will replace developers.” “Multimodal everything.”

I’m not going to do that.

What I can tell you is what I see working with teams that are actually shipping AI to production. The pattern is clear: 2024 was the year everyone built demos. 2025 is the year those demos have to work.

The demo hangover

Here’s what happened to most AI projects last year. Someone built a prototype in a weekend. It was impressive. Leadership got excited. Budget appeared. Then the prototype hit real users, real data, and real edge cases, and everything got complicated.

I watched this play out at three different companies. Same story every time. The model was fine. The engineering around the model wasn’t.

Missing evaluation suites. No fallback paths. Prompts that drifted every time someone tweaked them. Cost tracking that amounted to “we’ll figure it out later.” The model was the easy part. Operating discipline was the hard part.

That’s the real trend for 2025. Not a new model. A new level of engineering rigor around models.

Reasoning gets interesting

Models that think before they answer are genuinely useful for a specific class of problems. Multi-step analysis. Code review. Debugging. Anything where you would rather wait 30 seconds for a correct answer than get a fast wrong one.

The trap is treating reasoning models as the default. They’re slower, more expensive, and overkill for 80% of requests. The smart move is routing: fast model for simple tasks, reasoning model for complex ones. I’ll write more about this in a couple of weeks.

Multimodal is real but boring

Image, audio, and text working together is no longer a research demo. It’s a feature. Internal tools are the clearest win – think document-processing pipelines that can read scanned forms, or support systems that understand screenshots.

The value isn’t in any single modality being amazing. It’s in combining them so the system has richer context. Boring. Useful. Exactly the kind of thing that makes money.

Evaluation-first development

The single biggest shift I keep pushing is simple: define success before you write the first prompt .

This sounds obvious. Almost nobody does it. Teams will spend weeks tuning prompts and then measure success by vibes. “It feels better.” “The CEO liked the demo.” That isn’t engineering. That’s hope.

What works: a fixed eval set, tested on every change, with clear pass/fail criteria. Treat prompts like code. Version them. Review them. Test them. I won’t ship a prompt change without running it against the eval suite. Period.

Governance stops being optional

Regulation is firming up. The EU AI Act is real. Enterprise clients are asking for audit trails, documentation, and risk tiers before they’ll sign contracts. If your AI system can’t explain what it does, what data it touches, and who’s responsible when it goes wrong, you’re in for a bad year.

This isn’t bureaucracy for its own sake. Good governance actually accelerates adoption because it turns “can we use AI for this?” from a six-week debate into a checklist. Risk tier low? Ship it. Risk tier high? Here’s exactly what you need before you ship.

Governance that blocks delivery is broken governance. Governance that makes yes safe and fast is a competitive advantage.

Agents: promising, overhyped

Agents that can execute multi-step tasks are improving fast. They’re also still brittle. Context changes break them. Domain boundaries confuse them. The failure modes are subtle and hard to detect.

The near-term play is constrained agents with explicit checkpoints. Not open-ended autonomy. Not “let the agent figure it out.” Clear scope, clear permissions, clear rollback. We learned this lesson with microservices a decade ago: autonomy without contracts is chaos.

What I’m ignoring

Any roadmap built on vendor keynote slides instead of product outcomes.
Prompt engineering tricks that can’t be tested, versioned, or reproduced.
“Autonomous” systems with no permission model, no audit trail, and no kill switch.
Anyone who says “just add AI” without specifying what success looks like.

What matters

The capabilities are real. The models will keep getting better. But the gap between “this works in a demo” and “this works in production at 3am on a Saturday” is where careers and companies are made.

Ruthless focus on the boring stuff. Evals. Monitoring. Cost tracking. Fallback paths. Governance. That’s the 2025 playbook.

The teams that treat AI like infrastructure – with the same rigor they bring to databases and deployment pipelines – will win. Everyone else will keep rebuilding demos.

2025 Will Reward the Boring Teams

Mon, 23 Dec 2024 00:00:00 +0000

The prediction game is easy. Models get better. Context windows get longer. Multimodal improves. Agents get more capable. Legal and compliance teams get more involved. None of this is surprising.

The harder question: what should you actually do differently?

Here’s my short answer, based on a year of working on AI across multiple organizations and watching the gap between teams that shipped and teams that stalled.

Stop Experimenting. Start Measuring.

If you’ve been running AI “experiments” for more than a quarter without a clear evaluation framework, you aren’t experimenting. You’re procrastinating. Experiments have hypotheses, metrics, and endpoints. Pilots have owners, success criteria, and deadlines.

Pick two or three use cases closest to production. Define success in numbers, not narratives. Build an evaluation set. Ship to real users with monitoring. Learn from data, not opinions.

This isn’t glamorous. It’s effective.

Build the Operational Foundation

The teams that will move fastest in 2025 are the ones building the plumbing now. Not new models. Not new frameworks. Plumbing.

An evaluation loop that runs regularly, not when someone remembers
Cost tracking with per-feature attribution so you know where money goes
Security controls for model access and data handling that satisfy your legal team
Model-agnostic interfaces so you can swap providers without rewriting your stack

Every one of these is boring. Every one of these is a prerequisite for scaling anything in 2025. Through Q4, I’ve been helping teams set up exactly this kind of infrastructure, and the teams that have it in place are already iterating faster than teams that built flashy demos without it.

Governance Isn’t the Enemy

AI governance has a reputation problem. Engineers hear “governance” and think “bureaucracy that slows us down.” That framing is wrong.

Lightweight governance – clear ownership for use case intake, a simple review path for legal and security risks, a cadence for measuring value and retiring weak experiments – actually accelerates shipping. It removes the ambiguity that causes teams to stall waiting for implicit approval.

The companies that move fastest all have some version of this. Not a committee. Not a 50-page policy document. A clear owner, a simple process, and a regular review. That’s it.

What I’m Betting On

Personally, I’m betting that 2025 is the year AI stops being a separate initiative and becomes part of how software gets built. Not a team. Not a project. A capability that lives inside existing workflows, owned by existing teams, measured by existing standards.

The companies that treat AI as special will keep producing expensive demos. The companies that treat it as normal – same code review, same evaluation, same cost accountability, same ownership – will ship things that last.

Discipline over heroics. Same as always.

2024: The Year AI Got Boring (In a Good Way)

Mon, 16 Dec 2024 00:00:00 +0000

Looking back at 2024, the word that keeps coming to mind is “normalization.” AI stopped being the shiny thing leadership wanted to announce and became the thing teams had to maintain. That shift changed everything about how I spent my year.

The Work

Most of my 2024 was hands-on. Telecom, food delivery, real-time communications, fintech – different industries and scales, but the same fundamental questions. How do we go from demo to production? How do we control costs? How do we measure whether this actually works?

The conversations changed dramatically between January and December. Early in the year, the question was what AI could do. By mid-year, it was what AI should do – which tasks justified the cost, the complexity, and the risk. By Q4, the conversations were about operations: monitoring, evaluation cadence, cost attribution, team structure.

That progression felt right, like an industry growing up.

What Held Up

A few things I believed in January that held up through December:

Narrow scope wins. Every successful deployment I saw this year started with a tightly scoped use case. “Classify these support tickets into five categories” beats “build an AI assistant for customer service” every time. The narrow scope forces clear success criteria, which forces real evaluation, which forces real accountability.

Evaluation is the product. Teams that built evaluation harnesses early shipped faster and with more confidence. Teams that skipped evaluation shipped demos that never became products. I’ll keep saying it.

Retrieval quality determines answer quality. I built multiple RAG systems this year. In every single case, the initial complaint was “the model hallucinates” and the actual fix was improving retrieval. Better chunking. Hybrid search. Reranking. The model was fine. The evidence was bad.

Cost control is a day-one concern. I watched one team’s AI bill go from manageable to alarming in six weeks because nobody was tracking per-feature attribution. By the time they noticed, the organizational habit of ignoring cost was already baked in. Much harder to fix after the fact.

What Surprised Me

Claude 3.5 Sonnet changed my default recommendation. For most of the year I was recommending different models for different tasks with complex routing logic. By late 2024, Claude 3.5 Sonnet had become my default “just start here” answer for a wide range of production tasks. The quality-to-cost ratio was hard to beat. I still recommend routing for cost optimization, but the bar for when routing matters got higher.

Open models got good enough to matter. Llama 3 and Mistral variants crossed a threshold this year. Not for everything – frontier tasks still need frontier models. But for classification, extraction, and structured output, open models running on modest hardware became a real option. I helped two teams set up self-hosted deployments where the economics made sense.

Teams overbuilt. This one surprised me less than it should have. Multiple teams built multi-agent orchestration systems for tasks that should have been a single prompt with a good system message. The complexity wasn’t justified by the task. It was justified by enthusiasm. I spent a fair amount of Q3 and Q4 helping teams simplify.

What Stayed Hard

Evaluation is hard. I keep preaching it, and I keep watching teams struggle with it. Building a good eval set requires domain expertise, clear criteria, and the willingness to maintain it over time. Most teams get the first version right, then let it rot. Evaluation sets need the same care as test suites.

Multi-step workflows remained fragile. Agents that need to plan, execute, observe, and adapt are architecturally interesting and operationally painful. The tooling improved this year but the fundamental challenge – maintaining coherence over many steps – is still unsolved. The teams that succeeded constrained the number of steps aggressively.

Hiring remained weird. The “AI engineer” role is still not well-defined. Every company means something different by it. The best hires I saw were strong software engineers who learned the AI-specific parts on the job, not ML researchers who struggled with production engineering.

The Personal Angle

I’m still contributing to Go. Still building tools. The work is rewarding but I miss building full-time sometimes. There’s a different satisfaction in shipping code versus reviewing architecture diagrams.

The problem space – helping teams build faster and ship reliably – feels increasingly important as AI lowers the barrier to starting projects but does nothing to lower the barrier to finishing them. Starting is easy. Shipping is hard. That gap is where I keep ending up.

The Takeaway

2024 was the year AI got boring. I mean that as the highest compliment. Boring means production-ready. Boring means maintainable. Boring means teams can build on top of it without wondering if the foundation will shift next month.

The demo phase is over. The real work is underway. And the teams that win from here are the ones that treat AI for what it is: another production system that needs discipline, measurement, and ownership.

Same as everything else.

Your AI Infrastructure Is Not Special

Mon, 09 Dec 2024 00:00:00 +0000

I’m tired of seeing AI infrastructure treated as if it needs a whole new discipline.

It doesn’t. It’s the same infrastructure engineering we’ve been doing for decades, applied to a workload that happens to involve model inference. The latency problems are the same. The cost problems are the same. The reliability problems are the same. And the solutions are the same.

And yet every week I review a team’s architecture and find they’ve reinvented service meshes, badly, because they assumed AI needed something different.

The Demo-to-Production Gap Is Infrastructure

Here’s what happens: a team builds a demo. It works great at one request per minute. Then real traffic arrives and everything falls apart. Latency spikes. Costs explode. The system goes down when the provider rate-limits them.

None of these are AI problems. They’re infrastructure problems that we solved years ago in every other context. The teams that scale AI successfully are the ones that apply those solutions without reinventing them.

Put a Gateway in Front. Please.

I’m genuinely baffled by how many production AI systems I see where every service calls the model provider directly. No centralized routing. No rate limiting. No budget enforcement. No observability.

This is like building a web application in 2024 without a load balancer. Nobody would do that. But somehow AI gets a pass.

A gateway – call it whatever you want, broker, proxy, control plane – does the boring work:

Routes requests to the right model based on task type
Enforces rate limits and budgets per user, per feature, per environment
Caches deterministic responses
Provides a single point for observability and tracing
Handles provider failover when one API goes down

You can build a basic version in a day: a YAML config and a reverse proxy. It doesn’t need to be fancy. It needs to exist.

Separate Your Workloads

Interactive requests and batch processing shouldn’t share the same execution path. I keep saying this, and teams keep ignoring it until interactive latency tanks because a batch job saturated the rate limit.

Interactive work gets tight latency budgets and priority access. Batch work gets queued and retried patiently. The split is trivial to implement and painful to retrofit after the fact.

Cache. Everything. Deterministic.

If you’re sending the same prompt with the same inputs to the same model and not caching the response, you’re burning money. Literally.

Exact-match caching for deterministic requests is table stakes. Similarity-based caching for near-duplicate requests is a bonus. Even a simple TTL-based cache with invalidation on prompt updates can cut costs significantly.

One team was spending $40k/month on model inference. After adding exact-match caching for their classification pipeline, it dropped to $15k. Same outputs. Same quality. Less waste.

Cost Controls Aren’t Optional

“We’ll optimize costs later” is the AI equivalent of “we’ll add tests later.” You won’t. And when the bill arrives, it becomes an emergency.

Budget enforcement belongs in the gateway. Hard caps with clear error messages. Soft limits that degrade to cheaper models or slower paths. Per-user and per-feature attribution so you know where the money goes.

I’ve seen teams discover that a single feature was responsible for 70% of their AI spend because nobody was tracking attribution. The feature wasn’t even high-value. It was just chatty.

Reliability Isn’t Heroics

Retry with backoff. Circuit breakers. Graceful degradation. Provider failover.

These aren’t advanced patterns. They’re baseline production engineering. If your AI system doesn’t have them, it isn’t production-ready. It’s a demo with a billing account.

Graceful degradation is a product decision, not an ops feature. If the full response is unavailable, a simpler response or a cached response or even a “try again in a moment” is better than an error page. Design for this upfront. Don’t bolt it on during an incident.

The Unsexy Truth

AI infrastructure at scale is boring. That’s the point. Boring means predictable. Predictable means reliable. Reliable means you can actually build products on top of it.

The gateway, the cache, budget enforcement, workload separation, circuit breakers: none of it is novel. All of it is necessary. The teams that treat AI infrastructure like regular infrastructure, applying patterns that already exist, are the ones that scale without drama.

Stop reinventing. Start reusing. Your SRE team already knows how to do this. Let them.

Your AI Team Problem Is Not Technical

Mon, 02 Dec 2024 00:00:00 +0000

I’ve been in or around AI teams since 2018 – from a startup accelerator to enterprise teams, with roots going back to my first startup. One lesson keeps repeating: teams rarely fail at AI because they lack talent. They fail because nobody owns the outcome.

That sounds harsh. It’s also true.

The Ownership Gap

Here’s how it usually goes. A company decides to “do AI.” They hire an ML engineer, maybe two. Those engineers build a demo. Leadership is impressed. Then someone asks, “Who owns this in production?” and the room goes quiet.

The ML engineer built the model. The product team didn’t spec the success criteria. The data engineer wasn’t involved. The designer has no idea what happens when the model gets it wrong. And nobody defined what “getting it wrong” even means.

I’ve seen this exact pattern at large enterprises and small startups. The blocker isn’t technology. It’s structure.

Three Models That Work

Every successful team I’ve seen fits one of three structures, and the core tradeoff has not changed.

Embedded. AI engineers sit inside product teams. They ship features directly, own the evaluation, and live with the consequences of their choices. This works when AI is a feature, not a platform. The downside: practices drift across teams because there’s no central coordination.

Platform. A central team builds shared infrastructure – model serving, evaluation harnesses, prompt management, observability. Product teams consume that platform. This works when multiple products need AI. The downside: the platform team gets pulled in every direction and loses focus on any single product.

Hybrid. A platform team builds the core. Embedded engineers in product teams customize it. This is the most common pattern at companies that have scaled this successfully. It also requires the most coordination. Without clear ownership boundaries, it degenerates into blame-passing between platform and product.

Pick the model that matches your current scale, not the one you hope to need in two years.

Who to Hire

The best AI engineers I’ve worked with share a few traits that don’t show up on resumes.

They can explain how their system fails. Not just how it works, but how it breaks and what happens when it does. This is the best interview signal I’ve found.

They think in systems, not models. The model is one component. The retrieval layer, validation step, fallback path, and monitoring are just as important. A candidate who talks only about model architecture is missing the point.

They build evaluations before they build features. If you can’t measure whether the thing works, you’re guessing. The best engineers treat eval sets like test suites. They version them, maintain them, and refuse to ship without them.

They’ve shipped something to real users. Not a notebook. Not a demo. Something people used, complained about, and forced them to iterate on. Production experience changes how you think about every design choice.

The Operating Loop

Fancy process frameworks aren’t necessary. A tight loop between four phases covers it:

Discovery. Define success in measurable terms. What does “good” look like? What are the edge cases? Is the data available? A clear definition of success is worth more than a long list of ideas.

Prototyping. Run small experiments with real examples. Document the failures, not just the successes. Bring domain experts in early – they know the edge cases you’ll miss.

Development. Build the evaluation suite first. Version prompts and retrieval logic as code. Test against known failure cases whenever models or data change.

Production. Roll out gradually. Monitor quality and cost in the same dashboard. Treat regressions as product issues with named owners, not vague “the model changed” explanations.

What Actually Goes Wrong

The problems I see most often aren’t technical:

Nobody owns evaluation for a specific feature. There’s a shared checklist but no named person.
Success criteria are undefined, so feedback becomes opinion. “This doesn’t feel right” isn’t actionable.
The pipeline is too complex for the use case. Someone built a multi-agent system for what should have been a single prompt.
Knowledge stays in people’s heads. When someone leaves, the team loses context that took months to build.

Fix these four problems and you’re ahead of most AI teams. No new tools required. No new hires. Just clarity about who owns what and how you know it’s working.

That’s the whole secret: clear ownership, reliable evaluation, and the discipline to maintain both. Everything else is detail.

Picking an AI Model for Production (Late 2024)

Mon, 25 Nov 2024 00:00:00 +0000

Quick take

Benchmarks mislead. The right model depends on your tasks, latency requirements, cost tolerance, and how much ops overhead you can absorb. Run a bake-off on your actual workload. Route between models. Stop looking for a universal winner.

I get asked “which model should we use?” at least once a week. The answer is always the same: it depends. That answer always disappoints, so let me make it useful.

The late-2024 model landscape is competitive enough that the gap between top-tier providers on general tasks is small. The differences that matter most are operational, not intellectual. Here’s how I think about model selection for production systems.

The Landscape at a Glance

Two tracks dominate. Hosted APIs from Anthropic, OpenAI, and Google iterate fast and are easiest to ship with. Open-weight models from Meta (Llama), Mistral, and others give you more control but come with infrastructure baggage.

Track	Strengths	Weaknesses	Best For
Hosted API (frontier)	Latest capability, zero ops, fast iteration	Cost at scale, vendor dependency, data leaves your infra	Most teams starting out, complex reasoning tasks
Hosted API (mid-tier)	Good cost/quality ratio, same deployment simplicity	Weaker on complex tasks, less controllable	High-volume simple tasks, routing targets
Open-weight (large)	Data control, no per-token cost at scale, fine-tunable	GPU costs, ops burden, slower model updates	High volume, data residency, offline
Open-weight (small)	Fast inference, cheap, embeddable	Limited capability, more prompt engineering	Classification, extraction, edge deployment

What to Actually Compare

Forget leaderboards. They’re narrow, gameable, and rarely match your workload. Here are the dimensions that matter in production:

Dimension	What to Measure	Why It Matters
Task fit	Success rate on your actual prompts	A model that aces coding benchmarks might fail your extraction tasks
Latency	p50 and p95 with realistic prompt sizes	Average latency hides tail problems that users feel
Cost per success	Total spend per completed task, including retries	Cheap per-token doesn’t mean cheap per-task
Structured output	JSON/schema compliance rate	Critical if downstream code parses the response
Tool use	Accuracy of function calling and parameter extraction	Bad tool calls are worse than no tool calls
Safety/controllability	Refusal rates, policy adherence, output consistency	Too permissive or too restrictive both cause problems
Context handling	Quality at 8k, 32k, 128k+ tokens	Long context support isn’t the same as long context quality

I’ve run these comparisons for teams I’ve worked with. The results consistently surprise people. The “best” model on paper is rarely the best model for their specific tasks.

How to Run a Bake-Off

Don’t spend a month on this. A focused bake-off should take a few days:

Pick 30-50 representative inputs from your actual workload. Cover the common cases and the hard cases.
Define success criteria for each one. Not vibes, specific, checkable criteria.
Run each model against the same inputs with the same system prompt.
Score each model by task success rate, latency, and cost.
Check structured output compliance if you depend on it.

The results won’t be close on every dimension. One model will be cheaper. Another will be more accurate on complex tasks. A third will have better latency. That’s the point – you’re mapping the tradeoff space, not finding a winner.

The Router Pattern

Once you have bake-off data, the next step is obvious: route different task types to different models.

Task Type	Route To	Rationale
Simple classification / extraction	Small or mid-tier model	High volume, accuracy is sufficient, saves 60-80%
Complex reasoning / generation	Frontier model	Quality matters, volume is lower
Structured data extraction	Model with best schema compliance	Parsing reliability is non-negotiable
Latency-critical	Fastest model that meets quality bar	User experience trumps marginal quality
Fallback	Second provider	Availability protection

A routing layer adds complexity, but not much. An if statement or a config-driven switch is enough to start. You don’t need an ML-based router. You need a decision tree grounded in your bake-off results.

One team I worked with went from a single frontier model to a two-model router and cut monthly spend by 60% with no measurable quality regression. The hard part was running the bake-off. The router itself was 50 lines of Go.

Open Models: When and When Not

Self-hosting is a real option now. Llama 3 and Mistral variants are genuinely capable. But the question isn’t “can it do the task?” It’s “do we want to own the infrastructure?”

Self-host when:

Data must not leave your network (regulatory, contractual)
Volume is high and predictable enough that fixed GPU costs beat per-token pricing
You need fine-tuning that hosted APIs don’t support
You need offline or air-gapped operation

Don’t self-host when:

Volume is bursty or growing unpredictably
You need frontier capability that open models haven’t matched yet
Your team doesn’t have GPU ops experience
You want to iterate model versions quickly

I’ve talked a few teams out of self-hosting after running the numbers. The GPU costs plus ops burden plus slower iteration cycle made the total cost higher than the hosted API they were trying to replace. Self-hosting is a capability decision as much as a cost decision.

Contracts and Pricing: Check the Fine Print

Pricing shifts fast. What I can tell you as of late 2024:

The spread between frontier and mid-tier models is 10-30x on a per-token basis
Total cost is dominated by usage patterns (retries, context size, output length), not headline price
Enterprise agreements often include committed-use discounts that change the math significantly
Rate limits and quotas vary by tier and can cap throughput during peak usage

Verify current rates directly with providers before locking in. A pricing comparison that’s two months old is already stale.

The Only Advice That Ages Well

There’s no universal winner. Run a focused bake-off on your tasks, build a simple router, monitor everything, and re-evaluate quarterly. The model landscape moves fast. Your selection process should be fast too.

Treat vendor claims and public benchmarks as starting points, not decisions. The evaluation set built from your actual prompts, reviewed by your team, is the only benchmark that matters.

AI Safety Is Just Production Engineering

Mon, 11 Nov 2024 00:00:00 +0000

Quick take

Treat AI safety like you treat security: assume breach, layer your defenses, and make every boundary observable. A single filter will fail. A layered system with clear escalation paths won’t.

My time working with national cyber-defense taught me one lesson that transfers directly to AI safety: if your security model depends on a single control working perfectly, you don’t have a security model. You have hope.

Most AI safety implementations I review look like this: one content filter, one system prompt instruction, maybe a regex check on output. Then comes surprise when someone finds a bypass in production.

AI safety isn’t a research frontier. It’s production engineering. The same defense-in-depth thinking that protects networks also protects AI systems. The mental model is the same.

Assume Your Controls Will Be Tested

The moment you deploy an AI system to users, it becomes a target. Not always from malicious actors – though those exist – but from curious users, edge cases you never imagined, and the simple reality that models do unexpected things with novel inputs.

In cyber defense, you plan for this. You assume the perimeter will be breached and design the interior to limit damage. AI safety is the same. Assume:

Someone will try prompt injection. They’ll try hard.
The model will occasionally produce harmful or inappropriate output. No filter catches everything.
Data will leak through outputs or logs if you don’t explicitly prevent it.
Users will find ways to use capabilities you didn’t intend to expose.

This isn’t pessimism. It’s operational realism. Plan for it.

Input: Treat It as Untrusted

Every input to your AI system is untrusted. Full stop. This isn’t different from web security – you wouldn’t pass raw user input to a SQL query. Don’t pass raw user input to a model without validation.

Practical input controls:

Separate user content from system instructions at the architecture level, not just the prompt level
Length and format limits for every input field
Explicit allowlists for supported content types and languages
PII detection with consent-aware handling
Pattern checks for known injection techniques

Keep these simple. Complex input policies are hard to test, hard to maintain, and easy to bypass. A few robust checks beat a hundred brittle ones.

Output: The Last Boundary

Output is the final safety layer before the user sees a response. In my national cyber-defense work, we called this the “last line of defense” principle: design it assuming everything upstream has already failed.

Output controls:

Content filtering to block or redact unsafe responses
Leakage checks for system prompts, internal data, or PII
Schema validation when the response must follow a defined format
Safe fallback behavior when a response fails any check

Fallback behavior matters more than people think. A system that returns “I can’t help with that” when unsure is vastly safer than one that guesses and serves a plausible-looking wrong answer. Refusal is a feature.

System-Level Controls

Safety doesn’t live in the model layer alone. It belongs in the surrounding system. This is where the cyber defense analogy is strongest: you don’t just firewall the endpoint, you design the entire network for containment.

Rate limits and quotas reduce abuse surface and cost spikes. If someone is hammering your system with injection attempts, rate limiting slows them down before any content filter needs to fire.

Scoped tool access with clear permissions limits blast radius. If your agent can call APIs, those APIs should have the minimum permissions required. Not admin. Not read-write when read-only suffices.

Sandboxed execution for anything that touches external systems. If your agent generates code or makes API calls, run those in a sandbox. No exceptions.

Configurable policy modes so you can tighten safety quickly during an incident. A kill switch isn’t elegant but it’s necessary.

Monitoring: Safety Is Operational

In cyber defense, detection matters as much as prevention. You need to know when your controls are failing. The same applies to AI safety.

Treat safety incidents like reliability incidents:

Define thresholds for unsafe output rates, injection attempt rates, and escalation volumes
Set up clear escalation paths – who gets paged, what gets rolled back, what needs a review
Feed production signals back into model prompts, filters, and product design
Run regular reviews. Not quarterly. Weekly at minimum during early deployment.

The teams that catch problems early treat safety as an operational concern. The teams that catch problems late treat it as a PR crisis.

Defense in Depth

A single safeguard will fail. I can’t say this enough. Every content filter has bypasses. Every system prompt can be manipulated under the right conditions. Every validation check has edge cases.

The defense-in-depth approach layers controls so that any single failure doesn’t become an incident:

Input validation catches obvious abuse
System prompt discipline limits the model’s scope
Output filtering catches problematic responses
System controls (rate limits, permissions, sandboxing) limit blast radius
Monitoring detects when any layer is failing

Each layer is simple. The combination is robust. This isn’t a new idea – it’s how every mature security program works. AI safety should be no different.

Where to Start

If you’re deploying AI to production and haven’t built safety controls yet, start small:

Define the allowed inputs and outputs for your first use case. Write them down.
Implement input validation and output filtering with clear failure behavior
Add rate limiting and logging
Set up a simple review queue for flagged interactions
Iterate based on what you see in production

Don’t try to build a perfect safety system before shipping. Build a functional one, instrument it, and improve it continuously. Teams that wait for perfection ship nothing. Teams that ship with layered, observable safety controls learn fast and get better.

Safe systems and reliable systems are built the same way. Clear boundaries, observable behavior, steady iteration. The discipline transfers.

Agent Patterns That Survive Production

Mon, 28 Oct 2024 00:00:00 +0000

Quick take

Agents need structure, not longer prompts. Plan-execute-replan, specialist orchestration, compact memory management, and explicit recovery paths are the patterns that hold up. This post walks through each one with Go implementations.

I’ve been building and reviewing agent systems most of this year. The pattern is always the same: someone builds a single-prompt agent, it works beautifully on the happy path, and then it meets a real task and falls apart.

The fix is never “make the prompt better.” It’s always “add structure around the model.” Here are the patterns that actually survive production, with Go code you can adapt.

When Simple Agents Break

Simple agents – one prompt, one model call, maybe a tool – fail predictably once tasks get real:

More steps than fit in one context window
Tool calls that return errors or ambiguous results
Multiple valid paths with unknown payoff
Dependencies between sub-tasks that require ordering

If your task has any of these properties, you need patterns. Not hope.

Plan, Execute, Replan

The most useful pattern is also the simplest. Break the task into a plan, execute steps sequentially, and replan when reality diverges from the plan.

The plan is a draft, not a contract.

// Plan represents a sequence of steps the agent intends to execute.
// Steps can be updated mid-execution when results diverge.
type Plan struct {
	Goal      string
	Steps     []Step
	Completed []StepResult
}

type Step struct {
	ID          string
	Description string
	ToolName    string
	Input       map[string]any
}

type StepResult struct {
	StepID  string
	Output  any
	Err     error
	Blocked bool
}

// Execute runs through the plan, replanning when a step is blocked
// or produces unexpected results.
func (a *Agent) Execute(ctx context.Context, p *Plan) (*Plan, error) {
	for len(p.Steps) > 0 {
		step := p.Steps[0]
		p.Steps = p.Steps[1:]

		result := a.runStep(ctx, step)
		p.Completed = append(p.Completed, result)

		if result.Blocked || result.Err != nil {
			revised, err := a.replan(ctx, p)
			if err != nil {
				return p, fmt.Errorf("replan failed: %w", err)
			}
			p = revised
		}
	}
	return p, nil
}

// replan asks the model to revise remaining steps given what has
// happened so far. The completed results provide context.
func (a *Agent) replan(ctx context.Context, p *Plan) (*Plan, error) {
	prompt := fmt.Sprintf(
		"Goal: %s\nCompleted: %s\nRevise the remaining steps.",
		p.Goal, formatResults(p.Completed),
	)
	resp, err := a.llm.Complete(ctx, prompt)
	if err != nil {
		return p, err
	}
	p.Steps = parseSteps(resp)
	return p, nil
}

The key design choice is to replan on failure, not on every step. Replanning is expensive – it costs a model call and risks plan instability. Only trigger it when the current plan is provably broken.

I’ve seen teams replan after every step “for safety.” The result is an agent that never commits to anything and burns tokens oscillating between plans. Pick a plan, execute, and adjust on failure, not anxiety.

Orchestrator-Specialist Pattern

When tasks naturally split into parallel or specialized work, a single agent doing everything is the wrong abstraction. Use an orchestrator that breaks the task down and dispatches to specialists.

// Orchestrator decomposes a task and dispatches sub-tasks to
// specialist agents. It synthesizes their results.
type Orchestrator struct {
	planner     LLM
	specialists map[string]*Specialist
}

type Specialist struct {
	Name    string
	Agent   *Agent
	Domain  string // e.g. "research", "code-generation", "validation"
}

type SubTask struct {
	ID          string
	Description string
	Specialist  string
	Input       map[string]any
	DependsOn   []string
}

// Run decomposes the task, executes sub-tasks respecting dependencies,
// and synthesizes results.
func (o *Orchestrator) Run(ctx context.Context, task string) (string, error) {
	subtasks, err := o.decompose(ctx, task)
	if err != nil {
		return "", fmt.Errorf("decompose: %w", err)
	}

	results := make(map[string]string)

	for _, batch := range topologicalBatches(subtasks) {
		g, gCtx := errgroup.WithContext(ctx)

		for _, st := range batch {
			st := st
			spec, ok := o.specialists[st.Specialist]
			if !ok {
				return "", fmt.Errorf("unknown specialist: %s", st.Specialist)
			}

			g.Go(func() error {
				// Inject dependency results into the sub-task input.
				for _, dep := range st.DependsOn {
					st.Input[dep] = results[dep]
				}
				res, err := spec.Agent.RunTask(gCtx, st.Description, st.Input)
				if err != nil {
					return fmt.Errorf("specialist %s: %w", spec.Name, err)
				}
				results[st.ID] = res
				return nil
			})
		}

		if err := g.Wait(); err != nil {
			return "", err
		}
	}

	return o.synthesize(ctx, task, results)
}

The topological batching is important. Sub-tasks without dependencies run in parallel. Sub-tasks that depend on earlier results wait. This gives you concurrency where it’s safe and ordering where it’s required.

Go’s errgroup is perfect for this. I’ve tried this pattern in Python with asyncio, and the error handling is significantly worse. Go’s explicit error returns make failure paths clear.

Structured Working Memory

Context windows are finite and expensive. You can’t dump every intermediate result into the prompt and hope for the best. Working memory needs structure.

// Memory manages the agent's working context with size limits
// and periodic compression.
type Memory struct {
	mu       sync.Mutex
	facts    []Fact
	maxFacts int
	llm      LLM
}

type Fact struct {
	Key       string
	Value     string
	Source    string // which step produced this
	Priority  int    // higher = keep longer
	CreatedAt time.Time
}

// Add inserts a fact, compressing if the memory is full.
func (m *Memory) Add(ctx context.Context, f Fact) error {
	m.mu.Lock()
	defer m.mu.Unlock()

	m.facts = append(m.facts, f)

	if len(m.facts) > m.maxFacts {
		return m.compress(ctx)
	}
	return nil
}

// compress asks the model to summarize low-priority facts into
// fewer entries, keeping high-priority facts intact.
func (m *Memory) compress(ctx context.Context) error {
	sort.Slice(m.facts, func(i, j int) bool {
		return m.facts[i].Priority > m.facts[j].Priority
	})

	// Keep top half as-is, compress bottom half.
	keep := m.facts[:m.maxFacts/2]
	toCompress := m.facts[m.maxFacts/2:]

	summary, err := m.llm.Complete(ctx, fmt.Sprintf(
		"Summarize these facts into 2-3 key points:\n%s",
		formatFacts(toCompress),
	))
	if err != nil {
		// On failure, just drop the lowest priority facts.
		m.facts = keep
		return nil
	}

	m.facts = append(keep, Fact{
		Key:       "compressed_context",
		Value:     summary,
		Priority:  1,
		CreatedAt: time.Now(),
	})
	return nil
}

// ForPrompt renders the current memory as a string for inclusion
// in a prompt.
func (m *Memory) ForPrompt() string {
	m.mu.Lock()
	defer m.mu.Unlock()
	return formatFacts(m.facts)
}

The compression strategy matters. High-priority facts (decisions, constraints, key results) stay intact. Low-priority facts (intermediate outputs, exploration notes) get summarized. If compression fails, drop the least important items rather than crashing.

I keep raw tool outputs entirely outside the prompt. They go into a side store the agent can query if needed. Only extracted facts enter working memory.

Explicit Recovery

This is the pattern most teams skip, and it’s the one that matters most in production. Agents will encounter tool failures, stale plans, missing inputs, and model refusals. Without explicit recovery, those become silent failures or infinite loops.

// RecoveryStrategy defines how the agent handles a specific failure type.
type RecoveryStrategy struct {
	Name       string
	MaxRetries int
	Backoff    time.Duration
	Handler    func(ctx context.Context, err error) (Action, error)
}

type Action int

const (
	Retry        Action = iota
	Decompose           // break the failed step into smaller steps
	Skip                // mark step as skipped, continue
	Escalate            // pause for human input
	Abort               // stop the agent
)

// Recover selects and applies the appropriate recovery strategy.
func (a *Agent) Recover(ctx context.Context, step Step, err error) (Action, error) {
	strategy := a.selectStrategy(err)

	for attempt := 0; attempt < strategy.MaxRetries; attempt++ {
		action, retryErr := strategy.Handler(ctx, err)
		if retryErr == nil {
			return action, nil
		}
		time.Sleep(strategy.Backoff * time.Duration(attempt+1))
	}

	// All retries exhausted. Escalate.
	return Escalate, fmt.Errorf(
		"recovery exhausted for step %s after %d attempts: %w",
		step.ID, strategy.MaxRetries, err,
	)
}

The key insight: recovery actions are an enum, not free-form decisions. The agent picks from a fixed set of responses. Retry, decompose, skip, escalate, or abort. No improvisation. This keeps the failure paths testable and predictable.

The escalation path – pausing for human input – isn’t a failure. It’s a feature. An agent that knows when to ask for help is more reliable than one that guesses and gets it wrong.

Putting It Together

A production agent combines these patterns in layers:

Plan-execute-replan as the outer loop
Orchestrator-specialist for sub-task parallelism
Structured memory to manage context within budget
Explicit recovery at every step boundary

Each layer is independently testable. You can unit test recovery strategies, benchmark memory compression, and integration test the orchestrator without running the full agent.

Start with plan-execute-replan and explicit recovery. Those two patterns alone will take you from “works on demos” to “works on real tasks.” Add orchestration and structured memory when your tasks demand it.

The agents that survive production aren’t clever. They’re disciplined.

AI Cost Benchmarking: What Your Bill Actually Tells You

Mon, 14 Oct 2024 00:00:00 +0000

Quick take

Your AI cost isn’t what the pricing page says. It’s tokens times retries times fallbacks times human review – all shaped by your specific prompts and workload. Benchmark against your actual tasks or you’re optimizing fiction.

Every few weeks someone sends me a spreadsheet comparing AI provider pricing and asks “which one should we use?” The spreadsheet always compares cost per million tokens. It’s always useless.

After working on AI cost optimization since early 2024, I can tell you the gap between headline pricing and actual production cost is consistently 3-10x. Providers with the cheapest tokens sometimes end up being the most expensive per completed task. Here’s why and how to benchmark properly.

The Real Cost Stack

Token price is one line item. Production cost includes everything the system does to deliver a reliable result.

Cost Layer	What It Includes	Typical Share
Model inference	Input + output tokens	30-50%
Retries & fallbacks	Failed attempts, quality retries, provider failover	10-25%
Retrieval & preprocessing	Embedding, search, context assembly	10-20%
Human review	Escalation, QA sampling, edge case handling	10-30%
Infrastructure	Caching, logging, orchestration	5-10%

Teams that track only model inference are missing half their spend. I learned this the hard way on a document processing pipeline: the retry rate on complex documents was 40%, effectively doubling model cost. The pricing spreadsheet didn’t mention that.

Benchmark Your Tasks, Not Generic Prompts

A useful benchmark mirrors your actual workload. Generic “summarize this article” tests tell you nothing about how a model handles your prompts, error rates, and latency requirements.

Build a benchmark set that covers:

Task Category	Why It Matters	What to Measure
High-volume simple tasks	Dominates token count	Cost per success, latency p50
Complex multi-step tasks	Dominates per-task spend	Total cost including retries, success rate
Edge cases / policy triggers	Drives fallback and review cost	Escalation rate, human time per case
Retrieval-heavy tasks	Preprocessing is a big chunk of cost	End-to-end cost, retrieval overhead ratio

Keep this set stable. If benchmark inputs change every week, you can’t tell whether cost shifts came from system changes or test changes.

Compare Approaches, Not Providers

Provider names and model versions change quarterly. A benchmark built around “GPT-4 vs Claude 3.5” has a shelf life of weeks. Instead, compare the architectural choices you control:

Approach	Cost Profile	When It Wins
Large model, single pass	High per-call, low retry	Simple tasks, tight latency budgets
Small model + reranker	Lower per-call, extra step	High volume, tolerance for pipeline complexity
Router: small for easy, large for hard	Variable, needs routing logic	Mixed workloads with clear difficulty signals
Self-hosted open model	Fixed infra cost, zero per-token	High volume, data residency, offline needs

The router pattern is where I’ve seen the biggest wins. One team cut monthly spend by 60% by routing straightforward classification tasks to a small model and reserving the large model for generation. Classification accuracy from the small model was identical. They were paying frontier prices for commodity work.

The Drivers That Actually Move Your Bill

Forget micro-optimizing prompts. These four factors determine 80% of your cost trajectory:

Response length drift. Prompts evolve over time. Engineers add instructions, examples, formatting requirements. Output gets longer. Nobody notices until the bill does. Track average output tokens per task type weekly.

Retry rates. Every retry is a full cost event. If your validation rejects 20% of responses and retries, your effective cost is 1.25x the base. If it retries twice on failure, it’s worse. Measure retry rate by task type and fix the root cause.

Retrieval bloat. Context windows keep growing, so teams stuff more chunks in. More context means more input tokens. But past a point, more context doesn’t improve answers – it just costs more. Measure answer quality versus context size and find the plateau.

Routing waste. Sending everything to the most capable model is the default because it’s easy. It’s also the most expensive default. Any task where a smaller model achieves the same success rate is money burned on the large model.

Self-Hosting: When the Math Works

Self-hosting isn’t a cost optimization for most teams. It works for teams with specific constraints:

Predictable, high-volume workloads where the per-token savings exceed infra costs
Strict data residency or air-gapped environments
Fine-tuned models that don’t exist as hosted APIs

For bursty workloads or teams that need frequent model upgrades, operational overhead eats the savings. I’ve talked a few teams out of self-hosting after we modeled actual GPU costs, ops burden, and iteration-speed penalties. The math didn’t work for them. It might for you. Run the numbers on your workload, not someone else’s blog post.

Set Up Monitoring Before You Need It

A benchmark is a snapshot. Production spend is a moving target. Set up cost monitoring from day one:

Track cost per successful task, not cost per API call
Break it down by feature and user tier
Alert on spend spikes and retry rate increases
Review monthly with someone who owns the budget

The teams that catch cost problems early treat them like performance regressions. The teams that catch them late treat them like budget emergencies.

Boring systems, predictable bills.

RAG Retrieval That Actually Works

Mon, 30 Sep 2024 00:00:00 +0000

Quick take

Stop blaming the LLM. If your RAG system gives bad answers, the retrieval is almost certainly the bottleneck. Hybrid search, proper chunking, query expansion, and reranking – measured separately from generation – will do more for answer quality than any prompt engineering trick.

I’ve built three different RAG systems this year, and each time the first complaint was “the model hallucinates.” Each time, the real problem was retrieval feeding garbage into context. The model was doing its best with bad evidence.

Basic RAG – embed the query, grab the top-k chunks, stuff them into the prompt – is a fragile baseline. It works in demos. It breaks on real data. Here’s why, and what to do about it.

Why Basic Retrieval Fails

The failure modes are predictable. I see the same ones everywhere:

Vocabulary mismatch. The user asks about “cancellation policy” but the source document says “termination terms.” Pure semantic search sometimes bridges this gap. Sometimes it doesn’t.

Context fragmentation. A paragraph that answers the question gets split across two chunks. Neither chunk scores high enough on its own. The answer exists in your corpus but the retrieval never finds it.

Wrong granularity. Your chunks are 512 tokens. The user asks a question that needs a 50-token fact buried in the middle. The surrounding noise tanks the relevance score.

Temporal confusion. The 2022 policy and the 2024 policy both match the query. The retrieval returns whichever embeds closer, not whichever is current.

Multi-hop requirements. The answer requires combining facts from two different documents. Single-query retrieval will find one, maybe. Not both.

Hybrid Search: Combine Signals

Pure vector search misses exact terms. Pure lexical search misses paraphrases. Combine them.

The implementation is straightforward. Run both searches, normalize the scores, and fuse the rankings. Reciprocal Rank Fusion (RRF) is the simplest approach that works:

package search

// RRFMerge combines results from multiple search backends using
// Reciprocal Rank Fusion. k controls how much rank position
// matters -- 60 is a common default.
func RRFMerge(results [][]Result, k float64) []Result {
	scores := make(map[string]float64)
	docs := make(map[string]Result)

	for _, ranked := range results {
		for rank, r := range ranked {
			scores[r.ID] += 1.0 / (k + float64(rank+1))
			docs[r.ID] = r
		}
	}

	merged := make([]Result, 0, len(scores))
	for id, score := range scores {
		doc := docs[id]
		doc.Score = score
		merged = append(merged, doc)
	}

	sort.Slice(merged, func(i, j int) bool {
		return merged[i].Score > merged[j].Score
	})
	return merged
}

From what I’ve seen, hybrid search with RRF improves recall by 15-30% over pure vector search on real corpora. Not synthetic benchmarks – real production data with messy, inconsistent documents.

Chunking Isn’t a Formatting Detail

Most teams treat chunking as a config parameter. Set chunk_size=512, done. This is wrong.

Good chunking preserves the structure of the source material. If your documents have headings, keep them. If a section is self-contained, chunk it as a unit. If a chunk can’t be understood without its parent heading, prepend a breadcrumb.

// Chunk represents a document fragment with enough context
// to be understood when retrieved independently.
type Chunk struct {
	ID         string
	Content    string
	Breadcrumb string // e.g. "Policy Manual > Section 4 > Termination"
	Source     string
	UpdatedAt  time.Time
	Tokens     int
}

// ChunkWithContext prepends the breadcrumb to the content so the
// chunk is self-contained when injected into a prompt.
func (c Chunk) ChunkWithContext() string {
	if c.Breadcrumb == "" {
		return c.Content
	}
	return fmt.Sprintf("[%s]\n\n%s", c.Breadcrumb, c.Content)
}

The breadcrumb costs a few tokens per chunk. It pays for itself by making the model understand what it’s reading. Without it, the model gets a floating paragraph with no context about where it came from.

Query Expansion

Single-shot queries are narrow. The user types one phrasing, but the relevant document uses different words. You miss.

Query expansion generates alternative phrasings and retrieves against all of them. The simplest version that works: ask the LLM to generate 2-3 reformulations, then run all queries and merge results.

A more interesting approach is HyDE (Hypothetical Document Embeddings). Instead of expanding the query, generate a hypothetical answer and embed that. The intuition is that a hypothetical answer is closer in embedding space to the actual answer than the question is.

// ExpandQuery generates alternative phrasings for retrieval.
// Returns the original query plus expansions.
func ExpandQuery(ctx context.Context, llm LLM, query string, n int) ([]string, error) {
	prompt := fmt.Sprintf(
		"Generate %d alternative phrasings of this search query. "+
			"Return only the queries, one per line.\n\nQuery: %s",
		n, query,
	)

	resp, err := llm.Complete(ctx, prompt)
	if err != nil {
		// Fallback: just use the original query.
		return []string{query}, nil
	}

	queries := []string{query}
	for _, line := range strings.Split(resp, "\n") {
		line = strings.TrimSpace(line)
		if line != "" {
			queries = append(queries, line)
		}
	}
	return queries, nil
}

Note the error handling: if expansion fails, fall back to the original query. Don’t let a retrieval enhancement become a retrieval blocker.

Expansion increases recall, but it also brings in noise. That’s fine, because the next step handles it.

Reranking: The Cleanup Step

After gathering candidates from hybrid search across expanded queries, you have a broad set. Most of it is relevant. Some isn’t. A reranker fixes the ordering.

A cross-encoder reranker compares the full query against the full chunk text. It’s slower than embedding similarity but significantly more accurate for the final ranking. Run it on your top 20-50 candidates, not your entire corpus.

// Rerank takes candidate chunks and reorders them by relevance
// using a cross-encoder model. Keep topN results.
func Rerank(ctx context.Context, model Reranker, query string, candidates []Chunk, topN int) ([]Chunk, error) {
	type scored struct {
		chunk Chunk
		score float64
	}

	pairs := make([]QueryDocPair, len(candidates))
	for i, c := range candidates {
		pairs[i] = QueryDocPair{Query: query, Document: c.ChunkWithContext()}
	}

	scores, err := model.Score(ctx, pairs)
	if err != nil {
		return candidates[:topN], nil // degrade gracefully
	}

	ranked := make([]scored, len(candidates))
	for i := range candidates {
		ranked[i] = scored{chunk: candidates[i], score: scores[i]}
	}

	sort.Slice(ranked, func(i, j int) bool {
		return ranked[i].score > ranked[j].score
	})

	result := make([]Chunk, 0, topN)
	for i := 0; i < topN && i < len(ranked); i++ {
		result = append(result, ranked[i].chunk)
	}
	return result, nil
}

Again, graceful degradation. If the reranker fails, return the original order truncated to topN. The system should always return something useful.

Multi-Representation Indexing

One embedding per document is leaving retrieval quality on the table. For important documents, index multiple representations:

The full text (for detail queries)
A concise summary (for broad queries)
Question-like phrasings that the text answers (for direct questions)

This widens the retrieval surface without changing the source documents. It’s extra indexing work, but the recall improvement on multi-hop queries is substantial. I’ve seen it close the gap on questions that basic retrieval missed entirely.

Measure Retrieval Separately

This is the part most teams skip, and it’s the most important.

If you only measure end-to-end answer quality, you can’t tell whether a bad answer came from bad retrieval or bad generation. You need retrieval-specific metrics:

Recall@k: Did the relevant chunk appear in the top k results?
Precision@k: What fraction of the top k results were actually relevant?
MRR (Mean Reciprocal Rank): How high did the first relevant result rank?
nDCG: How well-ordered is the full ranking?

Build a small eval set – 50 to 100 query-document pairs where you know which chunks should be retrieved. Run it after every change to chunking, embedding, or search logic. This is the single highest-leverage investment in a RAG system.

I keep these eval sets in the repo alongside the retrieval code. They’re as important as unit tests. Maybe more important.

The Full Pipeline

Putting it all together, the retrieval pipeline for a production RAG system looks like:

Expand the query (2-3 reformulations)
Run hybrid search (vector + lexical) for each query variant
Merge results with RRF
Rerank the merged candidates
Return top-k chunks with breadcrumbs

Each step is independently testable and independently measurable. When something breaks, you know where to look.

The generation step is almost an afterthought once retrieval is solid. A decent model with the right evidence in context will give you a good answer. A frontier model with the wrong evidence will confidently give you a wrong one.

Fix retrieval first. Everything else follows.

Let AI Write Your First Draft, Not Your Docs

Mon, 16 Sep 2024 00:00:00 +0000

Technical documentation is one of the most undervalued forms of engineering communication. Everyone agrees it matters. Almost nobody prioritizes it. I’ve watched this pattern repeat at every company I’ve worked with, and the failure mode is always the same: docs rot because nobody owns them.

AI won’t fix that problem. But it can remove the excuse.

The Drafting Problem

The hardest part of writing docs is getting started. A blank page plus a busy engineer usually means no documentation. AI is genuinely good at solving this specific problem. Feed it the code structure, recent PRs, and changelogs, and you can get a usable first draft in minutes instead of hours.

That draft will be wrong in places. It will miss context. It will occasionally hallucinate an API parameter that doesn’t exist. That’s fine. A wrong draft you can edit is still faster than a correct document nobody writes.

Where It Falls Apart

The moment you treat AI output as finished documentation, you’ve created something worse than no documentation at all. Wrong docs train people to distrust all docs. I’ve seen this happen: a team auto-generates reference pages, skips review, and six months later nobody believes anything in the docs. They go straight to the source code. The docs become decoration.

The fix is dead simple: AI drafts, humans review, same PR as the code change. No separate workflow. No “we’ll update the docs later.” If the docs don’t land in the same review cycle as the code, they’ll drift. This isn’t a tooling problem. It’s a discipline problem.

The Search Use Case

The other place AI helps is doc search. A retrieval-backed answer system that points users to the right section – with citations – is genuinely useful. The key constraint: it should refuse to answer when it can’t find supporting material. “I don’t know, but here’s the closest section” is a better answer than a confident fabrication.

I’ve been setting this up across a few projects and the pattern holds. Grounded search with citations works. Generative answers without grounding don’t.

What I Would Actually Do

If I were starting a docs workflow today:

Generate first drafts from code context. Edit for accuracy and tone before merging.
Block releases when critical docs are stale. Make it a CI check if you have to.
Keep docs in the repo. Same review, same merge, same ownership.
Add retrieval-backed search with citation links. Refuse when unsupported.

None of this is complicated. The tooling exists. The gap is always ownership and review discipline, not technology. AI makes the drafting faster. It doesn’t make the caring automatic.

AI-Assisted Code Migration: What Actually Works

Mon, 02 Sep 2024 00:00:00 +0000

Last quarter I helped a team migrate a large Go codebase from an internal HTTP framework to standard library patterns: around 200K lines across 40+ services. It was the kind of project where you know the end state, you know the transformation rules, and the work is 90% mechanical and 10% judgment calls that keep you up at night.

We used LLMs to handle the mechanical 90%. It worked. But “it worked” comes with enough caveats that it’s worth being honest about what actually happened.

What the AI was good at

Pattern matching and consistent transformation are the sweet spot. We had about 15 distinct patterns that needed to change: custom route handlers to standard ones, middleware signatures, and error response formats. For each pattern, we wrote a clear transformation rule with before/after examples.

The LLM could take a file, identify which patterns were present, and produce a transformed version. For straightforward cases, it was faster than any human and more consistent. It didn’t get bored on file 200. It didn’t introduce typos. It applied the same transformation rule the same way every time.

We processed about 300 files in two days that would have taken two engineers a couple of weeks. The mechanical savings were real.

What the AI was bad at

Judgment. The 10% of cases that didn’t fit neatly into the transformation rules required understanding intent, not just pattern matching: a handler that looked standard but had a subtle side effect; a middleware chained in an unusual order for a specific reason; error handling intentionally different from the standard pattern because of a business rule documented nowhere except a Slack thread from 2021.

The LLM would happily transform these cases using the standard rules. The output would compile. The tests would pass. And the behavior would be subtly wrong in ways that only surfaced under specific conditions.

This is the dangerous part. AI-generated code that’s almost right is harder to catch than code that’s obviously wrong. It passes automated checks and casual review. Then you find the bug three weeks later when a customer reports something weird.

The workflow that worked

Here’s what we settled on after the first batch of surprises:

Step 1: Scope with samples. Don’t start with “migrate everything.” Pick 10 representative files that cover the range of patterns. Run them through the LLM. Review the output manually. This reveals the transformation rules you need and the edge cases you’ll need to handle differently.

Step 2: One rule per pattern. Write each transformation rule explicitly. Not “update the HTTP handlers,” but “replace framework.Handler(func(ctx *Ctx) error {...}) with http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {...}) and move error handling to…” The more specific the rule, the better the LLM follows it.

Step 3: Small batches, continuous validation. We processed 10-20 files at a time. After each batch: run the build, run the tests, run the linter, and do a quick diff review. If something broke, fix it and update the transformation rule before continuing. Don’t accumulate 200 files of changes and then try to debug a test failure.

Step 4: Flag the hard ones. When the LLM produced a transformation that looked different from the standard pattern, we flagged it for human review instead of forcing it through. About 15% of files got flagged. Those were the ones where the AI saved us no time at all – but catching them early saved us from a lot of pain later.

Treat AI output as draft code

This is the principle that made the whole process work. Every AI-generated change went through the same review process as a human-written change. Same CI checks. Same code review. Same approval workflow.

The temptation is to trust the AI more because it’s consistent and fast. Resist that temptation. The AI is a junior engineer who types incredibly fast and never pushes back on your instructions. That’s useful. It isn’t the same as reliable.

What I’d do differently

I’d build the evaluation harness first. We started the migration, then realized we didn’t have a good way to verify that migrated services behaved identically to the originals. We retrofitted integration tests, but it would have been faster to invest that time upfront.

I’d also version the transformation rules alongside the code. We iterated on the rules as we discovered edge cases, but we didn’t track which version of the rules produced which batch of changes. When we found a bug, tracing it back to the specific rule version that caused it was harder than it should have been.

The honest summary

AI made a two-month migration take three weeks. That’s a genuine win. But it didn’t change the nature of the hard parts. Scoping, validation, edge case handling, and human judgment on ambiguous cases – those are still the bottleneck. The AI accelerated the parts that were already straightforward.

Use AI for migrations. Just don’t pretend it replaces the discipline that makes migrations safe.

How I Actually Test LLM Features

Mon, 19 Aug 2024 00:00:00 +0000

Quick take

Test LLM features in layers: deterministic checks for everything around the model (parsing, validation, prompt rendering), property-based checks for model outputs (format, required fields, safety), and a curated golden set for regression detection. Don’t test exact string matches. Test the properties that matter to users.

The first time I shipped an LLM feature without a proper test suite, we spent three weeks arguing about whether the quality had regressed after a prompt change. Nobody had baseline numbers. Nobody had a definition of “good.” We were debugging by vibes.

Never again.

LLM testing is different from traditional software testing, but it isn’t impossible. It requires accepting that you’re testing probabilistic behavior and building your strategy around that reality instead of fighting it.

The problem with LLM outputs

Three things make LLM testing hard:

Non-determinism. The same input can produce different outputs across runs, even with temperature set to zero (some providers still have variance).
Multiple valid answers. For most tasks, there isn’t one correct answer. There’s a space of acceptable answers.
Invisible regressions. A prompt change or model update can shift behavior without any code change. Your CI pipeline sees green. Your users see worse outputs.

The instinct is to throw up your hands and say “we can’t test this.” That’s wrong. You can test this. You just can’t use assertEqual.

Layer 1: deterministic tests for everything around the model

The code around the LLM – prompt rendering, response parsing, validation, error handling – is deterministic. Test it like normal software.

func TestPromptRendering(t *testing.T) {
    tmpl := NewSupportPrompt()
    result, err := tmpl.Render(PromptInput{
        CustomerName: "Alice",
        Issue:        "billing dispute",
        History:      []string{"previous contact on 2024-07-15"},
    })
    if err != nil {
        t.Fatalf("render failed: %v", err)
    }

    if !strings.Contains(result, "Alice") {
        t.Error("prompt should contain customer name")
    }
    if !strings.Contains(result, "billing dispute") {
        t.Error("prompt should contain issue description")
    }
    if !strings.Contains(result, "2024-07-15") {
        t.Error("prompt should contain interaction history")
    }
}

func TestResponseParsing(t *testing.T) {
    raw := `{"action": "escalate", "reason": "billing dispute over $500", "priority": "high"}`

    resp, err := ParseSupportResponse([]byte(raw))
    if err != nil {
        t.Fatalf("parse failed: %v", err)
    }

    if resp.Action != "escalate" {
        t.Errorf("expected action=escalate, got %s", resp.Action)
    }
    if resp.Priority != "high" {
        t.Errorf("expected priority=high, got %s", resp.Priority)
    }
}

These tests are fast, stable, and catch a surprising number of regressions. I’ve seen parsing bugs slip through because teams only tested the happy path, then the model started returning JSON with trailing commas.

Also test mocked LLM responses to verify error handling and orchestration logic:

func TestHandlesModelTimeout(t *testing.T) {
    client := &MockLLMClient{
        Response: nil,
        Err:      context.DeadlineExceeded,
    }

    handler := NewSupportHandler(client)
    result, err := handler.Handle(context.Background(), "test query")

    if err != nil {
        t.Fatal("handler should not propagate model timeout as error")
    }
    if result.Fallback != true {
        t.Error("should trigger fallback on timeout")
    }
}

Layer 2: property-based checks for model outputs

You can’t check that the model said “I apologize for the inconvenience.” You can check that the response acknowledges the issue, avoids profanity, and stays under 200 words.

Define a rubric. Keep it simple.

type EvalCriteria struct {
    Name    string
    Check   func(input string, output string) bool
}

var supportResponseCriteria = []EvalCriteria{
    {
        Name: "acknowledges_issue",
        Check: func(input, output string) bool {
            lower := strings.ToLower(output)
            return strings.Contains(lower, "sorry") ||
                strings.Contains(lower, "understand") ||
                strings.Contains(lower, "apologize")
        },
    },
    {
        Name: "includes_next_steps",
        Check: func(input, output string) bool {
            lower := strings.ToLower(output)
            return strings.Contains(lower, "will") ||
                strings.Contains(lower, "next") ||
                strings.Contains(lower, "follow up")
        },
    },
    {
        Name: "reasonable_length",
        Check: func(input, output string) bool {
            words := strings.Fields(output)
            return len(words) >= 20 && len(words) <= 200
        },
    },
}

These aren’t perfect. The string matching is crude. But they catch common failure modes: responses that ignore the user’s problem, responses that are empty or absurdly long, and responses that miss expected elements.

For more nuanced checks – tone, factual accuracy, coherence – I use model-based evaluation. Have a separate evaluator model score the output against the rubric. It isn’t free, but it’s cheaper than human review on every test case and usually more reliable than regex.

Layer 3: the golden set

A golden set is a curated collection of representative inputs with expected properties. Not expected outputs, expected properties.

type GoldenCase struct {
    ID       string            `json:"id"`
    Input    string            `json:"input"`
    Expected map[string]string `json:"expected"`
}

// Example golden case
// {
//   "id": "billing_complaint_042",
//   "input": "I was charged twice for my subscription last month",
//   "expected": {
//     "tone": "empathetic",
//     "mentions": "refund OR credit OR billing",
//     "format": "paragraph under 150 words"
//   }
// }

I maintain 30-50 golden cases per feature. They cover common paths, known edge cases, and a few adversarial inputs. I run them weekly and after every prompt or model change.

The golden set is your regression detector. When a prompt change causes three previously passing golden cases to fail, you get a concrete signal that something shifted. No vibes. No arguments. Data.

The evaluation cadence that works

After trying several approaches, here’s what I’ve settled on:

Every commit: Run deterministic tests (layer 1). These are in CI and they block merges. Fast, stable, non-negotiable.
Every prompt/model change: Run the golden set (layer 3) and compare to the previous baseline. If pass rate drops, the change needs review.
Weekly: Run the full evaluation suite (layers 2 + 3) and track trends. Output a simple report: pass rate by criteria, any new failures, average response length.
After major updates: Human review of a random sample (~20 cases). Sanity check that the automated evaluation isn’t missing something.

This takes about two hours a week of human time. That’s a small investment for the confidence it provides.

What I wish more teams did

Version your prompts. Every prompt change should be a tracked commit with a diff. When quality regresses, you need to know which prompt version caused it. I keep prompts in version-controlled files, not in application code.

Track quality over time. A single evaluation run is a snapshot. A time series of evaluation results shows trends. Is quality gradually degrading? Did a model provider update cause a step change? You can’t answer these without historical data.

Test adversarial inputs. Your golden set should include attempts to jailbreak, confuse, or extract system prompts. These aren’t hypothetical attacks. They’re things real users will try.

LLM testing isn’t about proving the model is correct. It’s about building enough evidence that the system behaves acceptably across the inputs that matter. Layers, properties, golden sets, and a consistent cadence. That’s the strategy.

The Best Model Is the Smallest One That Works

Mon, 05 Aug 2024 00:00:00 +0000

The default instinct when building with LLMs is to reach for the biggest model available. I get it. When you don’t know exactly what you need, the biggest model feels like the safest bet. But “safest bet” and “right choice” are not the same thing.

Most production LLM tasks I see are classification, extraction, formatting, and short generation. Intent routing for a support bot. Extracting structured data from emails. Labeling inbound requests. These don’t need GPT-4 or Claude Opus. They need a model that’s fast, cheap, and predictable.

A small model running a well-scoped task will beat a large model running a vague one. Every time.

Where small wins

Small models shine when the output space is narrow and the success criteria are clear. If you can describe the correct answer format in one sentence, a small model can probably handle it: classification with a fixed label set, entity extraction with a defined schema, or reformatting text from one structure to another.

The advantages are not marginal. A Haiku-class model might respond in 200ms at a fraction of a cent per request. The same task on a frontier model might take 2 seconds and cost 10x more. At scale, that difference is the gap between a sustainable product and one that burns through runway.

I switched an intent router from GPT-4 to a small model last month. Accuracy stayed within 1%. Latency dropped 80%. Monthly inference cost dropped from $12K to under $2K. The engineering effort was two days of prompt tuning and evaluation.

Where small fails

Small models fall apart when the task requires multi-step reasoning, nuanced judgment, or long-form coherence. If you need a model to read a 10-page contract and identify three specific risks, it will miss things. If you need it to write a persuasive email that matches a specific executive’s tone, it will usually produce something generic.

The failure mode is subtle. Small models don’t refuse – they confidently produce mediocre output. You won’t see errors. You’ll see output that’s 80% right and 20% subtly wrong in ways that are hard to catch without careful evaluation.

The routing pattern

The most cost-effective architecture I’ve built is a two-tier system. Small model handles the 90% of requests that are well-scoped and predictable. Large model handles the 10% that need depth.

Route by complexity, not by topic. A billing question that maps to one of five categories goes to the small model. A billing dispute that requires reading context and making a judgment call goes to the large model. The router itself can be a small model – it’s just a classification task.

This is not novel. It is the same pattern as having junior engineers handle tickets and escalating to seniors. The model is the same. The economics are the same. Route smart, spend less.

Pick the smallest model that clears the bar

Don’t start with the biggest model and optimize later. Start with the smallest model and prove it’s insufficient before upgrading. You’ll be surprised how often “insufficient” never arrives.

The best model isn’t the smartest one. It’s the smallest one that meets your quality bar, at a cost and latency you can sustain.

Stop Stuffing Your Context Window

Mon, 22 Jul 2024 00:00:00 +0000

I’m tired of seeing teams dump entire documents into a context window because “it supports 128K tokens now,” then wonder why the model ignores their instructions. A bigger window isn’t a bigger brain. It’s a bigger inbox. And like any inbox, when you fill it with noise, important things get lost.

This is a rant. But it’s a rant with actionable advice.

The “just throw it all in” fallacy

Here’s what I keep seeing: a team builds a RAG pipeline that retrieves 20 document chunks for every query. They concatenate everything into the prompt because “more context is better.” The model now has 80K tokens of input, 60K of them irrelevant. The response is slower, more expensive, and, this is the part that kills me, lower quality than if they had sent 5K tokens of relevant context.

Retrieval isn’t free just because the window is big enough to hold it. Every irrelevant token dilutes the signal. The model has to figure out which parts of the context actually matter, and it isn’t always good at that, especially when the relevant information is sandwiched between walls of noise.

I reviewed a system where they were spending $400/day on inference. We cut their context budget by 70%, and quality went up. Not down. Up. The model could finally see the signal instead of drowning in noise.

Budget your context like you budget your infrastructure

You wouldn’t provision 10x the compute you need and call it a day. Don’t do it with context either.

Set a hard budget per request. Something like:

System prompt: 1-2K tokens (this should be stable and tight)
Retrieved context: 3-5K tokens max (be aggressive about relevance filtering)
Conversation history: 2-4K tokens (recent turns verbatim, older turns summarized)
Reserve: 1K tokens (for the model’s response and any overhead)

That’s 7-12K tokens for most requests. Not 128K. Not even close. And for 90% of production use cases, that’s more than enough.

Teams using 128K tokens per request are either doing something genuinely complex (document analysis, long-form generation) or being lazy. Mostly the latter.

Anchors: the stuff that must never fall out

Some information is non-negotiable. The user’s permissions. The current task definition. Key constraints. Explicit decisions made earlier in the conversation. I call these “anchors.”

Anchors go at the top of the context, every time. They don’t get summarized. They don’t get rotated out. They’re the ground truth that the model needs to respect regardless of how long the conversation gets.

I’ve debugged conversations where the model contradicted an earlier decision because the decision was in a turn that got summarized away. The summary said “the user chose option A” but the model treated it as a suggestion, not a commitment. Anchors prevent this.

Summaries need maintenance

Speaking of summaries: if you’re compressing conversation history into summaries, you need to refresh them. A summary generated 20 turns ago may be inaccurate or incomplete relative to the current state of the conversation.

The pattern I use is simple: keep the last 3-5 turns verbatim. Everything before that gets summarized. Refresh the summary every 10 turns or whenever a significant decision changes. It’s a small amount of extra work, and it prevents a category of bugs that’s extremely difficult to diagnose.

Retrieval is a precision problem, not a recall problem

Most RAG implementations err on the side of including too much. The logic goes: “better to include something irrelevant than to miss something important.” That sounds reasonable until you look at the actual failure modes.

From what I’ve seen, the most common production failure isn’t “the model didn’t have enough context.” It’s “the model had too much context and picked the wrong information.” Over-retrieval causes the model to confidently cite irrelevant passages while ignoring the one paragraph that actually answers the question.

Retrieve less. Filter aggressively. If you aren’t sure a chunk is relevant, leave it out. The model can ask follow-up questions. It can’t unsee irrelevant context.

The real problem is that nobody measures this

Most teams have no idea how their context utilization looks in production. They don’t track average context size, the ratio of relevant to irrelevant tokens, or the correlation between context size and output quality. They just set a max limit and hope for the best.

Instrument your context pipeline. Log the size of each section (system prompt, retrieved context, history, anchors). Track output quality as a function of context size. You’ll almost certainly discover that your sweet spot is much smaller than your current usage.

Bigger windows are a genuine improvement. They let you handle tasks that were impossible before. But for most production workloads, the discipline of managing context well matters more than the ability to stuff more into it.

Function Calling Patterns That Survive Production

Mon, 08 Jul 2024 00:00:00 +0000

Quick take

Function calling works in production when you treat it like boring infrastructure: strict schemas, validation at every boundary, explicit permissions, and structured errors. The model isn’t trusted code. It’s an external caller that happens to speak JSON. Build accordingly.

Function calling turned LLMs from text generators into system operators. That’s the opportunity and the risk. A model that can create tickets, query databases, and trigger deployments is powerful. A model that does those things with unvalidated arguments and no permission checks is a security incident waiting to happen.

I’ve built function calling integrations in past projects – mostly in Go – and the patterns that survive production are boring. That’s the point. Here’s what I’ve learned.

The mental model

Think of function calling as an API gateway where the caller is an LLM instead of a user. The model sees a list of available tools with schemas, picks one, and returns arguments as JSON. Your backend validates, executes, and returns results. The model then uses the results to continue the conversation.

User prompt + tool definitions
        |
        v
  Model selects tool + arguments (JSON)
        |
        v
  Backend validates arguments
        |
        v
  Backend executes tool (with permissions)
        |
        v
  Structured result returned to model
        |
        v
  Model generates final response

Simple in theory. In practice, the complexity is in validation, permissions, and error handling. That’s where most teams cut corners, and where most production incidents start.

Tool definitions: treat them like API contracts

A tool definition is a contract. The model’s behavior is only as good as the schema you provide. Vague descriptions produce vague arguments. Loose types produce invalid inputs.

In Go, I define tools as structs with explicit JSON Schema generation:

// ToolDef represents a callable tool exposed to the LLM.
type ToolDef struct {
    Name        string      `json:"name"`
    Description string      `json:"description"`
    Parameters  JSONSchema  `json:"parameters"`
    Handler     ToolHandler `json:"-"`
    Permission  Permission  `json:"-"`
}

type JSONSchema struct {
    Type       string                `json:"type"`
    Properties map[string]Property   `json:"properties"`
    Required   []string              `json:"required"`
}

type Property struct {
    Type        string   `json:"type"`
    Description string   `json:"description,omitempty"`
    Enum        []string `json:"enum,omitempty"`
    Default     string   `json:"default,omitempty"`
}

type ToolHandler func(ctx context.Context, args json.RawMessage) (*ToolResult, error)

A concrete example – a ticket creation tool:

var createTicketTool = ToolDef{
    Name:        "create_ticket",
    Description: "Create a support ticket. Requires a verified user session.",
    Parameters: JSONSchema{
        Type: "object",
        Properties: map[string]Property{
            "subject":  {Type: "string", Description: "Short summary of the issue"},
            "category": {Type: "string", Enum: []string{"billing", "bug", "account", "other"}},
            "priority": {Type: "string", Enum: []string{"low", "normal", "high"}, Default: "normal"},
        },
        Required: []string{"subject", "category"},
    },
    Handler:    handleCreateTicket,
    Permission: PermWriteApproval,
}

Notice the pattern: enums on every field with a bounded set of values, a clear description that tells the model when to use the tool, and required fields marked explicitly. The model doesn’t guess. It follows the contract.

The tool registry

Centralize tool registration. Don’t scatter tool definitions across your codebase. A single registry makes it easy to generate schemas for the model, enforce permissions, and audit what’s available.

type Registry struct {
    mu    sync.RWMutex
    tools map[string]ToolDef
}

func NewRegistry() *Registry {
    return &Registry{tools: make(map[string]ToolDef)}
}

func (r *Registry) Register(tool ToolDef) {
    r.mu.Lock()
    defer r.mu.Unlock()
    r.tools[tool.Name] = tool
}

func (r *Registry) Schema() []map[string]any {
    r.mu.RLock()
    defer r.mu.RUnlock()

    out := make([]map[string]any, 0, len(r.tools))
    for _, t := range r.tools {
        out = append(out, map[string]any{
            "type": "function",
            "function": map[string]any{
                "name":        t.Name,
                "description": t.Description,
                "parameters":  t.Parameters,
            },
        })
    }
    return out
}

func (r *Registry) Execute(ctx context.Context, name string, args json.RawMessage) (*ToolResult, error) {
    r.mu.RLock()
    tool, ok := r.tools[name]
    r.mu.RUnlock()

    if !ok {
        return &ToolResult{
            Success:   false,
            ErrorCode: "unknown_tool",
            Message:   fmt.Sprintf("tool %q not found", name),
        }, nil
    }

    return tool.Handler(ctx, args)
}

The Execute method is intentionally minimal. Validation and permission checks happen in the layers around it, not inside the registry itself. Separation of concerns matters here because you’ll want to add middleware later without rewriting the registry.

Validation: the model isn’t trusted

This is the hill I’ll die on: model-generated arguments are untrusted input. Always. Even with a tight schema, the model can produce unexpected values – empty strings, null where you expect a value, or fields that technically match the type but are nonsensical.

type CreateTicketArgs struct {
    Subject  string `json:"subject"`
    Category string `json:"category"`
    Priority string `json:"priority"`
}

func validateCreateTicketArgs(raw json.RawMessage) (*CreateTicketArgs, error) {
    var args CreateTicketArgs
    if err := json.Unmarshal(raw, &args); err != nil {
        return nil, fmt.Errorf("invalid JSON: %w", err)
    }

    args.Subject = strings.TrimSpace(args.Subject)
    if args.Subject == "" {
        return nil, fmt.Errorf("subject must be non-empty")
    }
    if len(args.Subject) > 200 {
        return nil, fmt.Errorf("subject exceeds 200 characters")
    }

    validCategories := map[string]bool{"billing": true, "bug": true, "account": true, "other": true}
    if !validCategories[args.Category] {
        return nil, fmt.Errorf("invalid category: %q", args.Category)
    }

    if args.Priority == "" {
        args.Priority = "normal"
    }
    validPriorities := map[string]bool{"low": true, "normal": true, "high": true}
    if !validPriorities[args.Priority] {
        return nil, fmt.Errorf("invalid priority: %q", args.Priority)
    }

    return &args, nil
}

Yes, this is verbose. That’s deliberate. I don’t want clever one-liners here. I want code that a new team member can read at 3 AM during an incident and immediately understand what it checks and why.

Structured errors that the model can recover from

When validation fails, return a structured error the model can act on. Not a stack trace. Not a generic “bad request.” A clear envelope:

type ToolResult struct {
    Success   bool   `json:"success"`
    ErrorCode string `json:"error_code,omitempty"`
    Message   string `json:"message,omitempty"`
    Data      any    `json:"data,omitempty"`
}

The model sees this and can retry with corrected arguments, ask the user for clarification, or explain the failure. Unstructured errors produce unstructured recovery attempts. I’ve seen models apologize to users for “server errors” when the actual problem was a missing required field.

Permission scoping

Every tool gets a permission level. Every request carries user context. The execution layer checks permissions before calling the handler. No exceptions.

type Permission int

const (
    PermReadOnly Permission = iota
    PermWriteApproval
    PermAdminOnly
)

type ExecContext struct {
    UserID    string
    Role      string
    SessionID string
}

func (r *Registry) ExecuteWithAuth(ctx context.Context, ec ExecContext, name string, args json.RawMessage) (*ToolResult, error) {
    r.mu.RLock()
    tool, ok := r.tools[name]
    r.mu.RUnlock()

    if !ok {
        return &ToolResult{Success: false, ErrorCode: "unknown_tool"}, nil
    }

    if !hasPermission(ec.Role, tool.Permission) {
        return &ToolResult{
            Success:   false,
            ErrorCode: "permission_denied",
            Message:   fmt.Sprintf("role %q cannot execute %q", ec.Role, name),
        }, nil
    }

    return tool.Handler(ctx, args)
}

func hasPermission(role string, required Permission) bool {
    switch required {
    case PermReadOnly:
        return true
    case PermWriteApproval:
        return role == "user" || role == "admin"
    case PermAdminOnly:
        return role == "admin"
    default:
        return false
    }
}

The model doesn’t decide permissions. The backend does. This isn’t negotiable. I’ve seen demos where the model is told “you have admin access” in the system prompt. That isn’t a permission system. That’s a suggestion.

Parallel execution with guardrails

Some models support parallel tool calls. This can cut latency significantly when tools are independent, but you still need timeouts and isolation.

func executeParallel(ctx context.Context, registry *Registry, ec ExecContext, calls []ToolCall) []*ToolResult {
    ctx, cancel := context.WithTimeout(ctx, 8*time.Second)
    defer cancel()

    results := make([]*ToolResult, len(calls))
    var wg sync.WaitGroup

    for i, call := range calls {
        wg.Add(1)
        go func(idx int, c ToolCall) {
            defer wg.Done()
            result, err := registry.ExecuteWithAuth(ctx, ec, c.Name, c.Arguments)
            if err != nil {
                results[idx] = &ToolResult{Success: false, ErrorCode: "execution_error", Message: err.Error()}
                return
            }
            results[idx] = result
        }(i, call)
    }

    wg.Wait()
    return results
}

The timeout is critical. A slow tool shouldn’t block the entire response. Return partial results and let the model work with what it has.

Observability

Log every tool call. But be smart about what you log:

Tool name and version
User ID and session ID
Argument hash (not raw arguments – those may contain PII)
Success/failure and error code
Execution latency

This gives you enough to debug failures, detect drift (is the model suddenly calling a tool it never used before?), and identify tools that are slow, failing, or overused.

What I wish I had known earlier

After building several of these systems, a few lessons stand out:

Keep tool descriptions short and precise. The model reads them on every request. Long descriptions waste tokens and confuse tool selection. One sentence describing the action, one sentence about when to use it.

Version your tool schemas. When you change a tool’s parameters, the model’s behavior will change too. Treat schema changes like API migrations.

Test with adversarial inputs. Ask the model to call tools with garbage arguments, impossible combinations, and injection attempts. Your validation layer should handle all of these cleanly.

Function calling is the interface between language models and real systems. It works when you treat it like infrastructure: boring, reliable, and well-instrumented. The clever part is the model. Your job is to make the execution layer as predictable as possible.

Claude 3.5 Sonnet Analysis: Cost, Coding, and Model Routing

Mon, 24 Jun 2024 00:00:00 +0000

Quick take

Claude 3.5 Sonnet is the first mid-tier model I’d default to for most production workloads. It matches or beats GPT-4 on coding tasks I care about, costs less, and Artifacts is genuinely useful for iteration. If you’re still routing everything to your most expensive model, run a side-by-side comparison. You’ll likely save money without losing quality.

Anthropic released Claude 3.5 Sonnet alongside a new Artifacts interface, and I’ve been running it against my usual workloads for a couple of weeks now. This isn’t a benchmark review. Benchmarks tell you how a model performs on someone else’s problems. I care about how it performs on mine.

The positioning shift that matters

Every model provider has a lineup: cheap-and-fast at the bottom, expensive-and-smart at the top. The default instinct for production teams is to reach for the top tier because the cost of a bad output usually outweighs the cost of inference.

Claude 3.5 Sonnet challenges that instinct. Anthropic is explicitly positioning a mid-tier model as the default for serious work. That isn’t just a pricing play. It’s a claim that the quality gap between tiers has narrowed enough that the mid-tier clears the bar for most real-world tasks. That is the same routing question behind broader AI inference cost trends: which requests actually deserve the expensive path?

I’ve been testing this claim. Here is what stood out.

Coding: where it actually impressed me

I ran Sonnet through the types of coding tasks I deal with in my Go-heavy workflow:

Multi-file refactors. I asked it to rename a package, update all references, and adjust the tests. Sonnet got this right on the first try, including edge cases in test helper files that GPT-4 had missed when I ran the same task a month earlier.

Bug diagnosis from error traces. I pasted a stack trace from a concurrency bug in a Go service. Sonnet identified the race condition, explained why it manifested only under load, and proposed a fix using sync.Mutex that was correct and idiomatic. It didn’t suggest sync.Map when a plain mutex was the right call. That kind of judgment matters.

Documentation from code. I gave it a 200-line Go package and asked for a README. The output was usable with minor edits. It captured the intent, not just the function signatures.

These are the tasks where I spend real time. A model that handles them reliably at a lower price point changes how I think about routing.

Where it falls short

Sonnet isn’t magic. I found its limits in a few predictable places:

Long-form reasoning across large contexts. When I loaded a full design document (~15K tokens) and asked for a critique, Sonnet’s analysis was surface-level compared to Opus. It identified structural issues but missed a subtle consistency problem that Opus caught.

Ambiguous instructions. When the prompt is vague, Sonnet tends to make reasonable but sometimes wrong assumptions instead of asking for clarification. This is manageable – you just need more explicit prompts – but it means you can’t be lazy with your instructions.

Creative writing. Not my primary use case, but I noticed it. Sonnet’s prose is competent but flat. If you need compelling narrative or nuanced tone, Opus is still noticeably better.

Artifacts: more useful than I expected

I was skeptical of Artifacts when I saw the announcement. It looked like a UI gimmick. After using it for two weeks, I changed my mind.

The core idea: when the model produces code, a document, or a visualization, it renders it in a separate panel instead of inline in chat. You can edit it, iterate on it, and share it. The model treats it as a persistent object in the conversation.

Where this is genuinely useful:

Prototyping UI components. Ask for a React component, see it rendered, ask for changes, see the update. The feedback loop is fast.
Drafting specs. The artifact is a living document that you refine through conversation. Much better than scrolling through a chat history to find the latest version.
Quick visualizations. SVG diagrams, simple charts, Mermaid flowcharts. The inline render makes iteration practical.

This isn’t a paradigm shift, but it is a genuine workflow improvement for anyone using an LLM for iterative creation.

How I’d evaluate this for your team

Don’t take my word for it. Run your own comparison. Here’s the approach I recommend:

Pick 10-15 real tasks from your last two sprints. Not toy problems – actual things your team spent time on. Code reviews, bug fixes, documentation, data analysis.
Run them through Sonnet and your current default model side by side. Same prompts, same context.
Score on three dimensions: correctness, usefulness (did you use the output or throw it away), and time saved.
Compare cost and latency. Sonnet should be meaningfully cheaper and faster. If the quality is comparable, the math is obvious.

Do this for a week, not an afternoon. First impressions are unreliable. You need enough data points to see the failure modes, not just the wins.

The model routing question

The real implication of Sonnet isn’t “use this instead of Opus.” It’s “think in terms of routing.”

Most teams use one model for everything. That was reasonable when the quality gap between tiers was large. Now that the gap is narrowing, a smarter approach is to route by task:

Sonnet for coding, classification, extraction, structured output, and most day-to-day work.
Opus for complex reasoning, nuanced analysis, and tasks where the cost of a wrong answer is high.
Haiku for preprocessing, filtering, and high-volume tasks where speed matters more than depth.

Keep model identifiers in config, not in code. Make routing a configuration decision, not a code change. That way you can shift traffic as models improve without redeploying.

What matters

Claude 3.5 Sonnet is the first mid-tier model where I stopped reaching for the top-tier by default. It handles my actual workloads well, costs less, and the Artifacts feature makes iteration faster.

The right move isn’t to blindly switch. It’s to test on your workloads, measure the quality gap, and route intelligently. For most teams, that will mean moving a significant chunk of traffic to Sonnet and saving the heavyweight model for the tasks that genuinely need it.

AI Compliance Without the Theater

Mon, 10 Jun 2024 00:00:00 +0000

Quick take

AI compliance is a design problem, not a paperwork problem. Build a data inventory, a model registry, and audit logging before you ship – not after legal gets involved. The organizations shipping fastest are the ones that treat compliance as architecture, not bureaucracy.

My perspective on AI compliance is shaped by two things: working on AI adoption at large enterprises and my work on national cyber-defense. Those are very different worlds, but they share one uncomfortable truth – organizations that treat security and compliance as an afterthought tend to have the worst incidents and the slowest response times.

In the defense world, you learn quickly that compliance isn’t about checking boxes. It’s about building systems that can answer hard questions fast. Where did this data come from? Who authorized this action? What changed between yesterday and today? When something goes wrong at 2 AM, nobody cares about your compliance document. They care about whether your systems can provide answers.

That same principle applies to enterprise AI. Just with lower stakes and, unfortunately, less discipline.

The questions that actually matter

I’ve sat through dozens of compliance reviews for AI systems. They all converge on the same handful of questions:

Where does user data go during inference, and is any of it retained?
Can you trace a specific output back to the model version and prompt that produced it?
How do you detect and handle unsafe, biased, or hallucinated outputs?
Who approved this use case, and what risk assessment was done?
If the model provider changes their terms or has a breach, what’s your exit plan?

If your engineering team can’t answer these within minutes, you aren’t ready for production. Full stop. I’ve seen AI projects delayed six months because the team couldn’t explain their data flow to a procurement review. That isn’t a compliance problem. That’s a design problem.

Data governance is the foundation

Start with a data inventory. Not a theoretical one – a real, maintained list of what data enters your AI pipeline, how it’s classified, where it’s processed, and when it’s deleted.

This sounds basic. It is. Most teams still skip it because it’s boring. Then, three months in, they discover their LLM provider’s terms allow training on API inputs, and they’ve been sending customer PII through an endpoint with no data processing agreement.

From my national cyber-defense experience: you don’t get to decide what data classification matters after the incident. You decide before. The same applies here. Know your data flows. Classify them. Enforce the policies technically, not just on paper.

Model accountability isn’t optional

You need a model registry. Every inference in production should be traceable to a specific model version, a specific prompt version, and a specific configuration. This isn’t overengineering. This is the minimum bar for debugging, incident response, and regulatory compliance.

What to log for each request:

A stable request ID
Model identifier and version
Prompt template version
A hash or summary of the output (not the raw output if it contains sensitive content)
Timestamp, user context, and latency

In the defense space, we call this “chain of custody for decisions.” In enterprise AI, it’s just good engineering. I’m still surprised by how many teams ship without it.

Human oversight that actually works

The compliance frameworks I’ve seen fail are the ones that require human approval for everything. That doesn’t scale. It creates bottlenecks, and people start rubber-stamping just to keep velocity.

Better approach: tier your use cases by risk.

Low risk (internal tools, human-reviewed outputs): self-service approval, lightweight monitoring. A team lead signs off and you move on.

Medium risk (customer-facing, influences decisions): security review, data assessment, defined rollback plan. One meeting, not a committee.

High risk (financial, medical, safety-critical): full review cycle with legal, security, and domain experts. No shortcuts, but a defined timeline.

The goal is to make the approval path proportional to the risk. Low-risk use cases should ship in days, not weeks. High-risk use cases should have rigor, not paralysis.

Vendor risk is your risk

Every AI provider you use is a critical dependency. Treat it that way. I’ve reviewed vendor contracts where data-handling terms were buried in an appendix nobody on the engineering team had read.

Key questions for any AI vendor:

Is customer data used for model training? Can you opt out?
What’s the breach notification timeline?
What happens to your data if you terminate the contract?
Can you run the same workload on a different provider if needed?

Lock-in is a compliance risk. If your only option is one provider and they change their terms or have a major incident, you need a plan B that doesn’t require rewriting your entire pipeline.

Three artifacts you actually need

Forget the 50-page compliance documents. Maintain three living artifacts:

System card. One page per AI system: what it does, what data it touches, known limitations, risk tier, and owner.
Data inventory. Where data comes from, where it goes, classification, retention, and deletion procedures.
Model registry. Model versions in production, evaluation results, prompt versions, and deployment history.

Keep them in version control, not in a shared drive nobody checks. Review them quarterly, or whenever the model or data pipeline changes.

The real competitive advantage

The enterprises shipping AI fastest right now aren’t the ones ignoring compliance. They’re the ones that built it into their architecture early, kept it lightweight, and made it a development practice instead of a legal review.

Compliance built into the system is invisible. Compliance bolted on afterward is a project that never ends.

Why Your Enterprise AI Pilot Is Stuck

Mon, 03 Jun 2024 00:00:00 +0000

Every enterprise AI conversation I’ve had this year follows the same arc. Someone builds a proof of concept. The demo goes well. Leadership gets excited. Then, three months later, the project is stuck in limbo: security reviews, data access requests, and nobody quite sure who actually owns it.

I see this pattern across telecom and fintech organizations. The demo-to-production gap isn’t a technology problem. It’s an organizational one.

The demo was the easy part

A POC can skip everything that makes enterprise software hard. It runs on a developer’s laptop with test data. It doesn’t need to handle real user volumes. During a demo, nobody asks about audit trails or data retention policies.

Then the project moves toward production and reality hits. Security wants a threat model. Legal wants to know where the data goes. The platform team wants to know who pays for compute. The data science team discovers the training data is messier than expected. None of this is surprising. These are the same problems every enterprise system faces, plus a few new AI-specific ones: model drift, prompt management, and probabilistic outputs.

The teams that get stuck are the ones that treated the POC as the starting line instead of a feasibility check.

Start boring, stay boring

The single best predictor of success I’ve seen is picking a first use case that’s low-risk and internal. Something where a human reviews the output before anything happens. Document summarization for internal teams. Draft generation for support responses that get edited before sending. Classification of inbound requests to route them to the right queue.

These aren’t exciting. That’s the point. You want a use case where a bad output is an inconvenience, not a liability. One where you can iterate on prompts and evaluate quality without a customer ever seeing an unpolished result.

I keep telling teams the same thing: your first AI feature should be invisible to customers. Ship it internally, prove it works, build the muscle memory for operating AI in production, then expand.

Build the platform before the pilots multiply

Here’s what happens when you don’t have a shared platform: every team builds its own integration. They pick different models, prompt patterns, and logging approaches. Six months later, you have eight AI features and no way to compare quality, manage costs, or enforce policies across them.

The fix is unglamorous. Build a thin shared layer early. It needs three things:

Centralized model access with authentication, rate limiting, and cost tracking.
A prompt registry so prompts are versioned, reviewable, and not buried in application code.
Evaluation tooling that every team can use to measure output quality against a golden set.

This doesn’t need to be perfect or fully featured. It needs to exist before the third team starts building their own AI integration. I’ve watched organizations try to consolidate after the fact. It’s painful and expensive.

Governance that enables instead of blocks

The worst governance models I see are designed by committee without input from the engineering teams that have to live with them. They produce a 40-page policy document, a six-week review cycle, and a strong incentive for teams to quietly build things without telling anyone.

Good governance is lightweight and fast. A one-page use case template. A clear risk-tier system: low risk gets self-service approval, high risk gets review. A standing meeting where legal, security, and engineering are in the same room instead of a months-long email chain.

One organization I worked with reduced its AI approval cycle from eight weeks to five days by switching from a document-based review to a 30-minute live walkthrough with all stakeholders. Same rigor. Fraction of the time.

The uncomfortable truth

Most enterprise AI projects don’t fail because the technology isn’t ready. They fail because the organization isn’t ready. The AI works fine in the demo. The procurement process takes four months. The data team can’t provide clean training data. The legal review has no precedent to follow, so it defaults to “no” until someone escalates.

If you want to ship AI in an enterprise, spend less time evaluating models and more time clearing organizational roadblocks. Get a budget owner. Get a security sponsor. Get data access sorted before you write the first prompt.

Process beats talent. Every time.

Building Voice AI That People Actually Use

Mon, 27 May 2024 00:00:00 +0000

Quick take

Voice AI works when you treat it like plumbing, not magic. Keep perceived latency under 500ms, treat interruptions as a first-class concern, and keep the task scope narrow. The architecture choice between a modular pipeline and an end-to-end model matters less than your streaming strategy.

The gap between a voice AI demo and a voice AI product is about six months of work on things nobody finds exciting: latency tuning, interruption handling, and figuring out what happens when the user mumbles, changes their mind, or goes silent for eight seconds.

I’ve been involved in voice interface projects going back to a travel startup I built, and more recently in voice-first support tools. The models have gotten dramatically better. The engineering around them hasn’t kept pace.

Two architectures, one tradeoff

You have two practical options for a voice AI system:

Modular pipeline: Separate services for transcription, reasoning, and synthesis. You can swap components, instrument each stage, and debug failures in isolation. The cost is latency at every boundary.

mic -> STT service -> LLM -> TTS service -> speaker
         ~200ms       ~800ms     ~300ms

End-to-end model: A single model like GPT-4o that handles audio natively. Lower latency and a more natural feel, but harder to debug, and you’re locked to one provider’s capabilities.

I lean modular for anything going to production. Here’s why: when a user reports “the bot said something weird,” I need to know whether it was a transcription error, a reasoning failure, or a synthesis artifact. With an end-to-end model, that’s a black box.

The streaming architecture that matters

The biggest latency win isn’t model speed. It’s streaming. Start synthesizing audio before the full response is generated. In Go, it looks something like:

type VoiceSession struct {
    sttClient    STTClient
    llm          LLMClient
    ttsClient    TTSClient
    audioOut     chan []byte
    interrupted  atomic.Bool
}

func (s *VoiceSession) HandleUtterance(ctx context.Context, audio []byte) error {
    // Transcribe
    transcript, err := s.sttClient.Transcribe(ctx, audio)
    if err != nil {
        return fmt.Errorf("transcription failed: %w", err)
    }

    // Stream LLM response, pipe chunks directly to TTS
    stream, err := s.llm.StreamChat(ctx, transcript)
    if err != nil {
        return fmt.Errorf("llm stream failed: %w", err)
    }

    var buf strings.Builder
    for chunk := range stream {
        if s.interrupted.Load() {
            return nil // User interrupted, stop generating
        }

        buf.WriteString(chunk.Text)

        // Flush to TTS at sentence boundaries
        if isSentenceEnd(buf.String()) {
            audioChunk, err := s.ttsClient.Synthesize(ctx, buf.String())
            if err != nil {
                continue // Degrade gracefully, skip this chunk
            }
            s.audioOut <- audioChunk
            buf.Reset()
        }
    }

    // Flush remaining text
    if buf.Len() > 0 {
        audioChunk, _ := s.ttsClient.Synthesize(ctx, buf.String())
        s.audioOut <- audioChunk
    }

    return nil
}

The key insight: flush to TTS at sentence boundaries, not at the end of the full response. The user hears the first sentence while the model is still generating the third. Perceived latency drops from 1300ms to under 500ms.

Interruptions aren’t edge cases

People interrupt. They talk over the bot. They say “wait, no, actually…” halfway through a sentence. If your system can’t handle this, users will hate it within 30 seconds.

The interrupt handler needs to do three things fast:

Stop audio output immediately. Not after the current sentence. Now.
Cancel pending TTS and LLM generation. Don’t waste compute on a response nobody will hear.
Accept the new input without resetting the conversation. Context should carry over.

func (s *VoiceSession) HandleInterrupt(ctx context.Context, newAudio []byte) error {
    s.interrupted.Store(true)

    // Drain the audio output channel
    for len(s.audioOut) > 0 {
        <-s.audioOut
    }

    s.interrupted.Store(false)
    return s.HandleUtterance(ctx, newAudio)
}

This is simplified, but the pattern holds. The atomic.Bool flag propagates interrupts to the streaming loop without complex synchronization.

When voice is the wrong interface

Voice is great when:

The user’s hands are busy (driving, cooking, field work)
The task has a narrow, predictable vocabulary
The expected output is short – a confirmation, a lookup, a simple action

Voice is terrible when:

The user needs to compare options visually
The output is complex or structured (tables, code, lists)
Precision matters more than speed (medical, legal, financial details)

I keep seeing teams try to build “voice-first everything” products. Don’t do this. Voice should be one input mode in a system that gracefully falls back to text or visual UI when the task demands it.

Operational concerns that will bite you

Transcription accuracy varies wildly by accent, background noise, and microphone quality. Test with real users in real environments, not in a quiet office with a studio mic. I learned this the hard way: a prototype that worked perfectly in our office fell apart in a warehouse with forklift noise.

Track these metrics from day one:

Transcription word error rate by user segment
Time to first audio byte (perceived latency)
Interruption rate and recovery success
Conversation completion rate vs. abandonment
Fallback-to-text rate

Cost adds up fast. A 30-second voice interaction can involve a STT call, an LLM call with conversation history, and a TTS call. Multiply by thousands of daily users and you need a cost model before you launch, not after.

Keep it boring

The best voice AI products I’ve seen are boring. They do one thing, they do it fast, and they handle failure gracefully. A voice ordering system that works for 50 menu items. A voice-controlled inventory check. A hands-free incident report dictation tool.

Nobody is going to have a deep philosophical conversation with your voice bot. They want to get something done and move on. Design for that.

The tech is ready. The hard part is the discipline to ship something narrow and reliable instead of something ambitious and fragile.

GPT-4o Changed the Interface, Not the Hard Part

Mon, 13 May 2024 00:00:00 +0000

I was on a call with an engineering team when the GPT-4o demo dropped. Someone shared the link in Slack, and within ten minutes nobody was paying attention to the sprint review anymore. The live voice demo, the real-time vision, the emotion in the synthesized speech – it looked like science fiction shipping on a Tuesday afternoon.

Then the demo high wore off, and the real questions started.

What actually shipped

GPT-4o is a single model that handles text, images, and audio natively. No more chaining a whisper transcription into GPT-4 into a TTS engine. One model, one round trip, multiple modalities.

That sounds incremental until you think about what it kills: the glue. I’ve spent more time than I want to admit debugging pipelines where context got lost between the speech-to-text step and the reasoning step, or where the TTS output sounded robotic because the model had no awareness it was producing spoken words. GPT-4o collapses that entire pipeline into a single inference call.

Fewer seams means fewer places for things to break. That matters more than any benchmark.

Where this changes product design

The interesting shift isn’t “AI can talk now.” It’s that users no longer have to context-switch between modalities. Show the camera, describe the problem, get an answer – all in one continuous loop.

I’ve been advising a couple of teams building support tools, and this unlocks patterns that were previously too brittle to ship:

Live visual troubleshooting. User points their phone at the broken thing, explains the issue, and the model responds while looking at the same image. No more “please upload a screenshot and describe what happened.”
Hands-free workflows. Voice as primary input, text as structured output. Think field technicians, warehouse workers, anyone whose hands are occupied.
Coaching and tutoring. The model sees the student’s work and talks through corrections in real time. This was a three-service pipeline before. Now it’s one call.

These aren’t hypothetical. They’re products teams tried to build last year and abandoned because latency and context loss across the pipeline made them unusable.

The complexity doesn’t disappear

Here is what the demo didn’t show: the model is faster and more unified, but the infrastructure around it is still hard.

Streaming audio over unreliable mobile networks is an unsolved problem in most organizations. Encoding images in real time on low-end devices is a performance cliff. And once you’re processing audio and video from users, you have entered a privacy and consent minefield that most teams haven’t mapped.

A single model simplifies the AI layer. It doesn’t simplify the transport layer, the device layer, or the compliance layer. If anything, it makes those harder because the demo sets expectations that the infrastructure can’t meet yet.

I told a team last week: “The model is ready. Your CDN isn’t.”

How I’d evaluate this

When API access is fresh and the documentation is still evolving, the worst thing you can do is build something ambitious. Pick the narrowest possible workflow. Something like: user speaks a question, model responds with text and audio. No vision, no tool calling, just the core loop.

Measure three things:

Does the end-to-end interaction feel natural, or does the latency break the illusion?
How does it behave with bad audio – background noise, accents, interruptions?
What does failure look like, and can the UI recover without the user noticing?

If you can’t answer those three questions with your prototype, you aren’t ready to expand scope. Ship the boring version first.

Real-time multimodal means you’re potentially recording and processing audio and video from real people. That’s a different legal and ethical surface than processing text prompts.

You need explicit consent flows. You need to decide what gets stored and what gets discarded after inference. You need a plan for when the model misinterprets visual input in a way that’s embarrassing or harmful. Most of the teams I’ve talked to are hand-waving this. Don’t be one of them.

What matters

GPT-4o is a genuine architecture shift. One model, multiple modalities, real-time responses. That eliminates an entire class of integration problems and makes products possible that weren’t viable six months ago.

But the hard part was never the model. The hard part is reliable transport, device compatibility, privacy, and graceful degradation. The teams that win with this will be the ones who treat the model as the easy layer and invest in everything around it.

LLM Structured Output in Go: JSON Schema, Validation, Retries

Mon, 29 Apr 2024 00:00:00 +0000

Quick take

Structured output is a contract-enforcement problem, not a prompting problem. Define a schema, constrain the prompt, validate every response, and build a repair loop for when the model drifts. I do this in Go with about 300 lines of reusable code. Here is all of it.

I have a rule for any LLM feature that feeds a downstream system: if you can’t json.Unmarshal the response into a typed struct, it isn’t done.

That sounds obvious. In practice, it isn’t. I still see production systems parsing LLM output with string splitting and regex. They work until they don’t, and when they break, they fail in ways that are hard to diagnose because the failure is subtle data corruption, not a crash.

Structured output from LLMs is a solved problem if you treat it as contract enforcement. Define what you expect. Tell the model exactly what you expect. Validate what you get. Repair what breaks. Here is how I do it in Go. This is one of the control surfaces that belongs in any serious AI-native architecture and evaluation pipeline .

The failure modes are predictable

LLMs generate text. They don’t generate data structures. Even with strong prompting, they will occasionally:

Wrap the JSON in markdown code fences or explanatory prose
Omit fields they consider “obvious” or irrelevant
Use wrong types (string "null" instead of JSON null, number as string)
Rename fields to something they think is more descriptive
Produce partial output when hitting token limits

Every pattern in this post targets one of these failures. They aren’t edge cases. They’re the normal operating reality of structured LLM output.

Define the contract as Go types

Start with the output structure. This isn’t just documentation – it’s both the validation target and the deserialization target. One definition serves both purposes.

type ContactInfo struct {
	Name    string  `json:"name"    validate:"required,min=1"`
	Email   *string `json:"email"   validate:"omitempty,email"`
	Company *string `json:"company"`
	Role    *string `json:"role"`
}

Nullable fields use pointers. Required fields use value types. The validate tags drive runtime validation. This struct is the single source of truth: the prompt references it, the validator enforces it, and the calling code consumes it.

I also generate a JSON Schema from the struct for inclusion in prompts. This keeps the prompt and validation in sync automatically:

func SchemaFor[T any]() ([]byte, error) {
	reflector := jsonschema.Reflector{
		RequiredFromJSONSchemaTags: true,
		DoNotReference:             true,
	}
	schema := reflector.Reflect(new(T))
	return json.MarshalIndent(schema, "", "  ")
}

One definition. One schema. No drift between what you ask for and what you validate.

Build the prompt to minimize ambiguity

The prompt should be rigid and specific. No motivational language. No “please try your best.” Just the schema, the rules, and the input.

func BuildExtractionPrompt(schema []byte, input string) string {
	return fmt.Sprintf(`Extract structured data from the input. Return ONLY valid JSON matching this schema:

%s

Rules:
- Use null for missing fields, not empty strings
- Lowercase email addresses
- No additional keys beyond the schema
- No markdown, no explanation, just the JSON object

Input:
%s

JSON:`, string(schema), input)
}

The JSON: at the end is a small trick that helps. It primes the model to start generating JSON immediately instead of opening with “Here is the extracted data:” or similar preamble.

The extraction pipeline

This is the core of the system: call the model, clean the response, parse it, validate it, and retry on failure.

type Extractor[T any] struct {
	client     LLMClient
	validator  *validator.Validate
	schema     []byte
	maxRetries int
}

func NewExtractor[T any](client LLMClient, maxRetries int) (*Extractor[T], error) {
	schema, err := SchemaFor[T]()
	if err != nil {
		return nil, fmt.Errorf("generating schema: %w", err)
	}

	return &Extractor[T]{
		client:     client,
		validator:  validator.New(),
		schema:     schema,
		maxRetries: maxRetries,
	}, nil
}

func (e *Extractor[T]) Extract(ctx context.Context, input string) (*T, error) {
	prompt := BuildExtractionPrompt(e.schema, input)
	var lastErr error

	for attempt := range e.maxRetries {
		raw, err := e.client.Generate(ctx, prompt)
		if err != nil {
			return nil, fmt.Errorf("llm call failed: %w", err)
		}

		cleaned := cleanJSONResponse(raw)

		var result T
		if err := json.Unmarshal([]byte(cleaned), &result); err != nil {
			lastErr = fmt.Errorf("attempt %d: json parse error: %w", attempt+1, err)
			prompt = buildRepairPrompt(prompt, raw, err.Error())
			continue
		}

		if err := e.validator.Struct(result); err != nil {
			lastErr = fmt.Errorf("attempt %d: validation error: %w", attempt+1, err)
			prompt = buildRepairPrompt(prompt, raw, err.Error())
			continue
		}

		return &result, nil
	}

	return nil, fmt.Errorf("extraction failed after %d attempts: %w", e.maxRetries, lastErr)
}

A few things to notice. The generic type parameter means this extractor works for any output struct: ContactInfo, InvoiceData, whatever. The cleaning step handles the most common format issues before parsing. And on failure, the repair prompt feeds the error back to the model so it can fix the specific problem.

Cleaning the response

Models love to wrap JSON in markdown code fences or add explanatory text. This function strips that away:

func cleanJSONResponse(raw string) string {
	s := strings.TrimSpace(raw)

	// Strip markdown code fences
	if strings.HasPrefix(s, "```") {
		lines := strings.Split(s, "\n")
		// Remove first line (```json) and last line (```)
		start := 1
		end := len(lines) - 1
		if end > start && strings.TrimSpace(lines[end-1]) == "```" {
			end = end - 1
		}
		s = strings.Join(lines[start:end], "\n")
	}

	// Find the first { and last } to extract the JSON object
	firstBrace := strings.Index(s, "{")
	lastBrace := strings.LastIndex(s, "}")
	if firstBrace >= 0 && lastBrace > firstBrace {
		s = s[firstBrace : lastBrace+1]
	}

	return strings.TrimSpace(s)
}

This isn’t pretty. It doesn’t need to be. It handles the three wrapping patterns I most often see in production: code fences, leading prose, and trailing explanation.

The repair prompt

When parsing or validation fails, the repair prompt tells the model exactly what went wrong:

func buildRepairPrompt(originalPrompt, badOutput, errorMsg string) string {
	return fmt.Sprintf(`%s

Your previous output was invalid:
%s

Error: %s

Fix the error and return ONLY valid JSON.

JSON:`, originalPrompt, badOutput, errorMsg)
}

This is where the retry loop earns its keep. The model gets the original instructions, sees its own bad output, and gets a specific error message to fix.

From what I’ve seen, this recovers about 80% of validation failures on the first retry. The remaining 20% usually indicate a genuinely ambiguous input that needs human review.

Use JSON mode when available

Most model APIs now offer a JSON-only response mode. Use it. It eliminates prose wrapping entirely and significantly reduces parsing failures.

func (e *Extractor[T]) Extract(ctx context.Context, input string) (*T, error) {
	prompt := BuildExtractionPrompt(e.schema, input)
	opts := GenerateOptions{
		ResponseFormat: ResponseFormatJSON, // Use JSON mode
	}

	// ... rest of the extraction logic
}

But – and I can’t stress this enough – JSON mode doesn’t mean you skip validation. The model can still omit required fields, use wrong types, or produce a valid JSON object that doesn’t match your schema. JSON mode guarantees parseable JSON. It doesn’t guarantee correct JSON for your use case.

Monitoring structured output in production

Three metrics I track for every structured-output pipeline:

Parse success rate. What percentage of responses parse and validate on the first attempt? If this drops below 95%, something changed: the model updated, the prompt drifted, or the input distribution shifted.
Retry rate and recovery rate. How often do you need retries, and how often do retries succeed? A high retry rate with good recovery means the repair loop is working. A high retry rate with low recovery means something is fundamentally wrong.
Field-level error distribution. Which fields cause the most validation failures? This tells you where the prompt needs to be more explicit or where the schema needs adjustment.

I log every extraction attempt : success or failure, first try or retry, with the raw model output. When something goes wrong in production, I want to see exactly what the model returned, not just that it failed.

The pattern, summarized

Every structured-output pipeline I build follows the same sequence:

Define the contract as a Go struct with validation tags.
Generate the JSON Schema from that struct.
Build a rigid prompt that includes the schema and leaves no room for interpretation.
Clean the raw response to handle common wrapping patterns.
Parse and validate against the struct.
On failure, retry with a repair prompt that includes the specific error.
Monitor parse rates, retry rates, and field-level errors.

This isn’t clever. It isn’t novel. It’s disciplined application of the same contract-enforcement thinking we use everywhere else in software engineering. The model is an unreliable data source. Treat it like one.

Most AI Developer Tools Are Not Worth Adopting Yet

Mon, 15 Apr 2024 00:00:00 +0000

Everyone has a favorite AI developer tool now: code assistants, LLM frameworks, vector databases, eval harnesses, observability platforms, deployment wrappers. The landscape is overwhelming, and most of it isn’t worth your time.

That isn’t cynicism. It’s experience. I’ve watched teams adopt tools that solve problems they don’t have, add abstraction layers they can’t debug, and create dependencies they can’t unwind. The result is a stack that’s harder to understand than the problem it was supposed to simplify.

The framework trap

Here is my unpopular opinion: most teams shouldn’t be using an LLM framework. LangChain, LlamaIndex, whatever ships next week – they are solving a real problem, but they are solving it for a use case most teams haven’t reached yet.

If your application calls one model with one prompt and parses the output, you don’t need a framework. You need an HTTP client and solid error handling. A framework adds routing, memory, tool calling, and chain-of-thought orchestration that you might need in six months. Right now, it mostly adds layers you can’t see through when something breaks.

Start without the framework. Add it when you can name the specific pieces it replaces and what maintenance burden it removes. Not before.

Code assistants are useful. Stop pretending they are magic.

I use Copilot daily. It’s good at boilerplate, decent at suggesting patterns I’ve seen before, and occasionally impressive on unfamiliar code. It’s also confidently wrong often enough that accepting suggestions uncritically is dangerous.

Teams getting real value from code assistants treat the output as a first draft. It goes through the same code review process as any other contribution. Teams getting hurt are the ones accepting suggestions because they “look right” without checking whether they actually are.

The productivity gain is real, but smaller than the marketing suggests. It also comes with a hidden cost: style drift. The assistant doesn’t know your team’s conventions. Over time, the codebase starts to feel inconsistent unless you actively enforce standards on AI-generated code.

What actually earns its place

After working with several teams on their AI tooling stacks, I have a short list of what I think is genuinely worth adopting:

Eval harnesses. Whatever helps you measure output quality against a test set. This can be a framework or a 200-line script. It doesn’t matter. What matters is that it exists and runs on every change.

Structured logging for LLM calls. Not a fancy observability platform – just disciplined logging of prompts, responses, latency, and token counts. You will need this data the moment something goes wrong. Which will be soon.

A simple abstraction over model providers. Not a framework. Just a thin interface that lets you swap models without rewriting calling code. I build these in Go in an afternoon. They pay for themselves the first time a provider changes their API.

That’s it. Everything else should prove its value before it gets a spot in go.mod.

The decision filter

Before adopting any AI tool, answer one question: what specific friction does this remove that I can’t solve with under a day of custom code?

If the answer is “it makes things easier” or “everyone is using it,” that isn’t good enough. If the answer is “it replaces 500 lines of boilerplate I maintain across three services,” then fine. Adopt it.

Keep the stack small. Keep it legible. The tooling landscape will look completely different in six months anyway.

Agentic Workflows: From Demo Magic to Production Reality

Mon, 01 Apr 2024 00:00:00 +0000

Quick take

An agent that can read data and change state isn’t a chatbot with extra steps. It’s a system with real blast radius. Constrain it with explicit policies, prefer structured workflows over free-form loops, and invest in observability before you invest in capabilities. The boring stuff is what makes agents safe to ship.

There’s a moment in every agentic AI demo that makes the audience gasp. The agent reads a database, reasons about the results, drafts an email, and sends it. Autonomously. It feels like magic.

Then someone asks: “What happens if it sends the wrong email?” And the room gets quiet.

I’ve been building agentic systems for several months now. The demo-to-production gap here is wider than almost anywhere else in AI engineering. A chatbot that hallucinates is annoying. An agent that hallucinates and then acts on the hallucination is a liability.

The difference between teams that ship agents successfully and teams that revert after a week comes down to three things: boundaries, structure, and boring reliability work.

Boundaries first, capabilities second

Almost every team starts with capabilities. “What tools should the agent have? What actions can it take?” Wrong starting point.

Start with constraints. What is the agent not allowed to do? What’s the maximum blast radius of a single run? What happens when it goes wrong?

A policy config is the simplest way to make these constraints explicit and auditable:

agent_policy:
  allowed_tools: [read_db, write_ticket, send_email_draft]
  max_steps: 8
  max_runtime_seconds: 120
  max_cost_usd: 0.50
  approval_required: [send_email, issue_refund, modify_production]

This isn’t a suggestion. It’s the foundation. The allowed tools list is an allowlist, not a blocklist – the agent can only use what’s explicitly permitted. Step and time limits prevent runaway loops. Cost caps prevent a single request from draining your budget. The approval list separates actions that are safe to automate from actions that need a human in the loop.

At one delivery company I worked with, a team skipped the approval step for “low-risk” actions. One of those low-risk actions turned out to be updating customer records. An agent misinterpreted a support request and bulk-updated addresses for a batch of orders. The fix took two days. The approval gate would have taken two seconds.

If the policy feels too restrictive, relax it intentionally and document why. If you can’t explain why a tool is on the allowed list, it shouldn’t be there.

Structured workflows beat free-form loops

The temptation with agents is to give them a goal and let them figure out the steps. This works beautifully in demos. In production, it creates systems that are impossible to debug, test, or audit.

I prefer structured workflows with a small number of decision points. The model chooses among defined paths. Deterministic logic handles state transitions. The result is a system you can trace, test, and explain.

Think of it as a state machine where the model influences transitions but doesn’t control them entirely. The model might decide whether a customer inquiry needs escalation or can be handled automatically. But the escalation path itself – what happens, in what order, and with what approvals – is defined in code, not improvised by the model.

When a task genuinely doesn’t fit a clean workflow, isolate it. Put the free-form reasoning in a narrow, heavily instrumented sandbox with tight constraints. Don’t make it the default path for everything.

The boring reliability checklist

I know this section won’t go viral. That’s fine. It’s the section that keeps your agent from becoming an incident.

Idempotent steps. If a step fails and retries, it shouldn’t duplicate work. The agent shouldn’t send two emails because the first one timed out after actually sending. Design every action to be safe to retry.

Checkpointing. Long-running workflows should save their state at each step. If the process crashes or the model call times out, the workflow should resume from the last checkpoint, not start over.

Time and step caps. Hard limits. Non-negotiable. An agent stuck in a reasoning loop should hit a wall after N steps or M seconds, return whatever partial results it has, and report the failure. I set these conservatively and loosen them only after seeing production data.

Retry discipline. Retry on clearly transient failures – rate limits, network timeouts. Don’t retry on semantic failures – the model misunderstood the task, or the tool returned an error because the input was wrong. Retrying bad logic just wastes money and time.

Observability isn’t optional

If you can’t trace what an agent did – every tool call, every model response, every decision point – you can’t debug it. And you will need to debug it.

Structured logging for every step:

What tool was called and with what inputs
What the model returned and what confidence signal it provided
Whether an approval was required and who approved it
How long each step took and how many tokens it consumed
The final outcome and whether it matched the intent

This log isn’t just for debugging. It’s your feedback loop. It tells you which prompts need refinement, which tools are unreliable, which workflows cost too much, and where the model consistently makes bad decisions.

One caution: be disciplined about what you log. Inputs and outputs may contain sensitive data. Define retention policies and access controls before you ship, not after an auditor asks.

Rolling out without regret

The teams that succeed with agentic workflows share a rollout pattern:

Shadow mode first. The agent runs alongside the existing process but doesn’t take any actions. Log what it would have done. Compare to what the human actually did. This gives you real quality data without any risk.
Low-risk tasks with clear success criteria. Start with internal tasks where a mistake is inconvenient, not catastrophic. Ticket triage. Data enrichment. Report drafting.
Expand only after stability. Once reliability, cost, and quality are stable for the initial scope, add more tools or more complex workflows. One step at a time.

This pacing is unglamorous. It’s also the only approach I’ve seen work consistently.

The uncomfortable truth

Agents are powerful. They’re also the highest-risk AI feature you can ship. Every other AI feature is advisory – the model suggests, the user decides. An agent acts. That means every bug, every hallucination, every misunderstanding has real consequences.

Treat agents as systems engineering, not prompt engineering. Define the blast radius. Build the constraints. Invest in the observability. Ship slow.

The teams that move carefully are the ones still running agents in production six months later. The teams that rush are the ones writing postmortems.

LLM Prompt Caching in Go: Cut Costs Without Breaking Things

Mon, 25 Mar 2024 00:00:00 +0000

Quick take

Your LLM is answering the same questions repeatedly and you’re paying for every single call. Exact-match caching alone can cut 30-50% of your API spend with zero quality loss. Add semantic caching carefully after that. The hard part isn’t the cache – it’s the key design and invalidation discipline.

I was reviewing API logs last month and found something depressing. About 40% of their LLM requests were functionally identical. Same system prompt, same user question (give or take whitespace), same model. They were paying full price for every single one.

Caching is the most boring and most effective optimization you can make to an LLM application. It isn’t glamorous. It doesn’t involve new models or clever prompt tricks. It just saves money and makes things faster. Here is how I build it in Go.

Start with exact match caching

Don’t get fancy. The first layer is simple: hash the request, check the cache, return the cached response if it exists. This catches identical requests and costs almost nothing to implement.

type CacheKey struct {
	Version    string `json:"v"`
	Model      string `json:"model"`
	PromptHash string `json:"prompt_hash"`
	ToolsHash  string `json:"tools_hash"`
	ParamsHash string `json:"params_hash"`
}

func NewCacheKey(req LLMRequest) CacheKey {
	return CacheKey{
		Version:    "v1",
		Model:      req.Model,
		PromptHash: sha256Hash(req.SystemPrompt + "\n" + req.UserPrompt),
		ToolsHash:  sha256Hash(marshalTools(req.Tools)),
		ParamsHash: sha256Hash(fmt.Sprintf("%f:%d", req.Temperature, req.MaxTokens)),
	}
}

func (k CacheKey) String() string {
	b, _ := json.Marshal(k)
	return sha256Hash(string(b))
}

func sha256Hash(s string) string {
	h := sha256.Sum256([]byte(s))
	return hex.EncodeToString(h[:])
}

The key includes everything that can change the output: model, prompt content, tools, and sampling parameters. If any of those differ, you get a different key. If they are all the same, you get a cache hit.

Notice the version field. When you change your key schema – and you will – bump the version. This prevents old entries with a different key structure from colliding with new ones.

The cache layer itself

I keep the cache interface simple so the backing store can be swapped. In production I usually start with Redis. For testing and small deployments, an in-memory LRU works fine.

type LLMCache interface {
	Get(ctx context.Context, key string) (*CachedResponse, error)
	Set(ctx context.Context, key string, resp *CachedResponse, ttl time.Duration) error
	Delete(ctx context.Context, key string) error
}

type CachedResponse struct {
	Content   string    `json:"content"`
	Model     string    `json:"model"`
	TokensIn  int       `json:"tokens_in"`
	TokensOut int       `json:"tokens_out"`
	CachedAt  time.Time `json:"cached_at"`
}

func (s *Service) Generate(ctx context.Context, req LLMRequest) (*LLMResponse, error) {
	key := NewCacheKey(req).String()

	if cached, err := s.cache.Get(ctx, key); err == nil && cached != nil {
		s.metrics.CacheHit(req.Model)
		return &LLMResponse{
			Content:  cached.Content,
			Model:    cached.Model,
			FromCache: true,
		}, nil
	}

	s.metrics.CacheMiss(req.Model)

	resp, err := s.llmClient.Generate(ctx, req)
	if err != nil {
		return nil, err
	}

	cached := &CachedResponse{
		Content:   resp.Content,
		Model:     resp.Model,
		TokensIn:  resp.TokensIn,
		TokensOut: resp.TokensOut,
		CachedAt:  time.Now(),
	}

	// Fire and forget -- cache write failure should not block the response
	go func() {
		if setErr := s.cache.Set(context.Background(), key, cached, s.ttlFor(req)); setErr != nil {
			s.logger.Warn("cache set failed", "key", key, "error", setErr)
		}
	}()

	return resp, nil
}

A few things to note. The cache write is fire-and-forget. A failed cache write should never block or degrade the response to the user. The FromCache flag on the response is important for monitoring – you need to know what percentage of traffic is served from cache.

TTL strategy

This is where people get it wrong. They set a blanket TTL and call it done. Different content ages at different rates.

func (s *Service) ttlFor(req LLMRequest) time.Duration {
	// Responses grounded in static reference data can live longer
	if req.HasStaticContext() {
		return 24 * time.Hour
	}

	// Responses involving real-time data should be short-lived
	if req.HasLiveDataRetrieval() {
		return 5 * time.Minute
	}

	// Default: conservative TTL
	return 1 * time.Hour
}

Static context – like a system prompt explaining how to format output, or reference documentation that changes monthly – can tolerate a long TTL. Responses that depend on live data need short TTLs or no caching at all. When in doubt, err toward shorter TTLs. A cache miss costs money. A stale response costs trust.

Invalidation beyond TTLs

TTLs are your baseline. But you also need event-driven invalidation for cases where you know the cache is stale.

Prompt changes are the big one. Every time you update a system prompt or retrieval pipeline, the old cached responses are wrong. The versioned key handles this naturally – a new prompt produces a new hash, which produces a new key, which misses the cache. Old entries expire on their own TTL.

For data-driven invalidation, I use a simple pattern:

func (s *Service) OnKnowledgeBaseUpdate(ctx context.Context, docIDs []string) {
	// Invalidate any cached responses that used these documents
	for _, docID := range docIDs {
		keys, err := s.cacheIndex.KeysForDocument(ctx, docID)
		if err != nil {
			s.logger.Error("failed to lookup cache keys for document", "doc_id", docID, "error", err)
			continue
		}
		for _, key := range keys {
			_ = s.cache.Delete(ctx, key)
		}
	}
}

This requires maintaining a secondary index that maps documents to cache keys. It’s more work, but for applications where correctness matters – and it usually does – it’s worth it.

What NOT to cache

Not every response should be cached. I have a short list of exclusions:

User-specific sensitive responses. Unless your cache has strict tenant isolation, don’t risk serving User A’s response to User B. I’ve seen this bug in production. It’s exactly as bad as it sounds.
Responses that depend on time-sensitive external state. Stock prices, live inventory, anything where a one-hour-old answer is wrong.
Creative or generative tasks where variability is the feature. If the user expects a different response each time, caching defeats the purpose.

Measuring what matters

You need four metrics from day one:

Cache hit rate by request type. Not a global number. A 60% overall hit rate might mean 90% for classification and 10% for analysis. The per-type breakdown tells you where to focus.
Latency with and without cache. This quantifies the speed improvement and justifies the infrastructure cost.
Cost savings. Track tokens not consumed due to cache hits. Multiply by your per-token rate. Show this number to whoever pays the bills.
Quality signals on cached responses. User corrections, retries, and thumbs-down ratings. If cached responses get worse quality signals than fresh ones, your TTL is too long or your keys are too broad.

Roll out behind a flag

Don’t flip caching on for all traffic at once. Use a feature flag. Start with one request type that has high repetition and low sensitivity. Measure hit rate, latency, and quality for a week. Then expand.

When something goes wrong – and something always goes wrong – you want to be able to turn caching off in seconds. A feature flag gives you that.

What matters

Caching isn’t sexy. It isn’t a new model or a clever prompting technique. It’s the same infrastructure discipline we’ve applied to every other expensive external service call for decades. The difference is that LLM calls are expensive enough that a 40% hit rate translates to real savings.

Build the cache. Version your keys. Keep TTLs honest. Monitor quality. The money you save on API calls will pay for a lot of actual engineering work.

Why I Run Multiple Models in Production

Mon, 18 Mar 2024 00:00:00 +0000

Let me tell you about a fun morning I had last month. A major model provider had a partial outage. Not a full downtime – worse. Elevated latency and intermittent 500s that made the retry logic work overtime without actually resolving anything. The team had bet everything on that one provider. Their AI features were effectively down for four hours.

Another team, running a multi-model setup, barely noticed. Their routing layer shifted traffic to the fallback model within seconds. Quality dipped slightly on complex tasks. Users didn’t complain.

Guess which architecture I recommend now.

The case is boring, and that’s the point

Multi-model isn’t about chasing the latest release or playing model arbitrage. It’s about the same boring infrastructure principles we’ve applied to databases, CDNs, and DNS for decades. Don’t have a single point of failure. Don’t lock yourself into one vendor. Have a plan for when things break.

With LLMs, the failure modes are broader than traditional services. A provider can go down entirely. Latency can spike. A model update can silently change behavior. Rate limits can throttle you during a traffic spike. Any of these will degrade your product if you have no alternative path.

How I think about routing

Routing doesn’t need to be sophisticated. I’ve seen teams over-engineer this with ML-powered classifiers that decide which model gets each request. That’s fun to build and painful to debug.

What works: simple rules based on task type and complexity.

Short classification tasks? Small, fast model. Interactive chat with a paying user? Mid-tier model with good latency. Complex analysis that needs deep reasoning? Big model. Fallback on timeout or error? The next model in the chain.

You can express this in a config file:

routing:
  default: "sonnet"
  rules:
    - task: "classify"
      model: "haiku"
    - task: "analyze"
      complexity: "high"
      model: "opus"
  fallback_chain: ["sonnet", "haiku"]
  timeout_ms: 10000

That’s it. No neural router. No reinforcement learning. Just explicit rules you can read, debug, and change in five minutes.

The key insight: routing is configuration, not code. When a new model drops or pricing changes, you update the config. You don’t refactor a service.

The fallback chain is everything

I can’t stress this enough. Your fallback chain is more important than your primary model choice. Because the primary model will be unavailable at some point.

Keep the chain short – two or three models. Set aggressive timeouts. And critically: log which model actually served each request. If you don’t, you have no idea what quality your users are actually getting. You think they’re getting Opus but half the traffic is silently falling back to Haiku because of rate limits.

I made this mistake early on in a project at a telecom company. We had a fallback in place but no logging on which model served the request. For two weeks, the primary model was rate-limited during peak hours and the fallback was handling 40% of traffic. We didn’t notice until a quality review showed unexpected patterns. Now I log every routing decision. Non-negotiable.

Cost management as a feature

Multi-model is also the most effective cost control mechanism I’ve found. Instead of running every request through the most capable (and expensive) model, you match model capability to task complexity.

The math is straightforward. If 60% of your requests are simple enough for a small model at one-tenth the cost per token, you just cut your AI spend by roughly half. That’s real money at scale. Working with larger companies always surfaces this – teams are shocked when they see how much they’re spending on GPT-4 for tasks that a 7B model could handle.

What goes wrong

Three failure modes I see repeatedly:

Silent fallbacks. The system falls back gracefully, but nobody knows. Quality degrades slowly. Users get frustrated. By the time someone investigates, there are weeks of bad data.

Stale routing rules. A rule made sense three months ago when Model X was the best at coding tasks. Now Model Y is better and cheaper. But nobody updated the config because nobody owns it.

No cross-model evaluation. Teams evaluate their primary model carefully and treat the fallback as “good enough.” Then the fallback serves 30% of traffic during a bad week and nobody has measured whether it’s actually good enough for those tasks.

The fix for all three is the same: monitor, measure, review. Log every routing decision. Run evals against every model in your chain. Review the routing config monthly. This isn’t exciting work. It’s the work that keeps production systems stable.

Keep it simple

Multi-model doesn’t mean complex. It means intentional. Pick two or three models that cover your cost and capability range. Write routing rules you can read. Log everything. Measure quality per model. Review monthly.

The teams shipping reliable AI features aren’t the ones with the cleverest model selection algorithm. They’re the ones that can swap a model in five minutes, measure the impact in an hour, and roll back in seconds.

That’s the whole strategy. Boring, effective, resilient.

Claude 3 First Impressions: Three Models, One Decision Framework

Mon, 04 Mar 2024 00:00:00 +0000

I was halfway through migrating an extraction pipeline to a new prompt format when Anthropic dropped Claude 3: three models – Opus, Sonnet, and Haiku – with different capability tiers, price points, and latency profiles.

My first reaction: finally, someone is admitting that one model doesn’t fit every job.

My second reaction: now I have to rerun all my evals.

The lineup

Anthropic did something smart here. Instead of releasing one model and calling it “the best,” they gave you a menu with clear trade-offs.

Opus is the heavyweight. Complex reasoning, deep analysis, demanding coding tasks. It’s slower and more expensive than the others, but the quality ceiling is noticeably higher. I ran it against some gnarly extraction cases I’ve been working on – multi-page contracts with nested clauses and ambiguous references. It handled nuance that the previous generation fumbled.

Sonnet is the workhorse. Good enough for most production workloads, fast enough for interactive use, and priced so it is still viable at volume. This is where I expect most teams to land as a default.

Haiku is the speed demon. Lightweight tasks, high-volume classification, anything where latency matters more than depth. I tested it on a categorization pipeline – hundreds of short inputs, simple labels – and it ripped through them. The quality was adequate for the task, and the speed was impressive.

The real value isn’t any single model. It’s the fact that you can route between them based on what the task actually needs.

What I noticed in practice

A few things stood out during my first week of testing.

Instruction following is substantially better. Prompts that previously needed careful phrasing to avoid drift now work with more natural language. This is the kind of improvement that doesn’t show up in benchmarks but saves real time in production prompt maintenance.

Vision capabilities are real. I fed Opus some architectural diagrams from a past project and asked it to describe the data flow. The descriptions were useful – not perfect, but useful enough to save someone from manually transcribing a whiteboard photo.

The context window is large, but I’ve learned not to treat large context as a substitute for good retrieval. Stuffing 200k tokens of raw documents into context and hoping for the best is still a bad strategy. I got better results with targeted retrieval feeding a smaller context window.

One thing that frustrated me: the API rate limits during launch week were tight. I burned through my allocation faster than expected while running evals. Plan for this if you’re testing around a major release.

How I’m thinking about adoption

The question isn’t “should I use Claude 3?” It’s “which tier maps to which workflow?”

Before switching any production traffic, I work through these questions:

Latency budget. Interactive features need sub-3-second responses. That might mean Haiku for the fast path and Sonnet for a follow-up detail request.
Quality threshold. Classification and routing tasks don’t need Opus. Contract analysis probably does.
Cost sensitivity. High-volume features should default to the cheapest model that meets the quality bar. Upgrade selectively.
Rollback plan. What happens if quality regresses after the switch? If you don’t have an answer, you aren’t ready.

I route by task type, not by model hype. Haiku handles the lightweight stuff. Sonnet is the default for anything interactive. Opus gets called when the task genuinely needs deeper reasoning. This isn’t a Claude-specific strategy – it’s how I think about any multi-model setup.

The honest assessment

Claude 3 is a meaningful step forward. The quality improvements are real, especially in instruction following and structured output. The tiered model approach is the right direction for the industry – it forces you to think about routing, evaluation, and cost management instead of treating the model as a magic box.

But it’s still a model. It still hallucinates. It still needs evaluation. It still needs guardrails and fallback paths. The teams that will get the most out of Claude 3 are the ones that already have those systems in place.

For everyone else, the release is a good excuse to finally build them.

LLM Evaluation: Stop Shipping on Vibes

Mon, 19 Feb 2024 00:00:00 +0000

Quick take

If your evaluation process is “I tried a few prompts and it seemed fine,” you don’t have evaluation. You have hope. Build a small test set, automate checks, monitor production, and block deploys that regress. It isn’t hard. It’s just work nobody wants to do.

I was on a call last month with a team. They had an AI-powered document analysis feature and wanted help figuring out why users were complaining about accuracy. My first question: “What does your evaluation suite look like?”

Silence. Then: “We test it manually before releases.”

That isn’t evaluation. That’s a prayer.

The core problem

LLMs are convincing even when they’re wrong. A hallucinated answer looks exactly like a correct one to someone who doesn’t already know the answer. This makes casual testing actively dangerous – it gives you false confidence.

The non-determinism makes it worse. Change one word in a system prompt and the behavior shifts in ways you can’t predict by reading the diff. The only way to know whether a change helped or hurt is to measure it against a stable reference.

What to actually measure

Not everything matters equally. I’ve seen teams build elaborate dashboards with dozens of metrics that nobody looks at. Start with the signals that map directly to user value.

Signal	What it tells you	When it matters
Task success rate	Does the feature accomplish what users need?	Always
Format compliance	Can downstream systems parse the output?	Structured output, pipelines
Factual accuracy	Is the output correct?	Knowledge-heavy features
Safety compliance	Does the output follow policy?	User-facing, sensitive domains
Latency (p50/p95)	Is the feature fast enough?	Interactive features
Cost per task	Is this economically viable?	High-volume features

Keep the list short. Four to six metrics is plenty. If you can’t explain why a metric is on the list, remove it.

Build a test set that looks like reality

This is where most teams cut corners, and it shows. A test set of five happy-path examples tells you nothing useful. You need cases that reflect the actual distribution of inputs your feature sees in production.

What a decent test set includes:

Typical cases. The bread-and-butter inputs that make up 80% of traffic.
Edge cases. Long inputs, short inputs, ambiguous inputs, inputs in unexpected formats.
Known failure modes. Cases that broke in the past. These are gold.
Adversarial inputs. Prompt injection attempts, confusing instructions, contradictory context.

Tag every case with a category. This prevents your overall score from hiding category-level failures. I’ve seen a system score 90% overall while completely failing on one important category because the other categories were easy.

Start with 30-50 cases. That’s enough to catch major regressions. Grow it as you learn.

The evaluation methods compared

There’s no single evaluation technique that works for everything. The right approach depends on what you’re measuring.

Method	Speed	Consistency	Best for	Limitations
Exact match	Instant	Perfect	Structured output, classifications	Useless for open-ended tasks
Rule-based checks	Instant	Perfect	Format validation, required fields	Can’t judge quality or nuance
Model-as-judge	Fast	Good (but noisy)	Open-ended quality, tone, relevance	Needs calibration, can drift
Human review	Slow	Variable	Subjective quality, edge cases	Expensive, doesn’t scale
A/B testing (production)	Slow	Good (with volume)	Real-world impact	Requires traffic, slow feedback

My recommendation: layer them. Use exact match and rule-based checks for everything you can. Use model-as-judge for quality on open-ended outputs, but calibrate it monthly against human reviewers. Reserve human review for cases where the automated signals disagree or when you’re exploring a new failure mode.

Offline vs. online: different jobs

This distinction matters more than most people realize.

Offline evaluation runs during development. It answers: “Did this prompt change improve behavior on known cases?” Run it before every deploy. Run it when you change prompts, retrieval logic, or model versions. It’s your regression gate.

Online evaluation runs in production. It answers: “Does this actually work for real users with real inputs?” Monitor task success, collect user signals (did they accept, edit, or reject the output?), and track drift over time.

Aspect	Offline	Online
Purpose	Catch regressions	Validate real-world quality
Data source	Curated test set	Production traffic
Timing	Pre-deploy	Continuous
Feedback speed	Minutes	Hours to days
Blind spots	Can’t predict novel inputs	Hard to attribute cause

You need both. A clean offline score without production monitoring is a false sense of security. I’ve personally seen features pass every offline test and fail in production because the test set didn’t represent the actual input distribution.

Operationalize it or it dies

Evaluation that lives in a notebook and runs when someone remembers isn’t evaluation. It’s a side project. Make it part of the delivery process.

The loop I use:

Maintain a baseline. Your current production version’s scores on the test set. This is the bar.
Run evals on every change. Prompt edits, model swaps, retrieval changes – all of it gets measured.
Block deploys that regress. Not on every metric – pick the ones that matter and set thresholds.
Refresh the test set. Add cases from production failures. Remove cases that no longer match product goals. Monthly is a good cadence.
Review model-as-judge calibration. Monthly, have a human review a sample of the judge’s ratings. Adjust the grading prompt if it drifted.

The tooling to do this isn’t exotic. A script that runs your test set through the system, compares outputs to expected behavior, and produces a report. I’ve built these in a few hundred lines of Go. The hard part isn’t the code. It’s the discipline to actually run it every time.

The gap is discipline, not tooling

I keep coming back to this. The tools exist. The techniques are well-understood. The test sets aren’t that hard to build. What’s missing is the organizational willingness to treat AI output quality with the same rigor as test coverage or uptime.

If you wouldn’t ship a backend service without tests, you shouldn’t ship an AI feature without evaluation. Same principle. Same discipline. Different domain.

Build the test set. Automate the checks. Block the regressions. Everything else is details.

Architecting AI-Native Applications (Without the Delusion)

Mon, 05 Feb 2024 00:00:00 +0000

Quick take

AI-native means the model is in the critical path, not a sidebar. That requires confidence-aware routing, structured feedback loops, explicit fallback chains, and a UX that doesn’t pretend the system is deterministic. This is the architecture I use.

There’s a particular kind of architectural diagram I keep seeing in pitch decks. A clean box labeled “AI” sits neatly between the frontend and the database, connected by two arrows. Everything looks tidy. Everything is a lie.

AI-native applications are messy. The model is non-deterministic. Responses vary in quality. Latency is unpredictable. Costs scale with usage in ways that don’t match traditional compute. And yet – the product’s core value depends on this unreliable component working well enough, often enough, that users trust it.

I’ve been building these systems for the past year across telcos and fintech companies. The architecture that actually works looks nothing like that clean diagram.

What “AI-native” actually means

Let me be precise. An AI-native application is one where removing the AI component wouldn’t leave you with a simpler app – it would leave you with no app. The AI isn’t a feature. It’s the product.

This creates three architectural consequences you can’t ignore:

Non-determinism is in the critical path. The same input can produce different outputs. Your architecture must absorb this instead of pretending it away.
Quality is a spectrum, not a boolean. You evaluate on ranges and intent, not exact matches.
The system must learn from usage. Feedback isn’t a nice-to-have – it’s what keeps the product from degrading.

The layered architecture I actually use

After building several of these systems, I’ve settled on a layered approach. Not because layers are fashionable, but because each layer has a distinct failure mode and a distinct owner.

┌─────────────────────────────────────┐
│         Experience Layer            │  <- Uncertainty communication, UI
├─────────────────────────────────────┤
│       Orchestration Layer           │  <- Routing, fallbacks, workflows
├─────────────────────────────────────┤
│         AI Services Layer           │  <- Model calls, retrieval, tools
├─────────────────────────────────────┤
│      Quality & Safety Layer         │  <- Validation, filtering, policy
├─────────────────────────────────────┤
│       Data & Context Layer          │  <- Knowledge, memory, embeddings
├─────────────────────────────────────┤
│     Feedback & Analytics Layer      │  <- Learning, monitoring, eval
└─────────────────────────────────────┘

These don’t need to be separate services. In most systems I build, they start as packages within a single Go binary. The point is that each responsibility exists, is testable, and has clear ownership.

Designing for uncertainty

This is the part most teams get wrong. They treat the model like a function: input goes in, correct output comes out. Then they’re shocked when production users get hallucinated garbage.

The architecture needs to absorb uncertainty at every level. Here is how I handle it in the orchestration layer:

type Confidence int

const (
	ConfidenceHigh   Confidence = iota // Route directly to user
	ConfidenceMedium                    // Add verification step
	ConfidenceLow                       // Escalate or fallback
)

type AIResponse struct {
	Content    string
	Confidence Confidence
	ModelID    string
	Latency    time.Duration
	TokensUsed int
}

func (s *Service) HandleRequest(ctx context.Context, req Request) (*Response, error) {
	aiResp, err := s.aiClient.Generate(ctx, req.ToPrompt())
	if err != nil {
		return s.fallbackResponse(ctx, req)
	}

	switch aiResp.Confidence {
	case ConfidenceHigh:
		return s.directResponse(aiResp), nil
	case ConfidenceMedium:
		verified, err := s.verify(ctx, aiResp, req)
		if err != nil {
			return s.directResponse(aiResp), nil // Degrade gracefully
		}
		return verified, nil
	case ConfidenceLow:
		return s.escalate(ctx, req, aiResp)
	default:
		return s.fallbackResponse(ctx, req)
	}
}

Confidence doesn’t need to be a number shown to the user. It’s an internal signal that controls what happens next. High confidence goes straight through. Medium confidence gets a verification step – maybe a retrieval check, maybe a second model call with a stricter prompt. Low confidence hits the fallback path.

The fallback path is critical. Every AI-native app needs one, and it should be designed before the happy path. What does the product do when the model is down? When it returns garbage? When it takes 30 seconds to respond? If the answer is “crash” or “show a spinner forever,” the architecture isn’t ready for production.

Feedback loops as architecture, not afterthought

Every request through the system should produce a feedback record. Not because you have time to look at them all, but because without them you’re blind to degradation.

type FeedbackRecord struct {
	RequestID   string
	Prompt      string
	Response    string
	ModelID     string
	Confidence  Confidence
	Latency     time.Duration
	UserSignal  UserSignal  // Accepted, rejected, edited, ignored
	Outcome     Outcome     // Success, partial, failure
	Timestamp   time.Time
}

type UserSignal int

const (
	SignalNone     UserSignal = iota
	SignalAccepted
	SignalRejected
	SignalEdited
	SignalIgnored
)

The user signal is the most valuable field. Did the user accept the output? Edit it? Ignore it entirely? That data drives everything: prompt improvements, model selection changes, confidence calibration.

I learned this the hard way on a project where we shipped an AI feature without feedback instrumentation. Two months later, we had no idea whether the model’s quality had drifted or whether users had simply stopped trusting it. We were debugging with anecdotes. Never again.

Routing without the PhD

You don’t need a machine learning model to route requests to the right model. A few rules go a long way.

type RouterConfig struct {
	Rules []RoutingRule
}

type RoutingRule struct {
	Condition func(req Request) bool
	ModelID   string
	Timeout   time.Duration
	MaxTokens int
}

func DefaultRouter() *RouterConfig {
	return &RouterConfig{
		Rules: []RoutingRule{
			{
				Condition: func(r Request) bool { return r.TokenEstimate() < 200 },
				ModelID:   "fast-small",
				Timeout:   5 * time.Second,
				MaxTokens: 512,
			},
			{
				Condition: func(r Request) bool { return r.RequiresReasoning() },
				ModelID:   "capable-large",
				Timeout:   30 * time.Second,
				MaxTokens: 4096,
			},
			{
				Condition: func(r Request) bool { return true }, // Default
				ModelID:   "balanced-medium",
				Timeout:   15 * time.Second,
				MaxTokens: 2048,
			},
		},
	}
}

Small requests get the fast model. Reasoning-heavy requests get the capable one. Everything else gets the balanced option. This isn’t clever. It doesn’t need to be. It just needs to keep costs predictable and latency acceptable.

The rules are configuration, not code. When you want to change routing – because a new model dropped, or costs shifted, or you learned that certain request types need more capability – you change the config. You don’t redeploy.

UX that respects the user’s intelligence

The biggest UX mistake in AI-native apps is pretending the system is certain when it isn’t. Users can handle uncertainty. They can’t handle being lied to.

A few principles I follow:

Show your work when confidence is low. If the model retrieved documents to answer a question, show which ones. Let the user verify.
Offer refinement, not just results. A “try again” button is lazy. A “here is what I found, want me to focus on X?” is useful.
Keep the UI stable on failure. When the model times out, the product should still work. Maybe with reduced functionality, but it shouldn’t break.

The best AI-native UIs I’ve seen treat the model like a very fast but occasionally wrong colleague. You check their work on important things. You trust them on routine things. The UI should support that mental model.

The data layer determines everything

I have a saying I repeat in these situations: your AI feature is only as good as the data you feed it.

The context layer needs to support structured facts (database records, configuration), unstructured knowledge (documents, guides, prior conversations), and session memory (what happened earlier in this interaction).

Retrieval quality matters more than model quality for most applications. I’ve seen teams spend weeks prompt-engineering their way around a bad retrieval pipeline. Fix the retrieval. The prompts will get simpler.

Operational discipline

Production AI-native apps need monitoring that goes beyond uptime checks:

Quality monitoring. Track your confidence distribution over time. If low-confidence responses are increasing, something changed.
Cost tracking per request type. Not aggregate cost – per-type. You need to know which workflows are expensive.
Latency budgets. Set them per workflow, not globally. A search feature and a document analysis feature have different acceptable latencies.
Drift detection. Model behavior changes. Provider behavior changes. Your data changes. Monitor for all of it.

The honest version

AI-native architecture isn’t a clean diagram. It’s a set of hard choices about where to trust the model, where to verify, where to fall back, and how to learn from every interaction. The teams that accept this build reliable products. The teams that draw clean boxes build impressive demos that break in production.

Build the fallback first. Instrument everything. Let the feedback loop make the system smarter over time. That’s the architecture that actually ships.

Stop Paying OpenAI to Test Your Prompts

Mon, 22 Jan 2024 00:00:00 +0000

I keep watching developers iterate on prompts by hitting GPT-4 hundreds of times a day. Every keystroke, another API call. Every experiment, another line on the invoice. Then they act surprised when the monthly bill shows up.

This is dumb. Not because the hosted models are bad – they are great. But because you don’t need frontier-model quality to test whether your prompt template works, your parsing logic handles edge cases, or your UI renders a streamed response correctly.

Run a local model. Iterate fast. Save the API calls for when you actually need them.

The actual reasons to go local

Forget the hand-wavy “sovereignty” arguments for a moment. The practical reasons are simple:

Speed. No network round-trip. No rate limits. No waiting in a queue behind someone else’s batch job. I can test a prompt change in under a second on a MacBook with Ollama running a 7B model. That feedback loop matters when you’re doing fifty iterations in an afternoon.

Cost. Zero marginal cost per request. I ran through over a thousand prompt variations last month while building an extraction pipeline. On GPT-4, that would have been a few hundred dollars. Locally, it was electricity.

Privacy. Some of my work involves data I can’t send to a third-party API. Full stop. Local inference solves that problem without paperwork.

The trade-offs are real, so stop pretending otherwise

Local models aren’t frontier models. A 7B parameter model running on your laptop isn’t going to match GPT-4 on complex reasoning tasks. That’s fine. You aren’t using it for production quality – you’re using it for development velocity.

Where local models genuinely fall short:

Multi-step reasoning. They lose the thread.
Long context windows. Most local models tap out well before 128k tokens.
Consistent formatting. They drift more on structured output tasks.
Nuanced instruction following. Subtle prompt changes sometimes get ignored.

If your development workflow requires frontier-quality responses at every step, local models aren’t for you. But honestly, most development workflows don’t. You need a model that’s good enough to validate your integration logic, and local models clear that bar easily.

My actual setup

I keep it simple. Ollama for the runtime, a 7B model as default, and an environment variable to swap between local and remote.

func getLLMConfig() LLMConfig {
	if os.Getenv("USE_LOCAL_LLM") == "true" {
		return LLMConfig{
			BaseURL: "http://localhost:11434",
			Model:   "mistral",
		}
	}
	return LLMConfig{
		BaseURL: os.Getenv("LLM_API_URL"),
		Model:   os.Getenv("LLM_MODEL"),
	}
}

That’s it. The rest of the application doesn’t care which model it’s talking to. The interface is the same, the error handling is the same, the retry logic is the same. When I want to validate quality against the real model, I flip the variable and run my eval suite.

The workflow that actually works

Develop locally. Prompt changes, parsing logic, UI work, error handling. All against the local model.
Eval against remote. Before merging, run the same test cases against the production model. Compare outputs.
Ship with confidence. The integration is tested. The quality is validated. The bill is reasonable.

The key insight: your development model and your production model don’t need to be the same. They need to share the same interface.

When to skip local entirely

Be honest about the cases where local doesn’t help:

You’re doing few-shot prompt engineering where response quality is the variable you’re testing.
Your feature depends on capabilities only frontier models have (vision, very long context, tool use with complex chains).
You’re evaluating model-specific behavior like safety responses or refusal patterns.

In those cases, just use the API. The point isn’t religious purity about local inference. The point isn’t burning money on API calls when a local model would have told you the same thing.

Stop overthinking it

Install Ollama. Pull a model. Point your dev config at localhost. You will iterate faster, spend less, and keep sensitive data on your own machine. When you need the real thing, it’s one environment variable away.

This isn’t complicated. It’s just discipline.

AI Engineering Is Its Own Discipline Now

Mon, 08 Jan 2024 00:00:00 +0000

Quick take

Stop hiring ML researchers to do integration work. AI engineering is the craft of turning probabilistic models into reliable product features. Different job, different skills, different mindset.

After a year of working on AI integration across different organizations, the pattern I keep seeing is the same: a team hires a machine learning engineer, points them at a product feature, and wonders why the result is a brilliant notebook that falls apart the moment a real user touches it.

The problem isn’t the engineer. The problem is a category error.

This isn’t ML. This isn’t backend. It’s its own thing.

AI engineering sits in an awkward gap. On one side, you have model training – the research-heavy work of building and improving models. On the other, traditional software engineering – APIs, databases, deployment pipelines, the stuff we’ve been doing for decades.

AI engineering is neither. It’s the work of taking someone else’s model and making it do something useful, reliably, in production. That means prompt design, retrieval pipelines, evaluation harnesses, cost management, safety guardrails, and graceful failure handling. It means caring deeply about the 2% of cases where the model confidently produces garbage.

I spent years building backend systems across fintech and cloud infrastructure. The shift to AI engineering felt familiar in some ways – you still think about latency, error handling, observability. But the non-determinism changes everything. You can’t unit test your way to confidence when the same input produces different outputs on Tuesday.

The skill set looks different

When I talk to CTOs about what to look for in AI engineering hires, I push them away from the classic ML job description. The competencies that actually matter are:

Prompt design and testing. Not prompt “engineering” as a parlor trick. Systematic testing across edge cases, with version control and regression detection.
Retrieval and context assembly. Getting the right information to the model at the right time. This is where most applications succeed or fail.
Integration discipline. Error handling, latency budgets, fallback paths. The boring stuff that separates demos from products.
Evaluation loops. If you can’t measure whether your AI feature got better or worse after a change, you aren’t doing engineering. You’re doing improv.
Safety and guardrails. Especially when the model can take actions or access private data.

None of this requires a PhD. It requires someone who has shipped software, understands production systems, and has the patience to wrangle probabilistic outputs into predictable behavior.

It’s a set of responsibilities, not a stack

People keep trying to draw AI engineering as a neat layer diagram. In practice, it’s a set of cross-cutting responsibilities. You’re choosing models, preparing data, shaping prompts, monitoring quality, controlling costs, and enforcing safety – all at once. The reason the role feels distinct is that it spans product thinking, system design, and ongoing operational care in a way that neither pure ML nor pure backend roles typically do.

At one large telecom, I watched teams try to split these responsibilities across existing roles. The ML team owned prompts. The backend team owned integration. The product team owned evaluation. Nobody owned the whole thing. The result was predictable: finger-pointing when quality dropped and no single person who could trace a bad output from user input to model response to product impact.

How to actually build these skills

Depth beats breadth. Don’t chase every new framework or technique. A solid path:

Build a feature that calls a model and returns something useful. Ship it.
Add retrieval so the model’s answers are grounded in real data instead of vibes.
Build an evaluation loop that catches regressions before your users do.
Add guardrails and define what happens when the model fails. Because it will.

The practice is learned by shipping and iterating. Blog posts help (including this one, I hope), but they aren’t a substitute for watching your carefully crafted prompt fall apart on production traffic.

Where this fits in your org

In smaller teams, AI engineering looks like a product-focused engineer who owns the AI feature end to end. At larger companies, it becomes a dedicated role that sits between product, platform, and security.

The interaction model is clean. Product defines intent and user experience. Platform provides infrastructure and monitoring. Security sets the safety bar. AI engineering turns those constraints into working features that don’t embarrass anyone.

The demand for this role is growing fast. Job descriptions are finally separating AI engineering from ML research, and the expectations center on integration, evaluation, and reliability rather than paper-publishing and model architecture. Good. That separation was overdue.

The discipline, not the hype

AI engineering isn’t a buzzword rotation. It’s the recognition that making models useful in production is real engineering work – with its own tools, its own failure modes, and its own career path. The teams that treat it as a distinct discipline are shipping better features. The teams that don’t are still arguing about whether their demo “works.”

Discipline over heroics. That’s the whole game.

2023: The Year Everything Changed (and I Barely Kept Up)

Mon, 25 Dec 2023 00:00:00 +0000

I’m writing this on Christmas morning with coffee that’s too hot and a year that went too fast. 2023 was the most professionally intense year since I left a deep-tech founder program in 2019 and started figuring out what kind of career I actually wanted. This year I found out.

Fintech Infrastructure

The biggest thread of 2023 for me was working on open-source financial ledger infrastructure. The kind of work where correctness isn’t a nice-to-have – it’s the entire point. Every line of code I touched had to be right because the alternative was someone’s money being wrong.

I came in to help with their Go codebase and ended up deep at the intersection of financial systems and AI. The question that kept coming up: can we use AI to help users interact with the ledger? To query transactions in natural language? To catch anomalies? The answer, frustratingly, was “sort of, but not the way you think.”

AI in fintech isn’t a feature you bolt on. It’s an engineering challenge that touches trust, auditability, and regulatory compliance at every level. I spent months thinking about how to make AI features that are safe enough for financial data. I’m still thinking about it.

The team was exceptional. Small, focused, opinionated about the right things. Working with open-source infrastructure reminded me why I love building tools for developers. The feedback loop is honest. If your tool is bad, people will tell you. If it’s good, they will contribute.

The AI Explosion

I don’t need to tell you what happened in AI this year. You were there. But living through it as someone who builds production systems was a specific kind of experience.

January started with everyone experimenting. By March, teams were asking when they could ship AI features. By summer, the questions changed from “should we use AI” to “how do we make it reliable enough for production.” By November, OpenAI DevDay reset the baseline for what the platform provides out of the box.

The speed was genuinely disorienting. I wrote a blog post about agent architecture in September and parts of it felt outdated by November. I built a RAG pipeline in October and the Assistants API made half of it unnecessary in November. The technical landscape shifted faster than I could blog about it.

What I learned: the teams that did well in 2023 weren’t the ones who moved fastest. They were the ones who picked a lane, built evaluation infrastructure, and iterated with discipline. The teams that chased every new capability announcement ended up with half-built features and no quality baseline.

Reflections

This year cemented something I’ve been discovering over the last few years: I like going deep on a problem, building something that works, and then moving on to the next challenge. The variety keeps me sharp. Working on fintech infrastructure, thinking about security from my national cyber-defense background, contributing to Go upstream – the breadth makes me a better engineer on each individual project.

The downside is context switching. Some weeks I had different codebases open and had to remember which architecture decisions belonged to which project. I’ve gotten better at it. My secret: extensive notes. Not fancy systems. Just a text file per project with decisions, open questions, and things that confused me. Future me always appreciates past me’s notes.

Go

I kept contributing to the Go ecosystem. Nothing dramatic – bug fixes, documentation improvements, the kind of work that keeps an open-source project healthy. Go remains my language of choice for production systems. It’s boring in the best way. The code I write in Go today looks like the code I wrote three years ago, and that’s a feature, not a bug.

The AI tooling landscape in Go is still immature compared to Python. I find myself writing Go wrappers around Python services more than I’d like. But I’d rather have a reliable Go service calling a Python sidecar than a Python monolith that I have to babysit.

What Stayed Hard

Evaluation. I wrote about it multiple times this year because it remained the hardest unsolved problem in AI engineering. Everyone agrees it matters. Nobody has a great solution for multi-step workflows. I got better at building lightweight eval suites, but they’re still more art than science.

Trust. One confidently wrong answer can undo weeks of user adoption. I saw this happen at two different companies this year. The AI feature was great 95% of the time and catastrophically wrong 5% of the time, and users only remembered the 5%.

Cost management. Token-based pricing sounds simple until you multiply it by production volume and realize your prompt changes have budget implications. I now review prompt changes like I review infrastructure changes – with a cost estimate attached.

Looking at 2024

I don’t do predictions. But I know what I’m going to focus on: making AI systems more reliable and more auditable. The hype cycle will do what hype cycles do. The engineering work of making these systems trustworthy is the real job, and it’s the job I want to be doing.

2023 was the year AI became real. 2024 will be the year we find out if it can stay real.

Happy holidays. Go take a break. The codebase will be there when you get back.

Your AI Infrastructure Is Not Ready for Scale. Neither Is Mine.

Mon, 18 Dec 2023 00:00:00 +0000

I’m going to be blunt: the state of AI infrastructure heading into 2024 is embarrassing.

We have models that can write poetry, generate code, and analyze images. We don’t have enough GPUs to run them reliably. We don’t have pricing that makes sense at scale. And we definitely don’t have the operational maturity to treat these systems like the production dependencies they have become.

I’ve spent December watching AI features I helped build at a fintech company run into every scaling problem distributed systems teams have been solving for twenty years. Rate limits. Cascading failures. Cost explosions. Latency spikes. The problems aren’t new. The industry is just re-learning them with a fresh coat of hype.

The GPU Situation Is Absurd

You can’t get H100s. You can’t reliably get inference capacity from any major provider unless you sign a months-long commitment or an enterprise contract that costs more than most startups raise in a seed round. The entire industry is building products on top of infrastructure that’s supply-constrained, and nobody wants to talk about what happens when demand doubles next year.

I tried to reserve inference capacity for a production workload last month. The response from one provider was “we can put you on a waitlist.” A waitlist. For compute. In 2023. This isn’t a technology problem. It’s a supply chain problem wearing a technology costume.

Rate Limits Are a Production Constraint

Every AI API has rate limits. At low volume, you don’t notice them. At production scale, they become the hardest ceiling in your architecture.

I hit OpenAI’s rate limit during a load test and watched requests queue up until the entire feature became unusable. Not degraded – unusable. The fix wasn’t clever engineering. It was a priority queue, backpressure, and load shedding. Distributed systems 101. The fact that most AI teams are learning this for the first time worries me.

Your Demo Won’t Survive Real Traffic

Here is what happens when your AI feature goes from 100 requests per day to 10,000:

Latency goes from “acceptable” to “users are closing the tab.” Costs go from “rounding error” to “someone just Slacked asking why the API bill tripled.” A provider outage that used to affect a handful of test users now takes down a production feature that the sales team just promised to a client.

I’ve seen all three of these happen at the same company. In the same month.

What You Actually Need

Queues and backpressure. Treat your AI traffic as a managed stream, not an open pipe. Priority queues for critical requests. Backpressure when the system is saturated. Load shedding for low-priority work. This isn’t optional once you have real users.

Circuit breakers. Your model provider will have bad hours. Mine had a bad day last week. Circuit breakers stop a provider outage from cascading through your entire system. They’re boring. They’re essential. I’ve been building systems with circuit breakers since my telecom days. The pattern hasn’t changed. The dependency has.

Graceful degradation. When GPT-4 is down, what happens? If the answer is “the feature breaks,” you don’t have a production system. You have a demo with users. Fall back to cached responses. Fall back to a smaller, faster model. Fall back to a static message that says “this feature is temporarily unavailable.” Anything is better than a spinning loader.

Cost controls that are actually enforced. Per-tenant budgets. Per-feature budgets. Daily caps. If you don’t enforce them, you’ll get a surprise invoice that triggers an emergency meeting. I’ve seen a single prompt change – adding two paragraphs of context – increase monthly costs by 35%. Token pricing is deceptively simple until you multiply it by production volume.

Caching. Exact-match caching is trivial to implement and saves real money. Same question, same context, same answer – serve it from cache. Semantic caching is fancier and worth exploring, but start with the easy wins.

This Is Distributed Systems Work

None of this is novel. Queues, circuit breakers, graceful degradation, cost controls, caching – these are patterns from every distributed systems textbook ever written. The only thing that’s new is the dependency type.

What frustrates me is that the AI community is treating infrastructure as a solved problem while building on top of infrastructure that’s anything but solved. The models are impressive. The plumbing is held together with optimism and rate limit retries.

Build your AI features like you would build any production system that depends on an unreliable, expensive, supply-constrained external service. Because that’s exactly what it is.

Multimodal AI: Five Use Cases That Actually Work (and Three That Do Not)

Mon, 11 Dec 2023 00:00:00 +0000

Quick take

Vision-capable models are legitimately useful for document extraction, UI review, and accessibility. They’re unreliable for precise measurements, tiny text, and anything that requires counting. Treat it like a smart intern who’s great at describing what they see but bad at details. Build for uncertainty, validate outputs, and keep a fallback path.

GPT-4V dropped and my first reaction was to throw every image I could find at it. Receipts. Architecture diagrams. Screenshots. Photos of whiteboards from meetings. The results ranged from “holy shit, this actually works” to “that’s confidently wrong in a way that would cost money.”

After a few weeks of serious testing, I have a clearer picture of where multimodal AI is ready for production and where it will get you in trouble.

What Actually Ships

1. Invoice and Receipt Extraction

This is the killer use case at a fintech company. We process financial documents. Extracting vendor name, amount, date, and line items from a photo of a receipt used to require a dedicated OCR pipeline, post-processing rules, and a prayer. Now I send the image to GPT-4V with a structured prompt and get JSON back.

Analyze this invoice image. Return JSON with these fields:
- vendor_name (string)
- total_amount (string, include currency)
- invoice_date (string, YYYY-MM-DD)
- line_items (array of {description, amount})
If a field is not visible, return null.

Hit rate on clean documents is around 90%. On crumpled receipts with bad lighting, it drops to maybe 65%. Good enough for a first pass with human review on low-confidence results.

2. UI Review

I started using it to review screenshots of our admin dashboards. “List any layout issues, missing states, or accessibility concerns in this screenshot.” The results aren’t comprehensive, but they catch obvious problems – misaligned elements, missing error states, low contrast text – faster than a manual review pass.

3. Accessibility

Alt text generation. Genuinely good at this. Feed it a product image or a chart and ask for a concise description. The output is usually better than what most developers write manually, which is a low bar, but still.

4. Architecture Diagram Interpretation

This one surprised me. I photographed a whiteboard diagram from a system design session and asked the model to describe the components and data flow. It got the high-level architecture right. Not perfect on every label, but the structure was correct. Useful for converting whiteboard photos into documentation drafts.

5. Visual Anomaly Detection

For predictable environments – “does this photo show the expected setup?” – the model is decent at spotting obvious differences. Missing components, wrong configurations, visible damage. It works best when you can describe what “normal” looks like and ask the model to flag deviations.

What Doesn’t Work (Yet)

Counting

Ask it to count items in a busy image. Watch it fail. It will confidently give you a number that’s wrong. Small objects, overlapping items, dense arrangements – the model can’t reliably count. Don’t build features that depend on this.

Precise Measurements

“How far apart are these two components?” The model doesn’t do spatial precision. It can tell you something is “on the left” or “near the top” but asking for millimeter-level accuracy is asking for trouble.

Tiny or Low-Quality Text

Faded labels, handwritten notes in bad lighting, text smaller than about 10px on a screenshot – all unreliable. The model will either skip the text entirely or hallucinate plausible content. This is the failure mode that scares me most because it’s indistinguishable from correct output unless you verify.

The Cost Problem

Vision calls are expensive. A single image analysis costs roughly 10-20x what a text-only call costs, depending on image size and detail level. At scale, this adds up fast.

My rules:

Resize aggressively. Crop to the region of interest. A full-resolution photo of a receipt when all you need is the total amount is wasting tokens and money.
Use low detail mode for simple tasks. GPT-4V supports a detail parameter. Use “low” for tasks like “is there text in this image?” and “high” only when you need it.
Cache everything. Same image, same question, same answer. Don’t re-process.
Batch when possible. Multiple questions about the same image should be a single API call, not five separate ones.

Building for Uncertainty

The single most important design principle: assume the model will be wrong sometimes, and build your product flow to handle it gracefully.

For document extraction at a fintech company, every result goes through a confidence check. If any field comes back null or if the extracted amount doesn’t parse as a valid number, it routes to human review. The model handles the easy 70-80% automatically. Humans handle the rest. The total cost is still lower than having humans process everything manually.

Ask the model to cite visible evidence. “What text did you read to determine the vendor name?” If it can’t point to specific text in the image, the answer is probably a hallucination.

Keep an OCR fallback for critical text extraction. The vision model is better at understanding context. Traditional OCR is better at reading exact characters. Use both.

Multimodal AI isn’t magic. It’s a new tool with a specific reliability profile. Know where it’s strong, know where it fails, and design your system to handle both. That’s the boring answer. It’s also the right one.

Two Weeks With the Assistants API: What I Like, What I Hate

Mon, 04 Dec 2023 00:00:00 +0000

I’ve spent the past two weeks building with the Assistants API. Not toy examples – actual tools that real people will use. Here is what I found.

The Good: Speed to Something Real

I built an internal documentation assistant for a fintech project in about four hours. Upload the docs, write a focused system prompt, wire up a simple Go client that manages threads. Done. The retrieval isn’t perfect, but it’s good enough for “which endpoint handles X” type questions. Previously this would have required a vector store, an embedding pipeline, chunking logic, and a retrieval chain. Now it’s an API call.

The code interpreter is surprisingly useful. I hooked it up to a tool that lets internal users ask data questions in plain English. “How many transactions failed last week?” gets translated into Python, executed in OpenAI’s sandbox, and the result comes back formatted. It took me a day. Building a safe code execution sandbox from scratch would have taken a week minimum.

The Bad: Opacity Everywhere

The retrieval is a black box. I can’t control how it chunks my documents. I can’t see what it retrieved before generating an answer. I can’t tune the similarity threshold or re-rank results. For the documentation assistant, this is tolerable – the stakes are low and approximate recall is fine.

For anything involving financial data at the fintech company, it’s a non-starter. I need to know exactly what context the model saw. I need to audit the retrieval path. I need to explain to compliance why the system gave a specific answer. The Assistants API can’t do any of that.

Thread management is also trickier than it looks. Threads accumulate context over time, and stale context degrades answers. I learned this the hard way when the documentation assistant started mixing up API versions because it was carrying context from a conversation about v1 into a question about v2. Now I have a policy: new thread for every topic change. It’s crude but it works.

The Ugly: Runs Are Flaky

A “Run” is one execution of an assistant against a thread. It can succeed, fail, stall, or time out. In my first week, I had runs that just… hung. No error. No timeout. Just pending forever. I added my own timeout logic around every run, with a hard kill after 30 seconds and a retry with a fresh thread if it fails twice.

ctx, cancel := context.WithTimeout(ctx, 30*time.Second)
defer cancel()

run, err := client.CreateRun(ctx, threadID, assistantID)
if err != nil {
    return fmt.Errorf("create run: %w", err)
}

// Poll until complete or timeout.
for {
    status, err := client.GetRun(ctx, threadID, run.ID)
    if err != nil {
        return fmt.Errorf("check run status: %w", err)
    }

    if status.Status == "completed" {
        break
    }
    if status.Status == "failed" || status.Status == "expired" {
        return fmt.Errorf("run %s: %s", status.Status, status.LastError)
    }

    time.Sleep(500 * time.Millisecond)
}

This isn’t elegant. It works. The API really needs webhooks or server-sent events instead of polling, but we work with what we’ve got.

Where I’m Using It

Internal tools with low stakes. Documentation Q&A, data exploration, onboarding helpers. The Assistants API is perfect here. Fast to build, good enough quality, and the opacity doesn’t matter because the stakes are low.

Prototypes that need to prove value. If the question is “would this feature be useful?” the Assistants API gets you an answer in days instead of weeks. Then you can decide whether to build custom infrastructure for the production version.

Where I’m Not

Anything with compliance requirements. Financial data, personal information, regulated workflows. If I can’t audit the retrieval path and explain every answer, I can’t use it.

Anything that needs precise orchestration. If the workflow involves multiple models, conditional branching, or complex tool chains, the Assistants API is too constrained. You’ll fight the abstraction instead of benefiting from it.

The Verdict

The Assistants API is the right default for a lot of use cases. It’s fast, it’s cheap, and it handles the boring parts – thread management, tool execution, file retrieval – so you don’t have to. The cost is control, and for many applications that’s a trade worth making.

Just go in with your eyes open. Know what you’re giving up. Have a plan for when you need to go custom. And for the love of all that’s holy, add your own timeouts.

OpenAI DevDay Happened and I Have Opinions

Mon, 27 Nov 2023 00:00:00 +0000

I was on a call with a fintech company engineer when the DevDay keynote started streaming. We had the livestream on one monitor and a half-finished RAG implementation on the other. About twenty minutes in, we both went quiet. Then he said, “So… do we still need this?”

That question – “do we still need this?” – is the real story of DevDay. Not GPT-4 Turbo. Not the Assistants API. Not Custom GPTs. The story is that OpenAI just told every team building on their platform: we’re going to own more of the stack now. And you need to decide how you feel about that.

What Actually Shipped

GPT-4 Turbo is the one that matters most for day-to-day work. 128K context window. Better instruction following. JSON mode that actually works. Lower prices. The practical effect is immediate: prompts I was carefully engineering to fit in 8K can now be sloppy and long. Function calling went from “fragile hack” to “usable feature.” Cost assumptions that made certain products unviable are suddenly different.

I rewrote two prompts that week. Both got simpler. Both worked better. That’s the kind of improvement I respect – not a new capability, but a dramatic reduction in friction for existing ones.

The Assistants API is more interesting and more concerning. It bundles threads, tool execution, file retrieval, and conversation state into a managed service. You describe an assistant, feed it files, and it handles the orchestration. For prototypes and internal tools, this is incredible. I spun up a document Q&A assistant in about an hour that would have taken days with our custom setup.

The concern is control. When OpenAI manages the thread, the retrieval, and the tool execution, you lose visibility into what’s happening. You can’t tune the retrieval. You can’t inspect the intermediate reasoning. For a quick prototype, that’s fine. For a production system handling financial data at the fintech company, I need to see what’s happening under the hood.

Custom GPTs are ChatGPT plugins done right. No-code assistants that anyone can build and share. For developers, this is a double-edged sword. It’s a distribution channel – you can ship lightweight tools that live inside ChatGPT. It’s also competition – because everyone else can, including non-developers. If your startup is “ChatGPT but with this one extra feature,” you now have a problem.

The Build-vs-Buy Shift

This is where it gets strategic. Before DevDay, the standard architecture for an AI feature was: pick a model, build a RAG pipeline, manage conversation state, wire up tools, handle the orchestration yourself. Lots of plumbing. Lots of control.

After DevDay, OpenAI is offering to handle most of that plumbing. The question is no longer “can we build this ourselves?” It’s “should we?”

My framework: use the managed path for anything that isn’t a core differentiator. If your product’s value comes from the quality of your retrieval, the specificity of your tool calls, or strict data governance, keep building custom. If the AI feature is a nice-to-have or an internal tool, the Assistants API will get you there in a fraction of the time.

The danger is the middle ground. Features that feel custom but aren’t actually differentiated. These are the ones that will get swallowed by the platform, and the teams building them will realize too late that they have been maintaining infrastructure OpenAI now gives away.

RAG Isn’t Dead (But the Bar Just Went Up)

I keep seeing “RAG is dead” takes. They’re wrong, but the kernel of truth is real. With 128K context and built-in retrieval, the bar for justifying a custom RAG pipeline just got much higher.

If you’re stuffing a few documents into context and asking questions, the Assistants API does this out of the box. If you need precise control over chunking, embedding models, re-ranking, or compliance with data residency requirements, custom RAG is still the answer.

At the fintech company, we’ll keep our custom retrieval. Financial data has strict requirements that a black-box retrieval system can’t satisfy. But I’d estimate that 60-70% of the RAG implementations I’ve seen in the wild could be replaced by the Assistants API with no loss in quality. Those teams should take the free lunch.

What I’m Doing About It

The same week as DevDay, I started a review of every custom component in our AI pipeline. The question for each one: does this still earn its maintenance cost?

Three things survived the review. Everything else is getting migrated or simplified.

That’s the right response to DevDay. Not panic. Not hype. A sober assessment of what’s now commodity and what’s still worth owning. OpenAI moved the line. The smart move is to acknowledge it and redraw your architecture accordingly.