Product · four primitives

The infrastructure for agents that actually run in production.

Most agent failures aren't model failures. They're infra failures — a tool call that hung, a retry that didn't, a queue that backed up at 3am. Lattice is the four primitives you'd otherwise rebuild in-house, given to you in a shape that survives shipping.

How it fits in your stack ↓
  ┌───────────────────────────────────────────────────────────────┐
  │                       your application                        │
  │           ( anywhere — Vercel, AWS, your own K8s )            │
  └───────────────────┬───────────────────────────┬───────────────┘
                      │ lattice.runs.create()     │ lattice.trace
                      ▼                           ▼
  ┌───────────────────────────────────┐  ┌───────────────────────┐
  │       Lattice control plane        │  │    OpenTelemetry      │
  │  ┌──────────┐  ┌────────────────┐  │  │   ───►  Datadog       │
  │  │ scheduler│  │   evaluator    │  │  │   ───►  Honeycomb     │
  │  │ + queue  │  │   sampler      │  │  │   ───►  Grafana       │
  │  └─────┬────┘  └─────┬──────────┘  │  └───────────────────────┘
  │        ▼             ▼             │
  │  ┌──────────────────────────────┐  │
  │  │      durable run state       │  │   self-host or
  │  │      (Postgres + S3)         │  │   managed cloud
  │  └──────────┬───────────────────┘  │
  └─────────────┼───────────────────────┘
                ▼
  ┌──────────────────────────────────────────────────────────────┐
  │                   model + tool providers                      │
  │   Anthropic · OpenAI · Google · your fine-tune · MCP tools    │
  └──────────────────────────────────────────────────────────────┘

Lattice sits between your app and your model providers. Your code calls lattice.runs.create instead of the model SDK directly; we own the retry loop, the queue, the trace, and the durable state. You keep model choice and tool definitions.

01 · runs

Durable execution

A run is a single agent invocation that survives crashes, restarts, and provider outages. Lattice persists every step to durable storage; if your worker dies mid-run, a fresh worker resumes from the last completed step. No retry loops you have to write yourself.

runs.ts·typescript
import { lattice } from "@lattice/sdk";
const run = await lattice.runs.create({
agent: "research",
input: { topic: "GLP-1 supply chain" },
resumable: true, // survives worker crash
timeoutMs: 6 * 3600_000, // 6h hard cap
});
max run length
168 hours (managed cloud)
step persistence
Postgres + S3 (events + artifacts)
resume guarantee
exactly-once, content-addressed
02 · scheduler

Smart scheduling

Per-provider rate limits, queue-aware concurrency, retry with jittered backoff, dead-letter handling, and priority lanes. The scheduler understands that Anthropic, OpenAI, and your in-house tool servers each have different limits — and won't trip yours.

scheduler.ts·typescript
lattice.schedule.create({
agent: "summarize",
rate: "60/m per:org", // 60 runs / minute / org
concurrency: 12,
retry: { max: 4, backoff: "jittered-exp" },
deadLetter: "failures.queue",
});
rate-limit primitives
per-tenant · per-provider · per-step
DLQ handling
configurable retention + replay UI
priority lanes
4 tiers, preemptive eviction
03 · tracing

Real observability

Every run produces a complete trace tree — every prompt, tool call, retry, and intermediate artifact. Traces export as native OpenTelemetry to Datadog, Grafana, Honeycomb, or your stack of choice. Replay any historical run with the exact inputs, prompt, and tool responses.

tracing.ts·typescript
// in-app: hover any step in the trace tree to see
// prompt, completion, latency, cost, model version.
// or pull it programmatically:
const trace = await lattice.trace.get("run_8a7f");
trace.steps.forEach(s => log(s.name, s.tokens, s.cost));
trace export
OpenTelemetry, native (no proprietary SDK)
retention
30 days included, 1 year on Scale
replay
exact-input replay against any model version
04 · evals

Inline evals

Attach evaluators to any agent. Sample a percentage of production traffic, run it through your rubric, and surface regressions before customers do. Compare the same prompt across model variants on real production data, not on toy benchmarks.

evals.ts·typescript
lattice.evals.attach({
agent: "support-triage",
rubric: rubrics.faithfulness, // your own or built-in
sample: 0.1, // 10% of prod traffic
alertThreshold: 0.85, // page below this score
});
rubric library
12 built-in + bring-your-own (LLM-as-judge or code)
model A/B
shadow-route any % to a candidate model
regression alerts
Slack, PagerDuty, webhook
Lattice vs. the agent stack you'd otherwise build ↓

We won't name names. You can guess the comparisons. The differences are real.

capability
Lattice
typical agent framework
Long-running execution (>1h)
Native, durable
Often DIY queue glue
Trace export
Native OpenTelemetry
Proprietary SDK + extra cost
Pricing model
Per agent step
Per seat or per LLM token
Self-host option
Apache 2.0 runtime
Managed-only
Eval primitive
First-class, inline
Separate product / vendor
Vendor lock-in
Standard interfaces, exportable state
Proprietary state, painful exit
⚠ what Lattice is not ↓

Four things on every other AI platform's homepage — and not ours, on purpose.

01

We don't ship a model. Use Anthropic, OpenAI, Google, your fine-tune, or your local server.

02

We don't ship a vector database. Use pgvector, Pinecone, Weaviate, or whatever you have.

03

We don't ship a UI builder. The traces are our UI; your app is your app.

04

We don't auto-generate agents. The library has helpers; the architecture is yours.

We're trying to do one thing. The four-primitive infrastructure for production agents. The other things on the AI-platform menu are real problems, but they're not our problem, and we'd be worse at them than the focused tools you should already use.

Read the docs. Or just start.

$npx lattice init
Get ProposalInstant SEO Audit