openlog
Somebody should remember what
the machines said.
openlog keeps the record of what your models did.
Hosted observability for LLM apps and agents. Every trace, prompt, token, and millisecond in one ledger, with evals that catch a regression before your users do.
A run becomes a record you can read.
One support agent answers one question. openlog keeps every span it took to get there: which model, how many tokens, how many milliseconds, what it cost, and the evals that watched the answer.
| span | detail | tokens | time | cost |
|---|---|---|---|---|
| agent.plan | claude-sonnet-4-6 | 412 | 181 ms | $0.002 |
| tool.search_orders | postgres | 24 ms | ||
| tool.refund_policy | retrieval · 12 chunks | 38 ms | ||
| agent.reply | claude-sonnet-4-6 | 1,204 | 603 ms | $0.007 |
| eval.groundedness | scored 0.94 · passed | 92 ms | ||
| eval.tone | scored 0.91 · passed | 87 ms |
The prompt and the reply are kept word for word.
Open the span that called the model. Nothing is summarized away: the system prompt, what the tools returned, the question, and the answer that went back to the user.
- system
- You are a support agent for Northwind Apparel. Answer only from the retrieved policy. If the policy does not cover the question, say so and offer to open a ticket.
- user
- Can I return shoes I've worn once? I ordered them three weeks ago.
- tool · refund_policy
- retrieved 12 chunks · matched §4 "Worn & final-sale items"
- assistant
- Returns are accepted within 30 days for unworn items in original packaging (§4). Since these were worn, they don't qualify for a refund. I can open a ticket for a store-credit exception, want me to do that?
openlog catches the regression before your users do.
Score every run against your own checks. When a prompt change moves a number, the ledger shows which check fell, by how much, and the answer that broke it.
groundedness
0.94 0.71
−0.23 · failed
caused by prompt v12 → v13, deployed 2026-06-09T18:22:00Z
"Can I return shoes I've worn once?"
Cited §4, declined the refund, and offered a store-credit ticket.
Invented a "30-day worn-item exception." No such clause exists in the policy.
311 of 312 checks still pass. openlog opened this one.
The integration is two lines.
Wrap your OpenAI or Anthropic client once. Every call that goes through it becomes a span, with the prompt, the completion, the tokens, and the latency attached.
import anthropicimport openlog client = openlog.trace(anthropic.Anthropic()) def answer(question): return client.messages.create( model="claude-sonnet-4-6", max_tokens=1024, system=POLICY, messages=[{"role": "user", "content": question}], ) import Anthropic from "@anthropic-ai/sdk";import { openlog } from "openlog"; const client = openlog.trace(new Anthropic()); export async function answer(question: string) { return client.messages.create({ model: "claude-sonnet-4-6", max_tokens: 1024, system: POLICY, messages: [{ role: "user", content: question }], });} A wrong answer you can't replay is a bug you can't fix.
openlog keeps the whole run the moment it happens: the prompt, the tools it called, the tokens it spent, and the eval that should have caught it. When the page goes off, you open the trace and read it, instead of reconstructing the call from logs across six services.