openlog The keeper: a pixel page of records with a small flower at its margin

Somebody should remember what
the machines said.

openlog keeps the record of what your models did.

Hosted observability for LLM apps and agents. Every trace, prompt, token, and millisecond in one ledger, with evals that catch a regression before your users do.

Read the docs Start tracing

A run becomes a record you can read.

One support agent answers one question. openlog keeps every span it took to get there: which model, how many tokens, how many milliseconds, what it cost, and the evals that watched the answer.

trace tr_9f3a2c · support-agent · production 2026-06-11T03:14:07Z
span detail tokens time cost
agent.plan claude-sonnet-4-6 412 181 ms $0.002
tool.search_orders postgres 24 ms
tool.refund_policy retrieval · 12 chunks 38 ms
agent.reply claude-sonnet-4-6 1,204 603 ms $0.007
eval.groundedness scored 0.94 · passed 92 ms
eval.tone scored 0.91 · passed 87 ms
wall 846 ms · 1,616 tokens · $0.009 · evals 2/2 passed
One agent run, as openlog records it. The marked span called the model; its prompt and reply are kept in full, just below.

The prompt and the reply are kept word for word.

Open the span that called the model. Nothing is summarized away: the system prompt, what the tools returned, the question, and the answer that went back to the user.

span ag_reply · claude-sonnet-4-6 1,204 tokens · 603 ms
system
You are a support agent for Northwind Apparel. Answer only from the retrieved policy. If the policy does not cover the question, say so and offer to open a ticket.
user
Can I return shoes I've worn once? I ordered them three weeks ago.
tool · refund_policy
retrieved 12 chunks · matched §4 "Worn & final-sale items"
assistant
Returns are accepted within 30 days for unworn items in original packaging (§4). Since these were worn, they don't qualify for a refund. I can open a ticket for a store-credit exception, want me to do that?
The recorded call. Every message in, every message out, attached to the span that produced them.

openlog catches the regression before your users do.

Score every run against your own checks. When a prompt change moves a number, the ledger shows which check fell, by how much, and the answer that broke it.

eval suite · support-agent · nightly 2026-06-10T02:00:00Z

groundedness

0.94 0.71

−0.23 · failed

caused by prompt v12 → v13, deployed 2026-06-09T18:22:00Z

"Can I return shoes I've worn once?"

v12 · grounded

Cited §4, declined the refund, and offered a store-credit ticket.

v13 · not grounded

Invented a "30-day worn-item exception." No such clause exists in the policy.

311 of 312 checks still pass. openlog opened this one.

A nightly suite, the morning after a deploy. The score that dropped, the version that dropped it, and the case you can read for yourself.

The integration is two lines.

Wrap your OpenAI or Anthropic client once. Every call that goes through it becomes a span, with the prompt, the completion, the tokens, and the latency attached.

import anthropicimport openlog client = openlog.trace(anthropic.Anthropic()) def answer(question):    return client.messages.create(        model="claude-sonnet-4-6",        max_tokens=1024,        system=POLICY,        messages=[{"role": "user", "content": question}],    )

A wrong answer you can't replay is a bug you can't fix.

openlog keeps the whole run the moment it happens: the prompt, the tools it called, the tokens it spent, and the eval that should have caught it. When the page goes off, you open the trace and read it, instead of reconstructing the call from logs across six services.