openlog

Somebody should remember what
the machines said.

openlog keeps the record of what your models did.

Hosted observability for LLM apps and agents. Every trace, prompt, token, and millisecond in one ledger, with evals that catch a regression before your users do.

Read the docs Start tracing

A run becomes a record you can read.

One support agent answers one question. openlog keeps every span it took to get there: which model, how many tokens, how many milliseconds, what it cost, and the evals that watched the answer.

trace tr_9f3a2c · support-agent · production 2026-06-11T03:14:07Z

span	detail	tokens	time	cost
agent.plan	claude-sonnet-4-6	412	181 ms	$0.002
tool.search_orders	postgres		24 ms
tool.refund_policy	retrieval · 12 chunks		38 ms
agent.reply	claude-sonnet-4-6	1,204	603 ms	$0.007
eval.groundedness	scored 0.94 · passed		92 ms
eval.tone	scored 0.91 · passed		87 ms

wall 846 ms · 1,616 tokens · $0.009 · evals 2/2 passed

One agent run, as openlog records it. The marked span called the model; its prompt and reply are kept in full, just below.

The prompt and the reply are kept word for word.

Open the span that called the model. Nothing is summarized away: the system prompt, what the tools returned, the question, and the answer that went back to the user.

span ag_reply · claude-sonnet-4-6 1,204 tokens · 603 ms

system: You are a support agent for Northwind Apparel. Answer only from the retrieved policy. If the policy does not cover the question, say so and offer to open a ticket.
user: Can I return shoes I've worn once? I ordered them three weeks ago.
tool · refund_policy: retrieved 12 chunks · matched §4 "Worn & final-sale items"
assistant: Returns are accepted within 30 days for unworn items in original packaging (§4). Since these were worn, they don't qualify for a refund. I can open a ticket for a store-credit exception, want me to do that?

The recorded call. Every message in, every message out, attached to the span that produced them.

openlog catches the regression before your users do.

Score every run against your own checks. When a prompt change moves a number, the ledger shows which check fell, by how much, and the answer that broke it.

eval suite · support-agent · nightly 2026-06-10T02:00:00Z

groundedness

0.94 0.71

−0.23 · failed

caused by prompt v12 → v13, deployed 2026-06-09T18:22:00Z

"Can I return shoes I've worn once?"

v12 · grounded

Cited §4, declined the refund, and offered a store-credit ticket.

v13 · not grounded

Invented a "30-day worn-item exception." No such clause exists in the policy.

311 of 312 checks still pass. openlog opened this one.

A nightly suite, the morning after a deploy. The score that dropped, the version that dropped it, and the case you can read for yourself.

The integration is two lines.

Wrap your OpenAI or Anthropic client once. Every call that goes through it becomes a span, with the prompt, the completion, the tokens, and the latency attached.

import anthropicimport openlog client = openlog.trace(anthropic.Anthropic()) def answer(question):    return client.messages.create(        model="claude-sonnet-4-6",        max_tokens=1024,        system=POLICY,        messages=[{"role": "user", "content": question}],    )

import Anthropic from "@anthropic-ai/sdk";import { openlog } from "openlog"; const client = openlog.trace(new Anthropic()); export async function answer(question: string) {  return client.messages.create({    model: "claude-sonnet-4-6",    max_tokens: 1024,    system: POLICY,    messages: [{ role: "user", content: question }],  });}

A wrong answer you can't replay is a bug you can't fix.

openlog keeps the whole run the moment it happens: the prompt, the tools it called, the tokens it spent, and the eval that should have caught it. When the page goes off, you open the trace and read it, instead of reconstructing the call from logs across six services.