Records every LLM call, tool call, token count and latency as replayable spans.
Four AI engineering libraries.
One agent. Watch them work.
Every message is traced by TraceForge, scored by Evalify, adversarially probed by RedForge, and versioned by StateForge. Send a question to see them in action.
LLM-judge scoring across relevance, accuracy and safety — after every turn.
Prompt-injection & jailbreak probes run in the background on your first message.
Git-like memory snapshots and diffs of what the agent knew, and when.