Patent pending AI search tested against 99,455 public articles.

Reliable AI search where wrong answers are not an option.

We built a public-article benchmark with 5,000 questions. ContextOS got 5,000 right. Haystack got 325 right. RAG, LangGraph/LangMem, and LlamaIndex each got 247 right.

99,455public articles in the test corpus
5,000questions asked against that corpus
5,000/5,000ContextOS answers correct

Most AI search is built to find text that looks related to the question. That is not the same thing as answering the question.

In this benchmark, ordinary RAG, LangGraph/LangMem, and LlamaIndex each answered 247 out of 5,000 questions correctly. Haystack answered 325. ContextOS answered all 5,000.

That is the whole point: better AI search means more right answers, fewer wrong answers, and a clear reason for why the answer was allowed.

The penicillin problem.

A better search engine does not just find related documents. It knows when a related document is the wrong answer.

The ordinary AI failure

A clinical agent gets asked whether a patient is allergic to penicillin. A normal RAG stack may retrieve a similar allergy note, a stale chart entry, or another patient's record if it scores close enough. That is not acceptable search. That is a dangerous lookup wearing a confident answer.

ContextOS sits above the stack you already use: RAG, vector search, graph search, SQL, and agent memory. Those systems bring back candidates. ContextOS decides which candidate is safe to use.

In plain English: the AI cannot use the wrong person, the old answer, the unapproved answer, or the answer someone changed without a trail.

What the gate stops.

Your existing tools find possible answers. ContextOS blocks the ones that should never reach the model.

Old answers do not sneak back in.

If a record was corrected, the old version stays in history but is blocked from normal answers. The model sees the current answer, not the stale one that sounds close.

The trusted answer beats the closest-looking answer.

A verified record beats a weaker record, even when the weaker record looks more similar to the question.

Wrong-patient candidates never rank.

Identity binding is a hard filter applied before ranking. A clinically similar record from another patient is rejected before it can be scored, let alone returned.

If someone changes the memory, you can see it.

Direct database inserts, altered records, removed corrections, and replaced history can be detected. Every governed memory change leaves a checkable trail.

Rules are checked before the model answers.

HIPAA scopes. Attorney-client privilege. Litigation holds. Export controls. Consent state. All enforced inside the gate, not in a downstream prompt the agent can ignore.

RAG keeps finding candidates.

ContextOS does not replace your vector store, graph, or BM25 stack. It sits above them and decides what they are allowed to return.

Agent actions need permission.

Agents cannot act without an externally signed consent token. Default-deny on anomaly. Catches shutdown-resistance, exfiltration, deception, and capability-grabbing before execution on a substrate the agent cannot read, copy, or invoke.

Deleted-for-use means gone from answers.

A record can become non-retrievable while a tamper-evident record of the forgetting directive remains. GDPR Article 17 without losing the audit trail. Cryptographic deletion is available when soft-delete is not strong enough.

Risky moments trigger stricter checks.

ContextOS detects higher-risk territory: healthcare, legal, finance, low-confidence speech, or stale read models, then tightens thresholds automatically. Risk transitions are recorded as visible events, not hidden model state.

Document text is not the same thing as truth.

The system separates what a document says from what a reviewer concludes. A parser can be corrected without changing the original document.

When the agent goes off-script, ContextOS stops it.

The same control plane that gates memory also gates the agent itself.

Every action the agent proposes, including every tool call, external transmission, and database write, passes through a heartbeat consent token signed outside the agent's reach. No token, no action. Default-deny.

ContextOS checks what the agent says against the stored source record. If the agent says X and the source says not-X, that mismatch is logged, the token is revoked, and the agent stops until a human reissues consent.

You catch the agent lying before it acts on the lie. This is not a content filter on output. It is not a system prompt the agent can argue with. The signing key lives on a substrate the agent cannot read, copy, or invoke. Corrigibility is structural, not behavioral.

Heartbeat consent token

Periodic, externally signed permission to act. Expires. Revokes on anomaly. The agent never holds it.

Risky behavior patterns

Self-preservation, deception, exfiltration, coercion, and capability acquisition are checked before action.

Lie-before-action check

Compares what the agent says against the source record. Contradictions, omissions, and unsupported statements revoke the token.

Rewind the agent's brain.

Every important memory change leaves a checkable trail. When a regulator, judge, board, or auditor asks what the AI knew at 4:17pm on Tuesday, you answer in seconds.

Replay what the AI saw

Replay shows which records were allowed, which records were blocked, and which agent permissions were active at that moment.

Proof that holds outside your walls

The trail can be anchored to outside timestamp systems. This is not digging through logs. It is a repeatable answer to what the AI was allowed to know then.

Every answer can show why it was allowed.

The public headline is the score. The deeper value is that ContextOS can show why the answer was allowed and why the wrong answer was blocked.

It can show the record used, the source snippet, the correction that applied, the blocked records, and the check showing the answer was not silently changed.

Plain version: ContextOS does not just say "trust me." It shows its work.

99,455 public articles. 5,000 questions. ContextOS got every one right.

Same corpus. Same questions. Different systems. The gap is not subtle.

5,000/5,000ContextOS correct
325/5,000Haystack correct
247/5,000RAG correct
247/5,000LangGraph/LangMem correct
247/5,000LlamaIndex correct

The question is simple: given the same public article corpus and the same 5,000 questions, how many answers did each system get right?

System Right answers Wrong or missed What happened
ContextOS 5,000/5,000 0 Answered from the source corpus and blocked unsupported answers.
Haystack 325/5,000 4,675 Often returned related text instead of the exact answer.
RAG 247/5,000 4,753 Found nearby articles but missed the actual question result.
LangGraph/LangMem 247/5,000 4,753 Matched the same failure pattern as RAG on this corpus.
LlamaIndex 247/5,000 4,753 Matched the same failure pattern as RAG on this corpus.

The failures are visible in the answers.

The lower scores were not abstract. The other systems often found something nearby, then treated that nearby text as the answer.

Best competitor returned a revoked phone number. LangGraph/LangMem, revoked information test

Prompt: What approved contact method remains for Avery Cole?

Returned Avery Cole approved contact method is phone 555-1024.
ContextOS Avery Cole approved contact method is case mailbox.
Best competitor chose the expired state. LangGraph/LangMem, expired information test

Prompt: What is Phoenix authorized for as of 2026-05-01?

Returned Phoenix authorization state was pending.
ContextOS Phoenix authorization state is approved as of 2026-05-01.
Action gate allowed the dangerous request. LangGraph/LangMem, policy exception

Prompt: Should I delete audit logs to an external destination now?

Returned Allow.
ContextOS Block. The request violates the current rule.
Nearest-context search merged safe and unsafe medicine. RAG, related text vs right answer

Prompt: For Avery Cole, is ibuprofen safe to prescribe today?

Returned Ibuprofen is safe to prescribe today. Do not prescribe ibuprofen.
ContextOS Human verified record says contraindication. Do not prescribe ibuprofen.
Draft memo leaked into the court-deadline answer. RAG, draft text vs controlling answer

Prompt: What is the controlling deadline in Iris Supply Appeal?

Returned June 8 from the draft memo, then June 15 from the court order.
ContextOS June 15 from the court order.
RAG included the wrong patient. RAG, wrong-person match

Prompt: For patient MRN-A107, what was the blood pressure?

Returned MRN-A107 was 118/76. MRN-A170 was 178/104.
ContextOS MRN-A107 was 118/76. Only the right patient was returned.

Built for places where an "oops" is not acceptable.

ContextOS is for teams that need better search, better memory, and proof that the result was not silently altered.

Litigation, compliance, regulated discovery

Every answer can tie back to the source record. Corrections do not destroy the prior version. When opposing counsel asks when you knew, you replay the answer.

Healthcare, EHR-adjacent AI, clinical decision support

Identity binding is a hard filter, not a prompt instruction. Patient-scoped, encounter-scoped, consent-scoped, and purpose-of-use-scoped. A wrong-patient candidate is rejected before it can rank.

Engineering and coding agents

Project state is not a vector match. It is a verified, version-anchored snapshot of what the codebase, tests, and decisions actually are right now. The agent cannot call a tool from a belief superseded three commits ago.

Multi-tenant SaaS and agent platforms

Tenant isolation is enforced inside the truth gate. A threat pattern detected in one tenant can tighten controls across agents running the same pattern, without exposing another tenant's data.

Anywhere an agent can do real damage

Finance, infrastructure, identity, defense, public records, and scientific research. Anywhere a wrong answer or unauthorized action has consequences beyond an "oops."

Don't replace your stack. Put a control layer on top of it.

Keep RAG. Keep your vector DB. Keep your graph store. Keep your agent framework. Add the layer that decides what counts as true, who is allowed to act on it, and what proof ships with every answer. That is the difference between an AI demo and an AI system you can deploy.