Documents that machines understand.
An AI-native document format with typed nodes, queryable knowledge graphs, and cryptographic signatures — 14x cheaper than VLM extraction, zero hallucination.
A VLM guesses what a number means. AXON declares it.
Left — VLM extraction (~22,000 tokens)
"RIA reaches 100% IID accuracy""MHA collapses to 35.8%""p = 0.003 at the 10th minute""95% CI [0.41, 0.78]""605,568 parameters"
Flat strings. The agent must guess which number is an accuracy, which is a parameter count, what 35.8% refers to. At scale, the guesses drift — and citations become hallucinations.
Right — native AXON (~1,950 tokens)
@data [data-type="metric" method="ria"
metric="iid-accuracy" value="100.0"]:
100.0 ± 0.0
@data [data-type="claim"
evidence="result-iid,result-ood"]:
RIA reaches 100% IID and OOD accuracy
@data [data-type="relationship"
from="entity-ria" to="entity-mha"
relation="outperforms"]:
RIA outperforms MHA on compositional binding
Typed declarations. Every metric, claim, and relationship is machine-queryable. The agent cites rather than guesses — at 14x lower cost.
Why AI deserves better documents.
We have spent a decade teaching language models to read documents designed for human eyes. The entire pipeline — OCR, VLM, extraction, validation — exists because our files never declared what they contained. It is time to build documents that speak machine natively.
Consider what happens when an AI agent reads a research paper today.
A vision-language model scans the PDF — pixel by pixel, page by page
— and produces ~22,000 tokens of flat text. The number 0.003
appears somewhere in the output. Is it a p-value? A coefficient? A
page footnote? The model guesses. Usually it guesses right. But in
regulated domains — clinical trials, financial disclosure, legal
evidence — a confident wrong guess is not a failure mode, it is
a lawsuit.
tdoc proposes the inverse. Instead of extracting structure from pixels, you author it: every entity is typed, every metric carries its units, every claim points to its evidence, every relationship between concepts is declared. The document arrives at the AI as a three-layer knowledge representation — structured markup, token-minimized compact form, and a queryable knowledge graph — with a self-describing preamble that tells any model what the file is, what it contains, and what queries to run.
The cost difference is not marginal. A single research paper through VLM extraction: ~22,000 tokens. The same paper as native AXON: ~1,950 tokens. That is 14x cheaper, with zero hallucination, because the types were never inferred — they were declared by the author. At 100,000 documents, the difference is hundreds of thousands of dollars.
The spec is open. The reference implementation is a single file of Python. The format is self-describing, deterministic to the byte, cryptographically signable, and queryable via AQL — a structured query language built into the spec. We are not building a better PDF parser. We are building the document format that machines deserve.
— Rishi Arun Shivhare, author of AXON 1.0.
Three layers of knowledge.
AXON is not markup. It is a knowledge representation. Layer 1
is structured .axc with typed nodes. Layer 2 is a
token-minimized compact form. Layer 3 is a queryable knowledge
graph with entities, relationships, claims, and metrics — each
with provenance pointers. One document, three views, all deterministic
and signable.
@section [id="knowledge" type="knowledge-graph"]:
@data [id="entity-ria" data-type="entity" class="method"]:
Role-Indexed Attention (RIA)
@data [id="result-iid" data-type="metric"
method="ria" metric="iid-accuracy"
value="100.0" unit="%"]:
100.0 ± 0.0
@data [id="claim-1" data-type="claim"
evidence="result-iid,result-ood"
status="supported"]:
RIA reaches 100% IID and OOD accuracy
@data [id="rel-1" data-type="relationship"
from="entity-ria" to="entity-mha"
relation="outperforms"]:
RIA outperforms MHA on compositional binding
Rates.
Public beta · every tier free
tdoc is in public beta. Every tier below is free during this period — we are looking for honest feedback, not paying customers yet. To request an API key (or a higher quota than the demo allows), write to [email protected]. The self-hosted Apache-2.0 reference at github.com/LuciferMors/tdoc is fully free and always will be. Pricing below is the structure paid plans will eventually take — nothing is billing today.
| Free | 100 documents a month. All parsing features, all query operators, community support, forever. No credit card at signup. | $0/mo |
| Pro | 2,000 documents a month. Email support; soft upgrade path — no hard overage block, you simply get a note when you cross. | $29 free/beta |
| Team | 10,000 documents a month. Ed25519 signing and verification. Twenty-four hour email response. Overage billed at $0.02 per document, with a monthly ceiling. | $99 free/beta |
| Scale | 100,000 documents a month. Self-hosted option. Shared Slack channel with the author. Overage at $0.01 per document. Annual contracts available. | $499 free/beta |
Self-hosting is always free and always will be: the reference implementation is Apache 2.0. The format is open. The hosted API is the convenience layer — not the only way to get value.
“Stop teaching AI to read human documents. Start writing documents for AI.”