tdoc.

Try · View · Spec · About · Rates · Source
[email protected]   

Documents that machines understand.

An AI-native document format with typed nodes, queryable knowledge graphs, and cryptographic signatures — 14x cheaper than VLM extraction, zero hallucination.

Fig. 1 The same research result: VLM extraction (left) vs native AXON (right).

A VLM guesses what a number means. AXON declares it.

Left — VLM extraction (~22,000 tokens)

"RIA reaches 100% IID accuracy"
"MHA collapses to 35.8%"
"p = 0.003 at the 10th minute"
"95% CI [0.41, 0.78]"
"605,568 parameters"

Flat strings. The agent must guess which number is an accuracy, which is a parameter count, what 35.8% refers to. At scale, the guesses drift — and citations become hallucinations.

Right — native AXON (~1,950 tokens)

@data [data-type="metric" method="ria"
       metric="iid-accuracy" value="100.0"]:
  100.0 ± 0.0

@data [data-type="claim"
       evidence="result-iid,result-ood"]:
  RIA reaches 100% IID and OOD accuracy

@data [data-type="relationship"
       from="entity-ria" to="entity-mha"
       relation="outperforms"]:
  RIA outperforms MHA on compositional binding

Typed declarations. Every metric, claim, and relationship is machine-queryable. The agent cites rather than guesses — at 14x lower cost.

A letter  ·  To the reader, from the author  ·  Vol. 1

Why AI deserves better documents.

We have spent a decade teaching language models to read documents designed for human eyes. The entire pipeline — OCR, VLM, extraction, validation — exists because our files never declared what they contained. It is time to build documents that speak machine natively.

Consider what happens when an AI agent reads a research paper today. A vision-language model scans the PDF — pixel by pixel, page by page — and produces ~22,000 tokens of flat text. The number 0.003 appears somewhere in the output. Is it a p-value? A coefficient? A page footnote? The model guesses. Usually it guesses right. But in regulated domains — clinical trials, financial disclosure, legal evidence — a confident wrong guess is not a failure mode, it is a lawsuit.

tdoc proposes the inverse. Instead of extracting structure from pixels, you author it: every entity is typed, every metric carries its units, every claim points to its evidence, every relationship between concepts is declared. The document arrives at the AI as a three-layer knowledge representation — structured markup, token-minimized compact form, and a queryable knowledge graph — with a self-describing preamble that tells any model what the file is, what it contains, and what queries to run.

The cost difference is not marginal. A single research paper through VLM extraction: ~22,000 tokens. The same paper as native AXON: ~1,950 tokens. That is 14x cheaper, with zero hallucination, because the types were never inferred — they were declared by the author. At 100,000 documents, the difference is hundreds of thousands of dollars.

The spec is open. The reference implementation is a single file of Python. The format is self-describing, deterministic to the byte, cryptographically signable, and queryable via AQL — a structured query language built into the spec. We are not building a better PDF parser. We are building the document format that machines deserve.

— Rishi Arun Shivhare, author of AXON 1.0.

Three layers of knowledge.

AXON is not markup. It is a knowledge representation. Layer 1 is structured .axc with typed nodes. Layer 2 is a token-minimized compact form. Layer 3 is a queryable knowledge graph with entities, relationships, claims, and metrics — each with provenance pointers. One document, three views, all deterministic and signable.

@section [id="knowledge" type="knowledge-graph"]:

  @data [id="entity-ria" data-type="entity" class="method"]:
    Role-Indexed Attention (RIA)

  @data [id="result-iid" data-type="metric"
         method="ria" metric="iid-accuracy"
         value="100.0" unit="%"]:
    100.0 ± 0.0

  @data [id="claim-1" data-type="claim"
         evidence="result-iid,result-ood"
         status="supported"]:
    RIA reaches 100% IID and OOD accuracy

  @data [id="rel-1" data-type="relationship"
         from="entity-ria" to="entity-mha"
         relation="outperforms"]:
    RIA outperforms MHA on compositional binding

Rates.

Public beta · every tier free

tdoc is in public beta. Every tier below is free during this period — we are looking for honest feedback, not paying customers yet. To request an API key (or a higher quota than the demo allows), write to [email protected]. The self-hosted Apache-2.0 reference at github.com/LuciferMors/tdoc is fully free and always will be. Pricing below is the structure paid plans will eventually take — nothing is billing today.

tdoc tiers (free during beta)
Free 100 documents a month. All parsing features, all query operators, community support, forever. No credit card at signup. $0/mo
Pro 2,000 documents a month. Email support; soft upgrade path — no hard overage block, you simply get a note when you cross. $29 free/beta
Team 10,000 documents a month. Ed25519 signing and verification. Twenty-four hour email response. Overage billed at $0.02 per document, with a monthly ceiling. $99 free/beta
Scale 100,000 documents a month. Self-hosted option. Shared Slack channel with the author. Overage at $0.01 per document. Annual contracts available. $499 free/beta

Self-hosting is always free and always will be: the reference implementation is Apache 2.0. The format is open. The hosted API is the convenience layer — not the only way to get value.

“Stop teaching AI to read human documents. Start writing documents for AI.