# Provenance: why is this fact true? When a Datalog program derives a fact, it's often useful to ask *why*: which rule fired, what values did its variables take, and which stored facts ultimately supported the derivation? Mangle ships with a provenance tool, `mgwhy`, and a matching Go package that answers exactly this question. Given a program and the fact store it produced, `mgwhy` prints a *proof tree*: the rule that fired, its substitution, and recursively the proofs of each body premise, bottoming out in stored (EDB) facts. ## A first example Consider a small graph and its transitive closure: ``` edge(1, 2). edge(2, 3). edge(3, 4). path(X, Y) :- edge(X, Y). path(X, Z) :- edge(X, Y), path(Y, Z). ``` Save this as `path.mg` and ask why `path(1, 3)` holds: ``` $ mgwhy -program path.mg 'path(1, 3)' proof 1 of 1: path(1,3) (/proof/763e3b51b4680148b139172de260f3ff) by rule /rule/b23b52ec…: path(X,Z) :- edge(X,Y), path(Y,Z). with X=1, Y=2, Z=3 premises: edge(1,2) (/proof/24541127…) [EDB] path(2,3) (/proof/86b33a99…) by rule /rule/e844ff77…: path(X,Y) :- edge(X,Y). with X=2, Y=3 premises: edge(2,3) (/proof/e5f6834a…) [EDB] ``` Each node has an identifier of the form `/proof/` — a 128-bit content hash of the rule, the derived fact, and the IDs of its sub-proofs. Identical sub-proofs share an identifier, so when a fact is reused across several parent derivations, it appears once in the output. Shared sub-proofs turn the proof tree into a DAG. ## Multiple derivations A single fact can be derivable in several ways. By default `mgwhy` prints one proof; ask for more with `-max-proofs`: ``` $ mgwhy -program path.mg -max-proofs 3 'path(1, 4)' ``` ## Provenance as queryable data Proof trees are themselves relational. With `-format facts`, `mgwhy` emits the proof DAG as Mangle facts that you can load back into `mg` and query in Datalog: ``` $ mgwhy -program path.mg -format facts 'path(1, 3)' > proof.mg ``` The schema is: | Predicate | Arity | Meaning | |---|---|---| | `proves(ProofID, Fact)` | 2 | `ProofID` concludes `Fact`. | | `uses_rule(ProofID, RuleID)` | 2 | Which rule fired (omitted for leaves). | | `premise(ProofID, Index, SubID)` | 3 | The `Index`-th body atom was proved by `SubID`. | | `edb_leaf(ProofID, Fact)` | 2 | Leaf: `Fact` comes from the store. | | `absence_leaf(ProofID, Fact)` | 2 | Leaf: `Fact` is **not** in the store (see *Negation* below). | | `binding(ProofID, VarName, Value)` | 3 | One entry of the rule's substitution σ. | | `rule_source(RuleID, ClauseText)` | 2 | Human-readable form of a rule. | Facts are encoded uniformly as **lists**: `edge(1, 2)` becomes the structured value `[/edge, 1, 2]`. The predicate symbol is prefixed with `/` so it can live in a name constant. Once loaded into `mg`, recursive queries on the proof DAG are just Datalog: ``` # All EDB facts that ultimately support a given derived fact. supports(P, F) :- proves(P, _), edb_leaf(P, F). supports(P, F) :- proves(P, _), premise(P, _, Sub), supports(Sub, F). # Every rule that fired, directly or transitively. rule_used(P, R) :- proves(P, _), uses_rule(P, R). rule_used(P, R) :- proves(P, _), premise(P, _, Sub), rule_used(Sub, R). ``` Using a logic language to query its own proofs is pleasingly self-referential, and is often the quickest way to ask things like "what sources contributed to this conclusion?" ## Negation Mangle supports stratified negation: a rule may use a negated premise `!p(X)` whose predicate `p` is fully computed before the rule runs. The explainer handles this by recording *evidence of absence*: when `!p(a)` is checked, the grounded atom is looked up in the final store, and if absent, the negated premise is satisfied and the proof records an `absence_leaf` pointing at `p(a)`. ``` node(1). node(2). node(3). reachable(1). reachable(2). unreachable(X) :- node(X), !reachable(X). ``` ``` $ mgwhy -program graph.mg 'unreachable(3)' proof 1 of 1: unreachable(3) (/proof/d960840a…) by rule …: unreachable(X) :- node(X), !reachable(X). with X=3 premises: node(3) (/proof/f824ff43…) [EDB] reachable(3) (/proof/e1f375a6…) [absent: !reachable(3)] ``` This works because stratified evaluation closes the predicate `reachable` before the rule for `unreachable` runs — so absence in the final store is sound evidence that the fact is not derivable. Note: `absence_leaf` nodes do **not** emit a `proves(...)` fact, since semantically they prove the *negation* of the fact, not the fact itself. Walk them via `premise(Parent, Index, AbsenceID)`. ## Simple and full modes `mgwhy` has two modes, switched via `-mode`: - **`-mode=simple`** (default) is a post-hoc backward-chainer. It works from the finished fact store, without instrumenting the engine, and only pays a cost when you ask for provenance. It covers plain Datalog, equality/inequality, stratified negation, and recursion. - **`-mode=full`** installs a recorder on the engine that captures every derivation during evaluation. It handles everything simple mode does, plus **let-transforms** and **do-transforms (aggregation)**. The recorder has a nil-check cost on the hot path — when it isn't installed, there is no overhead. ### Aggregation (full mode only) ``` sale("apple", 3). sale("apple", 5). sale("pear", 2). totals(P, S) :- sale(P, N) |> do fn:group_by(P), let S = fn:sum(N). ``` ``` $ mgwhy -program sales.mg -mode full 'totals("apple", 8)' proof 1 of 1: totals("apple",8) (/proof/b53cb477…) by rule …: totals(P,S) :- sale(P,N) |> do fn:group_by(P), let S = fn:sum(N). transform: do fn:group_by(P), let S = fn:sum(N) group key: ("apple") input facts: sale("apple",3) [EDB] sale("apple",5) [EDB] ``` The proof node for an aggregate lists the `group key` and every input fact that contributed to the group. As facts, aggregates emit `uses_transform(ProofID, TransformText)` and `group_key(ProofID, Index, Value)` in addition to the usual `proves`, `premise`, and `edb_leaf` tuples. ## What is still not explained - **Temporal annotations** on rule heads or premises are not yet covered. Affected proofs are flagged `partial`. ## Caps and performance Backward chaining re-queries the fact store for each premise. For big stores or high-fan-out rules, enumerating every possible proof can blow up, so the explainer has two caps, both exposed on the command line: - `-max-proofs N` — at most `N` alternative proofs per goal (default 1). - `-max-depth N` — cut proof depth (default 64). Sub-proofs beyond this depth are returned marked `partial`. Cycles are detected automatically: when the explainer is asked to prove a fact while already trying to prove it further up the call stack, that branch yields no proofs. Non-cyclic derivations of the same fact still succeed. ## Using the Go API Simple (post-hoc) mode: ```go import "codeberg.org/TauCeti/mangle-go/provenance" proofs, err := provenance.Explain(programInfo, factstore, goal, provenance.Options{ MaxProofs: 3, MaxDepth: 32, }) ``` Full (recorder) mode — attach a recorder during eval, then build proofs from the recording: ```go import ( "codeberg.org/TauCeti/mangle-go/engine" "codeberg.org/TauCeti/mangle-go/provenance" ) rec := provenance.NewMemoryRecorder() engine.EvalStratifiedProgramWithStats(pi, strata, predToStratum, store, engine.WithDerivationRecorder(rec)) proofs, err := provenance.BuildFromRecording(rec, store, goal, provenance.Options{}) // Pretty-print a tree: provenance.Print(os.Stdout, proofs) // Or write the proof DAG as Mangle facts into another store: out := factstore.NewSimpleInMemoryStore() provenance.EmitFacts(proofs, &out) ```