Provenance: why is this fact true?¶
When a Datalog program derives a fact, it’s often useful to ask why: which
rule fired, what values did its variables take, and which stored facts
ultimately supported the derivation? Mangle ships with a provenance tool,
mgwhy, and a matching Go package that answers exactly this question.
Given a program and the fact store it produced, mgwhy prints a proof
tree: the rule that fired, its substitution, and recursively the proofs
of each body premise, bottoming out in stored (EDB) facts.
A first example¶
Consider a small graph and its transitive closure:
edge(1, 2).
edge(2, 3).
edge(3, 4).
path(X, Y) :- edge(X, Y).
path(X, Z) :- edge(X, Y), path(Y, Z).
Save this as path.mg and ask why path(1, 3) holds:
$ mgwhy -program path.mg 'path(1, 3)'
proof 1 of 1:
path(1,3) (/proof/763e3b51b4680148b139172de260f3ff)
by rule /rule/b23b52ec…: path(X,Z) :- edge(X,Y), path(Y,Z).
with X=1, Y=2, Z=3
premises:
edge(1,2) (/proof/24541127…)
[EDB]
path(2,3) (/proof/86b33a99…)
by rule /rule/e844ff77…: path(X,Y) :- edge(X,Y).
with X=2, Y=3
premises:
edge(2,3) (/proof/e5f6834a…)
[EDB]
Each node has an identifier of the form /proof/<hex> — a 128-bit content
hash of the rule, the derived fact, and the IDs of its sub-proofs.
Identical sub-proofs share an identifier, so when a fact is reused across
several parent derivations, it appears once in the output. Shared
sub-proofs turn the proof tree into a DAG.
Multiple derivations¶
A single fact can be derivable in several ways. By default mgwhy prints
one proof; ask for more with -max-proofs:
$ mgwhy -program path.mg -max-proofs 3 'path(1, 4)'
Provenance as queryable data¶
Proof trees are themselves relational. With -format facts, mgwhy
emits the proof DAG as Mangle facts that you can load back into mg and
query in Datalog:
$ mgwhy -program path.mg -format facts 'path(1, 3)' > proof.mg
The schema is:
Predicate |
Arity |
Meaning |
|---|---|---|
|
2 |
|
|
2 |
Which rule fired (omitted for leaves). |
|
3 |
The |
|
2 |
Leaf: |
|
2 |
Leaf: |
|
3 |
One entry of the rule’s substitution σ. |
|
2 |
Human-readable form of a rule. |
Facts are encoded uniformly as lists: edge(1, 2) becomes the
structured value [/edge, 1, 2]. The predicate symbol is prefixed with
/ so it can live in a name constant.
Once loaded into mg, recursive queries on the proof DAG are just
Datalog:
# All EDB facts that ultimately support a given derived fact.
supports(P, F) :- proves(P, _), edb_leaf(P, F).
supports(P, F) :- proves(P, _), premise(P, _, Sub), supports(Sub, F).
# Every rule that fired, directly or transitively.
rule_used(P, R) :- proves(P, _), uses_rule(P, R).
rule_used(P, R) :- proves(P, _), premise(P, _, Sub), rule_used(Sub, R).
Using a logic language to query its own proofs is pleasingly self-referential, and is often the quickest way to ask things like “what sources contributed to this conclusion?”
Negation¶
Mangle supports stratified negation: a rule may use a negated premise
!p(X) whose predicate p is fully computed before the rule runs. The
explainer handles this by recording evidence of absence: when !p(a) is
checked, the grounded atom is looked up in the final store, and if absent,
the negated premise is satisfied and the proof records an absence_leaf
pointing at p(a).
node(1).
node(2).
node(3).
reachable(1).
reachable(2).
unreachable(X) :- node(X), !reachable(X).
$ mgwhy -program graph.mg 'unreachable(3)'
proof 1 of 1:
unreachable(3) (/proof/d960840a…)
by rule …: unreachable(X) :- node(X), !reachable(X).
with X=3
premises:
node(3) (/proof/f824ff43…)
[EDB]
reachable(3) (/proof/e1f375a6…)
[absent: !reachable(3)]
This works because stratified evaluation closes the predicate reachable
before the rule for unreachable runs — so absence in the final store is
sound evidence that the fact is not derivable.
Note: absence_leaf nodes do not emit a proves(...) fact, since
semantically they prove the negation of the fact, not the fact itself.
Walk them via premise(Parent, Index, AbsenceID).
Simple and full modes¶
mgwhy has two modes, switched via -mode:
-mode=simple(default) is a post-hoc backward-chainer. It works from the finished fact store, without instrumenting the engine, and only pays a cost when you ask for provenance. It covers plain Datalog, equality/inequality, stratified negation, and recursion.-mode=fullinstalls a recorder on the engine that captures every derivation during evaluation. It handles everything simple mode does, plus let-transforms and do-transforms (aggregation). The recorder has a nil-check cost on the hot path — when it isn’t installed, there is no overhead.
Aggregation (full mode only)¶
sale("apple", 3).
sale("apple", 5).
sale("pear", 2).
totals(P, S) :- sale(P, N) |> do fn:group_by(P), let S = fn:sum(N).
$ mgwhy -program sales.mg -mode full 'totals("apple", 8)'
proof 1 of 1:
totals("apple",8) (/proof/b53cb477…)
by rule …: totals(P,S) :- sale(P,N) |> do fn:group_by(P), let S = fn:sum(N).
transform: do fn:group_by(P), let S = fn:sum(N)
group key: ("apple")
input facts:
sale("apple",3) [EDB]
sale("apple",5) [EDB]
The proof node for an aggregate lists the group key and every input
fact that contributed to the group. As facts, aggregates emit
uses_transform(ProofID, TransformText) and group_key(ProofID, Index, Value) in addition to the usual proves, premise, and edb_leaf
tuples.
What is still not explained¶
Temporal annotations on rule heads or premises are not yet covered. Affected proofs are flagged
partial.
Caps and performance¶
Backward chaining re-queries the fact store for each premise. For big stores or high-fan-out rules, enumerating every possible proof can blow up, so the explainer has two caps, both exposed on the command line:
-max-proofs N— at mostNalternative proofs per goal (default 1).-max-depth N— cut proof depth (default 64). Sub-proofs beyond this depth are returned markedpartial.
Cycles are detected automatically: when the explainer is asked to prove a fact while already trying to prove it further up the call stack, that branch yields no proofs. Non-cyclic derivations of the same fact still succeed.
Using the Go API¶
Simple (post-hoc) mode:
import "codeberg.org/TauCeti/mangle-go/provenance"
proofs, err := provenance.Explain(programInfo, factstore, goal, provenance.Options{
MaxProofs: 3,
MaxDepth: 32,
})
Full (recorder) mode — attach a recorder during eval, then build proofs from the recording:
import (
"codeberg.org/TauCeti/mangle-go/engine"
"codeberg.org/TauCeti/mangle-go/provenance"
)
rec := provenance.NewMemoryRecorder()
engine.EvalStratifiedProgramWithStats(pi, strata, predToStratum, store,
engine.WithDerivationRecorder(rec))
proofs, err := provenance.BuildFromRecording(rec, store, goal, provenance.Options{})
// Pretty-print a tree:
provenance.Print(os.Stdout, proofs)
// Or write the proof DAG as Mangle facts into another store:
out := factstore.NewSimpleInMemoryStore()
provenance.EmitFacts(proofs, &out)