Aggregation¶
Datalog as a basic, yet powerful way to define and query data. However, real-life use cases frequently need more than querying. For example, instead of getting a large table of results, we may want to group values, count the number of results, compute sums, sort. In database community, this functionality is called aggregation.
Grouping¶
We reuse our volunteer database example but with an eye on software development projects. Every project has a person that act as an organizer, a number of software developers that are assigned to it with a number of hours per week.
The volunteer table volunteer(ID, Name, Skill) has records for
people in our volunteer database. Only look at people with skills
/organizer and /software_development are assigned to projects.
ID |
Name |
Skill |
|---|---|---|
/v/3 |
Alyssa P. Hacker |
/software_development |
/v/3 |
Alyssa P. Hacker |
/organizer |
/v/4 |
Ivan Hassenovich |
/software_development |
/v/4 |
Ivan Hassenovich |
/organizer |
/v/5 |
Claudio Ferrari |
/organizer |
… |
… |
… |
In this example, we use Mangle names like /v/1 for IDs, which
helps us distinguish between different kinds of IDs.
The project table project(ProjectID, Name) is where projects are defined.
ProjectID |
Name |
|---|---|
… |
… |
/p/10 |
Ultimate Kubernetes Control Plane UI |
/p/11 |
YAML Engineer Online Courseware |
/p/12 |
Personal Dopamine Fasting Tracker |
/p/13 |
LLM Fact Checker |
… |
… |
The table project_assignment(ProjectID, VolunteerID, Role, Hours)
has information how many hours a person is contributing to which project.
It is volunteering, so people assign themselves. Yet, that also means,
hours may be low and sometimes a project is missing an organizer or
developer or simply dead.
ProjectID |
VolunteerID |
Role |
Hours |
|---|---|---|---|
… |
… |
… |
… |
/p/10 |
/v/3 |
/organizer |
2 |
/p/10 |
/v/3 |
/software_development |
2 |
/p/10 |
/v/4 |
/software_development |
2 |
/p/11 |
/v/4 |
/software_development |
20 |
/p/12 |
/v/5 |
/organizer |
20 |
Suppose we want to list how many software developers are assigned to a project, and the total number of hours. In Mangle, we separate this into two parts: first, we identify all records, then, we group them rows and specify what aggregation we want.
project_dev_energy(ProjectID, NumDevelopers, TotalHours) ⟸
project_assignment(ProjectID, VolunteerID, /software_development, Hours)
|> do fn:group_by(ProjectID),
let NumDevelopers = fn:count(),
let TotalHours = fn:sum(TotalHours).
The query project_assignment(ProjectID, VolunteerID, /software_development)
gives us a result relation like this:
ProjectID |
VolunteerID |
Hours |
|---|---|---|
/p/10 |
/v/3 |
2 |
/p/10 |
/v/4 |
2 |
/p/11 |
/v/4 |
20 |
The part after the |> is called a do-transformation: we take the whole
result relation as a whole and do something to it. In this case, we
group by ProjectID, which we can imagine as creating a separate relation
for each ProjectID value.
ProjectID /p/10
VolunteerID |
Hours |
|---|---|
/v/3 |
2 |
/v/4 |
2 |
ProjectID /p/11
VolunteerID |
Hours |
|---|---|
/v/4 |
20 |
Finally, the other parts of the do-transform say what we should do with each
of the per-ProjectID tables. With fn:count() we count the number of rows,
and with fn:sum(Hours) we sum the values in the Hours column.
This then yields the final project_dev_energy table that contains the
aggregated values.
ProjectID |
NumDevelopers |
TotalHours |
|---|---|---|
/p/10 |
2 |
4 |
/p/11 |
1 |
20 |
Click to see SQL translation
Here is how the project_work_energy rule look like in SQL:
CREATE TABLE project_work_energy AS
SELECT ProjectID, COUNT(VolunteerID) as NumDevelopers, SUM(Hours) as TotalHours
FROM project_assignment
WHERE Role = '/software_developer'
GROUP BY ProjectID
Absence of values and Negation¶
If you followed the example carefully, you may notice that there is no way that we can end up with 0 developers. A project that has no developers assigned will simply not show up in the result relation. Datalog is a mechanism that deals with positive information.
We can deal with absence, by making use of negation. First we define a helper table that identifies projects without developers:
project_with_developers(ProjectID) ⟸
project_assignment(ProjectID, _, /software_development, _).
project_without_developers(ProjectID) ⟸
project_name(ProjectID, _).
!project_with_developers(ProjectID)
Datalog negation is explained elswhere, for now, do note
that the project_without_developers query mentions
the ProjectID appears in a positive subquery as well as a negated subquery.
Every variable mentioned in the head of a rule has to be mentioned in at
least one subquery that is not negated.
With this, we can now add another rule to project_dev_energy that handles
the case of zero developers.
project_dev_energy(ProjectID, NumDevelopers, TotalHours) ⟸
project_without_developers(ProjectID),
NumDevelopers = 0,
TotalHours = 0.
The final project_dev_energy table then looks like this:
ProjectID |
NumDevelopers |
TotalHours |
|---|---|---|
/p/10 |
2 |
4 |
/p/11 |
1 |
20 |
/p/12 |
0 |
0 |
/p/13 |
0 |
0 |
Click to see SQL translation for the zero case
Here is how to get a result that includes projects with zero developers in SQL.
We need to LEFT JOIN the project table with the assignments.
CREATE TABLE project_dev_energy AS
SELECT
p.ProjectID,
COUNT(pa.VolunteerID) AS NumDevelopers,
COALESCE(SUM(pa.Hours), 0) AS TotalHours
FROM
project AS p
LEFT JOIN
project_assignment AS pa
ON
p.ProjectID = pa.ProjectID AND pa.Role = '/software_development'
GROUP BY
p.ProjectID