# Aggregation

Datalog as a basic, yet powerful way to define and query data.
However, real-life use cases frequently need more than querying.
For example, instead of getting a large table of results, we
may want to group values, count the number of results, compute
sums, sort. In database community, this functionality is called
*aggregation*.

## Grouping

We reuse our volunteer database example but with an eye on
software development projects. Every project
has a person that act as an organizer, a number of software developers
that are assigned to it with a number of hours per week.

The volunteer table `volunteer(ID, Name, Skill)` has records for
people in our volunteer database. Only look at people with skills
`/organizer` and `/software_development` are assigned to projects.

| ID | Name | Skill |
|----|------|-------|
| /v/3 | Alyssa P. Hacker | /software_development |
| /v/3 | Alyssa P. Hacker | /organizer |
| /v/4 | Ivan Hassenovich | /software_development |
| /v/4 | Ivan Hassenovich | /organizer |
| /v/5 | Claudio Ferrari | /organizer |
| ... | ... | ... |

In this example, we use Mangle names like `/v/1` for IDs, which
helps us distinguish between different kinds of IDs.

The project table `project(ProjectID, Name)` is where projects are defined.

| ProjectID | Name |
|-----------|-------------|
| ... | ... |
| /p/10 | Ultimate Kubernetes Control Plane UI |
| /p/11 | YAML Engineer Online Courseware |
| /p/12 | Personal Dopamine Fasting Tracker |
| /p/13 | LLM Fact Checker |
| ... | ... |

The table `project_assignment(ProjectID, VolunteerID, Role, Hours)`
has information how many hours a person is contributing to which project.
It is volunteering, so people assign themselves. Yet, that also means,
hours may be low and sometimes a project is missing an organizer or
developer or simply dead.

| ProjectID | VolunteerID | Role | Hours |
|-----------|-------------|------|-------|
| ... | ... | ... | ... |
| /p/10 | /v/3 | /organizer | 2 |
| /p/10 | /v/3 | /software_development | 2 |
| /p/10 | /v/4 | /software_development | 2 |
| /p/11 | /v/4 | /software_development | 20 |
| /p/12 | /v/5 | /organizer | 20

Suppose we want to list *how many* software developers are assigned
to a project, and the total number of hours. In Mangle, we separate this
into two parts: first, we identify all records, then, we *group* them rows
and specify what aggregation we want.

```
project_dev_energy(ProjectID, NumDevelopers, TotalHours) ⟸
  project_assignment(ProjectID, VolunteerID, /software_development, Hours)
  |> do fn:group_by(ProjectID),
     let NumDevelopers = fn:count(),
     let TotalHours = fn:sum(TotalHours).
```

The query `project_assignment(ProjectID, VolunteerID, /software_development)`
gives us a result relation like this:

| ProjectID | VolunteerID | Hours |
|-----------|-------------|------|
| /p/10 | /v/3 | 2 |
| /p/10 | /v/4 | 2 |
| /p/11 | /v/4 | 20 |

The part after the `|>` is called a do-transformation: we take the whole
result relation as a whole and do something to it. In this case, we
group by `ProjectID`, which we can imagine as creating a separate relation
for each `ProjectID` value.

ProjectID /p/10

| VolunteerID | Hours |
|-------------|------|
| /v/3 | 2 |
| /v/4 | 2 |

ProjectID /p/11

| VolunteerID | Hours |
|-------------|------|
| /v/4 | 20 |

Finally, the other parts of the do-transform say what we should do with each
of the per-`ProjectID` tables. With `fn:count()` we count the number of rows,
and with `fn:sum(Hours)` we sum the values in the `Hours` column.

This then yields the final `project_dev_energy` table that contains the
aggregated values.

| ProjectID | NumDevelopers | TotalHours |
|-----------|-------------|------|
| /p/10 | 2 | 4 |
| /p/11 | 1 | 20 |

:::{admonition} Click to see SQL translation
:class: dropdown
Here is how the `project_work_energy` rule look like in SQL:
```sql
CREATE TABLE project_work_energy AS
SELECT ProjectID, COUNT(VolunteerID) as NumDevelopers, SUM(Hours) as TotalHours
FROM project_assignment
WHERE Role = '/software_developer'
GROUP BY ProjectID
```
:::

## Absence of values and Negation

If you followed the example carefully, you may notice that there is
no way that we can end up with 0 developers. A project that has no
developers assigned will simply not show up in the result relation.
Datalog is a mechanism that deals with positive information.

We can deal with absence, by making use of negation. First we
define a helper table that identifies projects without developers:

```
project_with_developers(ProjectID) ⟸
  project_assignment(ProjectID, _, /software_development, _).

project_without_developers(ProjectID) ⟸
  project_name(ProjectID, _).
  !project_with_developers(ProjectID)
```

Datalog negation is explained elswhere, for now, do note
that the `project_without_developers` query mentions
the `ProjectID` appears in a positive subquery as well as a negated subquery.
Every variable mentioned in the head of a rule has to be mentioned in at
least one subquery that is not negated.

With this, we can now add another rule to `project_dev_energy` that handles
the case of zero developers.

```
project_dev_energy(ProjectID, NumDevelopers, TotalHours) ⟸
  project_without_developers(ProjectID),
  NumDevelopers = 0,
  TotalHours = 0.
```

The final `project_dev_energy` table then looks like this:

| ProjectID | NumDevelopers | TotalHours |
|-----------|-------------|------|
| /p/10 | 2 | 4 |
| /p/11 | 1 | 20 |
| /p/12 | 0 | 0 |
| /p/13 | 0 | 0 |

:::{admonition} Click to see SQL translation for the zero case
:class: dropdown
Here is how to get a result that includes projects with zero developers in SQL.
We need to `LEFT JOIN` the project table with the assignments.

```sql
CREATE TABLE project_dev_energy AS
SELECT
  p.ProjectID,
  COUNT(pa.VolunteerID) AS NumDevelopers,
  COALESCE(SUM(pa.Hours), 0) AS TotalHours
FROM
  project AS p
LEFT JOIN
  project_assignment AS pa
ON
  p.ProjectID = pa.ProjectID AND pa.Role = '/software_development'
GROUP BY
  p.ProjectID
```
:::