Now let’s look at aggregation!
You will recall that an aggregation has two parts:
a set of partitioning attributes
one or more aggregate function(s)
Now we’re going to see how to implement these in SQL.
Aggregation
I must confess: this chapter is a little long—but I really want you to grasp these concepts!
The chapter can essentially be summed up in one line of code:
SELECT status, count(*) FROM entity GROUP BY status;
We have placed the partitioning attribute (here, status
) after the keyword GROUP BY
, and the aggregate function COUNT()
in the SELECT
clause. Alternatively, we could add the partitioning attributes in the SELECT
clause.
And now you know everything there is to know (well, almost)!
Thinking It Through...
First let’s test the aggregate functions and leave the partitioning attributes for later.
Aggregate functions
A little reminder: an aggregate function takes multiple rows and returns a single row.
For example, if we write…
SELECT max(incorporation_date) AS maxi FROM entity ;
…the max
function takes the attribute incorporation_date
, which has multiple rows (all of the rows of the entity table), and returns an attribute (we’ll call itmaxi
) on a single row.
You really should execute this query in order to fully understand it. ;)
Inputs and outputs
Most of the basic aggregate functions of SQL (that are predefined by the language) take a single attribute and also return a single attribute. This is the case for the max
function we just saw.
However, there is a function that doesn’t adhere to this rule: the count
function. The count
function returns a whole number: the number of rows it received as input. To count the number of rows, there’s no need to know the value of any argument of these rows! So we simply assign it the character *
:
SELECT count(*) FROM entity ;
A little puzzle
If entity contains 1,000 rows, how many rows will each of these two queries return?
SELECT incorporation_date FROM entity ;SELECT min(id) FROM entity ;SELECT incorporation_date, min(id) FROM entity ;
The first query will return 1,000 rows. The second will return only one row. However, for the third query, your RDBMS should return an error, because it won’t know whether to return 1 row or 1,000 rows.
On the other hand, this query will work, and return 1 row:
SELECT max(incorporation_date), min(id) FROM entity ;
This is because the two columns we are querying both result from an aggregate function.
Can we conclude, then, that in the SELECT
clause, we can use an aggregate function only if all of the other columns also result from an aggregate function?
Yes! Unless your query contains a GROUP BY
. If that’s the case, there’s more to consider.
And that’s what we’ll be looking at in the next section.
Partitioning attributes
In an aggregation, the partitioning attributes are specified in the GROUP BY
clause: :
SELECT count(*) FROM entity GROUP BY status ;
Here, we are counting the number of companies for each of the statuses found in entity.
Yes, but this query only returns a column with numbers! We don’t know what the numbers correspond to.
That’s right! To solve this problem, the status value can be displayed as follows:
SELECT status, count(*) FROM entity GROUP BY status ;
... which brings us back to the rule we mentioned earlier, which is completed as follows to form a Golden Rule:
La règle d'or de l'agrégation
Let’s Continue Our Investigation
Ready to return to the main query of the previous chapter? You can do it!
SELECT
i.id as intermediary_id,
i.name as intermediary_name,
e.id as entity_id,
e.name as entity_name,
e.status as entity_status
FROM
intermediary i,
assoc_inter_entity a,
entity e
WHERE
a.entity = e.id
AND a.inter = i.id
AND e.name = 'Big Data Crunchers Ltd.' ;
You will recall that this links intermediaries to entities. It enabled us to find two intermediaries with identifiers 5000 and 5001.
Here, we want to know the number of entities (companies) these two intermediaries created in each of the jurisdictions (for our purposes, jurisdiction = country ;) ). So we’re going to have to use GROUP BY!
As you may have guessed, our partitioning attributes will be:
jurisdiction
intermediary (taking, for example, its identifier and its name)
Let’s begin by amending our previous query. First of all, this earlier query is limited to the entity Big Data Crunchers Ltd. Here, we want to analyze all of the entities created by the two intermediaries. So let’s replace row 14 with AND (i.id = 5000 OR i.id = 5001)
.
Let’s add our three partitioning attributes in the GROUP BY
clause, then delete all of the columns that are not partitioning attributes from the SELECT
clause (that is, rows 4, 5, and 6).
All that’s left is our aggregate function, COUNT()
, because we want to count the number of entities. Here’s the result:
SELECT
i.id as intermediary_id,
i.name as intermediary_name,
e.jurisdiction,
count(*)
FROM
intermediary i,
assoc_inter_entity a,
entity e
WHERE
a.entity = e.id
AND a.inter = i.id
AND (i.id = 5000 OR i.id = 5001)
GROUP BY
i.id, i.name, e.jurisdiction;
Go Further: Aggregate Functions and Scalar Functions
In the chapter covering the SELECT clause, we saw that it is possible to apply functions in the SELECT
clause. Are these aggregate functions?
SELECT abs(id) AS valeur_absolue FROM entity ;
No. :( In this query (which returns the absolute value of the company identifier), abs
returns a result for as many rows as it received as input. If there are 1,000 companies in the entity table, then there will be 1,000 absolute values to calculate. However, an aggregate function returns only a single row.
Therefore, the abs
function is referred to as a scalar function.
Summary
In aggregations, the aggregate functions are placed in the
SELECT
clause ...... and the partitioning attributes are placed in the
GROUP BY
clause.In the
SELECT
clause, an aggregate function can be used only if all of the other columns also result from an aggregate function OR are present in theGROUP BY
clause.