Retrieve Data Using SQL

20 heures
Moyenne

Licence

Ce cours est visible gratuitement en ligne.

course.header.alt.is_video

course.header.alt.is_certifying

J'ai tout compris !

Mis à jour le 29/08/2024

Aggregate your data using GROUP BY

Now let’s look at aggregation!

You will recall that an aggregation has two parts:

a set of partitioning attributes
one or more aggregate function(s)

Now we’re going to see how to implement these in SQL.

Aggregation

I must confess: this chapter is a little long—but I really want you to grasp these concepts!

The chapter can essentially be summed up in one line of code:

SELECT status, count(*) FROM entity GROUP BY status;

We have placed the partitioning attribute (here, status ) after the keyword GROUP BY , and the aggregate function COUNT() in the SELECT clause. Alternatively, we could add the partitioning attributes in the SELECT clause.

And now you know everything there is to know (well, almost)!

Thinking It Through...

First let’s test the aggregate functions and leave the partitioning attributes for later.

Aggregate functions

A little reminder: an aggregate function takes multiple rows and returns a single row.

For example, if we write…

SELECT max(incorporation_date) AS maxi FROM entity ;

…the max function takes the attribute incorporation_date , which has multiple rows (all of the rows of the entity table), and returns an attribute (we’ll call itmaxi ) on a single row.

You really should execute this query in order to fully understand it. ;)

Inputs and outputs

Most of the basic aggregate functions of SQL (that are predefined by the language) take a single attribute and also return a single attribute. This is the case for the max function we just saw.

However, there is a function that doesn’t adhere to this rule: the count function. The count function returns a whole number: the number of rows it received as input. To count the number of rows, there’s no need to know the value of any argument of these rows! So we simply assign it the character * :

SELECT count(*) FROM entity ;

A little puzzle

If entity contains 1,000 rows, how many rows will each of these two queries return?

SELECT incorporation_date FROM entity ;SELECT min(id) FROM entity ;SELECT incorporation_date, min(id) FROM entity ;

The first query will return 1,000 rows. The second will return only one row. However, for the third query, your RDBMS should return an error, because it won’t know whether to return 1 row or 1,000 rows.

On the other hand, this query will work, and return 1 row:

SELECT max(incorporation_date), min(id) FROM entity ;

This is because the two columns we are querying both result from an aggregate function.

Can we conclude, then, that in the SELECT clause, we can use an aggregate function only if all of the other columns also result from an aggregate function?

Yes! Unless your query contains a GROUP BY . If that’s the case, there’s more to consider.
And that’s what we’ll be looking at in the next section.

Partitioning attributes

In an aggregation, the partitioning attributes are specified in the GROUP BY clause: :

SELECT count(*) FROM entity GROUP BY status ;

Here, we are counting the number of companies for each of the statuses found in entity.

Yes, but this query only returns a column with numbers! We don’t know what the numbers correspond to.

That’s right! To solve this problem, the status value can be displayed as follows:

SELECT status, count(*) FROM entity GROUP BY status ;

... which brings us back to the rule we mentioned earlier, which is completed as follows to form a Golden Rule:

La règle d'or de l'agrégation

Let’s Continue Our Investigation

Ready to return to the main query of the previous chapter? You can do it!

SELECT
    i.id as intermediary_id,
    i.name as intermediary_name,
    e.id as entity_id,
    e.name as entity_name,
    e.status as entity_status
FROM 
    intermediary i,
    assoc_inter_entity a,
    entity e
WHERE
    a.entity = e.id
    AND a.inter = i.id
    AND e.name = 'Big Data Crunchers Ltd.' ;

You will recall that this links intermediaries to entities. It enabled us to find two intermediaries with identifiers 5000 and 5001.

Here, we want to know the number of entities (companies) these two intermediaries created in each of the jurisdictions (for our purposes, jurisdiction = country ;) ). So we’re going to have to use GROUP BY!

As you may have guessed, our partitioning attributes will be:

jurisdiction
intermediary (taking, for example, its identifier and its name)

Let’s begin by amending our previous query. First of all, this earlier query is limited to the entity Big Data Crunchers Ltd. Here, we want to analyze all of the entities created by the two intermediaries. So let’s replace row 14 with AND (i.id = 5000 OR i.id = 5001) .

Let’s add our three partitioning attributes in the GROUP BY clause, then delete all of the columns that are not partitioning attributes from the SELECT clause (that is, rows 4, 5, and 6).

All that’s left is our aggregate function, COUNT() , because we want to count the number of entities. Here’s the result:

SELECT
    i.id as intermediary_id, 
    i.name as intermediary_name,
    e.jurisdiction, 
    count(*) 
FROM 
    intermediary i,
    assoc_inter_entity a, 
    entity e 
WHERE 
    a.entity = e.id 
    AND a.inter = i.id 
    AND (i.id = 5000 OR i.id = 5001) 
GROUP BY 
    i.id, i.name, e.jurisdiction;

Go Further: Aggregate Functions and Scalar Functions

In the chapter covering the SELECT clause, we saw that it is possible to apply functions in the SELECT clause. Are these aggregate functions?

SELECT abs(id) AS valeur_absolue FROM entity ;

No. :( In this query (which returns the absolute value of the company identifier), abs returns a result for as many rows as it received as input. If there are 1,000 companies in the entity table, then there will be 1,000 absolute values to calculate. However, an aggregate function returns only a single row.

Therefore, the abs function is referred to as a scalar function.

Summary

In aggregations, the aggregate functions are placed in the SELECT clause ...
... and the partitioning attributes are placed in the GROUP BY clause.
In the SELECT clause, an aggregate function can be used only if all of the other columns also result from an aggregate function OR are present in the GROUP BY clause.