Aggregating¶
Sheets provides aggregating/grouping functionality via pivot tables, which you
have seen in this section. Aggregating data means to
collect multiple units into one. Again, SQL provides the same
functionality as Sheets through keywords. In this case, the GROUP BY
statement allows you to group and aggregate data. For example, you can use
GROUP BY
to count how many bike trips were taken for each subscriber type,
or to calculate the mean duration for trips starting from each station.
The following query counts the number of trips (COUNT(*)
) for each member
type (GROUP BY member_type
).
The result of this query has one rows for each distinct member_type
, and the
column named n_trips
counts the number of rows for each type.
Whenever you use GROUP BY
, SQL expects you to use an aggregating function.
This is equivalent to which function is used to summarize data in a Sheets pivot
table. The most common examples are listed below.
COUNT
counts the rowsSUM
adds the values of a numeric columnMIN
calculates the minimum of a numeric columnMAX
calculates the maximum of a numeric columnAVG
calculates the mean of a numeric columnMEDIAN
calculates the median of a numeric column (only available in some SQL servers)
For example, the query below calculates the minimum and maximum trip duration for each start station.
You must include the GROUP BY
column(s) in your SELECT
clause. All
columns that are not in the GROUP BY
clause must contain an aggreating
function.
If you want to calculate a summary statistic for the entire population, you can
use an aggregating function without the GROUP BY
clause. For example, the
query below calculates the mean duration of all trips.
You can of course use filtering and ordering with aggregating functions. The query below calculates the mean trip duration just for trips with a casual member type.
You can then combine this with the GROUP BY
clause as you like. Filtering
with the GROUP BY
clause can serve multiple purposes. Either way, you must
put the WHERE
clause before the GROUP BY
clause.
If you filter by a column that is not grouped by, you can filter the aggregated results to a subset of the data. For example, you can calculate the sum of all trip durations, by start station, for a specific end station.
If you filter by a column that is being grouped by, you can filter the aggregated results to just show the row you are interested in. For example, you can display the total trip duration for each pair of start station and end station, just for trips that end at station 31111.
It can also be useful to combine aggregating functions with the ORDER BY
clause. This can allow you to sort by an aggregated column to find the minimum
or maximum.
Note that you can also use the column alias in the ORDER BY
clause. The
above query would have been identical (in almost all SQL servers) if it had been
ORDER BY mean_duration DESC
.
Extension: Counting Unique Values¶
It is often helpful to not only count the number of rows, but count the number
of unique values of a column. You can do this using the DISTINCT
keyword.
To count the distinct values of a column, you can simply use COUNT
along
with DISTINCT
. For example, the query below counts the number of bike
numbers used.
This can also be used alongside to count the unique values for each group. For example, the following query counts the unique bike numbers used for each start station.
It is sometimes also useful to select the distinct rows of a table. You can do
that by using DISTINCT
without an aggregating function.
Note that selecting distinct values is equivalent to selecting grouped values.
Extension: The HAVING Clause¶
You can never include an aggregating function within the WHERE
clause. For
example, suppose you are trying to show all start stations and the number of
trips, but only for start stations with over 100 trips. The following query
produces an error, since the WHERE
condition includes the COUNT
function.
The HAVING
clause can be used instead here. The query below executes
successfully.
As with the ORDER BY
clause, HAVING
is an operation on the results.
While WHERE
filters the data being queried, HAVING
filters the results
based on the value of an aggregating function. The HAVING
clause can only be
used immediately following the GROUP BY
clause.
Extension: Numbered Column Aliases¶
This section covers no new concepts, but introduces a convenient shorthand
notation. When using GROUP BY
and ORDER BY
, you can often (this is
supported in almost all SQL servers, but not all) reference columns using the
number in which they are selected. For example, the following two queries are
identical, listing member types by descending number of trips.
The first query uses the column names in the GROUP BY
and ORDER BY
.
SELECT
member_type,
COUNT(*) AS n_trips
FROM
trip_data
GROUP BY
member_type
ORDER BY
n_trips DESC
The second query uses the column number in the GROUP BY
and ORDER BY
.
SELECT
member_type,
COUNT(*) AS n_trips
FROM
trip_data
GROUP BY
1
ORDER BY
2 DESC
Note that 1
refers to the first column being selected, member_type
, and
2
refers to the second column being selected, n_trips
.
You can always use the column names in these clauses, but the numbered aliasing
can be extremely useful, especially if you have several columns. Note that you
can use this notation only with the GROUP BY
and ORDER BY
clauses. Using
such notation anywhere else will produce an error.