.. Copyright (C) Google, Runestone Interactive LLC
This work is licensed under the Creative Commons Attribution-ShareAlike 4.0
International License. To view a copy of this license, visit
http://creativecommons.org/licenses/by-sa/4.0/.
Aggregating
===========
Sheets provides aggregating/grouping functionality via pivot tables, which you
have :ref:`seen in this section`. Aggregating data means to
collect multiple units into one. Again, SQL provides the same
functionality as Sheets through keywords. In this case, the ``GROUP BY``
statement allows you to group and aggregate data. For example, you can use
``GROUP BY`` to count how many bike trips were taken for each subscriber type,
or to calculate the mean duration for trips starting from each station.
The following query counts the number of trips (``COUNT(*)``) for each member
type (``GROUP BY member_type``).
.. activecode:: bikeshare_count_trips_per_member_type
:language: sql
:dburl: /runestone/books/published/ac1/_static/bikeshare.db
SELECT
member_type,
COUNT(*) AS n_trips
FROM
trip_data
GROUP BY
member_type
The result of this query has one rows for each distinct ``member_type``, and the
column named ``n_trips`` counts the number of rows for each type.
Whenever you use ``GROUP BY``, SQL expects you to use an aggregating function.
This is equivalent to which function is used to summarize data in a Sheets pivot
table. The most common examples are listed below.
- ``COUNT`` counts the rows
- ``SUM`` adds the values of a numeric column
- ``MIN`` calculates the minimum of a numeric column
- ``MAX`` calculates the maximum of a numeric column
- ``AVG`` calculates the mean of a numeric column
- ``MEDIAN`` calculates the median of a numeric column (only available in some
SQL servers)
For example, the query below calculates the minimum and maximum trip duration
for each start station.
.. activecode:: bikeshare_min_and_max_trip_duration_by_start_station
:language: sql
:dburl: /runestone/books/published/ac1/_static/bikeshare.db
SELECT
start_station,
MIN(duration) AS minimum_duration,
MAX(duration) AS maximum_duration
FROM
trip_data
GROUP BY
start_station
.. activecode:: bikeshare_station_with_earliest_start_date
:language: sql
:dburl: /runestone/books/published/ac1/_static/bikeshare.db
Write a query to find start station with the earliest start date.
~~~~
====
assert 0,0 == 31620
You must include the ``GROUP BY`` column(s) in your ``SELECT`` clause. All
columns that are not in the ``GROUP BY`` clause must contain an aggreating
function.
If you want to calculate a summary statistic for the entire population, you can
use an aggregating function without the ``GROUP BY`` clause. For example, the
query below calculates the mean duration of all trips.
.. activecode:: bikeshare_average_duration
:language: sql
:dburl: /runestone/books/published/ac1/_static/bikeshare.db
SELECT
AVG(duration) AS mean_duration
FROM
trip_data
You can of course use filtering and ordering with aggregating functions. The
query below calculates the mean trip duration just for trips with a casual
member type.
.. activecode:: bikeshare_mean_duration_for_casual_member_type
:language: sql
:dburl: /runestone/books/published/ac1/_static/bikeshare.db
SELECT
AVG(duration)
FROM
trip_data
WHERE
member_type = 'Casual'
.. activecode:: bikeshare_first_trip_from_station_31111
:language: sql
:dburl: /runestone/books/published/ac1/_static/bikeshare.db
What is the total duration of all trips taken on bike W01274?
~~~~
====
assert 0,0 == 5009
You can then combine this with the ``GROUP BY`` clause as you like. Filtering
with the ``GROUP BY`` clause can serve multiple purposes. Either way, you must
put the ``WHERE`` clause before the ``GROUP BY`` clause.
If you filter by a column that is not grouped by, you can filter the aggregated
results to a subset of the data. For example, you can calculate the sum of all
trip durations, by start station, for a specific end station.
.. activecode:: bikeshare_filter_and_group_by_different_columns
:language: sql
:dburl: /runestone/books/published/ac1/_static/bikeshare.db
SELECT
start_station,
SUM(duration) AS total_duration
FROM
trip_data
WHERE
end_station = 31111
GROUP BY
start_station
If you filter by a column that is being grouped by, you can filter the
aggregated results to just show the row you are interested in. For example, you
can display the total trip duration for each pair of start station and end
station, just for trips that end at station 31111.
.. activecode:: bikeshare_filter_and_group_by_same_column
:language: sql
:dburl: /runestone/books/published/ac1/_static/bikeshare.db
SELECT
start_station,
end_station,
SUM(duration) AS total_duration
FROM
trip_data
WHERE
end_station = 31111
GROUP BY
start_station,
end_station
.. activecode:: bikeshare_debug_group_by_query
:language: sql
:dburl: /runestone/books/published/ac1/_static/bikeshare.db
Find and fix the error(s) in the following code, which is trying to calculate
the mean trip duration for trips by member type Member.
~~~
SELECT
AVG(duration)
FROM
trip_data
GROUP BY
member_type
====
assert 0,0 == 772.0053481492348
It can also be useful to combine aggregating functions with the ``ORDER BY``
clause. This can allow you to sort by an aggregated column to find the minimum
or maximum.
.. shortanswer:: bikeshare_explain_group_by_query
Explain what the following query will return.
.. code-block:: sql
SELECT
start_station,
AVG(duration) AS mean_duration
FROM
trip_data
WHERE
duration >= 3600
GROUP BY
start_station
ORDER BY
AVG(duration) DESC
Note that you can also use the column alias in the ``ORDER BY`` clause. The
above query would have been identical (in almost all SQL servers) if it had been
``ORDER BY mean_duration DESC``.
.. activecode:: bikeshare_bike_with_most_trips
:language: sql
:dburl: /runestone/books/published/ac1/_static/bikeshare.db
what is the bike number and trip count of the bike number with the most trips?
~~~~
====
assert 0,0 == W00893
assert 0,1 == 548
.. activecode:: bikeshare_mean_duration_trips_with_same_start_and_end
:language: sql
:dburl: /runestone/books/published/ac1/_static/bikeshare.db
For the station with the most trips that started and ended at the same
station, find the mean duration (in seconds) of all trips, rounded to the
nearest whole number.
~~~~
====
assert 0,0 == 31217
assert 0,3 == 5164
Extension: Counting Unique Values
---------------------------------
It is often helpful to not only count the number of rows, but count the number
of unique values of a column. You can do this using the ``DISTINCT`` keyword.
To count the distinct values of a column, you can simply use ``COUNT`` along
with ``DISTINCT``. For example, the query below counts the number of bike
numbers used.
.. activecode:: bikeshare_count_distinct_bikes
:language: sql
:dburl: /runestone/books/published/ac1/_static/bikeshare.db
SELECT
COUNT(DISTINCT bike_number) AS n_distinct_bikes
FROM
trip_data
This can also be used alongside to count the unique values for each group. For
example, the following query counts the unique bike numbers used for each start
station.
.. activecode:: bikeshare_count_distinct_bikes_by_start_station
:language: sql
:dburl: /runestone/books/published/ac1/_static/bikeshare.db
SELECT
start_station,
COUNT(DISTINCT bike_number) AS n_distinct_bikes
FROM
trip_data
GROUP BY
start_station
.. activecode:: bikeshare_number_of_start_stations
:language: sql
:dburl: /runestone/books/published/ac1/_static/bikeshare.db
How many unique start stations are in the table?
~~~~
====
assert 0,0 == 140
.. activecode:: bikeshare_unique_end_stations_per_start_station
:language: sql
:dburl: /runestone/books/published/ac1/_static/bikeshare.db
Write a query to display the number of unique end stations per each start
station.
~~~~
====
assert 0,0 == 31000
assert 0,1 == 36
It is sometimes also useful to select the distinct rows of a table. You can do
that by using ``DISTINCT`` without an aggregating function.
.. activecode:: bikeshare_select_distinct_bikes
:language: sql
:dburl: /runestone/books/published/ac1/_static/bikeshare.db
SELECT
DISTINCT bike_number
FROM
trip_data
Note that selecting distinct values is equivalent to selecting grouped values.
.. activecode:: bikeshare_distinct_bike_using_group_by
:language: sql
:dburl: /runestone/books/published/ac1/_static/bikeshare.db
SELECT
bike_number
FROM
trip_data
GROUP BY
bike_number
Extension: The HAVING Clause
----------------------------
You can never include an aggregating function within the ``WHERE`` clause. For
example, suppose you are trying to show all start stations and the number of
trips, but only for start stations with over 100 trips. The following query
produces an error, since the ``WHERE`` condition includes the ``COUNT``
function.
.. activecode:: bikeshare_where_with_aggregation
:language: sql
:dburl: /runestone/books/published/ac1/_static/bikeshare.db
SELECT
start_station,
COUNT(*) AS n_trips
FROM
trip_data
WHERE
COUNT(*) > 100
GROUP BY
start_station
The ``HAVING`` clause can be used instead here. The query below executes
successfully.
.. activecode:: bikeshare_having_over_100_trips
:language: sql
:dburl: /runestone/books/published/ac1/_static/bikeshare.db
SELECT
start_station,
COUNT(*) AS n_trips
FROM
trip_data
GROUP BY
start_station
HAVING
COUNT(*) > 100
As with the ``ORDER BY`` clause, ``HAVING`` is an operation on the results.
While ``WHERE`` filters the data being queried, ``HAVING`` filters the results
based on the value of an aggregating function. The ``HAVING`` clause can only be
used immediately following the ``GROUP BY`` clause.
.. activecode:: bikeshare_bikes_having_over_1_hour
:language: sql
:dburl: /runestone/books/published/ac1/_static/bikeshare.db
Write a query to show the bikes that have been used for over 1 hour (3600
seconds), and the total duration they have been ridden. Order the result
from shortest total duration to longest
~~~~
====
assert 0,0 == W01204
assert 0,1 == 3765
Extension: Numbered Column Aliases
----------------------------------
This section covers no new concepts, but introduces a convenient shorthand
notation. When using ``GROUP BY`` and ``ORDER BY``, you can often (this is
supported in almost all SQL servers, but not all) reference columns using the
number in which they are selected. For example, the following two queries are
identical, listing member types by descending number of trips.
The first query uses the column names in the ``GROUP BY`` and ``ORDER BY``.
.. code-block:: sql
SELECT
member_type,
COUNT(*) AS n_trips
FROM
trip_data
GROUP BY
member_type
ORDER BY
n_trips DESC
The second query uses the column number in the ``GROUP BY`` and ``ORDER BY``.
.. code-block:: sql
SELECT
member_type,
COUNT(*) AS n_trips
FROM
trip_data
GROUP BY
1
ORDER BY
2 DESC
Note that ``1`` refers to the first column being selected, ``member_type``, and
``2`` refers to the second column being selected, ``n_trips``.
You can always use the column names in these clauses, but the numbered aliasing
can be extremely useful, especially if you have several columns. Note that you
can use this notation only with the ``GROUP BY`` and ``ORDER BY`` clauses. Using
such notation anywhere else will produce an error.
.. shortanswer:: bikeshare_explain_numbered_aliases
Explain what question the following query is answering.
.. code-block:: sql
SELECT
start_station,
end_station,
AVG(duration) AS mean_duration
FROM
trip_data
GROUP BY
1, 2
ORDER BY
3 DESC