.. Copyright (C) Google, Runestone Interactive LLC
This work is licensed under the Creative Commons Attribution-ShareAlike 4.0
International License. To view a copy of this license, visit
http://creativecommons.org/licenses/by-sa/4.0/.
.. _joining_sql:
Joining
=======
It is frequently the case that the data we need is spread across multiple tables
in a database. For example, along with the ``trip_data`` table you have already
seen with data on each trip, there might be a table storing additional
information the locations of the start and end stations, called
``bikeshare_stations``.
.. TODO(https://github.com/RunestoneInteractive/RunestoneComponents/issues/917):
Fix the table heading alignment.
========== =================================================
Field Name Description
========== =================================================
station_id Unique identifier of the station
name Public name of the station
status Status of the station (either 'open' or 'closed')
latitude latitude of the station
longitude longitude of the station
========== =================================================
.. activecode:: bikeshare_rows_from_stations_table
:language: sql
:dburl: /runestone/books/published/ac1/_static/bikeshare.db
SELECT
*
FROM
bikeshare_stations
LIMIT
10
.. TODO(raskutti): Link to Module B section on joining.
In Sheets, when you need to join data across multiple spreadsheets, you use
``VLOOKUP``. In SQL, you use the ``JOIN`` keyword. ``JOIN`` is used to join
multiple tables on a common column. You specify the common column on which you
wish to join using the ``ON`` keyword.
Suppose you want to find the number of bikeshare trips that originated at the
*Van Ness Metro / UDC* station. You can join the ``start_station`` field from
the ``trip_data`` table to the ``station_id`` field in the
``bikeshare_stations`` table.
.. activecode:: bikeshare_trips_from_van_ness_metro_udc
:language: sql
:dburl: /runestone/books/published/ac1/_static/bikeshare.db
SELECT
COUNT(*) AS n_trips
FROM
trip_data AS trips
INNER JOIN
bikeshare_stations AS stations
ON
trips.start_station = stations.station_id
WHERE
stations.name = 'Van Ness Metro / UDC'
There is a lot of new material to unpack in this query.
- Both tables are aliased using the ``AS`` keyword. (The ``trip_data`` table
is named ``trips`` and the ``bikeshare_stations`` table is named
``stations``.) Whenever you use ``JOIN``, it is good practice to alias all
tables used in the join. That way, columns can always be according to which
table they appear in.
- An ``INNER JOIN`` returns only rows for which the join key appears in both
tables. For example, if a row in the ``trips`` table has station 12345 as
the ``start_station`` but it does not appear as a ``station_id`` in the
``stations`` table, this row will not be returned. The converse is also
true. `You can read more about other types of joins here.`_
- The ``ON`` clause uses the table names to reference the columns. SQL looks
for rows in the ``stations`` table where the ``station_id`` matches each of
the ``start_station`` values in the ``trips`` table.
- The ``WHERE`` clause limits the query to count rows where the start station
is *Van Ness Metro / UDC*. Note that the table alias ``trips`` is used here
again to reference the column.
.. shortanswer:: bikeshare_explain_join
What question is the following query answering?
.. code-block:: sql
SELECT
trips.start_station,
AVG(duration) AS mean_duration
FROM
trips_data AS trips
INNER JOIN
bikeshare_stations AS stations
ON
trips.start_station = stations.station_id
WHERE
stations.status = 'closed'
.. activecode:: bikeshare_number_of_trips_from_adams_mill_columbia
:language: sql
:dburl: /runestone/books/published/ac1/_static/bikeshare.db
How many trips were started at *Adams Mill & Columbia Rd NW*?
~~~~
====
0,0 == 9883
.. activecode:: bikeshare_mean_duration_by_station_name
:language: sql
:dburl: /runestone/books/published/ac1/_static/bikeshare.db
Write a query to display the mean duration of trip for each start station
name. For example, one row could read as ``White House Station | 12345``.
sort the result by the average duration in ascending order.
~~~~
====
assert 0,0 == 15th St & Massachusetts Ave SE
assert 0,1 == 446.11764705882354
.. activecode:: bikeshare_start_station_name_with_most_trips_by_casual
:language: sql
:dburl: /runestone/books/published/ac1/_static/bikeshare.db
What is the name of the station has the most trips by riders with member type as casual?
~~~~
====
assert 0,0 == USDA / 12th & Independence Ave SW
assert 0,1 == 3609
.. activecode:: bikeshare_open_vs_closed_stations
:language: sql
:dburl: /runestone/books/published/ac1/_static/bikeshare.db
Write a query to compare, in terms of number of trips and mean duration,
stations that are listed as open and closed.
~~~~
====
assert 0,0 == closed
assert 1,0 == open
assert 0,1 == 13440
assert 1,1 == 395482
Extension: Subqueries
---------------------
It is sometimes the case that you can’t accomplish everything you want to within
one query. In these cases, in the same way that you can wrap one Sheets function
within another, you can wrap one query within another. The inner query is often
referred to as a subquery.
One use case for a subquery is within the ``WHERE`` clause. This can be used
instead of joining, and can often be more efficient. For example, the following
query calculates the mean duration of trips that start at an open station.
.. activecode:: bikeshare_subquery_mean_duration_open_stations
:language: sql
:dburl: /runestone/books/published/ac1/_static/bikeshare.db
SELECT
COUNT(*) AS n_trips
FROM
trip_data
WHERE
start_station IN (
SELECT
station_id
FROM
bikeshare_stations
WHERE
status = 'open'
)
This could equivalently be done by joining the ``trips`` table to the
``stations`` table on ``trips.start_station = stations.station_id``, then
filtering on ``WHERE stations.status = 'open'``. However, this involves joining
every row of both tables. While using the subquery means having two distinct
``SELECT`` steps, each step involves less data, since each ``WHERE`` clause
filters each table down to much fewer rows.
Another common use case for subqueries involves aggregating functions. For
example, suppose you want to calculate the percentage of trips that start at
each station. While you can calculate the number of trips that start at each
start station in a single query, and the the total number of trips in a single
query, it is complex to calculate both in a single ``SELECT`` statement.
However, a subquery can be used to simply and logically accomplish this.
.. activecode:: bikeshare_subquery_proportion_trips_per_start_station
:language: sql
:dburl: /runestone/books/published/ac1/_static/bikeshare.db
SELECT
start_station,
100.0 * n_trips / (SELECT COUNT(*) FROM trip_data) AS percentage_trips
FROM (
SELECT
start_station,
COUNT(*) AS n_trips
FROM
trip_data
GROUP BY
start_station
)
The above query contains three ``SELECT`` statements. The “inner query” counts
the number of trips for each start station. There is also a one-line subquery
that sums the total number of trips. The “outer query” combines the two
subqueries to calculate the percentage of total trips that start at each start
station.
.. _You can read more about other types of joins here.: http://www.sql-join.com/sql-join-types