9.4. Reduce¶
Reduce is a unary operator that, like Project, is column oriented. Unlike Project, Reduce produces a result relation with a different base and identifier than its input relation’s base and identifier. With Reduce you specify one or two lists of column names as inputs:
The columns (original or computed) forming the result relation’s identifier.
The other, non-identifying (original or computed) columns, if any, you want to carry along. (I call them “carry” columns.)
A Reduce’s result relation has these notable name and structure characteristics:
Name characteristics:
Its identifier is formed from the specified columns. Its identifier must differ from the input relation’s identifier. (If they were the same then you would be Projecting. Also, renaming identifying columns is “not same” too.)
Its base must differ from the input relation’s base.
Its row modifier is formed from the input relation’s base.
Its column modifier is formed from the specified columns.
Structure characteristics:
It is as wide as the number of columns named together in the lists.
It is generally shorter than the input relation, but in some cases may be just as tall. It cannot be a zero-row relation unless the input relation has zero rows.
Important
The first bullet is important. Unlike for Project where the bases and identifiers must be the same, for Reduce they must differ. Some novices never distinguish between Project and Reduce, and so fail to progress towards mastery.
9.4.1. Example¶
We will continue using our old friend, the creature relation:
creatureId |
creatureName |
creatureType |
reside_townId |
idol_creatureId |
---|---|---|---|---|
1 |
Bannon |
person |
p |
10 |
2 |
Myers |
person |
a |
9 |
3 |
Neff |
person |
be |
NULL |
4 |
Neff |
person |
b |
3 |
5 |
Mieska |
person |
d |
10 |
6 |
Carlis |
person |
p |
9 |
7 |
Kermit |
frog |
g |
8 |
8 |
Godzilla |
monster |
t |
6 |
9 |
Thor |
superhero |
as |
NULL |
10 |
Elastigirl |
superhero |
mv |
13 |
11 |
David Beckham |
person |
le |
9 |
12 |
Harry Kane |
person |
le |
11 |
13 |
Megan Rapinoe |
person |
sw |
10 |
English Query:
Find the creatureType of each Creature.
The corresponding chart for this looks like this:
Note what is shown in the operator: the new identifying column (can be columns) is creatureType, no additional columns are carried along in the result (because nothing is written after the word carry), and no additional computed columns are needed.
Also note how we changed the base noun of the result relation. Deciding what this should be can be tricky. You need to think carefully about what is now identifying the resulting relation. Using the phrase ‘of creature’ is important so that you know where the column value came from. Such precision seems overkill in simple examples, but is invaluable in real situations with dozens or hundreds of raw and result relations.
An alternate name for the result relation is “creatureType of one or more Creatures.” The latter name is longer, but it makes explicit what the former only implies.
Corresponding SQL:
Explanation of the SQL queries:
The query in the first tab displays a correct result relation. Why 4 rows? The reason is that in Creature there are 13 creatureType values, but there are only 4 distinct ones, and Reduce should remove duplicate rows automatically. It must do so because, if it did not, as in the table (not relation) shown in the second tab, then the result could not be a relation, because it violates the rules for identifier – more than one row contains “person.” You might imagine that Reduce using the distinct keyword in SQL first sorts the rows into like-valued bunches, then works on each bunch of perhaps- many duplicate rows and put one copy in the result relation.
Thus, in the first tab, the keyword DISTINCT is absolutely necessary. The columns we wish to keep in the result are in a comma-separated list after the keywords SELECT DISTINCT. The input relation, in this case creature, is shown after the keyword FROM.
9.4.2. Let’s think about this operator¶
Many times the result is shorter than the original input after removing duplicates, but not always. Let’s look at the data for the Contribution relation:
creatureId |
achId |
skillCode |
roleName |
---|---|---|---|
1 |
19 |
TR4 |
anchor leg |
2 |
18 |
TR4 |
third leg |
4 |
16 |
TR4 |
first leg |
5 |
17 |
TR4 |
second leg |
7 |
14 |
B2 |
pilot |
7 |
22 |
D3 |
first speaker |
9 |
12 |
THR |
right leg |
9 |
15 |
B2 |
brakeman |
9 |
20 |
D3 |
second speaker |
10 |
13 |
THR |
left leg |
13 |
21 |
D3 |
team captain |
The nature of this data at this particular time would given us a result with the same height if we asked this:
Find each roleName of Contribution.
Make sure you see why this is- this can happen.
9.4.3. Comparing Reduce to Project¶
Look at these two precedence charts as a means to solidify the difference between the Project and Reduce operators.
Note how the base of the result relation changes with reduce, along with what identifies it. We use a different symbol for the reduce operator so that we see this more readily. Note how each of them results in a relation with skills in them, represented by their skillCode. But the one on the left is all the skills, and the name of the one on the right needed to be changed to make it clear that it only contains achieved skills.
9.4.4. Naming the result relation¶
Here is another reduce query:
Find each creatureName of one or more creatures
Draw this one for yourself.You might have been tempted to keep Creature as its base but cannot because creatureName is not Creature’s identifier.
Important
Seeing the difference between bases is necessary for mastery. Beware, some students skim this crucial notion to get to the snazzier operators and never grasp the fundamental importance of choosing precise names. You need to become sensitive to language and, in particular, to internalize that Reduce yields a different base than the input relation, while Project yields the same base.
9.4.5. Bad Case: Reduce is dangerous!¶
Consider what happens if you Reduce a relation and specify an identifying column with a NULL value in it. Let’s use the town relation from our small database as an example. Here is what is currently in it.
townId |
townName |
State |
Country |
townNickname |
townMotto |
---|---|---|---|---|---|
a |
Anoka |
MN |
United States |
Halloween Capital of the world |
NULL |
as |
Asgard |
NULL |
NULL |
Home of Odin’s vault |
Where magic and science are one in the same |
b |
Bemidji |
MN |
United States |
B-town |
The first city on the Mississippi |
be |
Blue Earth |
MN |
United States |
Beyond the Valley of the Jolly Green Giant |
Earth so rich the city grows! |
d |
Duluth |
MN |
United States |
Zenith City |
NULL |
g |
Greenville |
MS |
United States |
The Heart & Soul of the Delta |
The Best Food, Shopping, & Entertainment In The South |
le |
London |
England |
United Kingdom |
The Smoke |
Domine dirige nos |
mv |
Metroville |
NULL |
NULL |
Home of the Incredibles |
Still Standing |
p |
Philadelphia |
PA |
United States |
Philly |
Let brotherly love endure |
sw |
Seattle |
Washington |
United States |
The Emerald City |
The City of Goodwill |
t |
Tokyo |
Kanto |
Japan |
NULL |
NULL |
The following query and its chart cannot result in a relation (so it is drawn with a rounded rectangle to represent a table). When you try the SQL code, you get an SQL table– note the row with null values.
English query:
Find each state and country of town.
9.4.6. Bad Case: Computed columns in Reduce are dangerous.¶
A Reduce, like a Project, can compute columns, although this is a convenience, not a necessity, because a Project preceding the Reduce can serve to compute columns. Such computed columns are dangerous, since they could have NULL, or non-singleton values. It is best to avoid this practice unless you are absolutely certain neither of these is the case. For most data, you never can be sure of that some value somewhere is NULL or that when you reduced you eliminated some duplicates, rendering a computation in error.
9.4.7. Exercises¶
Work on precedence charts for these. Decide which of these is bad. Note that for each one of these, there will be one input relation and one result relation from the Reduce operator.
Beware: though you may be tempted to just write SQL, show some restraint and simply draw the charts. To help with this, try making a copy of this drawio operator template in your browser. To do this, you will need to access this Google Drive file and make a copy in your own Google Drive space. You can use it in diagrams.net.
Find each country of town.
Find each creatureName of creature.
Find each creatureId of creature who has achieved.
Find each achieved skillCode.
Find each Achieved skill of creature with its proficiency (id: skillCode, creatureId).
If you haven’t yet, you can download This Small Creature Database Schema
and keep a copy for your own reference as you work on the precedence charts.