Section 10.5 Adding a Second Variable to the Model
We discussed at the beginning of this chapter the origin of Australian Rules Football in Victoria, where the MCG is located. While most of the teams in the AFL are also Victoria teams, and therefore have a supporter base which can easily access the MCG, a number of the teams are from other states, and their supporters would need to make a significant interstate journey to see their team play at the MCG. For example, the journey from Sydney to Melbourne is around eight hours by car or two by plane, whereas from Perth, where most West Coast’s supporter base is located, is close to five hours by air - and two time zones away. Australia is a really huge country.
The dataset doesn’t have a variable for interstate teams but fortunately there are only four teams that are interstate: Brisbane, Sydney, Adelaide, and West Coast, abbreviated respectively as "Bris", "Syd", "Adel", and "WC". We can make a binary coded variable to indicate these interstate teams with a simple command:
The code above checks the values in the column labeled ’Away’, and if it finds an exact match with one of the names of an interstate team, it stores a value of 1. Otherwise it stores a value of 0. Note that we use a double equals sign for the exact comparison in R, and the vertical bar is used to represent the logical ’OR’ operator. These symbols are similar, although not precisely the same, as symbols used to represent logical operators in programming languages such as C and Java. Having created the new ’Away team is interstate’ variable, we can use this variable to create a new linear regression model that includes two independent variables.
Note that the r-squared value is now 0.9246, which is quite a bit higher than the 0.8847 that we observed in the previous model. In this new model, the two independent variables working together account for 92.46% of the dependent variable. So together, the total fan base and the status as an away team are doing a really great job of predicting attendance. This result is also intuitive - we would expect that football fans, regardless of how devoted they are to their team, are more likely to come to games if they’re a moderate car ride away, compared to a plane journey.
Because we have two independent variables now, we have to look beyond the r-squared value to understand the situation better. In particular, about one third of the way into the output for the lm() command there is a heading that says "Estimate." Right below that are slope values for Members and for away.inter. Notice that the slope (sometimes called a "B-weight") on Members is positive: This makes sense because the more fans the team has the higher the attendance. The slope on away.inter is negative because when this variable is 1 (in the case of interstate teams) the attendance is lower) whereas when this variable is 0 (for local teams), attendance is higher.
How can you tell if these slopes or B-weights are actually important contributors to the prediction? You can divide the unstandardized B-weight by its standard error to create a "t value". The lm() command has done this for you and it is reported in the output above. This "t" is the Student’s t-test, described in a previous chapter. As a rule of thumb, if this t value has an absolute value (i.e., ignoring the minus sign if there is one) that is larger than about 2, you can be assured that the independent/predictor variable we are talking about is contributing to the prediction of the dependent variable. In this example we can see that Members has a humongous t value of 21.257, showing that it is very important in the prediction. The away.inter variable has a somewhat more modest, but still important value of -4.545 (again, don’t worry about the minus sign when judging the magnitude of the t value).
You have attempted of activities on this page.