What Is A Regression (aka Forecasting)?
A regression is a trend towards a previous state, which forms a linear pattern.
Regression analysis uses linear modeling to help us understand the relationship between at least two variables - a dependent variable and an independent variable.
Think cause and effect...
The cause or independent variable (x) affects the dependent variable (y).
Relationships (aka correlation or association) between two variables can be positive or negative. As one goes up, the other may go up (positive/direct). Or one goes up, the other may go down (negative/inverse). Alternatively there may be no relationship at all (equivalent to zero correlation).
For example, we might think that as a basketball player’s age increases, his or her points per game will decrease over time:
Here, age is the independent variable.
Points per game is the dependent variable because we believe it depends on a player’s age.
We might think that the regression line for this relationship would look like this (a downward sloping line), indicating an inverse or negative relationship:
Our thought would be that as age gets higher ⬆️, the points scored per game go down ⬇️.
However, when we run our regression analysis on a sample of 17 players - the regression line for our data points actually looks like this:
The relationship is actually positive! Our line goes up (not down)!
As the players' ages increase ⬆️, their points per game tend to increase ⬆️, indicating a direct or positive relationship.
How? Well as we can see, the data points start to diverge as age increases. The highest points per game were scored by some of the oldest players in our sample.
The high scores of the older players actually outweigh the lower scores of the younger players, causing our line to have a positive/upward slope, and revealing a positive relationship.
How To Run A Regression In Excel
You can download the Excel worksheet with the above data set below:
You’ll need to be sure you’ve installed the Excel Data Analysis Tools Add-On.
Click on the Data Tab
Click on the Data Analysis Tool
Click on the Regression tool
4. Select your regression input data and output options:
Select the data in the “Points Per Game” columns as your dependent variable, Y.
Select the data in the “Age” columns as your independent variable, X.
Select “Labels” so excel doesn’t get confused by the text in the first row.
Check "Line Fit Plots" if you want to see a line plot of the points.
5. Click OK. And Voila! Your regression output should populate in a new worksheet.
How to Interpret Your Excel Output
The most important takeaways from the regression output include the following:
The strength of a regression model is expressed by its explanatory power - the percentage of change in dependent variable (points per game) that is explained by changes in the independent variable (age). R-squared gives us that percentage - in this case, only .0225 or 2.25%.
The explanatory power of our model is very weak. This suggests that other independent variables could be added to better explain the changes in the dependent variable (points per game), i.e. skill level or physical health level.
Note that a weak explanatory power does not mean that a relationship does not exist between the variables. It simply means that other variables that are not included in our model may better explain (measure) the changes in the dependent variable.
The adjusted R-squared accounts for the number of independent variables in the model, and is therefore slightly more precise than the R-squared, but always a value very close to it.
In our model, we have only one independent variable (aka a simple regression model). You can have multiple independent variables (aka multiple regression or multivariate analysis) - however, only one dependent variable can be examined at a time in any regression analysis.
Direction (➕ or ➖)
The direction and rate of change of the relationship between the independent variables are given by their correlation coefficients. The direction can be positive or negative.
In our case, our only variable is age.
The coefficient for age is 0.07. Because our coefficient is positive, we know that the direction of the relationship between age and points per game is positive.
Rate of Change 📈
The coefficient for age also tells us the rate of change between age and points per game.
Specifically, our coefficient can be interpreted as:
For every one unit (year) increase in our independent variable (age), our dependent variable (points per game) increases by .07 points.
The intercept is the very beginning value of the line. It is the value of the dependent variable (y), when the independent variable (x) is equal to zero.
In our example, the coefficient for the intercept is 1.3. This is the predicted points per game, when age is equal to zero.
This would mean that when a person is born at zero years old, their points per game is 1.3 points - which is obviously unrealistic.
Because the intercept (zero) falls outside of the range of the dataset (none of the players in our sample an age of zero), the regression formula "extrapolates" or predicts a value that makes mathematical sense, but not necessarily logical sense. (More on this when we get to forecasting.)
For this reason, the intercept is not always meaningful.
Lastly, and most importantly, we have to check to see if our results are significant - meaning that they are likely to occur.
To understand that, we focus on the likelihood of it not happening. We want the chance of the relationship not happening to be as low as possible (typically less than .05 or 5%). This method helps ensure the accuracy of our findings and accounts for different types of potential error.
We can check the significance of the overall model, as well as the significance for each of our independent variables. (In our simple regression example the only independent variable is age.)
Significance of the Model
We are able to check the significance of the overall model by looking at the Significance F value (aka p-value, or probability value of F). In our example, our Significance F is .565 which means the model is not statistically significant. To be statistically significant, the p-value would need to be less than 5% or .05.
(Regression analysis uses the F statistic is used for ANOVA testing.)
Significance of Independent Variables
The p-value (probability value) for our independent variable(s) tells us whether or not their relationship to the dependent variable is significant (meaning likely to happen).
In our example, the p-value for age is .565, indicating there is a 56.5% chance of the relationship not occurring. This means that the relationship is not statistically significant. To be statistically significant, the p-value would need to be less than 5% or .05.
When you think about it, it makes sense that the coefficient for age (.07) is not statistically significant. We would not expect each player to score .07 points higher every year that they age. Age alone would not necessarily cause a player's score to increase. Other factors, such as skill level and health, should be considered.
Finally, notice the significance value for our independent variable age and the overall model itself are both .565. This is because we ran a simple regression, with just one independent variable (age). If additional independent variables were added (making it a multiple regression model), the significance levels would be different.
Regression analysis examines the relationship between independent and dependent variables (how one changes with the other).
It allows us to answer the following questions:
Do the variables change in opposite directions or the same direction? Alternatively, are changes in the variable not related at all?
If they are related, by how much does an increase in one variable either increase or decrease the other variable?
Are the relationships between the variables significant?
How reliable is our model? Is it significant? How well does it explain changes to the dependent variable? Can we use our model to make accurate predictions?
In our example, there are many factors that could have led to both the low explanatory power of the model, and statistical insignificance found for the relationship between age and points per game.
Adding variables to the model (creating a multivariate regression analysis) would account for some of the other factors contributing to the changes in points per game. This would result in a larger percentage of the changes in points per game to be explained by the variables in the model, aka higher R-squared.
Also, increasing the sample size would also allow for more accurate and possibly more significant results. The recommended minimum sample size is usually 30. Ours was only 17.
Next, let's learn how to use a regression model to make predictions/forecasts about data that is not included in our data set!