Part 1: Foundations of Regression
Welcome, Kartheek. This first module introduces the core building blocks of linear regression. We will explore the fundamental concepts of variables, relationships, and how we can represent them visually. Understanding these basics is the crucial first step to mastering regression analysis.
1.1 Variables: The Language of Data
In statistics, a variable is any characteristic, number, or quantity that can be measured or counted. We typically work with two types in linear regression:
- Independent Variable (Predictor, $X$): The variable that you believe influences another variable. You use it to make predictions.
- Dependent Variable (Response, $Y$): The variable you are trying to predict or explain. Its value is thought to depend on the independent variable.
Example:
If we want to predict a student's exam score based on the number of hours they studied:
• X (Independent): Hours Studied
• Y (Dependent): Exam Score
1.2 Visualizing Relationships: The Scatter Plot
Before building a model, we must first see if a relationship exists between our variables. The best tool for this is a scatter plot. It plots pairs of data points ($X, Y$) on a graph. By looking at the pattern, we can get an idea of the relationship's strength and direction (positive or negative).
The chart below shows the relationship between hours studied and exam scores for several students. As you can see, there appears to be a positive trend: as study hours increase, exam scores tend to increase as well.
Part 2: Simple Linear Regression
Now that we can visualize relationships, let's quantify them. Simple Linear Regression finds the single straight line that best describes the relationship between one independent variable ($X$) and one dependent variable ($Y$). This section will explore the equation of this line and how we find the "best" one.
2.1 The Regression Equation
The line is described by a simple mathematical equation. For any given value of $X$, this equation gives us a predicted value for $Y$, which we call $\hat{Y}$ ("Y-hat").
$\hat{Y} = \beta_0 + \beta_1 X$
- $\hat{Y}$: The predicted value of the dependent variable.
- $\beta_0$ (Beta-naught): The intercept. It's the predicted value of $Y$ when $X$ is 0. This is where the line crosses the vertical Y-axis.
- $\beta_1$ (Beta-one): The slope or coefficient. It represents the change in $Y$ for a one-unit increase in $X$.
2.2 Finding the Best Fit Line: The Cost Function
How do we find the best values for $\beta_0$ and $\beta_1$? We find the line that minimizes the total error. The error for a single point is the vertical distance between the actual data point ($Y$) and the predicted point on our line ($\hat{Y}$). This is called the residual.
To get the total error, we square each residual (to make them all positive) and sum them up. This is called the Sum of Squared Errors (SSE) or Residual Sum of Squares (RSS). The goal is to find the line that makes this SSE as small as possible.
Cost Function: $SSE = \sum_{i=1}^{n} (Y_i - \hat{Y}_i)^2$
Interactive Demo
Use the sliders below to manually adjust the intercept ($\beta_0$) and slope ($\beta_1$). Try to find the line that best fits the data. The line that produces the lowest SSE is the best fit line, which an algorithm like Gradient Descent finds automatically.
Sum of Squared Errors (SSE):
Part 3: Multiple Linear Regression
The real world is complex. Often, an outcome is influenced by more than one factor. Multiple Linear Regression extends the simple model by allowing us to use several independent variables ($X_1, X_2, ..., X_p$) to predict a single dependent variable ($Y$). This provides a more realistic and often more accurate model.
3.1 The Expanded Equation
The equation is a natural extension of the simple one. We just add more terms, one for each new independent variable.
$\hat{Y} = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + ... + \beta_p X_p$
- Each $\beta_j$ is the coefficient for its corresponding variable $X_j$.
- Interpretation: $\beta_j$ represents the average change in $Y$ for a one-unit increase in $X_j$, while holding all other variables constant. This last part is very important.
Example:
Predicting a house price ($Y$) based on its size ($X_1$) and age ($X_2$).
• $\hat{Price} = \beta_0 + \beta_1 Size + \beta_2 Age$
• $\beta_1$ would be the increase in price for each extra square foot, assuming the age is the same.
3.2 A Key Challenge: Multicollinearity
A new problem arises in multiple regression: multicollinearity. This happens when two or more independent variables are highly correlated with each other. For example, if we included both 'house size in sq. ft.' and 'house size in sq. meters' as predictors, they would be perfectly correlated.
Why is this bad?
- It becomes difficult for the model to determine the individual effect of each correlated variable. The coefficient estimates ($\beta$) become unstable and hard to interpret.
- It doesn't necessarily reduce the predictive accuracy of the model as a whole, but it undermines our ability to understand the importance of individual predictors.
Example:
If we add 'number of rooms' ($X_3$) to our house price model. It's likely that 'size' ($X_1$) and 'number of rooms' ($X_3$) are highly correlated. The model might struggle to decide whether to attribute price increases to size or the number of rooms, making their coefficients unreliable.
Part 4: Model Evaluation & Diagnostics
Building a model is only half the battle. We must rigorously evaluate how good it is. This section covers two critical aspects of evaluation: measuring the model's performance and diagnosing potential problems by examining its errors (residuals).
4.1 Measuring Performance: R-squared ($R^2$)
$R^2$, also called the coefficient of determination, is a popular metric that tells us the proportion of the variance in the dependent variable ($Y$) that is predictable from the independent variable(s) ($X$).
- It ranges from 0 to 1 (or 0% to 100%).
- A higher $R^2$ indicates that the model explains a larger portion of the variability in the outcome.
- An $R^2$ of 0.75 means that 75% of the variation in $Y$ can be explained by our model's inputs $X$.
Example:
In our 'hours studied vs. exam score' model, if we get an $R^2$ of 0.82, it means that 82% of the differences in exam scores among students can be explained by the differences in how many hours they studied. The remaining 18% is due to other factors (luck, intelligence, etc.).
Caution: A high $R^2$ doesn't automatically mean the model is good. Adding more variables will almost always increase $R^2$, even if those variables are useless. This is why we also use Adjusted $R^2$, which penalizes the score for adding non-useful predictors.
4.2 Diagnosing Problems: Residual Plots
A key assumption of linear regression is that the errors (residuals) are random and have no pattern. A residual plot, which graphs the predicted values ($\hat{Y}$) against the residuals ($Y - \hat{Y}$), is the best way to check this.
What we want to see: A random cloud of points centered around 0 with no discernible shape. This indicates the model assumptions hold.
What we don't want to see: Any clear pattern, like a U-shape or a funnel shape. This suggests that our linear model is not the right choice for this data.
This is a healthy residual plot. The points are randomly scattered around the zero line, indicating that our linear model is a good fit.
Part 5: Advanced Topics & Regularization
Finally, we'll touch on advanced techniques used to improve model performance, particularly when dealing with many variables or multicollinearity. The main concept here is regularization, which involves adding a penalty to the cost function to discourage overly complex models.
5.1 Overfitting: The Problem of Complexity
Overfitting occurs when a model learns the training data *too well*. It captures not only the underlying relationship but also the random noise. This leads to a model that performs great on the data it was trained on, but fails to generalize and make accurate predictions on new, unseen data.
This often happens when you have too many independent variables compared to the number of data points.
5.2 Regularization: A Solution to Overfitting
Regularization techniques add a penalty term to the cost function (the SSE). This penalty is larger for larger coefficient ($\beta$) values. This forces the model to keep the coefficients small, which in turn creates a simpler, less complex model that is less likely to overfit. The two most common types are:
Ridge Regression (L2)
Adds a penalty equal to the sum of the squared coefficient values.
Penalty = $\lambda \sum \beta_j^2$
Ridge shrinks the coefficients towards zero, but never to exactly zero. It's great for handling multicollinearity.
Lasso Regression (L1)
Adds a penalty equal to the sum of the absolute values of the coefficients.
Penalty = $\lambda \sum |\beta_j|$
Lasso can shrink some coefficients all the way to zero, effectively performing variable selection by removing unimportant predictors. This makes the final model easier to interpret.
The term $\lambda$ (lambda) is a hyperparameter that you control. It determines the strength of the penalty. A larger $\lambda$ results in smaller coefficients and a simpler model.