Machine learning > Fundamentals of Machine Learning > Key Concepts > Underfitting
Understanding Underfitting in Machine Learning
Underfitting occurs when a machine learning model is too simple to capture the underlying patterns in the training data. This leads to poor performance on both the training data and unseen data. This tutorial will explore the concept of underfitting, its causes, consequences, and methods to mitigate it.
Defining Underfitting
Underfitting happens when a model fails to learn the underlying relationship between input features and the target variable. It typically results from using a simple model (e.g., linear regression on a non-linear dataset) or not providing enough features to the model. The model's accuracy is low on both the training set and the test set.
Causes of Underfitting
Several factors can contribute to underfitting:
Consequences of Underfitting
The main consequence of underfitting is poor predictive performance. An underfit model will have:
Identifying Underfitting
You can identify underfitting by observing the performance of your model. Specifically, look for the following signs:
Code Example: Demonstrating Underfitting
This code generates non-linear data and then attempts to fit a linear regression model to it. The resulting plot and MSE scores will clearly show that the linear model is unable to capture the underlying pattern, demonstrating underfitting. The MSE on both training and test sets will be high.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
# Generate some non-linear data
X = np.linspace(-5, 5, 100).reshape(-1, 1)
y = 2 * X**2 + 3 * X + 1 + np.random.randn(100, 1) * 10 # Add some noise
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Fit a linear regression model (underfitting)
model = LinearRegression()
model.fit(X_train, y_train)
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)
# Evaluate the model
train_mse = mean_squared_error(y_train, y_train_pred)
test_mse = mean_squared_error(y_test, y_test_pred)
print(f'Training Mean Squared Error: {train_mse}')
print(f'Testing Mean Squared Error: {test_mse}')
# Plot the data and the model's predictions
plt.figure(figsize=(8, 6))
plt.scatter(X, y, label='Data')
plt.plot(X, model.predict(X), color='red', label='Linear Regression Model')
plt.xlabel('X')
plt.ylabel('y')
plt.title('Underfitting Example: Linear Regression on Non-linear Data')
plt.legend()
plt.show()
Concepts Behind the Snippet
The code snippet demonstrates the principle of model selection. Choosing the right model complexity is critical. Linear regression assumes a linear relationship between the input and output. When the data exhibits a non-linear pattern, a linear model will inevitably underfit. The Mean Squared Error (MSE) is used to quantify the error between the predicted and actual values. A high MSE indicates a poor fit.
Real-Life Use Case
Consider predicting housing prices based solely on square footage using a linear model when other factors like location, number of bedrooms, and age of the house significantly influence the price. Using just square footage (a single feature, simple model) would lead to underfitting because it fails to capture the complexities of the housing market.
Solutions to Underfitting
There are several ways to address underfitting:
Code Example: Fixing Underfitting with Polynomial Regression
This code builds on the previous example by using PolynomialFeatures to transform the input data into polynomial features (in this case, degree 2). A linear regression model is then fit to these transformed features. This allows the model to capture the non-linear relationship in the data, significantly reducing underfitting and improving accuracy. The MSE will be much lower compared to the previous linear regression example.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
# Generate some non-linear data
X = np.linspace(-5, 5, 100).reshape(-1, 1)
y = 2 * X**2 + 3 * X + 1 + np.random.randn(100, 1) * 10 # Add some noise
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Transform features to polynomial features
poly = PolynomialFeatures(degree=2)
X_train_poly = poly.fit_transform(X_train)
X_test_poly = poly.transform(X_test)
# Fit a linear regression model on the polynomial features
model = LinearRegression()
model.fit(X_train_poly, y_train)
y_train_pred = model.predict(X_train_poly)
y_test_pred = model.predict(X_test_poly)
# Evaluate the model
train_mse = mean_squared_error(y_train, y_train_pred)
test_mse = mean_squared_error(y_test, y_test_pred)
print(f'Training Mean Squared Error: {train_mse}')
print(f'Testing Mean Squared Error: {test_mse}')
# Plot the data and the model's predictions
plt.figure(figsize=(8, 6))
plt.scatter(X, y, label='Data')
X_plot = np.linspace(-5, 5, 100).reshape(-1, 1)
X_plot_poly = poly.transform(X_plot)
y_plot_pred = model.predict(X_plot_poly)
plt.plot(X_plot, y_plot_pred, color='red', label='Polynomial Regression Model')
plt.xlabel('X')
plt.ylabel('y')
plt.title('Polynomial Regression on Non-linear Data')
plt.legend()
plt.show()
Best Practices
Interview Tip
When discussing underfitting in an interview, be prepared to explain the concept in simple terms, provide examples of its causes and consequences, and describe methods for mitigating it. Highlight your understanding of the trade-off between model complexity and generalization ability. Be ready to discuss specific algorithms and techniques like polynomial regression or adding interaction terms as ways to address underfitting.
When to Use Them
Understanding when a model is underfitting can guide model selection and adjustments. An underfitting model isn't always bad, but it does mean that the model is not capturing all the available information. When the goal is to predict as precisely as possible based on the available data, addressing underfitting can improve accuracy and reliability.
Alternatives
Alternatives to address underfitting depend on the specific situation. Besides increasing model complexity and adding features, consider using ensemble methods (e.g., Random Forest, Gradient Boosting) that can combine multiple weak learners to create a stronger model. Also, explore different data preprocessing techniques (e.g., scaling, normalization) that might improve the model's ability to learn.
Pros
While underfitting is undesirable, very simple models are fast to train, easily interpretable and require minimal resources. These models work as a baseline performance measure.
Cons
The main con is poor accuracy and generalization ability. Underfitting models fail to capture the underlying relationships and patterns in the data. A model that underfits the training data is likely to perform poorly on new, unseen data.
FAQ
-
What is the difference between underfitting and overfitting?
Underfitting occurs when a model is too simple to capture the underlying patterns in the data, resulting in poor performance on both the training and test sets. Overfitting occurs when a model is too complex and learns the noise in the training data, resulting in excellent performance on the training set but poor performance on the test set. -
How can I tell if my model is underfitting?
You can identify underfitting by observing low accuracy on both the training and validation datasets, and by analyzing learning curves that plateau at a high loss value. -
Is underfitting always a bad thing?
Generally, yes. While simple models can be computationally efficient, the goal of most machine learning tasks is to build a model that accurately predicts outcomes. Underfitting indicates that the model is not effectively learning from the data, thus sacrificing predictive accuracy. In certain situations when interpretability is paramount and high accuracy is not required, a simpler, underfit model might be acceptable.