Python > Data Science and Machine Learning Libraries > Scikit-learn > Unsupervised Learning (Clustering, Dimensionality Reduction)
Principal Component Analysis (PCA) for Dimensionality Reduction
This snippet demonstrates how to use Principal Component Analysis (PCA) from scikit-learn to reduce the dimensionality of a dataset while preserving the most important information. We'll use the Iris dataset as an example.
Import Libraries
This section imports necessary libraries:
numpy
: For numerical operations, particularly array manipulation.matplotlib.pyplot
: For creating visualizations.sklearn.decomposition.PCA
: The PCA algorithm.sklearn.datasets.load_iris
: A function to load the Iris dataset.sklearn.preprocessing.StandardScaler
: Used for scaling the data.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
Load and Prepare the Iris Dataset
We load the Iris dataset and scale the features:
iris = load_iris()
: Loads the Iris dataset, which contains measurements of sepal length, sepal width, petal length, and petal width for three species of iris flowers.X = iris.data
: Assigns the feature data to the variable X
.y = iris.target
: Assigns the target labels (species) to the variable y
.scaler = StandardScaler()
: Creates a StandardScaler object.X_scaled = scaler.fit_transform(X)
: Scales the data using StandardScaler. Scaling is important for PCA because it is sensitive to the scale of the features. StandardScaler standardizes the data by subtracting the mean and dividing by the standard deviation.
iris = load_iris()
X = iris.data
y = iris.target
# Scale the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
Apply PCA
Here, we apply PCA to reduce the dimensionality to 2 components:
pca = PCA(n_components=2)
: Creates a PCA object with 2 components. This means that PCA will reduce the data from 4 dimensions to 2 dimensions.X_pca = pca.fit_transform(X_scaled)
: Fits the PCA model to the scaled data X_scaled
and transforms the data to the new lower-dimensional space.explained_variance = pca.explained_variance_ratio_
: Stores the explained variance ratio for each principal component. This tells us how much variance is explained by each of the 2 principal components.
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)
explained_variance = pca.explained_variance_ratio_
Visualize the Reduced Data
This section visualizes the reduced data:
plt.figure(figsize=(8, 6))
: Creates a figure for the plot.plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y, cmap='viridis')
: Creates a scatter plot of the reduced data. X_pca[:, 0]
and X_pca[:, 1]
represent the first and second principal components. c=y
colors the points according to their species.print(f"Explained variance ratio: {explained_variance}")
: Prints the explained variance ratio for each principal component.
plt.figure(figsize=(8, 6))
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y, cmap='viridis')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.title('PCA on Iris Dataset (2 components)')
plt.colorbar(label='Species')
plt.show()
print(f"Explained variance ratio: {explained_variance}")
Complete Code
This section provides the complete code for easy copy-pasting and execution.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
# Load and prepare the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target
# Scale the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Apply PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)
explained_variance = pca.explained_variance_ratio_
# Visualize the reduced data
plt.figure(figsize=(8, 6))
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y, cmap='viridis')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.title('PCA on Iris Dataset (2 components)')
plt.colorbar(label='Species')
plt.show()
print(f"Explained variance ratio: {explained_variance}")
Concepts Behind the Snippet
Principal Component Analysis (PCA) is a dimensionality reduction technique that transforms a dataset into a new coordinate system where the principal components (new variables) are orthogonal to each other and ordered by the amount of variance they explain. The first principal component explains the most variance in the data, the second principal component explains the second most variance, and so on. By selecting a subset of the principal components, we can reduce the dimensionality of the data while retaining most of the important information. It's useful for visualizing high-dimensional data, reducing noise, and improving the performance of machine learning algorithms.
Real-Life Use Case
Best Practices
StandardScaler
or MinMaxScaler
from scikit-learn.
Interview Tip
Be prepared to discuss the assumptions of PCA (e.g., linearity, large variance implies important structure), its limitations (sensitivity to outliers, difficulty with non-linear relationships), and alternative dimensionality reduction techniques like t-SNE or UMAP. Also, be ready to explain how to choose the optimal number of components.
When to Use Them
Use PCA when:
Memory Footprint
The memory footprint of PCA depends on the size of the dataset (number of samples and features). For very large datasets, consider using incremental PCA (IncrementalPCA
) to process the data in batches.
Alternatives
Pros
Cons
FAQ
-
Why is scaling important before applying PCA?
PCA is sensitive to the scale of the features. Features with larger scales will have a greater influence on the principal components, even if they are not necessarily more important. Scaling the data ensures that all features are treated equally. -
How do I choose the optimal number of components?
Use the explained variance ratio. Plot the cumulative explained variance ratio as a function of the number of components. Choose the number of components that explains a sufficiently high percentage of the total variance (e.g., 90% or 95%). You can also use techniques like cross-validation to evaluate the performance of a machine learning model with different numbers of components. -
What is the difference between PCA and LDA?
PCA is an unsupervised dimensionality reduction technique that aims to find the principal components that explain the most variance in the data. LDA is a supervised dimensionality reduction technique that is specifically designed for classification tasks. LDA aims to find the linear discriminants that maximize the separation between classes.