Python > Data Science and Machine Learning Libraries > Scikit-learn > Pipelines
Pipeline with Feature Union and Grid Search
This example demonstrates a more advanced pipeline setup, combining feature extraction with different techniques using FeatureUnion and then optimizing the model using GridSearchCV for hyperparameter tuning.
Introduction to FeatureUnion in Pipelines
FeatureUnion
allows you to combine the results of multiple feature extraction techniques in parallel. This is useful when you want to leverage different types of features or apply different transformations to the same features.
Code Example: Pipeline with FeatureUnion and GridSearchCV
This code generates synthetic regression data. It then defines three feature extraction pipelines: one for PCA, one for polynomial features, and one for feature selection using SelectKBest. `FeatureUnion` combines the outputs of these pipelines. Finally, a `LinearRegression` model is trained on the combined features. `GridSearchCV` is used to find the best hyperparameters for the PCA n_components, the PolynomialFeatures degree, and the SelectKBest k parameter. The best parameters and score are then printed, and the model is evaluated on the test set.
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.linear_model import LinearRegression
from sklearn.datasets import make_regression
# Generate synthetic data
X, y = make_regression(n_samples=100, n_features=10, noise=0.1, random_state=42)
X = pd.DataFrame(X)
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Define feature extraction pipelines
pipeline_pca = Pipeline([('pca', PCA())])
pipeline_poly = Pipeline([('poly', PolynomialFeatures(degree=2)), ('scaler', StandardScaler())])
pipeline_select = Pipeline([('select', SelectKBest(score_func=f_regression))])
# Combine feature extraction pipelines using FeatureUnion
features = FeatureUnion([
('pca', pipeline_pca),
('poly', pipeline_poly),
('select', pipeline_select)
])
# Create the full pipeline
pipeline = Pipeline([
('features', features),
('linear_regression', LinearRegression())
])
# Define hyperparameter grid
param_grid = {
'features__pca__pca__n_components': [2, 5],
'features__select__select__k': [2, 5],
'features__poly__poly__degree': [2, 3]
}
# Perform grid search with cross-validation
grid_search = GridSearchCV(pipeline, param_grid, cv=3, scoring='neg_mean_squared_error')
grid_search.fit(X_train, y_train)
# Print best parameters and score
print("Best parameters:", grid_search.best_params_)
print("Best score:", grid_search.best_score_)
# Evaluate the best model on the test set
score = grid_search.score(X_test, y_test)
print("Test score:", score)
Concepts Behind the Snippet
This Snippet uses:FeatureUnion:
Combines different feature engineering steps.PCA:
Principal Component Analysis for dimensionality reduction.PolynomialFeatures:
Generates polynomial combinations of features.SelectKBest:
Selects the best features based on statistical tests.GridSearchCV:
Exhaustive search over specified parameter values for an estimator.
Real-Life Use Case
Consider a scenario where you are predicting house prices. You might want to use PCA to reduce the dimensionality of highly correlated features like square footage and number of rooms, generate polynomial features to capture non-linear relationships, and select the most relevant features based on statistical tests. A pipeline with FeatureUnion allows you to combine these techniques and optimize the hyperparameters of each step using GridSearchCV.
Best Practices
Interview Tip
Be prepared to explain the benefits of using FeatureUnion and GridSearchCV in a pipeline. Emphasize the ability to combine different feature engineering techniques and optimize the entire pipeline for best performance.
When to Use Pipelines with FeatureUnion and GridSearchCV
Use pipelines with FeatureUnion when you want to combine multiple feature extraction techniques and GridSearchCV when you want to automatically optimize the hyperparameters of your pipeline.
Memory Footprint Considerations
Using PolynomialFeatures
can lead to a significant increase in the number of features and consequently, memory usage. PCA
can reduce the memory footprint by reducing dimensionality. Be mindful of the memory implications when combining these techniques, especially with large datasets.
Alternatives to FeatureUnion
While FeatureUnion
is a powerful tool, alternative approaches include manually combining the features after applying individual transformations. However, this approach is less organized and more error-prone. In some cases, domain knowledge might allow you to hand-craft specific features, avoiding the need for automated feature extraction techniques.
Pros of FeatureUnion
Cons of FeatureUnion
FAQ
-
What is the purpose of `FeatureUnion`?
FeatureUnion
combines the results of multiple feature extraction techniques into a single feature space. -
How does `GridSearchCV` work with a pipeline?
GridSearchCV
systematically searches through a grid of hyperparameters, training and evaluating the pipeline for each combination of parameters using cross-validation. -
How do I access the parameters of individual steps within a pipeline in `GridSearchCV`?
You can access the parameters of individual steps using a double underscore notation, e.g.,'features__pca__pca__n_components'
refers to then_components
parameter of thePCA
step within thepca
pipeline within thefeatures
FeatureUnion.