Python > Data Science and Machine Learning Libraries > Scikit-learn > Pipelines

Pipeline with Feature Union and Grid Search

This example demonstrates a more advanced pipeline setup, combining feature extraction with different techniques using FeatureUnion and then optimizing the model using GridSearchCV for hyperparameter tuning.

Introduction to FeatureUnion in Pipelines

FeatureUnion allows you to combine the results of multiple feature extraction techniques in parallel. This is useful when you want to leverage different types of features or apply different transformations to the same features.

Code Example: Pipeline with FeatureUnion and GridSearchCV

This code generates synthetic regression data. It then defines three feature extraction pipelines: one for PCA, one for polynomial features, and one for feature selection using SelectKBest. `FeatureUnion` combines the outputs of these pipelines. Finally, a `LinearRegression` model is trained on the combined features. `GridSearchCV` is used to find the best hyperparameters for the PCA n_components, the PolynomialFeatures degree, and the SelectKBest k parameter. The best parameters and score are then printed, and the model is evaluated on the test set.

import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.linear_model import LinearRegression
from sklearn.datasets import make_regression

# Generate synthetic data
X, y = make_regression(n_samples=100, n_features=10, noise=0.1, random_state=42)
X = pd.DataFrame(X)

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define feature extraction pipelines
pipeline_pca = Pipeline([('pca', PCA())])
pipeline_poly = Pipeline([('poly', PolynomialFeatures(degree=2)), ('scaler', StandardScaler())])
pipeline_select = Pipeline([('select', SelectKBest(score_func=f_regression))])

# Combine feature extraction pipelines using FeatureUnion
features = FeatureUnion([
    ('pca', pipeline_pca),
    ('poly', pipeline_poly),
    ('select', pipeline_select)
])

# Create the full pipeline
pipeline = Pipeline([
    ('features', features),
    ('linear_regression', LinearRegression())
])

# Define hyperparameter grid
param_grid = {
    'features__pca__pca__n_components': [2, 5],
    'features__select__select__k': [2, 5],
    'features__poly__poly__degree': [2, 3]
}

# Perform grid search with cross-validation
grid_search = GridSearchCV(pipeline, param_grid, cv=3, scoring='neg_mean_squared_error')
grid_search.fit(X_train, y_train)

# Print best parameters and score
print("Best parameters:", grid_search.best_params_)
print("Best score:", grid_search.best_score_)

# Evaluate the best model on the test set
score = grid_search.score(X_test, y_test)
print("Test score:", score)

Concepts Behind the Snippet

This Snippet uses:
FeatureUnion: Combines different feature engineering steps.PCA: Principal Component Analysis for dimensionality reduction.
PolynomialFeatures: Generates polynomial combinations of features.
SelectKBest: Selects the best features based on statistical tests.
GridSearchCV: Exhaustive search over specified parameter values for an estimator.

Real-Life Use Case

Consider a scenario where you are predicting house prices. You might want to use PCA to reduce the dimensionality of highly correlated features like square footage and number of rooms, generate polynomial features to capture non-linear relationships, and select the most relevant features based on statistical tests. A pipeline with FeatureUnion allows you to combine these techniques and optimize the hyperparameters of each step using GridSearchCV.

Best Practices

  • Carefully select feature extraction techniques: Choose techniques appropriate for your data.
  • Use FeatureUnion strategically: Combine techniques that complement each other.
  • Define a comprehensive hyperparameter grid: Explore a wide range of parameter values.
  • Use cross-validation: Ensure that your model generalizes well to unseen data.

Interview Tip

Be prepared to explain the benefits of using FeatureUnion and GridSearchCV in a pipeline. Emphasize the ability to combine different feature engineering techniques and optimize the entire pipeline for best performance.

When to Use Pipelines with FeatureUnion and GridSearchCV

Use pipelines with FeatureUnion when you want to combine multiple feature extraction techniques and GridSearchCV when you want to automatically optimize the hyperparameters of your pipeline.

Memory Footprint Considerations

Using PolynomialFeatures can lead to a significant increase in the number of features and consequently, memory usage. PCA can reduce the memory footprint by reducing dimensionality. Be mindful of the memory implications when combining these techniques, especially with large datasets.

Alternatives to FeatureUnion

While FeatureUnion is a powerful tool, alternative approaches include manually combining the features after applying individual transformations. However, this approach is less organized and more error-prone. In some cases, domain knowledge might allow you to hand-craft specific features, avoiding the need for automated feature extraction techniques.

Pros of FeatureUnion

  • Flexibility: Allows combining diverse feature engineering techniques.
  • Modularity: Enhances code organization and maintainability.

Cons of FeatureUnion

  • Complexity: Can increase pipeline complexity.
  • Hyperparameter Tuning: Requires careful hyperparameter tuning of individual components.

FAQ

  • What is the purpose of `FeatureUnion`?

    FeatureUnion combines the results of multiple feature extraction techniques into a single feature space.
  • How does `GridSearchCV` work with a pipeline?

    GridSearchCV systematically searches through a grid of hyperparameters, training and evaluating the pipeline for each combination of parameters using cross-validation.
  • How do I access the parameters of individual steps within a pipeline in `GridSearchCV`?

    You can access the parameters of individual steps using a double underscore notation, e.g., 'features__pca__pca__n_components' refers to the n_components parameter of the PCA step within the pca pipeline within the features FeatureUnion.