Python > Data Science and Machine Learning Libraries > TensorFlow and Keras > Natural Language Processing

Text Classification with TensorFlow and Keras

This code snippet demonstrates how to perform text classification using TensorFlow and Keras. It covers loading a dataset, preprocessing text data, building a simple neural network model, training the model, and evaluating its performance.

Importing Necessary Libraries

This section imports the required libraries. TensorFlow and Keras are used for building and training the neural network. scikit-learn's `train_test_split` helps divide the dataset into training and testing sets. NumPy is used for numerical operations, and pandas for data manipulation.

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from sklearn.model_selection import train_test_split
import numpy as np
import pandas as pd

Loading and Preparing the Dataset

This section loads the dataset (assuming a CSV file named 'spam.csv'), preprocesses the data by renaming columns, mapping categorical labels to numerical values (spam=1, ham=0), and then splits the data into training and testing sets using `train_test_split`. The `test_size` parameter specifies that 20% of the data will be used for testing. `random_state` ensures reproducibility.

# Load the dataset
data = pd.read_csv('spam.csv', encoding='latin-1')
data = data[['v1', 'v2']]
data.rename(columns={'v1': 'label', 'v2': 'text'}, inplace=True)

# Convert labels to numerical values (spam: 1, ham: 0)
data['label'] = data['label'].map({'spam': 1, 'ham': 0})

# Split data into training and testing sets
text = data['text'].values
labels = data['label'].values
text_train, text_test, labels_train, labels_test = train_test_split(text, labels, test_size=0.2, random_state=42)

Text Vectorization

This part involves text vectorization using Keras' `Tokenizer`. The `Tokenizer` converts words into numerical tokens. `num_words` limits the vocabulary size to the most frequent 10000 words. `oov_token` handles out-of-vocabulary words. `texts_to_sequences` converts the text into sequences of integers. Since the sequences have different lengths, `pad_sequences` is used to pad or truncate them to a fixed length (`max_len`). `padding='post'` adds padding at the end of the sequences, and `truncating='post'` truncates sequences from the end if they exceed `max_len`.

# Text vectorization using Tokenizer
max_words = 10000 # Maximum number of words to keep
tokenizer = keras.preprocessing.text.Tokenizer(num_words=max_words, oov_token='<OOV>')
tokenizer.fit_on_texts(text_train)

# Convert text to sequences of integers
train_sequences = tokenizer.texts_to_sequences(text_train)
test_sequences = tokenizer.texts_to_sequences(text_test)

# Pad sequences to have the same length
max_len = 200 # Maximum sequence length
train_padded = keras.preprocessing.sequence.pad_sequences(train_sequences, maxlen=max_len, padding='post', truncating='post')
test_padded = keras.preprocessing.sequence.pad_sequences(test_sequences, maxlen=max_len, padding='post', truncating='post')

Building the Model

Here, a sequential model is built using Keras. `Embedding` layer converts integer tokens into dense vectors of fixed size (16 in this case). `GlobalAveragePooling1D` reduces the dimensionality of the embedding output by averaging across the sequence length. `Dense` layers are fully connected layers. The first `Dense` layer has 24 neurons with a ReLU activation function. The output layer has 1 neuron with a sigmoid activation function, which is suitable for binary classification. The model is compiled with the Adam optimizer, binary cross-entropy loss function (appropriate for binary classification), and accuracy as the evaluation metric.

# Build the model
model = keras.Sequential([
    layers.Embedding(max_words, 16, input_length=max_len),
    layers.GlobalAveragePooling1D(),
    layers.Dense(24, activation='relu'),
    layers.Dense(1, activation='sigmoid')
])

# Compile the model
model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])

Training the Model

The model is trained using the `fit` method. `epochs` specifies the number of training iterations. `batch_size` determines the number of samples processed in each batch. `validation_split` sets aside 20% of the training data for validation during training.

# Train the model
epochs = 10
batch_size = 32

history = model.fit(train_padded, labels_train,
                    epochs=epochs,
                    batch_size=batch_size,
                    validation_split=0.2)

Evaluating the Model

The model's performance is evaluated on the test set using the `evaluate` method. It calculates the loss and accuracy on the test data and prints the results.

# Evaluate the model
loss, accuracy = model.evaluate(test_padded, labels_test)
print(f'Loss: {loss:.4f}')
print(f'Accuracy: {accuracy:.4f}')

Real-Life Use Case

This type of text classification model can be used for spam detection, sentiment analysis, topic categorization, and more. For example, it could be used to filter unwanted emails in a mailbox or to analyze customer reviews to understand their sentiment towards a product or service.

Best Practices

  • Data Preprocessing: Proper data cleaning and preprocessing are crucial for model performance.
  • Hyperparameter Tuning: Experiment with different hyperparameters such as the number of layers, neurons, learning rate, and batch size to optimize model performance.
  • Regularization: Use regularization techniques (e.g., dropout, L1/L2 regularization) to prevent overfitting.
  • Monitoring: Monitor the training process for signs of overfitting or underfitting.

Interview Tip

Be prepared to discuss the different layers used in the model, the activation functions, the loss function, the optimizer, and the evaluation metrics. Also, understand the importance of data preprocessing and hyperparameter tuning.

When to Use This Snippet

Use this snippet when you have a text classification problem where you need to classify text into different categories. This is a good starting point for basic text classification tasks and can be extended for more complex scenarios.

Memory Footprint

The memory footprint depends on the size of the vocabulary, the sequence length, and the model architecture. Larger vocabularies and longer sequences will require more memory. The number of layers and neurons in the model also impact memory usage.

Alternatives

  • Other Deep Learning Models: Recurrent Neural Networks (RNNs) like LSTMs and GRUs, and Transformers like BERT can be used for more complex text classification tasks.
  • Traditional Machine Learning Models: Naive Bayes, Support Vector Machines (SVMs), and Logistic Regression can also be used for text classification.

Pros

  • Relatively Simple: Easy to understand and implement.
  • Good Starting Point: Provides a good baseline for text classification tasks.
  • Scalable: Can be scaled to handle larger datasets.

Cons

  • Limited Complexity: May not perform well on complex text classification tasks.
  • Requires Preprocessing: Requires careful data preprocessing.
  • Overfitting: Prone to overfitting if not properly regularized.

FAQ

  • What is the purpose of the Embedding layer?

    The Embedding layer converts integer tokens into dense vectors of fixed size. It learns a vector representation for each word in the vocabulary.
  • Why do we pad the sequences?

    We pad the sequences because neural networks require inputs of the same length. Padding ensures that all sequences have the same length by adding zeros to the shorter sequences.
  • What is the difference between `padding='pre'` and `padding='post'`?

    `padding='pre'` adds padding at the beginning of the sequences, while `padding='post'` adds padding at the end. `padding='post'` is often preferred for recurrent neural networks.