Python > Modules and Packages > Standard Library > Concurrency and Parallelism (`threading`, `multiprocessing` modules)

Basic Multiprocessing Example

This snippet showcases the multiprocessing module for parallel execution. It spawns two processes that each execute the square_numbers function, calculating the squares of numbers from 1 to 5. Unlike threading, multiprocessing bypasses the GIL, enabling true parallel processing on multi-core systems.

Code

The multiprocessing module creates separate processes, each with its own Python interpreter. The square_numbers function calculates the square of each number. We create two processes and start them. process.join() ensures the main program waits for the child processes to complete. This approach leverages multiple CPU cores for true parallel execution, which is beneficial for CPU-bound tasks.

import multiprocessing
import time

def square_numbers(process_id):
    for i in range(1, 6):
        time.sleep(0.1)  # Simulate some work
        print(f"Process {process_id}: {i} squared = {i*i}")

if __name__ == "__main__":
    # Create two processes
    process1 = multiprocessing.Process(target=square_numbers, args=(1,))
    process2 = multiprocessing.Process(target=square_numbers, args=(2,))

    # Start the processes
    process1.start()
    process2.start()

    # Wait for the processes to finish
    process1.join()
    process2.join()

    print("All processes finished.")

Concepts Behind the Snippet

  • Processes: Processes are independent units of execution that have their own memory space. This means that processes do not share memory directly, which helps to avoid many of the concurrency issues that can arise with threads.
  • Parallelism: Parallelism refers to the ability of a program to execute multiple tasks simultaneously on multiple CPU cores. The multiprocessing module enables true parallelism in Python by creating separate processes for each task.
  • Inter-Process Communication (IPC): Since processes do not share memory directly, they need to use IPC mechanisms to communicate and share data. Common IPC mechanisms include queues, pipes, and shared memory.
  • multiprocessing.Process: This class is used to create new processes. You specify the target function and any arguments to pass to that function.
  • process.start(): This method starts the process's execution.
  • process.join(): This method blocks the calling process (in this case, the main process) until the process whose join() method is called completes its execution.

Real-Life Use Case

Consider a scenario where you need to perform a computationally intensive task, such as image processing or scientific simulations. You can divide the task into smaller subtasks and assign each subtask to a separate process. This allows you to utilize all available CPU cores and significantly reduce the overall execution time. Machine Learning model training often uses multiprocessing to speed up data preprocessing.

Best Practices

  • Avoid Shared State: While multiprocessing avoids direct memory sharing, be careful about external shared resources like files or databases. Coordinate access to these resources to prevent corruption.
  • Use Queues for Communication: For sharing data between processes, use the multiprocessing.Queue class. It's a thread-safe and process-safe queue that allows you to pass data between processes.
  • Be Mindful of Overhead: Creating and managing processes has a higher overhead than creating and managing threads. Don't use multiprocessing for very short or simple tasks, as the overhead might outweigh the benefits of parallelism.
  • Use a Process Pool: For managing a large number of processes, use the multiprocessing.Pool class. It provides a convenient way to distribute tasks among a pool of worker processes.

Interview Tip

Understand the difference between threading and multiprocessing, and when to use each. Explain how multiprocessing overcomes the GIL limitation. Discuss common IPC mechanisms and the challenges of managing communication between processes.

When to Use Multiprocessing

Multiprocessing is ideal for CPU-bound tasks that can be divided into independent subtasks. Examples include numerical computations, image processing, video encoding, and parallel data analysis. If your code is primarily waiting for I/O, then threading or asyncio might be more appropriate.

Memory Footprint

Processes have a larger memory footprint compared to threads because each process has its own memory space. This can be a concern if you are creating a large number of processes or dealing with large datasets.

Alternatives

  • threading: Suitable for I/O-bound tasks where the GIL is less of a bottleneck.
  • asyncio: For asynchronous programming and event-driven concurrency, especially useful for handling many concurrent I/O operations efficiently.
  • Dask: For parallel computing with larger-than-memory datasets.

Pros

  • True Parallelism: Bypasses the GIL, enabling true parallel execution on multi-core systems.
  • Fault Isolation: If one process crashes, it does not affect other processes.
  • Improved Performance: Can significantly improve performance for CPU-bound tasks.

Cons

  • Higher Overhead: Creating and managing processes has a higher overhead than threads.
  • Memory Consumption: Processes have a larger memory footprint.
  • Complex Communication: Requires explicit IPC mechanisms to communicate between processes.

FAQ

  • How does multiprocessing overcome the GIL limitation?

    Each process created by multiprocessing has its own Python interpreter and its own memory space. This means that the GIL in one process does not affect other processes. Each process can execute Python bytecode in parallel on different CPU cores.
  • How can I share data between processes?

    You can share data between processes using IPC mechanisms such as queues (multiprocessing.Queue), pipes (multiprocessing.Pipe), or shared memory (multiprocessing.Value, multiprocessing.Array). Queues are the most common and convenient way to pass data between processes.
  • What is a multiprocessing.Pool, and when should I use it?

    A multiprocessing.Pool is a collection of worker processes that can be used to distribute tasks among them. You should use a Pool when you have a large number of tasks to perform and want to manage the processes efficiently. The Pool class handles the creation, management, and distribution of tasks among the worker processes.