Python > Advanced Topics and Specializations > Concurrency and Parallelism > Processes and the `multiprocessing` Module
Basic Process Creation with Multiprocessing
This snippet demonstrates how to create and start a new process using the multiprocessing
module. It shows the fundamental steps involved in parallel execution in Python.
Code Example
This code creates three separate processes. Each process executes the worker
function. The multiprocessing.Process
class is used to create new processes. The target
argument specifies the function to be executed in the new process, and the args
argument provides the arguments to that function. The start()
method starts the process, and the join()
method waits for the process to complete. The if __name__ == '__main__':
block is crucial; it prevents the child processes from re-importing and re-running the main script, which can lead to infinite recursion, especially on Windows.
import multiprocessing
import time
def worker(num):
print(f'Worker {num}: Starting')
time.sleep(2) # Simulate some work
print(f'Worker {num}: Finishing')
if __name__ == '__main__':
processes = []
for i in range(3):
p = multiprocessing.Process(target=worker, args=(i,))
processes.append(p)
p.start()
for p in processes:
p.join() # Wait for all processes to finish
print('All workers finished.')
Concepts Behind the Snippet
The core idea is to achieve parallelism by distributing tasks across multiple processes. Each process has its own memory space, meaning variables and data are not shared by default. This avoids common concurrency issues like race conditions that are prevalent in multithreaded environments. The multiprocessing
module creates entirely new Python interpreters, each with its own Global Interpreter Lock (GIL), allowing true parallel execution on multi-core systems. The GIL limits true parallelism in multithreaded Python programs because only one thread can hold control of the Python interpreter at any given time. Processes circumvent this limitation.
Real-Life Use Case
Imagine you're building a web scraper that needs to fetch data from hundreds of websites. Scraping each website sequentially would be very slow. By using multiprocessing
, you can scrape multiple websites concurrently, significantly reducing the overall scraping time. Another example would be image or video processing, where large files can be split into smaller chunks and processed in parallel by different processes.
Best Practices
Always use the if __name__ == '__main__':
guard when using multiprocessing
, especially on Windows. This prevents child processes from re-importing and re-executing the main script. Consider using a process pool (multiprocessing.Pool
) for managing a group of worker processes efficiently. Be mindful of inter-process communication (IPC). Sharing data between processes requires explicit mechanisms like pipes, queues, or shared memory. Keep the workload of each process relatively independent to minimize communication overhead.
Interview Tip
Be prepared to explain the difference between threads and processes in Python. Focus on the GIL limitation in multithreading and how multiprocessing
overcomes it by creating separate Python interpreters. Also, discuss the challenges of inter-process communication and the available mechanisms for sharing data between processes.
When to Use Processes
Use processes when you need true parallel execution on multi-core systems and are dealing with CPU-bound tasks. Processes are also suitable when you need isolation between different parts of your application, as each process has its own memory space. Choose processes when data sharing is minimal, as data needs to be serialized and deserialized when sent between processes, adding overhead.
Memory Footprint
Processes generally have a higher memory footprint compared to threads because each process has its own copy of the Python interpreter and the program's data. This can be a significant consideration if you need to create a large number of processes.
Alternatives
Alternatives to multiprocessing
include multithreading (using the threading
module), asynchronous programming (using asyncio
), and distributed computing frameworks like Dask or Spark. Multithreading is suitable for I/O-bound tasks where the GIL is not a major bottleneck. Asynchronous programming is a good choice for concurrent I/O operations and single-threaded event loops. Dask and Spark are designed for large-scale data processing and distributed computing across multiple machines.
Pros
Cons
FAQ
-
What is the Global Interpreter Lock (GIL)?
The GIL is a mechanism in CPython that allows only one thread to hold control of the Python interpreter at any given time. This prevents multiple threads from executing Python bytecode in parallel, effectively limiting true parallelism in multithreaded Python programs. Processes circumvent this limitation because each process has its own Python interpreter and GIL.
-
How do I share data between processes?
You can share data between processes using mechanisms like
multiprocessing.Queue
,multiprocessing.Pipe
, andmultiprocessing.sharedctypes
. Queues provide a thread-safe way to pass messages between processes. Pipes allow bidirectional communication between two processes. Shared memory (sharedctypes
) allows processes to access and modify the same memory region. Choose the appropriate mechanism based on your data sharing needs and the level of synchronization required.