Python tutorials > Advanced Python Concepts > Concurrency and Parallelism > What are processes (`multiprocessing`)?
What are processes (`multiprocessing`)?
Understanding Processes with This tutorial explores the concept of processes in Python, specifically how to leverage the multiprocessing
in Python
multiprocessing
module to achieve true parallelism. We will delve into the creation, management, and communication between processes, providing practical code examples and explanations.
Introduction to Processes and multiprocessing
In the context of computing, a process is an instance of a program in execution. It has its own memory space, resources, and system threads. Unlike threads within a single process, processes run independently and do not share memory by default. This isolation prevents issues like race conditions and deadlocks that can occur when multiple threads access shared resources concurrently. The What are Processes?
Why use
multiprocessing
?multiprocessing
module in Python allows you to spawn processes much like threads. However, the key difference is that processes sidestep the Global Interpreter Lock (GIL) limitation of CPython, enabling true parallel execution on multi-core processors. This is crucial for CPU-bound tasks where significant performance gains can be achieved by distributing the workload across multiple cores.
Creating and Starting Processes
Code Explanation
multiprocessing
: We start by importing the necessary module.worker
function represents the task that each process will execute. In this example, it performs a computationally intensive task (calculating the sum of squares) to simulate CPU-bound work.Process
Objects: Inside the if __name__ == '__main__':
block (which is essential for multiprocessing on some operating systems like Windows), we create a list of Process
objects. Each Process
is initialized with:target
: The function to be executed (worker
).args
: A tuple of arguments to pass to the worker
function.start()
method on each one. This creates a new process and begins executing the worker
function within it.join()
method blocks the main process until the corresponding process has completed its execution. This ensures that the main program waits for all worker processes to finish before exiting.
import multiprocessing
def worker(num):
"""Worker function to be executed in a separate process."""
print(f'Process {num}: Starting')
# Simulate some work
result = sum(i * i for i in range(1000000))
print(f'Process {num}: Finished, Result: {result}')
if __name__ == '__main__':
processes = []
for i in range(3):
p = multiprocessing.Process(target=worker, args=(i,))
processes.append(p)
p.start()
for p in processes:
p.join() # Wait for all processes to complete
print('All processes completed.')
Communication between Processes (Pipes)
Pipes provide a simple mechanism for unidirectional communication between processes. The Using Pipes for Inter-Process Communication
multiprocessing.Pipe()
function creates a pair of connected file descriptors representing the ends of the pipe. One process can send data through one end (conn.send()
), and another process can receive data from the other end (conn.recv()
).Code Explanation
parent_conn, child_conn = multiprocessing.Pipe()
creates two connection objects. Data written to child_conn
can be read from parent_conn
.sender
function sends a list of messages through its connection, and the receiver
function receives and prints messages until it encounters an EOFError
(indicating the connection has been closed).
import multiprocessing
def sender(conn, messages):
"""Sends messages through the connection."""
for message in messages:
conn.send(message)
print(f'Sent: {message}')
conn.close()
def receiver(conn):
"""Receives messages from the connection."""
while True:
try:
message = conn.recv()
print(f'Received: {message}')
except EOFError:
break
print('Receiver finished.')
if __name__ == '__main__':
parent_conn, child_conn = multiprocessing.Pipe()
messages_to_send = ['Hello', 'from', 'process', 'land!']
p1 = multiprocessing.Process(target=sender, args=(child_conn, messages_to_send))
p2 = multiprocessing.Process(target=receiver, args=(parent_conn,))
p1.start()
p2.start()
p1.join()
p2.join()
print('Done!')
Communication between Processes (Queues)
Using Queues for Inter-Process Communication
multiprocessing.Queue
provides a thread-safe, process-safe FIFO queue. It is often preferred over pipes for more complex communication patterns. Queues automatically handle locking, making it easier to exchange data safely between processes.Code Explanation
q = multiprocessing.Queue()
creates a new queue object.producer
function puts items into the queue, and the consumer
function retrieves and processes items from the queue. The producer adds None
to the queue to signal the consumer to stop.
import multiprocessing
def producer(queue, items):
"""Adds items to the queue."""
for item in items:
queue.put(item)
print(f'Produced: {item}')
queue.put(None) # Signal the consumer to stop
def consumer(queue):
"""Consumes items from the queue."""
while True:
item = queue.get()
if item is None:
break
print(f'Consumed: {item}')
print('Consumer finished.')
if __name__ == '__main__':
q = multiprocessing.Queue()
items_to_produce = ['Apple', 'Banana', 'Cherry']
p1 = multiprocessing.Process(target=producer, args=(q, items_to_produce))
p2 = multiprocessing.Process(target=consumer, args=(q,))
p1.start()
p2.start()
p1.join()
p2.join()
print('Done!')
Concepts Behind the Snippet
The snippets illustrate core concepts of process-based concurrency:multiprocessing.Process
.
Real-Life Use Case Section
Consider a scenario where you need to process a large image dataset. Instead of processing images sequentially in a single process, you can distribute the processing across multiple processes using Another use case is web scraping. You can use multiple processes to scrape different websites concurrently, gathering data much faster than a single-threaded scraper.multiprocessing
. Each process can handle a subset of the images, significantly reducing the overall processing time.
Best Practices
if __name__ == '__main__':
: This is crucial, especially on Windows, to prevent recursive process creation.multiprocessing.Pool
to efficiently reuse processes and reduce overhead.
Interview Tip
Be prepared to discuss the differences between threads and processes, the limitations of the GIL, and the advantages of using multiprocessing
for CPU-bound tasks. Also, understand the different IPC mechanisms available and their trade-offs (e.g., pipes vs. queues).
When to Use Them
Use processes when:
Memory Footprint
Each process has its own memory space, resulting in a higher memory footprint compared to threads within a single process. This is because each process needs to load its own copy of libraries and data. Consider this when designing your application, especially when dealing with very large datasets or a high number of processes.
Alternatives
Alternatives to multiprocessing
include:threading
: Suitable for I/O-bound tasks where the GIL is not a major bottleneck.asyncio
: A single-threaded concurrency model based on coroutines, ideal for I/O-bound and high-concurrency scenarios.
Pros
Cons
FAQ
-
What is the Global Interpreter Lock (GIL)?
The GIL is a mutex in CPython that allows only one thread to hold control of the Python interpreter at any given time. This means that even on multi-core processors, only one thread can execute Python bytecode at a time. This limits the ability of threads to achieve true parallelism for CPU-bound tasks.
-
When should I use
multiprocessing
instead ofthreading
?
Use
multiprocessing
for CPU-bound tasks where you want to achieve true parallel execution and bypass the GIL limitation. Usethreading
for I/O-bound tasks where threads spend most of their time waiting for external operations to complete. -
How do I share data between processes?
You can share data between processes using various IPC mechanisms provided by the
multiprocessing
module, such as pipes, queues, shared memory, and managers. -
What are process pools?
A process pool is a collection of worker processes that are created at the start of a program and are reused to execute multiple tasks. Using process pools can improve performance by reducing the overhead of creating and destroying processes for each task.