Java > Java 8 Features > Streams API > Parallel Streams

Parallel Stream Example: Filtering and Collecting Data

This snippet demonstrates filtering and collecting data using a parallel stream. It creates a list of strings, filters the strings that start with a specific prefix, and collects the filtered strings into a new list using a parallel stream for improved performance.

Code Example

This code creates a list of strings. It then filters the strings that start with 'prefix_' and collects them into a new list. This is done using both sequential and parallel streams, and the execution time of each is printed to the console. The `parallelStream()` method converts the collection into a stream that can be processed in parallel, allowing multiple threads to filter and collect the data concurrently.

import java.util.ArrayList;
import java.util.List;
import java.util.stream.Collectors;

public class ParallelStreamFilterCollect {

    public static void main(String[] args) {
        // Create a list of strings
        List<String> strings = new ArrayList<>();
        for (int i = 0; i < 1000; i++) {
            strings.add("prefix_" + i);
            strings.add("other_" + i);
        }

        // Sequential Stream
        long startTimeSequential = System.nanoTime();
        List<String> filteredStringsSequential = strings.stream()
                .filter(s -> s.startsWith("prefix_"))
                .collect(Collectors.toList());
        long endTimeSequential = System.nanoTime();
        long durationSequential = (endTimeSequential - startTimeSequential) / 1_000_000;

        System.out.println("Sequential Filtered String Count: " + filteredStringsSequential.size());
        System.out.println("Sequential Time: " + durationSequential + " ms");

        // Parallel Stream
        long startTimeParallel = System.nanoTime();
        List<String> filteredStringsParallel = strings.parallelStream()
                .filter(s -> s.startsWith("prefix_"))
                .collect(Collectors.toList());
        long endTimeParallel = System.nanoTime();
        long durationParallel = (endTimeParallel - startTimeParallel) / 1_000_000;

        System.out.println("Parallel Filtered String Count: " + filteredStringsParallel.size());
        System.out.println("Parallel Time: " + durationParallel + " ms");
    }
}

Concepts Behind the Snippet

Parallel Streams: As in the previous example, leverages multiple cores for parallel processing of the stream.

Filtering: The `filter()` operation selectively includes elements based on a predicate (a boolean-valued function).

Collecting: The `collect()` operation accumulates the results of the stream processing into a new collection (in this case, a `List`). The `Collectors.toList()` method provides a convenient way to collect the elements into a list.

Real-Life Use Case

This pattern is useful when you need to extract specific elements from a large dataset based on certain criteria, such as:

  • Log Analysis: Filtering log entries based on severity level or timestamp.
  • Database Queries: Processing query results to extract relevant data.
  • Web Scraping: Filtering and extracting specific information from web pages.

Best Practices

  • Consider Data Size: Parallel streams are most beneficial when processing large datasets. For small datasets, the overhead of parallelization might outweigh the performance gains.
  • Avoid Complex Filters: Complex filtering logic can reduce the efficiency of parallel streams. Try to simplify filters as much as possible.
  • Use Immutable Data: Avoid modifying the underlying data source while processing the stream. This can lead to unexpected results and race conditions.

Interview Tip

Be prepared to discuss the performance implications of using parallel streams for filtering and collecting data. Explain how the size of the dataset and the complexity of the filtering logic can impact the efficiency of parallelization.

When to use them

Use Parallel Streams for filtering and collecting when:

  • The collection to be filtered and collected is large.
  • The filter operation is CPU intensive.
  • You want to leverage multiple cores to improve performance.
  • You are using immutable data sources.

Memory Footprint

The memory footprint of this operation will primarily depend on the size of the original list and the size of the resulting filtered list. Using a parallel stream can also temporarily increase memory usage due to the creation of sub-streams and intermediate data structures.

Alternatives

Alternatives include:

  • Sequential Streams: For small datasets or when parallelization is not necessary.
  • Traditional Loops: For more fine-grained control over the filtering and collecting process.
  • Guava's Collections Utilities: Provides utilities for filtering and transforming collections.

Pros

  • Potential for significant performance improvement for large datasets.
  • Concise and readable code using the Stream API.
  • Automatic handling of thread management and task distribution.

Cons

  • Overhead of parallelization can reduce performance for small datasets.
  • Potential for race conditions if shared mutable state is used.
  • Can be more complex to debug than sequential code.

FAQ

  • Why does `collect(Collectors.toList())` not need to be synchronized in this parallel example?

    `Collectors.toList()` in the context of `parallelStream()` uses an internal concurrent data structure to accumulate the results from multiple threads. While the underlying data structure *is* being modified concurrently, the `Collectors.toList()` implementation handles the synchronization internally, so you don't need to add explicit synchronization code.
  • Can I use a different collector with parallel streams?

    Yes, you can use a variety of collectors with parallel streams, such as `Collectors.toSet()`, `Collectors.toMap()`, and `Collectors.groupingBy()`. Just ensure that the collector is thread-safe and can handle concurrent modifications.
  • What if my filtering operation is very computationally expensive?

    In this case, parallel streams can provide significant performance benefits. However, you should still measure the performance of both sequential and parallel streams to ensure that parallelization is actually improving execution time. Also, consider if the operation can be optimized separately.