Skip Top Navigation Bar
Current Tutorial
Finding the Characteristics of a Stream

Finding the Characteristics of a Stream

 

Characteristics of a Stream

The Stream API relies on a special object, an instance of the Spliterator interface. The name of this interface comes from the fact that the role of a spliterator in the Stream API looks like the role of an iterator has in the Collection API. Moreover, because the Stream API supports parallel processing, a spliterator object also controls how a stream splits its elements among the different CPUs that handle parallelization. The name is the contraction of split and iterator.

Covering this spliterator object in details is beyond the scope of this tutorial. What you need to know is that this spliterator object carries the characteristics of a stream. These characteristics are not something you will often use, but knowing what they are will help you to write better and more efficient pipelines in certain cases.

The characteristics of a stream are the following.

Characteristic Comment
ORDERED The order in which the elements of the stream are processed matters.
DISTINCT There are no doubles in the elements processed by that stream.
NONNULL There are no null elements in that stream.
SORTED The elements of that stream are sorted.
SIZED The number of elements this stream processes is known.
SUBSIZED Splitting this stream produces two SIZED streams.

There are two characteristics, IMMUTABLE and CONCURRENT, which are not covered in this tutorial.

Every stream has all these characteristics set or unset when it is created.

Remember that a stream can be created in two ways.

  1. You can create a stream from a source of data, and we covered several different patterns.
  2. Every time you call an intermediate operation on an existing stream, you create a new stream.

The characteristics of a given stream depend on the source it has been created on, or the characteristics of the stream with which it was created, and the operation that created it. If your stream is created with a source, then its characteristics depend on that source, and if you created it with another stream, then they will depend on this other stream and the type of operation you are using.

To check is a stream has a given characteristics, you need to check for a given bit in a word within the spliterator of a stream. Let us write a predicate that can check for the ORDERED characteristic of a stream.

Running the previous code prints the following.

ordered = true

If you replace List by Set in the previous code, you can see that the ORDERED characteristic is not present anymore.

Let us present each characteristic in more details.

 

Ordered Streams

ORDERED streams are created with ordered sources of data. The fist example that may come to mind is any instance of the List interface. There are others: Files.lines(path) and Pattern.splitAsStream(string) also produce ORDERED streams.

Keeping track of the order of the elements of a stream may lead to overheads for parallel streams. If you do not need this characteristic, then you can delete it by calling the unordered() intermediate method on an existing stream. This will return a new stream without this characteristic. Why would you want to do that? Keeping a stream ORDERED may be costly in some cases, for instance when you are using parallel streams.

The example in the previous section shows you how to check for the ORDERED characteristic of a stream.

 

Sorted Streams

A SORTED stream is a stream that has been sorted. This stream can be created from a sorted source, such as a TreeSet instance, or by a call to the sorted() method. Knowing that a stream has already been sorted may be used by the stream implementation to avoid sorting again an already sorted stream. This optimization may not be used all the time because a SORTED stream may be sorted again with a different comparator than the one used the first time.

There are some intermediate operations that clear the SORTED characteristic. In the following code, you can see that both strings.stream() and filteredSortedStrings are SORTED streams, whereas lengths is not.

Running the previous prints the following.

Is strings sorted? false
Is sortedStrings sorted? true
Is filteredSortedStrings sorted? true
Is lengths sorted? false

Mapping or flatmapping a SORTED stream removes this characteristic from the resulting stream.

 

Distinct Streams

A DISTINCT stream is a stream with no duplicates among the elements it is processing. Such a characteristic is acquired when building a stream from a HashSet for instance, or from a call to the distinct() intermediate method call.

The DISTINCT characteristic is kept when filtering a stream but is lost when mapping or flatmapping a stream.

Let us examine the following example.

Running the previous code prints the following.

Is strings distinct? false
Is distinct sorted? true
Is filteredStrings sorted? true
Is lengths sorted? false
  • strings.stream() is not DISTINCT as it is build from an instance of List.
  • strings.stream().distinct() is DISTINCT as this stream is created by a call to the distinct() intermediate method.
  • filtered is still DISTINCT: removing elements from a stream cannot create duplicates.
  • length has been mapped, so the DISTINCT characteristic is lost.

 

Non-Null Streams

A NONNULL stream is a stream that does not contain null values. There are structures from the Collection Framework that do not accept null values, including ArrayDeque and the concurrent structures like ArrayBlockingQueue, ConcurrentSkipListSet, and the concurrent set returned by a call to ConcurrentHashMap.newKeySet(). Streams created with Files.lines(path) and Pattern.splitAsStream(line) are also NONNULL streams. There are some others that, due to the way they are working, cannot produce collections with null values, like Map.values().

The following code checks for some of these streams.

Running the previous code produces the following result.

Values from hash map is non null? false
ArrayDeque is non null? true

As for the previous characteristics, some intermediate operations can produce a stream with different characteristics.

 

Sized and Subsized Streams

Sized Streams

This last characteristic is very important when you want to use parallel streams. Parallel streams are covered in more detail later in this tutorial.

A SIZED stream is a stream that knows how many elements it will process. A stream created from any instance of Collection is such a stream because the Collection interface has a size() method, so getting this number is easy.

On the other hand, there are cases where you know that your stream will process a finite number of elements, but you cannot know this number unless you process the stream itself.

This is the case for streams created with the Files.lines(path) pattern. You can get the size of the text file in bytes, but this information does not tell you how many lines this text file has. You need to analyze the file to get this information.

This is also the case for the Pattern.splitAsStream(line) pattern. Knowing the number of characters there are in the string you are analyzing does not give any hint about how many elements this pattern will produce.

Subsized Streams

The SUBSIZED characteristic has to do with the way a stream is split when computed as a parallel stream. In a nutshell, the parallelization mechanism splits a stream in two parts and distribute the computation among the different available cores on which the CPU is executing. This splitting is implemented by the instance of the Spliterator the stream uses. This implementation depends on the source of data you are using.

Suppose that you need to open a stream on an ArrayList. All the data of this list is held in the internal array of your ArrayList instance. Maybe you remember that the internal array on an ArrayList object is a compact array because when you remove an element from this array, all the following elements are moved one cell to the left so that no hole is left.

This makes the splitting an ArrayList straightforward. To split an instance of ArrayList, you can just split this internal array in two parts, with the same amount of elements in both parts. This makes a stream created on an instance of ArrayList SUBSIZED: you can tell in advance how many elements will be held in each part after splitting.

Suppose now that you need to open a stream on an instance of HashSet. A HashSet stores its elements in an array, but this array is used differently than the one used by ArrayList. In fact, more than one element can be stored in a given cell of this array. There is no problem in splitting this array, but you cannot tell in advance how many elements will be held in each part without counting them. Even if you split this array by the middle, you can never be sure that you will have the same number of elements in both halves. This is the reason why a stream created on an instance of HashSet is SIZED but not SUBSIZED.

Transforming a stream may change the SIZED and SUBSIZED characteristics of the returned stream.

  • Mapping and sorting a stream preserves the SIZED and SUBSIZED characteristics.
  • Flatmapping, filtering, and calling distinct() erases these characteristics.

It is always better to have SIZED and SUBSIZED stream for parallel computations.

Examples of Sized and Subsized Streams

You can easily check that an ArrayList is both SIZED but not SUBSIZED with the following code.

Running the previous code prints the following.

Array list is sized? true
Array list is subsized? true

You can run the same code on HashSet

Running the previous code prints the following.

Hash set is sized? true
Hash set is subsized? false

Let us see one last example: the stream produced by the splitting of a string of characters using a pattern.

Running the previous code prints the following.

Pattern split as stream is sized? false
Pattern split as stream is subsized? false

Last update: September 14, 2021


Current Tutorial
Finding the Characteristics of a Stream