Finding the Characteristics of a Stream
Characteristics of a Stream
The Stream API relies on a special object, an instance of the
Spliterator interface. The name of this interface comes from the fact that the role of a spliterator in the Stream API looks like the role of an iterator has in the Collection API. Moreover, because the Stream API supports parallel processing, a spliterator object also controls how a stream splits its elements among the different CPUs that handle parallelization. The name is the contraction of split and iterator.
Covering this spliterator object in details is beyond the scope of this tutorial. What you need to know is that this spliterator object carries the characteristics of a stream. These characteristics are not something you will often use, but knowing what they are will help you to write better and more efficient pipelines in certain cases.
The characteristics of a stream are the following.
|ORDERED||The order in which the elements of the stream are processed matters.|
|DISTINCT||There are no doubles in the elements processed by that stream.|
|NONNULL||There are no null elements in that stream.|
|SORTED||The elements of that stream are sorted.|
|SIZED||The number of elements this stream processes is known.|
|SUBSIZED||Splitting this stream produces two SIZED streams.|
Every stream has all these characteristics set or unset when it is created.
Remember that a stream can be created in two ways.
- You can create a stream from a source of data, and we covered several different patterns.
- Every time you call an intermediate operation on an existing stream, you create a new stream.
The characteristics of a given stream depend on the source it has been created on, or the characteristics of the stream with which it was created, and the operation that created it. If your stream is created with a source, then its characteristics depend on that source, and if you created it with another stream, then they will depend on this other stream and the type of operation you are using.
Let us present each characteristic in more details.
ORDERED streams are created with ordered sources of data. The fist example that may come to mind is any instance of the
List interface. There are others:
Pattern.splitAsStream(string) also produce ORDERED streams.
Keeping track of the order of the elements of a stream may lead to overheads for parallel streams. If you do not need this characteristic, then you can delete it by calling the
unordered() intermediate method on an existing stream. This will return a new stream without this characteristic. Why would you want to do that? Keeping a stream ORDERED may be costly in some cases, for instance when you are using parallel streams.
A SORTED stream is a stream that has been sorted. This stream can be created from a sorted source, such as a
TreeSet instance, or by a call to the
sorted() method. Knowing that a stream has already been sorted may be used by the stream implementation to avoid sorting again an already sorted stream. This optimization may not be used all the time because a SORTED stream may be sorted again with a different comparator than the one used the first time.
Collection<String> stringCollection = List.of("one", "two", "two", "three", "four", "five"); Stream<String> strings = stringCollection.stream().sorted(); Stream<String> filteredStrings = strings.filtered(s -> s.length() < 5); Stream<Integer> lengths = filteredStrings.map(String::length);
Mapping or flatmapping a SORTED stream removes this characteristic from the resulting stream.
A DISTINCT stream is a stream with no duplicates among the elements it is processing. Such a characteristic is acquired when building a stream from a
HashSet for instance, or from a call to the
distinct() intermediate method call.
The DISTINCT characteristic is kept when filtering a stream but is lost when mapping or flatmapping a stream.
Let us examine the following example.
Collection<String> stringCollection = List.of("one", "two", "two", "three", "four", "five"); Stream<String> strings = stringCollection.stream().distinct(); Stream<String> filteredStrings = strings.filtered(s -> s.length() < 5); Stream<Integer> lengths = filteredStrings.map(String::length);
stringCollection.stream()is not DISTINCT as it is build from an instance of
stringsis DISTINCT as this stream is created by a call to the
filteredStringsis still DISTINCT: removing elements from a stream cannot create duplicates.
lengthhas been mapped, so the DISTINCT characteristic is lost.
A NONNULL stream is a stream that does not contain null values. There are structures from the Collection Framework that do not accept null values, including
ArrayDeque and the concurrent structures like
ConcurrentSkipListSet, and the concurrent set returned by a call to
ConcurrentHashMap.newKeySet(). Streams created with
Pattern.splitAsStream(line) are also NONNULL streams.
As for the previous characteristics, some intermediate operations can produce a stream with different characteristics.
- Filtering or sorting a NONNULL stream returns a NONNULL stream.
distinct()on a NONNULL stream also returns a NONNULL stream.
- Mapping or flatmapping a NONNULL stream returns a stream without this characteristic.
Sized and Subsized Streams
This last characteristic is very important when you want to use parallel streams. Parallel streams are covered in more detail later in this tutorial.
A SIZED stream is a stream that knows how many elements it will process. A stream created from any instance of
Collection is such a stream because the
Collection interface has a
size() method, so getting this number is easy.
On the other hand, there are cases where you know that your stream will process a finite number of elements, but you cannot know this number unless you process the stream itself.
This is the case for streams created with the
Files.lines(path) pattern. You can get the size of the text file in bytes, but this information does not tell you how many lines this text file has. You need to analyze the file to get this information.
This is also the case for the
Pattern.splitAsStream(line) pattern. Knowing the number of characters there are in the string you are analyzing does not give any hint about how many elements this pattern will produce.
The SUBSIZED characteristic has to do with the way a stream is split when computed as a parallel stream. In a nutshell, the parallelization mechanism splits a stream in two parts and distribute the computation among the different available cores on which the CPU is executing. This splitting is implemented by the instance of the
Spliterator the stream uses. This implementation depends on the source of data you are using.
Suppose that you need to open a stream on an
ArrayList. All the data of this list is held in the internal array of your
ArrayList instance. Maybe you remember that the internal array on an
ArrayList object is a compact array because when you remove an element from this array, all the following elements are moved one cell to the left so that no hole is left.
This makes the splitting an
ArrayList straightforward. To split an instance of
ArrayList, you can just split this internal array in two parts, with the same amount of elements in both parts. This makes a stream created on an instance of
ArrayList SUBSIZED: you can tell in advance how many elements will be held in each part after splitting.
Suppose now that you need to open a stream on an instance of
HashSet stores its elements in an array, but this array is used differently than the one used by
ArrayList. In fact, more than one element can be stored in a given cell of this array. There is no problem in splitting this array, but you cannot tell in advance how many elements will be held in each part without counting them. Even if you split this array by the middle, you can never be sure that you will have the same number of elements in both halves. This is the reason why a stream created on an instance of
HashSet is SIZED but not SUBSIZED.
- Mapping and sorting a stream preserves the SIZED and SUBSIZED characteristics.
- Flatmapping, filtering, and calling
distinct()erases these characteristics.
Last update: September 14, 2021