Have you ever traveled by public transport in a busy city? Trains and buses usually have a certain number of seats that everyone is trying to squeeze into. 😀 Rarely do two people squeeze into the same seat. You can only ever have one person in a seat at a time. And seats are only large enough for a single sitter at a time.
As you've seen, concurrent applications can go wrong in unexpected ways when your threads share mutable data structures. In other words, you don't want two bottoms on one seat. 😉 To avoid this, you can try using final variables or create new variables instead of updating shared ones. However, there are times when your problem requires a data structure that is designed to be both mutable and shared.
Maps are a typical example of this. Let's check them out!
Working With Mutable Maps
A map is like a massive index of values, which are super fast to look up. Rather than having to iterate through every item in a map, you can find things instantly by providing a key that they are optimized to lookup. It's like knowing where your friend always sits and walking straight to her.
Why do we use maps? Can't we use an array and search through it?
The answer is time complexity. Time complexity is a metric that you use to understand the worst-case number of steps involved in running an algorithm. For instance, what is involved with looking up a value from an array? It's is similar to how the famous Gang of Four (GoF) used document design patterns in their engineering (you really should read their book!). Worst case, to find something in an array, you'd have to visit every value and ask, "Is this the one I'm looking for?" So for an array of any size which we'll call n, in the worst case, you'd have to perform n operations. We call this O(n), or order n.
Maps are incredibly efficient as they typically involve implementations, which makes them appear to be O(1). That is, for any value in the map, you can find it in one operation. You provide a key and have a value returned without having to search the whole map.
Imagine you're building a system where multiple threads need to share values or update a sizeable HashMap implementation - such as a mapping from planet names to temperatures. You might use a planet name, such as the Kepler Id as a key, and want to make sure you can avoid duplicates counting across threads. Remember the one bottom per seat rule; there can be only one. 😛
If you have hundreds of keys, it can quickly become slow and wasteful to copy your whole map into a new one, just because you've updated a single field. Over the years, Java has attempted to handle this by providing various map implementations that can work safely in a multithreaded application:
Thread Safe Map | Description |
This has been in the JDK for generations. Today, it implements the Map API and synchronizes all access its keys, values, and operations. As a result, it suffers from overlocking. This means that even different threads updating or reading different keys will be blocked on each other and can slow down your application. It's like the driver announcing that everyone should stand completely still while you sit down - or check to find an available seat. Its Javadoc states:
| |
This method, provided by Java's collections framework, takes any ordinary map object and make it thread-safe. While this seems great, it simply wraps the object in another one which synchronizes all access in a similar fashion to the HashTable implementation. This too suffers from overlocking and can slow your application down by only allowing one thread at a time to collaborate with it. | |
The ConcurrentHashMap provides a fully thread-safe map API without blocking all other threads wanting to use it. It is the current recommended solution when you require a thread-safe map. We'll explore this further and consider the consequence of using one. |
Can't I just use locks and synchronization mechanisms to make the operations I care about thread-safe?
Yes, you could manually synchronize using locks, wait/notify, or some other mechanism. One of the secrets to successful development is code-reuse. If you can get the performance and behavior you're after without writing a line of code, that's time you could better spend adding value to your client through other software.
ConcurrentHashMaps
are designed to give you predictable behavior and performance. They're built by engineers who are accountable for the whole JDKs performance. Unless you need to support some complex requirements, then it's better to stick to the prescribed and tested.
Working With ConcurrentHashMap
When two people reach for the same seat, they usually coordinate about just that seat. Other occupants don't care what's going on in a seat on which they aren't seated. ConcurrentHashMaps still use a locking scheme, but it's more comparable to two passengers trying to figure out who gets to sit on the one seat they have both approached.
In the case of our ConcurrentHashMap, the lock only applies to threads updating it and is only applied across a small group of keys called a bucket. This group includes the specific key being updated or added. It's called a striped lock.
A ConcurrentHashMap guarantee the following behavior:
A striped lock ensures that specific keys have synchronized updates across threads without locking the rest of the map.
Threads reading from the map are guaranteed to continue to see the map exactly as it was when they started reading from it. This means that data in that map won't change from under their feet.
For instance, imagine that my key is updated from my value to your value. If one thread started to read from map.get("my key")
before it was changed, it is guaranteed to get back a value of my value even though it might have changed.
Therefore, multiple threads are able to read and update a ConcurrentHashMap at the same time with efficiency and a guaranteed behavior. It allows you to understand how your program will behave and ask questions such as whether you need a more aggressive synchronization.
How do I use a ConcurrentHashMap?
ConcurrentHashMap is just another map implementation; therefore, you can use it in exactly the same way you'd use a map. Without doing anything extra, it is suddenly safe to use across your threads. Use new
to create a new one, and then use put()
and get()
methods to access it.
We're going to create a ConcurrentHashMapand, make two threads update, and read from it.
Step 1: Start up JShell:
jshell
Step 2: Create a ConcurrentHashMap with ConcurrentHashMap map = new ConcurrentHashMap()
:
jshell> ConcurrentHashMap map = new ConcurrentHashMap()
map ==> {}
Step 3: Create a thread that counts to a large number and updates a key with the value "last_number":
Thread mutator1 = new Thread( () -> IntStream.range(0, 1000000).forEach( (n) -> map.put("last number", n) ) )
Step 4: Now create an accessor thread to print some values from the map:
Thread accessor = new Thread( () -> IntStream.range(0, 2000).forEach( (n) -> m.get("last number", n) ) )
This loops 2000 times and shows the value of the "last number" key.
Now query m.get("last_number")
.
Let's do it together with a few more threads!
Comparing Fully Synchronized Maps vs ConcurrentHashMaps
For you to understand the full power of ConcurrentHashMap, let's implement both and compare the results. Imagine that the files with planet temperatures have some duplicate temperatures caused by dodgy space equipment. You might want to only sample changes to a planet's temperature. Let's update the planet analyzer to solve the same problem with either Collections.synchronized()
or a ConcurrentHashMap()
.
Let's use this map to only sample temperatures for a particular planet if it has changed and see how you can do this using a new stream filter when processing the file.
As you saw, we changed the code to remember the last updated planet temperature in a thread-safe map. By starting the application with different arguments, we were able to benchmark and compare ConcurrentHashMaps against a synchronized HashMap. The ConcurrentHashMap was nearly 10 operations per second faster.
Let's break down the main steps involved in using our maps:
Step 1: Declare a map in the analyzer using the map interface as opposed to any particular implementation. That is, don't declare it as ConcurrentHashMap
or any other Map
implementation.
private Map<String, Double> planetsSeen;
Step 2: Create a constructor that sets the planetsSeen
map to either a ConcurrentHashMap
or a HashMap
synchronized using Collections.sychronized()
.
public ThreadSafeMapFilteringFileAnalyzer(DedupingScheme dedupingScheme) {
if (DedupingScheme.CONCURRENT_HASHMAP.equals(dedupingScheme)){
planetsSeen = new ConcurrentHashMap<>();
} else if (DedupingScheme.SYNCHRONIZED_HASHMAP.equals(dedupingScheme)) {
planetsSeen = Collections.synchronizedMap(new HashMap<>());
}
}
Line 2 to 3: Assign a
ConcurrentHashMap
toplanetsSeen
ifCONCURRENT_HASHMAP
is passed to the constructor.Line 4 to 5: Assign a
HashMap
, synchronized usingCollections.sychronized()
toplanetsSee
whenSYCHRONIZED_HASHMAP
is passed to the constructor.
Step 3: Add a new filter to the sequential stream which processes each file, row by row.
DoubleStream streamOfDoubles = Files.lines(path).
// Separate the columns using a comma
map(line -> line.split(",")).
// Remove rows with less than three columns
filter(row -> row.length >= 3).
// Remove planet/temperature pairs we've not sampled
filter(this::planetHasNotBeenSampled).
...
As you can see this points to the planetHasNotBeenSampled
method in the current class.
Step 4: Then implement this method and use the map:
private boolean planetHasNotBeenSampled(String[] row) {
// Get planet details for this row
String planet = row[ KeplerCsvFields.KEPOI_NAME_COLUMN ];
Double temperature =
Double.parseDouble(row[ KeplerCsvFields.EQUILIBRIUM_TEMPERATURE_COLUMN ]);
// Check our map for this planet
Double lastValue = planetsSeen.get(planet);
// Return early and skip a value if it hasn't changed
if (null != lastValue && lastValue == temperature) {
return false;
}
// Sample this planet and temperature + update our map
planetsSeen.put(planet, temperature);
return true;
}
Lines 2 to 5: Extract the values from the string array passed to the method from the stream.
Line 7: Fetch the last object associated with this key from the
Map
.Map
implementations return anull
where the key does not exist. We'll use this in a bit.Line 10: Test if the value returned by the map was a null. This should mean that it's the first time processing a row with a value for this particular planet. If you'd seen it before, then check whether the temperature for that planet has changed.
Line 11: If you have seen it previously, and it hasn't changed since the last time, only then only you filter it from the stream. Do this by returning
false
from the filter.Line 15: Code reaching this point stores the temperature for this row into the
Map
against the planet mentioned in this row.
Other than this, the code is pretty much the same as our CompletableFuture
version.
Running the Benchmarks!
1) Check out the p2-c3-concurrent-hash-maps branch:
git checkout p2-c3-concurrent-hash-maps2) Run the benchmarks using the
runBenchmarks
Gradle task:./gradlew runBenchmarks
How did they perform? 😎 ConcurrentHashMaps are another piece of the arsenal for your tool belt as a thread-safe concurrent developer.
Let's Recap!
Java has traditionally provided Hashtables and
Collections.synchronizedMaps()
as a means of creating a thread-safe map. This is achieved by synchronizing against any access to the map.Fully synchronizing a map can provide an unnecessary performance penalty when only specific keys are modified.
ConcurrentHashMaps provide thread safety through striped locking. This means that writing is synchronized, but only across a specific set of keys.
ConcurrentHashMaps being read from represent the state of the entire map as of its last successful mutation. If you start reading from a map, you are guaranteed to have a consistent representation that does not change.
In the next chapter, you'll learn how to modify arrays on multiple threads!