Shenandoah Performance

I wanted to attent FOSDEM two weeks ago, but couldn’t because I was sick, in bed with fever. I should have done a presentation about Shenandoah. Unfortunately, my backup Andrew Dinn also became sick that weekend, so that presentation simply didn’t happen. I want to summarize some interesting news that I wanted to show there. About Shenandoah’s performance.

When I talked about Shenandoah at FOSDEM 2015, I didn’t really announce any performance numbers, because we would have been embarrassed by them :-) We spent the better part of last year optimizing the heck out of it, especially the barriers in the C2 compiler, and here we are, with some great results.


Ok. This doesn’t really exist. The last SPECjvm release was SPECjvm2008. Unfortunately, SPEC doesn’t seem to care about SPECjvm anymore, which means the last Java version that runs it without any modifications is Java7. We did some small fixes, that allows it to run with Java9 too. This invalidates compliance of the results. But they are still tremendously useful for comparison. So here it comes:


This was run on a 32 core box with 160GB of RAM, giving the JVM 140GB of heap. Exact same JVM and settings with G1 and Shenandoah. No special tuning parameters.

In terms of numbers, we get:

Throughput:   Shenandoah: 374 ops/m  vs. G1: 393 ops/m (95%, min 80%, max 140%)
Pauses:          Shenandoah: avg: 41 ms, max 202 ms                     G1:               avg: 240 ms, max 1126 ms

This means, throughput of Java apps, running with Shenandoah is on average 95% that of G1, depending on the actual application, it’ll range from around 80% to around 140%. However, pause times on such large heaps are significantly better with Shenandoah!


SPECjbb2015 measures throughput of a simulated shop system under response time constraints, or service level agreements (SLAs). It measures ‘max-jops’ which is maximum throughput of the system without SLA, and critical-jops, which is throughput of the system under a restrictive SLA. Here are the numbers, G1 vs. Shenandoah, same machine and JVM settings as above:

SPECjbb2015This basically confirms the results that we got from SPECjvm2008: general throughput in the area of 95% behind G1, but much better pause times, reflected in a much higher score of critical-jops.

Other exciting news is that Shenandoah is now stable enough that we want to encourage everybody who’s interested to try it out. The nice folks at Adopt-OpenJDK have set up a nightly build from where you can grab binaries (Shenandoah JDK8 or Shenandoah JDK9). Enjoy! (And please report back if you encounter any problems!)

Shenandoah OpenJDK project

It’s been a long while since I posted anything. In the meatime we’ve made lots and lots of progress with Shenandoah. The most important news of the week is that Shenandoah has now been accepted as an official OpenJDK project. We’ve got a website, a mailing list, mercurial repositories, and a wiki page. The code hasn’t been moved yet, I am in the process of doing it. It will first land in our JDK9 forest (as we’re doing our main development there) and will be backported to JDK8 in a while.

Just to give you a summary of what happened in the last 1.5 years since I posted anything (bad me): we’ve implemented all that we wanted (runtime, interpreter, C1 and C2 barriers, weak-reference support, JNI critical regions support, System.gc() support, and lots of other smallish things), it’s fairly stable (still expect bugs here and there, but should run quite a lot of your code), performance looks good (on average, ~90% of what G1 does, some benchmarks as bad as ~70% relative to G1, some beat it by ~150%), pause times on largish heaps are significantly better than with G1 (but still not quite where we want it to be).

If you’re interested, subscribe to the new list, watch out for the code to land (or grab it from IcedTea in the meantime), and give it a try! Any feedback is welcome, as always! :-)

The Shenandoah Code

Last weekend I’ve been talking about the Shenandoah GC at FOSDEM 2014. I announced the availability of the Shenandoah source code and project pages. Until Shenandoah becomes a proper OpenJDK project, it will be hosted on IcedTea servers. We currently have:

We also filed a JEP for Shenandoah, here you can find the announcement and discussion on hotspot-gc-dev.

If you are interested in the slides that I prsented at FOSDEM, you can find them here.

Implementation-wise we’re making good progress. Concurrent evacuation is working well and support for the C1 compiler is almost done.

Shenandoah GC: Brooks pointers

I didn’t write about Shenandoah in a while. We first needed to clear up some issues around it, which is done now. The project is not dead yet, quite the contrary, we’re working feverishly. Just now, we are about concurrent evacuation to work :-)

Last time I wrote about concurrent marking. Before I carry on, I want to introduce a new concept: Brooks pointers. The idea is that each object on the heap has one additional reference field. This field either points to the object itself, or, as soon as the object gets copied to a new location, to that new location. This will enable us to evacuate objects concurrently with mutator threads (how exactly this is done is a topic for a separate post).

One problem of course is that as soon as we have two copies of an object (one in from-space, one in to-space), we need to be careful to maintain consistency. This means that any changes to objects must happen in the to-space copy. This is achieved by putting barriers in various places (everything that writes to objects, or runtime code that uses objects) that resolve the object before writing into it or using it. If we don’t do that, we might end up with some threads using the old copy, and some threads using the new copy, which is, obviously, a problem. The barrier simply reads the brooks pointer field and returns the forwarded object to any code that uses it. In terms of machine code instructions, this means one additional read instruction for each write operation into the heap (and some read operations by the runtime). Infact, we currently need two instructions, the reason for which I’ll explain later.

Eventually, when evacuation is done, we need to somehow update all references to objects to point to the to-space locations. We do that by piggy-backing on the concurrent marking phase. When we traverse the heap for marking, we see all live objects and references, and whenever we visit an object, we update all its object references to point to the new locations of the referents.

There’s two tradeoffs with using brooks pointers: we need more heap space (ideally, one word per object), and we need more instructions to read and write objects.

Next time, I’ll start explaining how concurrent evacuation works.

Because there have been many requests: yes, Shenandoah will be open source. Our plan is to propose a JEP as soon as we can, and make it an OpenJDK project if possible.

Shenandoah GC: Concurrent parallel marking

Yesterday, I gave a quick overview over the Shenandoah GC. (And Christine also started blogging about Shenandoah today, giving even more background information.) I mentioned that Shenandoah collects garbage in two main phases: marking and evacuation. All of that needs to happen concurrently with the mutator threads.

Let’s look at the marking phase first. We kick it off by briefly stopping the world in order to scan the GC roots (java stack references and such). Then we resume the world and start traversing the heap going from the objects we just marked. Concurrently with the mutator.

The problem when marking objects on the heap while running concurrently with the application/mutator threads is, of course, that the structure of the heap changes while we’re traversing it. There are a few solutions to this problem, and we did not invent anything new here. We decided that we’d go with the so-called snapshot-at-the-beginning (SATB). The idea is that we want to mark all the objects live that are live at the beginning of the marking phase (hence the name). We probably mark a few more objects as live (those that become live during marking) but we don’t care about those (they don’t hurt). The reason we want to do this is the following: suppose we have an object A that points to B. During marking, A becomes unreachable. But there comes a new object C into live that receives a reference to B. Now, it is possible that the concurrent marking misses B because it can’t reach it anymore through A, and it doesn’t reach it yet through C. We’d basically treat B as garbage even though it isn’t.

How does SATB work? During marking, we employ a write barrier, that is an additional hook that gets called whenever mutator threads write reference fields in objects (not to be confused with memory fences, which are also sometimes called barriers…). This write barriers logs the original contents of modified reference fields into a linked list. After the concurrent marking phase is done, we stop the world, and process this SATB-list by marking all objects in it, and traversing their references where necessary. When this is done, we resume the world, and we can be sure that all objects that have been live at the beginning of the marking phase are marked as live. And since we’re also counting liveness during marking, we then know how much live data each region contains.

I also said that we’d do the marking work using parallel GC threads. How does that work? Instead of having one thread traversing the graph of live objects in the heap, we use many threads. Each thread maintains a work-stealing-queue. Each item in the queue is a heap object. Initially, the queues get filled with objects referenced by the GC roots. Each GC thread then pops an object from the queue, marks it, increase the containing region’s liveness counter, then checks all reference fields of that object, and pushes each object that is referenced back to the thread’s queue. This happens in a loop. Eventually a thread will run out of objects. When that happens, the GC work thread attempts to steal from another thread’s queue. When all threads are done, i.e. no thread can steal from any other thread anymore, we’re done.

After the marking phase is done, we know for each region how much live data (and therefore garbage) it contains, and all reachable objects are marked as such. All of the above is already implemented in Shenandoah and works well. Next step is the actual evacuation of regions that have accumulated enough garbage. This is a topic for a followup post. Or more likely, several follow-up posts.


Shenandoah GC: An overview

Let me give you a quick overview how the Shenandoah GC works.

First of all, let’s look at the heap in Shenandoah. Shenandoah’s heap is basically (like all heaps) a big chunk of memory, and in Shenandoah it is divided into many regions. Consider a heap of 1 GB, and a region size of 1MB, then you would get 1000 regions. This is different from the classic mark-and-sweep collectors, that have no regions at all, or the classic copying collectors that usually divide the heap in 2, or only a few regions. The reason for doing this is that we want to selectively evacuate those regions with the most garbage with higher priority (you know this idea from Hotspot’s G1 collector). Additionally, it makes multithreading a little bit easier (you can have multiple threads working on separate regions without stepping on each other’s toes… well, in theory ;-) ).

Now how does it work now you ask? Shenandoah is a mark-copy collector. That already implies the two important phases of Shenandoah’s collection cycle. In the first phase, we mark all the live objects in the heap. Considering the ‘copy’ in its classification, this may seem unintuitive, because normally copying collectors don’t need a marking phase, that’s their point (they collect garbage by copying live objects from one region to another, and leave the garbage behind). The main reason for having a marking phase is that we need to calculate the amount of live objects in each region. This is most naturlally done by running a marking pass over the heap. Starting from the GC roots (stack maps, internal data structures, etc) we follow the object graph and mark each object that we encounter, and update the containing region’s live counter accordingly.

The second important phase is the evacuation phase. Here we first select the regions that we want to evacuate (from-regions) and corresponding regions that the objects are copied to (to-regions). This is what we need the liveness information for. We then simply scan the from-regions and copy all objects that are marked to appropriate (e.g. with enough empty space) to-regions. (Note that this is different than classic copying collectors, that traverse the heap like we do in the marking phase. Linear scanning of the heap is much more efficient because of more predictable memory access, and as an additional bonus it preserves the allocation order, which is often a good thing.) After evacuation, the from-regions only contain garbage objects and we can free them up for future allocations.

So far it all sounds very simple and straightforward, and in fact is not even new. What we do, however, (or plan to do) is to make all of the marking and evacuation concurrent with mutator threads (i.e. mark and evacuate regions while application threads are running and messing with the heap), and parallel (i.e. do marking and evacuation using potentially many GC threads). And this is where the complication begins. I will go into the details of the issues and how Shenandoah addresses them in the next posts of this little series. Stay tuned!

Shenandoah: A pauseless GC for OpenJDK

It’s been quite a while that I did not post anything. The reason is that I was busy with a very interesting new project in the last weeks, Shenandoah, a new GC for OpenJDK. Christine, a collegue of mine in Red Hat’s Java team, set out to build a next generation garbage collector for OpenJDK. I jumped into this project a while ago, and I’m very excited about it.

I guess I will not go into technical details in this post, but instead outline a few of the problems of existing garbage collectors, and Shenandoah’s goals and some highlevel ideas how to achieve them.

  • Pause times: this problem existed since the invention of GCs, and hasn’t been fully solved since then. The good old stop-the-world garbage collection is probably still in everybody Java programmers mind. Nowadays we have more modern garbage collectors, and most of them drastically reduce pause times, but nevertheless, pause times still exist, be it for evacuation of regions, for book keeping of references, etc. Shenandoah aims to reduce pause times even more. Realistically speaking, there will still be pause times, but we aim to minimize them as much as possible. Our goal is to reduce them to less than 10ms per GC cycle.
  • Scalability. The larger the heap gets, and the more processors/threads are working, the more garbage is produced. This means more work for the garbage collector to find and free all this garbage. In other words, the garbage collector needs to be able to keep up with ever-growing memory demands and processor capability. We are shooting for TB-sized heaps.

The idea now is that, in order to reduce pause times, the GC needs to do as much work as possible concurrently to mutator threads (mutators are all threads that modify the heap, i.e. the actual application running). In order to be scalable, the GC needs to be able to utilize that ever-growing processing power by doing as much work as possible using multiple threads in parallel. (Notice that the terminology is a bit confusing here. In the Java GC world, ‘concurrent’ often means ‘concurrently with mutators’ while ‘parallel’ means ‘using parallel worker threads to do GC work’).

While that sounds conceptually simple, I can tell you it’s not. Boy, are there a lot of hairy issues to resolve here. I think I will explain the ideas one blog post at a time here. Stay tuned!


Get every new post delivered to your Inbox.

Join 183 other followers