Tuesday, 5 January 2021

Performance Improvements for TechEmpower Round 20

OfficeFloor is turning it's attention to improving performance now that it's underlying concepts have been defined.

OfficeFloor primarily is Inversion of Coupling Control to avoid the OO Matrix.  This goes beyond dependency injection to include function/continuation injection and thread injection. In our writings, we've come to understand the fundamental model.

However, while some frameworks claim functionality is more important than performance, we at OfficeFloor believe that's just incomplete solutions leaking cumbersome work arounds.  Take for example, aspect oriented programming in Spring, where the implementation is layers of reflective calls that hamper performance. In OfficeFloor, everything is executed in discrete functions, or more correctly first class procedures. This results in the application being a list of functions executed one after another.  An aspect is just a function inserted into the list.  Therefore, the cost of an aspect in OfficeFloor incurs no additional overhead costs.  But yes, we admit, there is overheads managing the execution of the list of functions.

So now that the inversion of coupling control solution is understood, the focus has been to look at optimisations.  And what better way to look at improving performance than competing against other frameworks.  Hence, we very much appreciate the work by the TechEmpower Benchmarks.

However, truth be told, OfficeFloor's focus has been both on improving optimisations and increasing third party integrations to inject into OfficeFloor applications.  So from round 19 to round 20 of the TechEmpower Benchmarks, OfficeFloor has added ability for asynchronous programming.  This has opened the door to using technologies such as Reactor and in particular R2DBC for asynchronous database interaction.

Asynchronous frameworks seem to dominate the top of the TechEmpower Benchmarks.  Now with Project Loom claiming negligible threading overheads this may change in future years.  However, we doubt a model of "negligible threading contention" will out perform the no threading contention of the single threaded asynchronous frameworks.  But this is just conjecture without real numbers to back it up.  So let's focus on the real numbers of round 20.

The most interesting inclusions by OfficeFloor for round 20 are:

  • officefloor-raw using R2DBC to provide implementations for the database tests
  • officefloor-async providing R2DBC solution
  • officefloor-raw using thread affinity to increase throughput
  • ByteBuffer pooling improvements
  • using parallel GC rather than Java's default garbage collection

OfficeFloor Raw using R2DBC

The officefloor-raw entry is different to the other OfficeFloor entries, as it's focus is purely on the HTTP server implementation.  It does not include the OfficeFloor inversion of coupling control framework.  It just provides a custom test implementation using OfficeFloor's HTTP server component.  Hence, why it is classed as a platform rather than framework in the tests.  This, however, allows seeing the performance overheads of OfficeFloor's inversion of coupling control framework.

Given we are looking to see the overheads, we provided officefloor-raw implementations using R2DBC for the database tests.  As officefloor-raw does not incur the inversion of control overheads, it is by nature going to be a lot faster than the rest of the OfficeFloor entries.  However, the closer we can get the other entries to officefloor-raw, the more we reduce OfficeFloor's inversion of coupling control overheads.

Though looking at the Java competitors above OfficeFloor, we will also consider supporting Vertx's PgClient for Round 21.

OfficeFloor Async

As mentioned, the major change from round 19 to round 20 for OfficeFloor, is the support of asynchronous programming.  The officefloor-async entry is using this functionality with R2DBC.

Now OfficeFloor is a multi-threaded framework.  The asynchronous functionality, therefore, has thread synchronising overheads that single threaded asynchronous frameworks do not suffer.  Well the developer suffers in a more complex programming model. However, the performance of single threaded asynchronous frameworks is optimal due to little contention.

Hence, the officefloor-async entry shines in the update test.  The update test is heavily database dependent and requires reduction of contention between connections. Therefore, pipelining database queries over lower number of connections will decrease the number of parallel updates.  This subsequently reduces the contention in the database and improves overall throughput.

The officefloor-async entry out performs the other OfficeFloor framework entries in the update test because the database contention is significantly more important than the threading contention for throughput.

Thread Affinity

OfficeFloor supports Thread Affinity in it's inversion of coupling control framework to pin servicing of requests to a single CPU.  This ensures optimal use of CPU caches, as moving threads between CPUs causes overheads in migrating data to the other CPU caches.  By using thread affinity, the operating system's thread scheduler is informed to only schedule the thread to run on a particular CPU.

For round 20, we have used thread affinity for the officefloor-raw entry to pin each socket listener thread to a particular CPU.  This effectively allows running as 32 separate servers in Citrine, rather than one big congested server grabbing at the work.  As contention is always overhead, using thread affinity to run isolated without contention (ok bus aside) has provided increased throughput.

ByteBuffer Pooling

The ByteBuffers to read and write from the socket have seen improvements in their pooling.  Creating and destroying direct ByteBuffers is expensive, hence it is much better to pool them for improved performance.  Furthermore, as direct ByteBuffers are separate memory to the heap, they allow OfficeFloor to effectively manage memory (rather than relying on the "dark magic" of garbage collection).

OfficeFloor has two levels of pooling the ByteBuffers:
  • single core pool
  • thread local pool

The single core pool is the shared pool of ByteBuffers all threads can draw from.  Previously, access to this pool was via synchronized methods.  This has been enhanced to use ConcurrentLinkedDeque to reduce thread synchronising overheads.  This has provided improved throughput to access the core pool.

The thread local pools are an individual pool per thread.  As the pool is specific for the thread, there is no thread contention in obtaining/returning ByteBuffers from/to the pool.  This further improves performance of threads heavily involved in reading/writing to the sockets.  However, can cause out of memory if many threads are involved in reading/writing from sockets (which can be the case particularly for writing responses by request servicing threads).

Improvements have also been made in the thread local pooling to keep memory down and allow better distribution of ByteBuffers.  This is achieved by only allowing the socket listener threads to pool ByteBuffers on their threads. As the majority of ByteBuffer interaction is undertaken by a constant set of socket listener threads, this means they can pool a significantly higher number of ByteBuffers for improved performance.  It does mean servicing threads do need to use the core pool for the response ByteBuffers.  However, as servicing threads spend time doing other work, there is less chance of contention in accessing the core pool.  Therefore, with the improved core pool access, less contention for core pool and the socket listener threads now able have higher pool sizes, it overall provides greater throughput.

Parallel GC

While Java is optimising garbage collection for large heaps, focus of OfficeFloor is on reducing garbage.  Creating lots of objects incurs more GC, which is overhead taking away from performance.  Hence, in reducing object creation there is little need for large heaps.  Therefore, while OfficeFloor has moved from Java 11 (round 19) to Java 15 (round 20), we have fallen back to Parallel GC for improved throughput due small heap size.

Note we did not notice any significant difference in Java 15 avoiding biases locking by default.  However, this was not extensively tested, as we had to do this locally (we did not have enough continuous runs before round 20 is being published to properly check this - so went with Java 15's claim indicating it should be negligible impact).

Learning from this is that for Java 15 consider using the Parallel GC if your heap sizes don't grow large.

Future Work For Round 21

For round 21 we will be looking at flattening of the call stack within OfficeFloor.  Work is already underway on this.  This will reduce the number of methods called to execute each function in the list.  Ideally, bringing OfficeFloor entries closer to officefloor-raw.