60%+ Performance Improvements with Continuous Profiling and Library Matching – Part 1/2 on Combining Dynamic and Static Analysis for Performance Optimisation

This is the first post in a two part series on combining static and dynamic analyses for performance optimisation. I’ve split them up as otherwise it’ll be horrifically long, and the second post will be online later this week. This post lays out some high level context, discusses why we should combine analyses, and has a loose framework for thinking about analysis combination. I’ll then explain a straightforward combination of continuous profiling with pattern matching on library names and versions which has led to 60%+ performance gains per service in real customer infrastructure.

In the follow-up post, I’ll discuss experimental work I did last year on using CodeQL to search for performance optimisation opportunities in C/C++ codebases. CodeQL is typically used to find security vulnerabilities, but I’ll show how it can also be used to find C++ code patterns that may lead to the compiler producing machine code that is 10x, or more, slower than it could be. Primarily, these patterns relate to code constructs that prevent the compiler from auto-vectorising loops and where, with some minor type changes or reordering of statements, we can achieve significant performance improvements. 

Why Combine Static and Dynamic Analyses?

A few years ago I co-founded optimyze.cloud, and we built Prodfiler (now the Elastic Universal Profiler) – a fleet-wide, continuous, whole-system profiler. That is, a profiler that is sufficiently efficient that it can be always on, everywhere, profiling the entire system on every machine in your fleet. Enabling Prodfiler/EUP gives insight into exactly which lines of code are costing you CPU resources across your entire fleet, regardless of whether it is in userland, the kernel, native code, or some higher level runtime runtime. That’s fantastic, but, when presented with the top 10, or 100, most expensive functions in your fleet, the question arises: is it “fine” that these functions are that expensive, or not? And if not, why are they so expensive and what can be done to remedy matters? 

If you’re lucky, the expensive functions are in code that you wrote and understand, and the answers to these questions may be easily arrived at. If you’re unlucky, it’s in some third party library you’re unfamiliar with, written in a language you don’t know, and on top of a runtime you’ve never heard of. A common question from our users is whether, instead of just a list of expensive functions, we could also suggest root causes or, even better, fixes. It is from here we come to the need for some sort of complementary analysis that can “reason” about why a particular chunk of code may be expensive, and if possible suggest a way to make it less so. 

A lesson I’ve taken from building bug finding and exploit generation tools is that the sweet spot for tooling usually brings together: 

  1. Large scale, fast, dynamic analysis, involving the execution of software/system and observing what it does under a large and diverse set of inputs.
  2. Automated static analysis, that may augment, or be augmented by the dynamic analysis.
  3. A well designed UI/UX that enables exploratory workflows even in the presence of large amounts of data. The UI should bring together all available context, while supporting filtering, slicing and searching, with little to no perceptible lag. 

With Prodfiler we have 1 and 3, but 2 is relatively unexplored in the performance world, and certainly in the context of combining it with continuous profiling. 

My belief is that, just like security, the sweet spot for performance analysis is in the intersection of points 1-3, with dynamic and static analysis being mutually beneficial, and an intuitive and performant UI being game changing. In this post I am going to focus exclusively on the static analysis part of the equation. If you’d like to learn more about how to build a high performance continuous profiling engine then we have an initial blog post on this topic over on the Elastic blog, and will have more in the future. 

Modes of Combining Dynamic and Static Analysis

In general, in program analysis R&D, there are no hard and fast rules on what does and does not work well, and it’s an area where’s there’s plenty of ongoing work in both academia and industry. It’s also not the case that one might stick to combining one static analysis with one dynamic analysis. You may even decide to skip one form of analysis entirely, and chain a bunch of exclusively dynamic analyses, or vice versa. As an example, for the purposes of bug finding, an over-approximate but fast abstract interpretation based analysis could be applied to every function in the code base to find potential bugs. The potentially buggy functions could then be passed to a symbolic execution engine to validate the bugs in the intraprocedural context. Finally, a fuzzer could be used to try and find paths to those buggy functions which exercise the conditions required to trigger the bug. For a concrete example, check out Sys, which combines static analysis and symbolic execution to leverage the strengths of each to find bugs in web browsers [code] [paper & talk].

When it comes to performance optimisation and analysis, there are three modes of operation that I can think of where multiple analyses may be mutually beneficial:

  1. Context-for-ranking in which a dynamic analysis provides a ground truth on the most expensive parts of the system, ideally when it is executing in production on real workloads. This context can then be used to rank the results of a static analysis. Ranking is important for two reasons. First, developers have finite time. An approximate analysis with false positives will require a developer to validate each result. If we want a developer to actually use the tool then we need some way to ensure that their efforts are applied to the most impactful findings before they run out of time or energy. Second, whether a finding from a static analysis tool intended to find performance issues is even valid or not may hinge on whether the code in question is on a hot path. The results of the static analysis are essentially incomplete unless combined with the context indicating what code is hot on production workloads. An example of context-for-ranking might be continuous CPU profiling in production being used to rank the output of an analysis which lists the loops in a program that the compiler failed to auto-vectorise. If a loop is unvectorised in a function that only executes once on startup then it probably doesn’t matter, but continuous profiling gives you a way to filter out this sort of noise and focus on the loops that are significant. 
  2. Ranking-for-analysis in which a lightweight dynamic analysis provides information on the most expensive parts of the system so that a heavyweight analysis can be focused there (or alternatively, so that more of a lightweight analysis can be done on those areas). Some analyses are CPU/RAM intensive to run. An example of this would be symbolic execution. Were we to use such an analysis, or something similarly heavy, in the context of performance optimisation then we would want to use some other data source to indicate which functions are most expensive, and focus our efforts there. 
  3. Context-for-operation in which one analysis provides data that is necessary for another analysis to operate at all. Examples here would be things like profile guided optimisation/AutoFDO or a post-link time optimizer like Facebook’s BOLT. For these, profiling information for them to operate at all. I’ll give another example of an analysis of this kind in the next section that is incredibly straightforward, yet effective.

Low Hanging, Delicious, Fruit 

The second post in this series, discussing using CodeQL to find optimisation opportunities, is in the category of “intellectually interesting research warranting further investigation”, but is still quite early stage. In the interests of giving some practical advice to go along with the ivory tower thinking, here’s a simple, obvious, and very effective analysis combination that has repeatedly given our customers wins across all sorts of systems.

The most low hanging fruit in combining static and dynamic analysis for performance optimisation is to just pattern match on the libraries and functions that are most significant in the performance profile of an application, and for which we know there is a functionally equivalent, optimised, version available. We have real world data with customers having huge performance wins following this approach. In Prodfiler this can be done by looking at the TopN view, which ranks the top N most expensive functions across your entire infrastructure. 

Some concrete examples we’ve come across:

  • If significant time is spent in functions related to memory allocation (e.g. malloc, free, realloc etc), and the application is using the standard system allocator, then it is likely that it will get a performance improvement by switching to a more optimized allocator, such as mimalloc or jemalloc. An example of this is described in this blog post, in which I found an application spending 10% of its time dealing with memory allocation. Replacing the system allocator with mimalloc cut this time in half, and increased overall performance of the application by ~10%. An internal team at Elastic encountered a similar pattern and, alongside the reduced CPU overhead for their application, reported a 20% drop in RAM usage by swapping out their allocator for jemalloc.
  • If significant time is spent in zlib then switching to zlib-ng can bring improvements. A customer of ours had a 30% improvement in application performance using this change. 
  • If a significant amount of time is spent processing JSON, or in serialisation and deserialization, then there is a huge amount of variance in the performance of 3rd party libraries that do this. Often the version provided by the runtime, or that is most popular, can perform dramatically worse than the state of the art. A customer of ours had a 60% reduction in one of their heaviest workloads by replacing the JSON encoder they were using with orjson.

This process is straightforward, and has an excellent effort-to-reward ratio – look at the results, google for “optimised library X”, ensure it meets your functional and non-functional requirements, replace, redeploy, measure, move on. If you’re starting out with combining analyses, then start there. This can be easily automated and integrated with your notification framework of choice. There are actually a couple of approaches to implementing this “grep for libraries” solution. One is to write a script that compares function and/or library names in your profiler’s Top N output against a list of known replaceable libraries. This would be an example of a context-for-operation combination of analyses, with profiling providing the context for the regex/grep script to run. Alternatively, you could have a tool which scans all of your base container images for libraries for which you know an faster alternative exists. Sending this information as alerts to developers is just going to annoy them though, as who knows if those libraries are even loaded, let alone have code on a hot path. However, combined with a continuous profiler you can filter out such noise, rank the results by their impact across your entire fleet and then dispatch alerts only in cases where replacing the library could have significant gains (an example of context-for-ranking).   


I hope the above was useful in laying out a framework for thinking about ways to combine analyses. If you’re a SRE/SWE and integrate the continuous profiling/grep-to-win approach in your environment, I’d love to hear how it pans out. There’s lots of scope for R&D in this space, so if you’ve got any interesting and useful examples of analyses then drop me a DM on Twitter. I’ll follow up later this week with the second part in this series, which will discuss using CodeQL to find loop vectorisation opportunities.