How do I use JDK Mission Control to find Java performance problems?

Collect a JFR recording with -XX:StartFlightRecording=filename=recording.jfr,settings=profile, then open it in JMC. Start with the overview tab: if heap is climbing and GC is constant, check the Allocation view. If CPU is high, check Method Profiling and the flame graph. If CPU is low but response times are slow, check the Contention tab. Fix the loudest problem first, re-profile, and repeat.

How do I read a Java flame graph?

Each box in a flame graph is a method. Width is proportional to the percentage of CPU sample time spent in that method and everything it called. Wider boxes are hotter. The hottest leaf methods are often JDK internals; trace up the stack to find your application code that is driving the cost. A single wide box dominating the graph indicates a clear optimization target.

Why should I fix one performance problem at a time and re-profile?

Fixing the largest hotspot changes the shape of the profile. Methods that were previously buried in the noise become visible. For example, a 70% CPU hotspot from an O(n²) loop can mask a String.format() overhead that only becomes apparent after the first fix. Iterative profiling reveals second-order bottlenecks that batch fixing would hide.

How do I detect thread contention in Java with JFR?

Thread contention appears in the Contention tab in JDK Mission Control, not in the flame graph. Contention is load-dependent: a synchronized method may show no contention at 100 threads but become a bottleneck at 2,500 threads. Profile under realistic production concurrency to detect it. JFR records monitor contention events including wait times and stack traces.

What is the overhead of Java Flight Recorder in production?

JFR overhead is under 1% in most configurations. It is included in OpenJDK since Java 11 and backported to OpenJDK 8u262. It can be safely run in production. Use settings=profile for comprehensive event capture including CPU sampling, allocation tracking, GC events, and thread contention.

One Method Was Using 71% of CPU. Here's the Flame Graph.

Part 2 of 3 in the Java Performance Optimization series. ← Part 1 · Part 3 coming soon.

In Part 1 I listed eight Java performance anti-patterns and explained what each one costs at the JVM level. I also opened with some numbers I didn’t fully explain: 1,198ms down to 239ms, over 1GB of heap down to 139MB, 19 GC pauses down to 4.

This post is where those numbers come from.

We’re going to open the Java Flight Recording (JFR) from my demo app in JDK Mission Control (JMC) and actually look at what’s hot. The thing I want to show isn’t just before and after. It’s what happens in between. Fix the biggest problem and something that was invisible before suddenly shows up. The shape of the profile changes. You end up doing this in rounds, not in one pass.

The App

The demo app is an order analytics pipeline. It generates synthetic orders, validates them, computes revenue by currency, scores each order for fraud risk, detects hourly volume trends, and accumulates the results into a summary report. Nothing too crazy. I wrote it intentionally with several of the anti-patterns from Part 1 sprinkled in, to reflect what might happen in a real production system. These types of anti-patterns can naturally accumulate in a real codebase over time.

The load test processed 100,000 orders in batches of 1,000 across 100 virtual threads and collected the JFR recording.

Opening the Recording

If you haven’t used JMC before, here’s how you can do it. You can generate a .jfr file from your running process using the JDK’s built-in profile settings:

java -XX:StartFlightRecording=filename=recording.jfr,settings=profile -jar myapp.jar

Or attach to a running process:

jcmd <pid> JFR.start duration=120s filename=recording.jfr settings=profile

Then open it in JMC. It’s free and open source. The settings=profile part matters because it enables a broader set of events including allocation tracking, GC details, and thread contention, which you’ll want.

The first screen you land on is the Java Application overview. CPU usage, heap usage, GC activity, thread count, all plotted over the recording window. Before I changed anything, it looked like this:

JMC overview tab showing high CPU, heap climbing toward 1 GiB, and GC pauses visible as red vertical lines

The heap pattern is the first thing to notice. Memory climbing steadily, GC running constantly trying to keep up. You’re paying twice: once for creating all those objects and once for collecting them.

The Flame Graph

The Method Profiling tab is where the flame graph lives. JFR collected stack trace samples throughout the recording. Each box in the graph is a method. Width is proportional to how much of total CPU sample time was spent in that method and everything it called. Wider means hotter.

Method Profiling view showing ReduceOps$3ReducingSink.accept at 158 samples (70.5%), flame graph with TrendDetector.detect dominating the width

One box is taking up over 70% of the samples. That’s the app spending nearly three quarters of its CPU time in one place.

Notice this maps to java.util.stream.ReduceOps$3ReducingSink.accept(). That’s JDK stream internals. The flame graph shows you what’s hot at the leaf level, which is often a JDK method. To find the problem, you trace up the stack.

Dive in and you’ll see TrendDetector.detect() (my code).

for (Order order : orders) {
    int hour = order.timestamp().atZone(ZoneId.systemDefault()).getHour();

    long countForHour = orders.stream()
        .filter(o -> o.timestamp().atZone(ZoneId.systemDefault()).getHour() == hour)
        .collect(Collectors.counting());

    ordersByHour.put(hour, countForHour);
}

For each order, it streams the entire list to count how many orders share that hour. With 1,000 orders per batch, that’s 1,000 iterations times 1,000 stream elements (a million operations per batch for what should be a single pass). This is the O(n²) stream-inside-loop pattern from Part 1. It was the single largest CPU hotspot, accounting for roughly 70% of all CPU samples in the recording.

While digging into the recording, I also spotted RevenueAnalyzer.analyze() doing string concatenation in a loop, and OrderValidator.validate() calling Pattern.compile() on every invocation, meaning costly regex compilation on every single order. These are immutable, thread-safe objects that should be compiled once and reused, same principle as the recreating reusable objects pattern from Part 1.

Round 1: The Obvious Fixes

Three problems, all visible in the first recording. I fixed them:

1. TrendDetector: O(n²) stream-in-loop → single-pass accumulation

for (Order order : orders) {
    int hour = order.timestamp().atZone(ZoneId.systemDefault()).getHour();
    ordersByHour.merge(hour, 1L, Long::sum);
}

One pass. O(n). Each order increments its hour’s count directly.

2. RevenueAnalyzer: string concatenation → StringBuilder

StringBuilder formattedSummaryBuilder = new StringBuilder();
for (Order order : orders) {
    double revenue = (double) order.quantity() * order.unitPrice();
    formattedSummaryBuilder
        .append("Order ").append(order.orderId())
        .append(" | ").append(order.currency())
        .append(" | Revenue: ").append(String.format("%.2f", revenue))
        .append("\n");
}

Single mutable buffer. No intermediate string copies.

3. OrderValidator: per-call regex compilation → static final patterns

private static final Pattern ORDER_ID_PATTERN = Pattern.compile("ORD-\\d{8}");
private static final Pattern SKU_PATTERN = Pattern.compile("[A-Z]{3}-\\d{4}");

public ValidationResult validate(Order order) {
    // uses ORDER_ID_PATTERN and SKU_PATTERN directly
}

Compiled once at class load and reused on every call.

I re-ran the load test and collected a new recording:

Method Profiling after Round 1 showing no dominant method, top method at 6.9%, CPU spread evenly across generateBatch, validate, analyze, score, detect

The 70% box is gone. Elapsed time dropped from around 1,000ms to around 400ms. The profile looks completely different.

But look at what’s showing up now. With the TrendDetector hotspot out of the way, the profiler sees things that were previously buried in the noise.

Round 2: What Was Hidden Before

OrderGenerator.generateOrder() is now one of the more visible methods in the profile. It calls String.format() three times per order to build IDs:

String orderId = String.format("ORD-%08d", rng.nextInt(0, 100_000_000));
String customerId = String.format("CUST-%06d", rng.nextInt(0, 1_000_000));
String productSku = SKU_PREFIXES[rng.nextInt(SKU_PREFIXES.length)]
        + "-" + String.format("%04d", rng.nextInt(0, 10_000));

With 100,000 orders, that’s 300,000 String.format() calls. Each one parses the format string, runs the full java.util.Formatter machinery, and allocates intermediate objects. This is the String.format() in hot paths problem from Part 1, different use case (zero-padding IDs instead of building summaries) but same overhead. It was always there. The TrendDetector cost was just louder and was drowning it out.

Fixed it by replacing String.format() with a simple zero-padding helper:

String orderId = "ORD-" + padLeft(rng.nextInt(0, 100_000_000), 8);

Along with a handful of other smaller fixes I found while in the code (autoboxing in FraudScorer, string concatenation inside a synchronized method in AnalyticsAccumulator, suboptimal collection choices in ReportGenerator), the elapsed time dropped from around 400ms to around 230ms.

Re-ran and collected a new recording.

The profile is flat. No single dominant hotspot. CPU time is spread across the actual business logic, which is what you want to see. The app is spending time doing work instead of fighting itself.

The Problem the Default Recording Didn’t Show

The flame graph showed me the CPU hotspots. The allocation view showed me the heap pressure. But there’s a third category of performance problem that neither of those surfaces: thread contention.

The default recording ran 100 virtual threads. At that concurrency level, the synchronized method in AnalyticsAccumulator didn’t cause enough lock contention to cross JFR’s reporting threshold. The lock waits were too short to notice.

I got interesting results when I re-ran with higher concurrency, 2,500 virtual threads processing 500,000 orders in batches of 200:

java -XX:StartFlightRecording=filename=contention.jfr,settings=profile \
  -cp target/classes com.demo.orderanalytics.App \
  --orders 500000 --batch-size 200

Now the Contention tab lights up.

Lock Instances view showing AnalyticsAccumulator as the only contended monitor, 842 contention events, 16.3s total blocked time, stack trace showing accumulate() at 100%

842 monitor contention events. Every single one on AnalyticsAccumulator.accumulate(). Virtual threads queuing up waiting to enter the synchronized method. This is the too-broad synchronization pattern from Part 1, showing up live in the profiler. Median wait time: 15ms. Total aggregate wait time across all threads: over 16 seconds.

This is the thing about contention: it’s load-dependent. At 100 virtual threads it was invisible. At 2,500 it became the bottleneck. If you only profile at low concurrency, you’ll miss it entirely.

There are two ways to attack this. You can replace synchronized with ReentrantLock, which allows more granular lock management and avoids the coarse monitor semantics:

private final ReentrantLock lock = new ReentrantLock();

public void accumulate(BatchAnalytics batchResult) {
    lock.lock();
    try {
        // ... merge results
    } finally {
        lock.unlock();
    }
}

Or you can reduce the work inside the critical section. In this case, replacing string concatenation with StringBuilder means the lock is held for less time on each call, reducing the window where other threads are blocked. In practice, I’d do both.

Results

When I ran this for my DevNexus talk, the numbers from a specific run were:

Metric	Before	After
Elapsed time	1,198ms	239ms
Throughput	85K orders/sec	419K orders/sec
Peak heap	1,021MB	139MB
GC pauses	19 (34ms total)	4 (6ms total)

Re-running the same workload for this post, I see similar ratios. Baseline around 1,000ms, optimized around 230ms. The exact numbers shift between runs, but the improvement is consistently 4-5x on elapsed time and over 10x on peak heap.

The seven fixes:

Fix	Location	Anti-Pattern (Part 1)
O(n²) stream-in-loop → single-pass	`TrendDetector.detect()`	Streaming full list per element
String concat in loop → StringBuilder	`RevenueAnalyzer.analyze()`	O(n²) string copying
Regex recompilation → static Pattern	`OrderValidator.validate()`	Recreating reusable objects
String.format() → direct string building	`OrderGenerator.generateOrder()`	Formatter overhead in hot path
Autoboxing → primitive arrays	`FraudScorer.score()`	Wrapper object allocation
String concat → StringBuilder	`AnalyticsAccumulator.accumulate()`	Allocation inside critical section
LinkedList/Hashtable → ArrayList/HashMap	`ReportGenerator.generate()`	Suboptimal collection choices

Why This Matters at Scale

As I wrote in Part 1, these improvements compound across a fleet. On a single instance, going from 1,000ms to 230ms is a nice win. Across a fleet of instances all handling the same workload, it changes the economics.

It’s hard to convince your boss you should spend two weeks shaving 50ms off a request. Imagine the conversation when you frame it as cost savings. “Let me spend two weeks and we’ll save five figures on our annual infrastructure costs.”

If a 5x throughput improvement means you can handle the same load on fewer instances, or downsize your instance types, run the math on your own fleet. The savings compound fast.

What the Profile Is Actually Telling You

The most useful thing I can pass on: don’t start with the flame graph.

The flame graph shows you hot compute methods. But not every performance problem shows up there. If your bottleneck is excessive allocation, you’ll see it in heap growth and GC frequency before you see it in CPU samples. If your bottleneck is synchronization, you’ll see it in the Contention tab, not the flame graph at all. CPU can look completely normal while threads are stalled waiting on a lock. And as I showed above, contention might not even show up until you profile under realistic production concurrency.

So start with the overview tab. If heap is climbing and GC is running constantly, go to the allocation view first. If CPU is high but heap is flat, go to the flame graph. If CPU is low but response times are slow, go to the Contention tab. Something is blocking.

The other thing worth saying: fix one thing at a time and re-profile. It’s tempting to batch all your changes and just run the benchmarks at the end. The problem is you won’t know what each change actually did. And more importantly, you’ll miss the second-order effects. The String.format() problem in Round 2 only became visible after the Round 1 fixes cleared the noise. Fixing the TrendDetector alone dropped elapsed time from around 1,000ms to around 400ms. But that revealed the String.format() overhead in OrderGenerator, and fixing that dropped it further to around 230ms. If I’d fixed everything at once and just looked at the final numbers, I would have gotten the same result but I wouldn’t have understood why.

One more thing. Look at the fixes in this post. Replacing a stream-in-loop with Map.merge(). Using StringBuilder instead of string concatenation. Hoisting Pattern.compile() to a static field. Most of these made the code both faster and cleaner. If you find yourself in a position where optimizing is making the code harder to read or maintain, that’s worth pausing on. Focus on the changes that will have impact when running at scale, and keep the code something your team can live with.

Doing This on Your Own App

You need two things: a JFR recording from a load test or a production traffic window, and JMC to open it.

# Start recording at launch with comprehensive event capture
java -XX:StartFlightRecording=filename=myapp.jfr,settings=profile -jar myapp.jar

# Or attach to a running process
jcmd <pid> JFR.start duration=120s filename=myapp.jfr settings=profile

JFR is included in OpenJDK since Java 11 (and backported to OpenJDK 8u262). Overhead is under 1% in most configurations. You can run it in production. Use settings=profile to get the full picture: CPU sampling, allocation tracking, GC events, and thread contention.

Once you have the file:

Open it in JMC
Start with the overview tab to understand the shape of the problem
Heap climbing and GC pressure → go to the Allocation view
High CPU → go to Method Profiling and the flame graph
Slow response times with normal CPU → go to the Contention tab under Threads
Fix the loudest problem first, re-profile, then repeat

And if contention is a concern, profile under realistic concurrency. A lock that’s invisible at 10 threads can become the bottleneck at 1,000.

That last point is the whole workflow. It’s not a one-shot process. Each round of fixes changes what’s visible in the next recording.

The full source code for the demo app is on GitHub. The main branch has the unoptimized code with all the anti-patterns intact. The performance-optimized branch has the fixes. You can diff the two to see exactly what changed.

In Part 3 we’ll look at how to automate this loop entirely. The JFR data has enough signal to drive the fixes programmatically, and there are tools that can read it, rank the anti-patterns by hot path weight, write the code changes, run your tests, and hand you a diff. We’ll walk through exactly what that looks like.

Something you’re seeing in your own JFR recording that doesn’t make sense? I’m on LinkedIn.

Part 3 coming soon.