Profiling
- Instrumenting the program to measure performance of a specific operation or part of the program
- Identify the bottlenecks in the code
Profile type
memory / alloc_in_new_tlab_bytes
Grafana Pyroscope
- Continuously profiling the code
- Requires very minimal overhead
- Can store years of perf data down to 10 second granularity
- Uses a unique, inverted flame graph for increased readability
Profiling event type
CPU (Events.CPU, Units.SAMPLES, AggregationType.SUM)
Meaning: Samples taken while code was executing on the CPU (i.e., CPU-time sampling).
Unit: SAMPLES — each recorded sample is one observation of CPU activity.
Aggregation: SUM — total number of CPU samples in the aggregation window (useful to rank hotspots by sample count).ALLOC (Events.ALLOC, Units.OBJECTS, AggregationType.SUM)
Meaning: Allocation events — occurrences where the program allocated an object (or memory chunk).
Unit: OBJECTS — counts of allocated objects (not bytes).
Aggregation: SUM — total number of allocations over the window (useful to find allocation-heavy code paths).LOCK (Events.LOCK, Units.SAMPLES, AggregationType.SUM)
Meaning: Lock-related samples — times when execution was observed waiting on or interacting with locks (contention points).
Unit: SAMPLES — each sample indicates a lock-related event.
Aggregation: SUM — total lock samples (helps identify contention hotspots).WALL (Events.WALL, Units.SAMPLES, AggregationType.SUM)
Meaning: Wall-clock time samples — sampling based on real elapsed time (includes sleeping, I/O waits, blocked periods).
Unit: SAMPLES — counts of wall-clock samples.
Aggregation: SUM — total wall-clock samples (shows where time is actually spent from a user-visible perspective).CTIMER (Events.CTIMER, Units.SAMPLES, AggregationType.SUM)
Meaning: CPU timer samples (per-thread CPU-time via a clock_t or similar) — typically records thread CPU-time using a clockid like CLOCK_THREAD_CPUTIME_ID.
Unit: SAMPLES — counts of ctimer samples.
Aggregation: SUM — total ctimer samples (useful for attributing CPU time to threads/tasks).ITIMER (Events.ITIMER, Units.SAMPLES, AggregationType.SUM)
Meaning: Interval timer samples (e.g., POSIX ITIMER profiling) — samples triggered by an interval timer signal.
Unit: SAMPLES — counts of itimer-triggered samples.
Aggregation: SUM — total itimer samples (used when profiling via interval timer interrupts).Notes (practical):
SAMPLES vs OBJECTS: SAMPLES are sampling observations; OBJECTS are discrete allocation counts.
SUM aggregation is additive; to get rates, divide by time.
Choose event type based on what you need to measure: CPU for CPU hot paths, WALL for end-to-end latency/time including waits, ALLOC for allocation behaviour, LOCK for contention, CTIMER/ITIMER for specific timer-driven sampling.Pyroscope - HTTP API reference
Pyroscope Server HTTP API (opens in a new tab) DeepWiki - grafana/pyroscope (opens in a new tab)
Pyroscope - UI
```shell
http://$PYROSCOPE_HOST:4040/
```SDK Instrumentation
-
Requires code changes by using the SDK
-
Examples
Auto-instrumentation using Grafana Alloy
-
No code changes required
-
Requires a collector to send profiles
-
Examples
Java
JVM - Troubleshooting
Common JVM options for troubleshooting
```shell
-Xms2g
-Xmx2g
-XX:+UnlockExperimentalVMOptions
-XX:+UseG1GC
-Xlog:gc*=info,gc+heap=debug,gc+ref*=debug,gc+ergo*=trace,gc+age*=trace:file=${project.build.directory}/gc-%t.log:utctime,pid,level,tags:filecount=2,filesize=100m
-XX:StartFlightRecording=settings=default,filename=${project.build.directory}/${project.artifactId}.jfr,dumponexit=true,maxsize=100M
-XX:+UnlockDiagnosticVMOptions
-XX:+LogVMOutput
-XX:LogFile=${project.build.directory}/jvm.log
-XX:ErrorFile=${project.build.directory}/hs_err_%p.log
-XX:+DisableExplicitGC
-XX:+UseCompressedOops
-XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=${project.build.directory}/heapDump.log
```Profile application startup
Profile specific code spot
Identify the thread with the highest CPU consumption
```shell
top -H -p $PID_of_the_java_process
```Display thread dump to stderr
```shell
kill -3 / -SIGQUIT $PID
```Tooling
Java - IntelliJ
Java - Swiss Java Knife (SJK)
Java - Async Profiler
Aync Profiler - CPU Profiling
CPU Sampling Engines (opens in a new tab)
Async-profiler has 3 options for CPU profiling: -e cpu, -e itimer and -e ctimer.
cpu mode measures CPU time spent by the running threads.
1 profiling sample means that 1 CPU core was actively running for N nanoseconds (profiling interval).
Linux - perf
- BellSoft Blog - How to use perf to monitor Java performance (opens in a new tab)
- Generating perf maps with OpenJDK 17 (opens in a new tab)
- JavaOne 2016: Java Performance Analysis on Linux with Flame Graphs (opens in a new tab)
- Profiling JVM Applications in Production (opens in a new tab)
Linux - sysprof
TypeScript
Node.js
Docker
-
Key points
- Docker adds very little overhead in terms of CPU and memory to the application.
- The biggest performance hit is in disk I/O performance.
- If you require very low latency you can switch to using Docker’s host network feature, cutting out NAT.
-
Resources
-
Network
-
In the same IBM study cited before, the researchers found that Docker’s NAT doubled latency from roughly 35 µs to 70 µs for a 100-byte request from the client and a 200-byte response from the application.
-
If you require very low latency you can switch to using Docker’s host network feature, which allows your container to share the same network as the host, cutting out the need for NAT.
-
Unless you require very low latency, you should be fine sticking with the default bridge networking option. Just be sure to test it out and see if you’re getting the throughput you need.
-
Avoid Docker forwarded ports in production environments. Use either Unix sockets or the host network mode in this case, as it will introduce virtually no overhead.
-
Ports can be easier to manage, instead of a bunch of files, when dealing with multiple processes - either regarding many applications or scaling a single one. If you can afford a little drop in throughput, go for IP sockets.
-
If you have to extract every drop of performance available, use Unix domain sockets where possible.
-