Observability - Profiling

Profiling

Instrumenting the program to measure performance of a specific operation or part of the program
Identify the bottlenecks in the code

Profile type

memory / alloc_in_new_tlab_bytes

Grafana Pyroscope

Continuously profiling the code
Requires very minimal overhead
Can store years of perf data down to 10 second granularity
Uses a unique, inverted flame graph for increased readability

Profiling event type

CPU (Events.CPU, Units.SAMPLES, AggregationType.SUM)

Meaning: Samples taken while code was executing on the CPU (i.e., CPU-time sampling).
Unit: SAMPLES — each recorded sample is one observation of CPU activity.
Aggregation: SUM — total number of CPU samples in the aggregation window (useful to rank hotspots by sample count).

ALLOC (Events.ALLOC, Units.OBJECTS, AggregationType.SUM)

Meaning: Allocation events — occurrences where the program allocated an object (or memory chunk).
Unit: OBJECTS — counts of allocated objects (not bytes).
Aggregation: SUM — total number of allocations over the window (useful to find allocation-heavy code paths).

LOCK (Events.LOCK, Units.SAMPLES, AggregationType.SUM)

Meaning: Lock-related samples — times when execution was observed waiting on or interacting with locks (contention points).
Unit: SAMPLES — each sample indicates a lock-related event.
Aggregation: SUM — total lock samples (helps identify contention hotspots).

WALL (Events.WALL, Units.SAMPLES, AggregationType.SUM)

Meaning: Wall-clock time samples — sampling based on real elapsed time (includes sleeping, I/O waits, blocked periods).
Unit: SAMPLES — counts of wall-clock samples.
Aggregation: SUM — total wall-clock samples (shows where time is actually spent from a user-visible perspective).

CTIMER (Events.CTIMER, Units.SAMPLES, AggregationType.SUM)

Meaning: CPU timer samples (per-thread CPU-time via a clock_t or similar) — typically records thread CPU-time using a clockid like CLOCK_THREAD_CPUTIME_ID.
Unit: SAMPLES — counts of ctimer samples.
Aggregation: SUM — total ctimer samples (useful for attributing CPU time to threads/tasks).

ITIMER (Events.ITIMER, Units.SAMPLES, AggregationType.SUM)

Meaning: Interval timer samples (e.g., POSIX ITIMER profiling) — samples triggered by an interval timer signal.
Unit: SAMPLES — counts of itimer-triggered samples.
Aggregation: SUM — total itimer samples (used when profiling via interval timer interrupts).

Notes (practical):

SAMPLES vs OBJECTS: SAMPLES are sampling observations; OBJECTS are discrete allocation counts.
SUM aggregation is additive; to get rates, divide by time.
Choose event type based on what you need to measure: CPU for CPU hot paths, WALL for end-to-end latency/time including waits, ALLOC for allocation behaviour, LOCK for contention, CTIMER/ITIMER for specific timer-driven sampling.

Pyroscope - HTTP API reference

Pyroscope Server HTTP API (opens in a new tab) DeepWiki - grafana/pyroscope (opens in a new tab)

Pyroscope - UI

```shell
http://$PYROSCOPE_HOST:4040/
```

JVM - Troubleshooting

Common JVM options for troubleshooting

```shell
-Xms2g
-Xmx2g
-XX:+UnlockExperimentalVMOptions
-XX:+UseG1GC
-Xlog:gc*=info,gc+heap=debug,gc+ref*=debug,gc+ergo*=trace,gc+age*=trace:file=${project.build.directory}/gc-%t.log:utctime,pid,level,tags:filecount=2,filesize=100m
-XX:StartFlightRecording=settings=default,filename=${project.build.directory}/${project.artifactId}.jfr,dumponexit=true,maxsize=100M
-XX:+UnlockDiagnosticVMOptions
-XX:+LogVMOutput
-XX:LogFile=${project.build.directory}/jvm.log
-XX:ErrorFile=${project.build.directory}/hs_err_%p.log
-XX:+DisableExplicitGC
-XX:+UseCompressedOops
-XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=${project.build.directory}/heapDump.log
```

Profile application startup

VisualVM - Startup Profiler (opens in a new tab)

Profile specific code spot

BellSoft Blog - Hunting down code hotspots with JDK Flight Recorder (opens in a new tab)

Identify the thread with the highest CPU consumption

```shell
top -H -p $PID_of_the_java_process
```

Display thread dump to stderr

```shell
kill -3 / -SIGQUIT $PID
```

Tooling

Java - IntelliJ

Profiling in IntelliJ (opens in a new tab)

Java - Swiss Java Knife (SJK)

GitHub - aragozin/jvm-tools (opens in a new tab)

Java - Async Profiler

Download (opens in a new tab)
Documentation (opens in a new tab)
Use cases (opens in a new tab)
- JVM Agent
- Java API
- IntelliJ IDEA

Aync Profiler - CPU Profiling

CPU Sampling Engines (opens in a new tab)

Async-profiler has 3 options for CPU profiling: -e cpu, -e itimer and -e ctimer.

cpu mode measures CPU time spent by the running threads.

1 profiling sample means that 1 CPU core was actively running for N nanoseconds (profiling interval).

Linux - perf

Linux - sysprof

TypeScript

Resources
- GitHub - TypeScript - Performance Tracing (opens in a new tab)

Node.js

Resources
- GitHub - clinicjs/node-clinic (opens in a new tab)

Docker

Key points
- Docker adds very little overhead in terms of CPU and memory to the application.
- The biggest performance hit is in disk I/O performance.
- If you require very low latency you can switch to using Docker’s host network feature, cutting out NAT.
Resources
- An Updated Performance Comparison of Virtual Machines and Linux Containers (opens in a new tab)
- Benchmarking IP and Unix domain sockets (for real) (opens in a new tab)
Network
- In the same IBM study cited before, the researchers found that Docker’s NAT doubled latency from roughly 35 µs to 70 µs for a 100-byte request from the client and a 200-byte response from the application.
- If you require very low latency you can switch to using Docker’s host network feature, which allows your container to share the same network as the host, cutting out the need for NAT.
- Unless you require very low latency, you should be fine sticking with the default bridge networking option. Just be sure to test it out and see if you’re getting the throughput you need.
- Avoid Docker forwarded ports in production environments. Use either Unix sockets or the host network mode in this case, as it will introduce virtually no overhead.
- Ports can be easier to manage, instead of a bunch of files, when dealing with multiple processes - either regarding many applications or scaling a single one. If you can afford a little drop in throughput, go for IP sockets.
- If you have to extract every drop of performance available, use Unix domain sockets where possible.

Database - Time Series CMake