Profiling performance will take you places - and at times - the improvement process can spin off into quite interesting journeys often leading to unexpected results. Performance tuning is one of significant parts for qryn development, and we are extremely proud of being able to do it using qryn itself!
Today we'll share a small sample from our development experience in using PGO to optimize the qryn Opentelemetry collector performance. Through the article, we'll examine the effectiveness of PGO and analyze the results of our profiling exercise using qryn to explore the practical implications of this optimization technique.
Application Benchmark Setup
In order to perform our optimization and demonstrate the results, we need an application to optimize. The application should be under a constant load for a significant amount of time (the longer, the better; let's say at least 20 minutes).
We will use the Pyroscope client to collect the profile data and some readers to access the collected profiles.
We will run an Opentelemetry collector accepting application profiles through the Pyroscope-compatible API and saving them directly into the qryn database.
The running Opentelemetry collector features a Pyroscope extension that sends its own profiles to the accepting endpoint described earlier.
Thus, in this specific example, the collector will profiles itself 🔥
Apart from the Opentelemetry collector, we will leverage a qryn instance reading from the same database and a Grafana instance providing the user interface to read the saved profiles. Let's start our profiling exercise!
PGO optimization of the Collector
This example optimization process will be quite simple:
Wait for 10-20 minutes to collect a sufficient amount of pprof data.
Merge the collected pprof observations into one result pprof.
Add a
-pgo pgo.pprof
flag to thego build
directive to build the optimized application.
Let's begin by analyzing the outcome of our first step:
The results are quite clear and easy to understand:
About 50% of the time was taken by the Go garbage collector (GC)
runtime.mca...
is likely the HTTP server(4) go.opente
is the qryn exporter writing to the qryn databasegithub.com/
is the Pyroscope client itself
Next, let's try to combine the intermediate pprofs using the profilecli tool:
$ profilecli query merge \
--from='2024-08-27T15:37:45+03:00' \
--to='2024-08-27T16:00:30+03:00' \
--url=http://localhost:3100 | head -n 10
level=info msg="query aggregated profile from profile store" url=http://localhost:3100 from=2024-08-27T15:37:45+03:00 to=2024-08-27T16:00:30+03:00 query={} type=process_cpu:cpu:nanoseconds:cpu:nanoseconds
PeriodType: cpu nanoseconds
Period: 10000000
Time: 2024-08-27 15:37:55.600057479 +0300 EEST
Duration: 22m4
Samples:
samples/count cpu/nanoseconds
4 40000000: 1 2 3 4 5 6
2 20000000: 7 8 9 10 11 12 13
2 20000000: 14 15 16 17 18 5 6
2 20000000: 19 20 21 22 23 24 25 26 27
$ profilecli query merge \
--from='2024-08-27T15:37:45+03:00' \
--to='2024-08-27T16:00:30+03:00' \
--url=http://localhost:3100 \
--output=pprof=./pgo.pprof
level=info msg="query aggregated profile from profile store" url=http://localhost:3100 from=2024-08-27T15:37:45+03:00 to=2024-08-27T16:00:30+03:00 query={} type=process_cpu:cpu:nanoseconds:cpu:nanoseconds
$ ls -la | grep pgo.pprof
-rw-r--r-- 1 user user 30324 сер 29 20:42 pgo.pprof
Since we mostly use a dockerized opentelemetry collector to run our test we can simply add the -pgo
flag to the Dockerfile and build it:
$ cat cmd/otel-collector/Dockerfile
# Builder stage
FROM golang:1.22.1-alpine as builder
RUN apk --update add ca-certificates
# ...
RUN cd cmd/otel-collector && \
go build -pgo=/src/pgo.pprof -tags timetzdata -o /out/otel-collector
# Final stage
FROM scratch
# ...
ENTRYPOINT ["/otel-collector"]
CMD ["--config", "/etc/otel/config.yaml"]
$ docker build \
-t otel-collector:latest \
-f cmd/otel-collector/Dockerfile .
[+] Building 61.5s (13/13) FINISHED docker:default
=> [internal] load build definition from Dockerfile 0.1s
=> => transferring dockerfile: 579B 0.0s
=> WARN: FromAsCasing: 'as' and 'FROM' keywords' casing do not match (line 2) 0.1s
=> [internal] load metadata for docker.io/library/golang:1.22.1-alpine 1.6s
...
=> => naming to docker.io/library/otel-collector:latest 0.0s
Let's repeat our test and wait for another 20 minutes and check the results...
Results of the Optimization
According to most blog articles, PGO typically yields a 2-7% improvement.
Initially, we want to debunk how multiple iterations could potentially improve the application's performance several times (spoiler alert: they do not)
To find out we performed two iterations of the process and compared results:
Both charts represent 20 minutes span of observation, with the application under constant load. The comparison between Base and First Round of optimization shows promising results: an improvement of (1-(4.66/4.94))*100=5.67%
.
However, as suspected, the Second Round yielded less optimistic results:
🛑 The comparison of 4.94s vs 4.97s shows (1-(4.94/4.97))*100=0.6%
downgrade
We can conclude that the second round performance is similar (if not slightly worse) to the baseline measurement. This easily demonstrates that there's definitely lots of value in PGO but also no such thing as infinite optimization and no speed of light :)
The most viable strategy for PGO usage:
Provide a benchmark of the unoptimized Go application.
Observe performance using the Pyroscope client and qryn.
Save the resulting pprof for PGO needs periodically.
Other notices during the comparison
Contrary to my expectations based on articles about PGO, which suggested it would "trim the noodles" on the flamechart by in-lining some functions into their callers, it sometimes does quite the opposite:
(left is baseline, right is optimized)
Next: let's take things a step further and examine the diff view supported in qryn since the latest 3.2.29 version.
Hey, did we mention how we're using ONLY qryn and NOT Pyroscope to run this test? 😏 We're so proud of our polyglot API supporting Continuous Profiling!
Pyroscope diff view
Well, obviously, the most improved parts are GC (Garbage Collection) and the runtime.mcall HTTP server. PGO definitely "trimmed some noodles" there (dark green color in the mid GC part). The rest of the chart needs further investigation.
The Pyroscope diff view is not very intuitive to read. The main downside I see is that they use percent comparison to represent color. Let's consider an example:
Suppose John, Mike, and I had 2 apples each. We had 6 apples overall, and each of us had 33% of all apples. Then one of my apples was "optimized" out. Now we have 5 apples. John and Mike have 40% each, and I have just 20%.
Have my apples been optimized? Yes, and my part will be green on the chart.
Have John's and Mike's apples been downgraded? No, they still have 2 apples each. But their parts will be red because their percent stake in increased by 10%.
With this in mind, we should read the diff view as follows:
green is definitely green
bright red is definitely bright red
faded red may be grey or green
grey may be green
During the comparison process, at some point, I came up with an average determiner of the profile:<value of a sample in ns>/<time between first and last merged observations>
.
This would have the measurement of ns/sec
and would be a more descriptive representation of the difference between profiles. However, it's quite complicated to calculate at this point.
During the comparison process at some point I came to the average determiner of the profile which is<value of a sample in ns>/<time between first and last merged observations>
That would have the measurement of ns/sec
and will be a bit more descriptive representation of difference between profiles. But it's quite complicated to calculate at this point.
Another approach would be to select the same amount of time (as I did in the explorer before the diff view chapter) and compare the absolute values:
As you can see here, the absolute difference of the sample is 0ns. But the sample is red due to the 8% difference. It's also worth mentioning that this approach is quite challenging because selection with a mouse without a time picker is never precise.
Conclusion
The Profile-Guided Optimization (PGO) tool offers a relatively straightforward method to achieve some improvements in application efficiency. The experiments with the OpenTelemetry collector have shown several key insights:
Performance Gains: The initial round of PGO optimization resulted in a 5.67% improvement in performance. It is consistent within the average 2-7% range reported in literature.
Diminishing Returns: The second round of optimization showed no additional improvement. This suggests that a single, well-executed PGO pass may be sufficient for most applications.
Visualization Challenges: While the Pyroscope diff view in Grafana is valuable for analyzing optimizations, it needs some basic training to be read properly. This emphasizes the need for multiple analysis approaches, including absolute value comparisons over fixed time intervals.
Integration with CI/CD: To fully leverage PGO, it should be integrated into the continuous integration and deployment pipeline, with periodic re-optimization (e.g., every few weeks) to adapt to changing usage patterns.
Future of Profiling Tools: The limitations observed in current visualization tools, such as Grafana "Explore Profiles" plugin, point to opportunities for improvement in profiling and analysis tools, which could further enhance the effectiveness of techniques like PGO.
In conclusion, while PGO is a "low hanging fruit" to gain a few percentage points of performance and represents a valuable tool in the Go developer's toolkit. As profiling tools and visualization techniques continue to evolve, the process of applying and analyzing PGO optimizations is likely to become even more accessible and insightful for development teams.
And the best part - all you need is qryn 😏 the polyglot observability stack
Ready for Polyglot Observability?
Logs, Metrics, Traces and Profiles. Your data, your way.
You can run qryn self-hosted or managed by qryn.cloud