Skip to main content

Profiling ClickHouse with LLVM's XRay

Learn how to profile ClickHouse using LLVM's XRay instrumentation profiler, visualize traces, and analyze performance.

Types of profilers

LLVM already includes a tool that instruments the code that allows us to do instrumentation profiling. As opposed to sampling or statistical profiling, it's very precise without losing any calls, at the expense of needing to instrument the code and be more resource expensive.

In a few words, an instrumentation profiler introduces new code to track the call to all functions. Statistical profilers allow us to run the code without requiring any changes, taking snapshots periodically to see the state of the application. So, only the functions running while the snapshot is taken are considered. perf is a very well-known statistical profiler.

Profiling ClickHouse using XRay's integration

On ClickHouse 25.12, XRay is integrated to seamlessly add new instrumentation points to functions. So, any official release already comes with this feature that can be triggered on demand, without affecting the overall performance when not enabled. The idea is to enable the minimum amount of instrumentation points to get valuable information.

We can add a new profile instrumentation point using the SYSTEM INSTRUMENT ADD PROFILE statement. The functions to be instrumented can be collected from system.symbols system table. Say we want to profile the sleepForNanoseconds function, which is a convenient function to check how long it takes to run.

SYSTEM INSTRUMENT ADD `sleepForNanoseconds` PROFILE

Then, we leave it running for the time period we want to profile and stop it.

SYSTEM INSTRUMENT REMOVE ALL

We convert the data collected in system.trace_log to Chrome format to visualize it in Perfetto. Notice the query_id, cpu_id and stacktrace for every entry.

Profiling a native application using XRay

The following section is left as a reference to know how XRay works under the hood and how it can be used out of the box to profile a native application.

Instrument the code

Imagine the following souce code:

#include <chrono>
#include <cstdio>
#include <thread>

void one()
{
    std::this_thread::sleep_for(std::chrono::milliseconds(10));
}

void two()
{
    std::this_thread::sleep_for(std::chrono::milliseconds(5));
}

int main()
{
    printf("Start\n");

    for (int i = 0; i < 10; ++i)
    {
        one();
        two();
    }

    printf("Finish\n");
}

In order to instrument with XRay, we need to add some flags like so:

clang++ -o test test.cpp -fxray-instrument -fxray-instruction-threshold=1
  • -fxray-instrument is needed to instrument the code.
  • -fxray-instruction-threshold=1 is used so that it instruments all functions, even if they're very small as in our example. By default, it instruments functions with at least 200 instructions.

We can ensure the code has been instrumented correctly by checking there's a new section in the binary:

objdump -h -j xray_instr_map test

test:     file format elf64-x86-64

Sections:
Idx Name          Size      VMA               LMA               File off  Algn
 17 xray_instr_map 000005c0  000000000002f91c  000000000002f91c  0002f91c  2**0
                  CONTENTS, ALLOC, LOAD, READONLY, DATA

Run the process with proper env var values to collect the trace

By default, there is no profiler collection unless explicitly asked for. In other words, unless we're profiling the overhead is negligible. We can set different values for XRAY_OPTIONS to configure when the profiler starts collecting and how it does so.

XRAY_OPTIONS="patch_premain=true xray_mode=xray-basic verbosity=1" ./test
==74394==XRay: Log file in 'xray-log.test.14imlN'
Start
Finish
==74394==Cleaned up log for TID: 74394

Convert the trace

XRay's traces can be converted to several formats. The trace_event format is very useful because it's easy to parse and there are already a number of tools that support it, so we'll use that one:

llvm-xray convert --symbolize --instr_map=./test --output-format=trace_event xray-log.test.14imlN | gzip > test-trace.txt.gz

Visualize the trace

We can use web-based UIs like speedscope.app or Perfetto.

While Perfetto makes visualizing multiple threads and querying the data easier, speedscope is better generating a flamegraph and a sandwich view of your data.

Time Order

Left Heavy

Sandwitch

Check out the docs

· 4 min read