# Quickwords 26 Linux Perf Under The Hood

Too many perf usage articles exist on the Internet yet few talks about its source code, mechanism and architecture under the hood. For anyone who’s not satisfied with a black box, this article tries to address this.

## Hardware Background Knowledge

You probably know that perf retrieves CPU hardware PMU counter value at some regular sampling frequency to make everything happen. The path through the hardware value is called MSR(Model-Specific Register) operations. Note that MSR is a general designation for various kinds of registers and PMU is just a small portion of it. To be more specific, we are mainly talking about the Performance Event Select Registers and the Performance Monitoring Counters(PMC) which make up the PMU together. By interacting with the Performance Event Select Register at the software side, value of performance events is streaming out via PMC.

The Performance Event Select Register has quite a lot of control flags to let you specify which and how a performance event you’d like to monitor. This article does not try to cover this topic, one can refer to the Intel Software Developer Manual for detail.

The PMC value can be proactively counted as well as reactively sampled. In this article we narrow our scope to the proactive manner only as we would like to concentrate more on the “big picture”.

## How perf Works

In order to understand how perf works we need to answer the following questions:

• How does perf obtain PMC value via file operation

These are basically the core functionalities of perf and every its fancy feature is based on this.

MSR is relatively difficult to interact when compare to stuffs like rdtsc. You need a whole software schema to build up the APIs even though they share similar names(rdpmc).

Regarding the underlying implementation, one can find in the struct pmu data struct(/include/linux/perf_event.h) which consists of a branch of operation function pointers. Implementations pointed by these pointers are the real workers for PMU tasks.

In case of x86, the implementations are located in file /arch/x86/events/intel/core.c. Address of worker functions are assigned to corresponding function pointers in struct x86_pmu(/anch/x86/events/core.c).

Set x86_pmu_read which reads the PMU of a particular kind of event as an example, this function finally invokes x86_perf_event_update()(/arch/x86/events/core.c) to obtain PMC value via rdpmcl which involves with the ultimate assembly instruction, in a compare and swap manner.

### How Does perf Obtain PMC Value via File Operation

A complete perf_event consists of the value of PMC, the critical part, and a handful of linked list, status and statistics elements. The definition is struct perf_event in /include/linux/perf_event.h.

perf_event exposes as a file descriptor to the user space applications(stat, top, record etc). These applications interact with perf_event, as well as PMU, by invoking normal file operations on corresponding file descriptors.

It appears that application just calls read on a file operations, actually, the read action has already been replaced by actions of static const struct file_operations perf_fops in /kernel/events/core.c. The real read function invoked is perf_read. This function will finally call pmu->read to stream out the PMC value.

The initialization process of perf_event and file descriptor is sys_perf_event_open in /kernel/events/core.c. The new file descriptor will be returned when success.

In real application, buildtin-stat.c for instance, __run_perf_stat consequently calls process_interval(), read_counters(), read_counter(), perf_evsel__read_counter(), perf_evsel__read_one(), perf_evsel__read(), and readn().

Multiple file descriptors/counters can be polled as same as the normal poll() operation from application’s perspective. The only difference is invoking fd’s customized poll() function.