Using the TRACE_EVENT() macro (Part 1)

标签：__ TRACE tracepoint next Part Using prev EVENT

如果你对tracepoint后面的机制感兴趣，可以阅读关于TRACE_EVENT的三篇文章。这一系列文章对tracepoint后面的机制进行了详细讲解,以及其使用方法都有涉及。该系列文章出现在lwn.net上。注意，不要被英文吓住了，我保证，你认真读完这系列文章后，tracepoint后面的秘密不在是秘密。

March 24, 2010

本文主要说明了使用TRACE_EVENT(...)的使用细节，至于TRACE_EVENT(...)是如何工作的,本文暂未涉及,在part3中会详细描述。

Throughout the history of Linux, people have been wanting to add static tracepoints — functions that record data at a specific site in the kernel for later retrieval(稍后检索) — to the kernel.

Those efforts weren't very successful because of the fear(担心) that tracepoints would sacrifice(牺牲) performance.

Unlike the Ftrace function tracer, a tracepoint can record more than just the function being entered. A tracepoint can record local variables of the function.

Over time, various strategies for adding tracepoints have been tried, with varying success, and the TRACE_EVENT() macro is the latest way to add kernel tracepoints.

1、History

Mathieu Desnoyers worked on adding a very low overhead(开销) tracer hook called trace markers(跟踪标记).

Even though the trace markers solved the performance issue by using cleverly crafted(巧妙设计的) macros,

the information that the trace marker would record was embedded at the location in the core kernel as a printf format.

This upset(使心烦意乱) several core kernel developers as it made the core kernel code look like debug code was left scattered(分散的) throughout.

In trying to appease(安抚) the kernel developers, Mathieu came up with tracepoints.

The tracepoint included a function call in the kernel code that, when enabled, would call a callback function passing the parameters of the tracepoint to that function

as if the callback function was called with those parameters. This was much better than the trace markers since it allowed the passing of type casted pointers that the callback functions could dereference, as opposed to the marker interface, which required the callback function to parse a string.

With the tracepoint, the callback function could efficiently take whatever it needed from the structures.

Although this was an improvement over trace markers, it was still too tedious(繁琐) for developers to create a callback for every tracepoint they wanted to add,

so that a tracer would output its data. The kernel needed a more automated(自动的) way to connect a tracer to the tracepoints.

That would require automating the creation of the callback and also format its data, much like what the trace marker did, but it should be done in the callback, and not at the tracepoint site in the kernel code.

To solve this issue of automating the tracepoints(为了解决自动化跟踪点的问题), the TRACE_EVENT() macro was born.

Inspired by Tom Zanussi's zedtrace, this macro was specifically made to allow a developer to add tracepoints to their subsystem and have Ftrace automatically be able to trace them.

The developer need not understand how Ftrace works, they only need to create their tracepoint using the TRACE_EVENT() macro.

In addition, they need to follow some guidelines(指导方针) in how to create a header file and they would gain full access to the Ftrace tracer.

Another objective of the design of the TRACE_EVENT() macro was to not couple(耦合) it to Ftrace or any other tracer.

It is agnostic to the tracers that use it, which is apparent now that TRACE_EVENT() is also used by perf, LTTng and SystemTap.

2、The anatomy(解析) of the TRACE_EVENT() macro

Automating tracepoints had various requirements that must be fulfilled(被满足):

It must create a tracepoint that can be placed in the kernel code.
It must create a callback function that can be hooked to this tracepoint.
The callback function must be able to record the data passed to it into the tracer ring buffer in the fastest way possible.
It must create a function that can parse the data recorded to the ring buffer and translate it to a human readable format that the tracer can display to a user.

To accomplish that, the TRACE_EVENT() macro is broken into six components, which correspond to the parameters of the macro:

TRACE_EVENT(name, proto, args, struct, assign, print)

name - the name of the tracepoint to be created.
prototype(原型) - the prototype for the tracepoint callbacks(跟踪点回调函数原型)
args - the arguments that match the prototype(与原型匹配的参数).
struct - the structure that a tracer could use (but is not required to) to store the data passed into the tracepoint.
assign - the C-like way to assign the data to the structure.
print - the way to output the structure in human readable ASCII format.

A good example of a tracepoint definition, for sched_switch, can be found here. That definition will be used below to describe each of the parts of TRACE_EVENT() macro.

All parameters except the first one are encapsulated with another macro (TP_PROTO, TP_ARGS, TP_STRUCT__entry, TP_fast_assign and TP_printk).

These macros give more control in processing and also allow commas(逗号) to be used within the TRACE_EVENT() macro.

Name

The first parameter is the name.

TRACE_EVENT(sched_switch,

This is the name used to call this tracepoint. The actual tracepoint that is used has trace_ prefixed(trace_前缀) to the name (ie. trace_sched_switch).

Prototype(原型)

The next parameter is the prototype.

TP_PROTO(struct rq *rq, struct task_struct *prev, struct task_struct *next),

The prototype is written as if you were to declare the tracepoint directly:

trace_sched_switch(struct rq *rq, struct task_struct *prev,struct task_struct *next);

It is used as the prototype for both the tracepoint added to the kernel code and for the callback function.

Remember, a tracepoint calls the callback functions as if the callback functions were being called at the location of the tracepoint.

Arguments

The third parameter is the arguments used by the prototype.

TP_ARGS(rq, prev, next),

It may seem strange that this is needed, but it is not only required by the TRACE_EVENT() macro, it is also required by the tracepoint infrastructure underneath.

The tracepoint code, when activated, will call the callback functions (more than one callback may be assigned to a given tracepoint).

The macro that creates the tracepoint must have access to both the prototype and the arguments.

Below is an illustration of what a tracepoint macro would need to accomplish this:

#define TRACE_POINT(name, proto, args) \

void trace_##name(proto) \

{ \

if (trace_##name##_active) \

callback(args); \

}

Structure

The fourth parameter is a bit more complex.

TP_STRUCT__entry(

__array( char, prev_comm, TASK_COMM_LEN )

__field( pid_t, prev_pid )

__field( int, prev_prio )

__field( long, prev_state )

__array( char, next_comm, TASK_COMM_LEN )

__field( pid_t, next_pid )

__field( int, next_prio )

This parameter describes the structure layout of the data that will be stored in the tracer's ring buffer(该参数描述将存储在跟踪器环形缓冲区中的数据的结构布局).

Each element of the structure is defined by another macro.

These macros are used to automate the creation of a structure and are not function-like. Notice that the macros are not separated by any delimiter(分隔符) (no comma nor semicolon).

The macros used by the sched_switch tracepoint are:

__field(type, name) - this defines a normal structure element, like int var; where type is int and name is var.
__array(type, name, len) - this defines an array item, equivalent to int name[len]; where the type is int ,the name of the array is array and the number of items in the array is len.

There are other element macros that will be described in a later article. The definition from the sched_switch tracepoint would produce a structure that looks like:

struct {

char prev_comm[TASK_COMM_LEN];

pid_t prev_pid;

int prev_prio;

long prev_state;

char next_comm[TASK_COMM_LEN];

pid_t next_pid;

int next_prio;

};

Note that the spacing used in the TP_STRUCT__entry definition breaks(违反) the rules outlined by checkpatch.pl.

That is done because these macros are not function-like but, instead, are used to define a structure.

The spacing follows the rules of structure spacing and not of function spacing, so that the names line up in the structure declaration.

Needless to say, checkpatch.pl fails horribly when processing changes to TRACE_EVENT() definitions.

Assignment

The fifth parameter defines the way the data from the parameters is saved to the ring buffer(参数中的数据保存到环形缓冲区的方式).

TP_fast_assign(

memcpy(__entry->next_comm, next->comm, TASK_COMM_LEN);

__entry->prev_pid = prev->pid;

__entry->prev_prio = prev->prio;

__entry->prev_state = prev->state;

memcpy(__entry->prev_comm, prev->comm, TASK_COMM_LEN);

__entry->next_pid = next->pid;

__entry->next_prio = next->prio;

The code within the TP_fast_assign() is normal C code.

A special variable __entry represents the pointer to a structure type defined by TP_STRUCT__entry and points directly into the ring buffer.

The TP_fast_assign is used to fill all fields created in TP_STRUCT__entry.

The variable names of the parameters defined by TP_PROTO and TP_ARGS can then be used to assign the appropriate data into the __entry structure.

The last parameter defines how a printk() can be used to print out the fields from the TP_STRUCT__entry structure.

TP_printk("prev_comm=%s prev_pid=%d prev_prio=%d prev_state=%s ==> " \

"next_comm=%s next_pid=%d next_prio=%d",

__entry->prev_comm, __entry->prev_pid, __entry->prev_prio,

__entry->prev_state ?

__print_flags(__entry->prev_state, "|",

{ 1, "S"} , { 2, "D" }, { 4, "T" }, { 8, "t" },

{ 16, "Z" }, { 32, "X" }, { 64, "x" },

{ 128, "W" }) : "R",

__entry->next_comm, __entry->next_pid, __entry->next_prio)

Once again the variable __entry is used to reference the pointer to the structure that contains the data.

The format string is just like any other printf format.

The __print_flags() is part of a set of helper functions that come with TRACE_EVENT(), and will be covered in another article.

Do not create new tracepoint-specific helpers, because that will confuse user-space tools that know about the TRACE_EVENT() helper macros

but will not know how to handle ones created for individual tracepoints.

Format file

The sched_switch TRACE_EVENT() macro produces the following format file in /sys/kernel/debug/tracing/events/sched/sched_switch/format:

name: sched_switch

ID: 33

format:

field:unsigned short common_type; offset:0; size:2;

field:unsigned char common_flags; offset:2; size:1;

field:unsigned char common_preempt_count; offset:3; size:1;

field:int common_pid; offset:4; size:4;

field:int common_lock_depth; offset:8; size:4;

field:char prev_comm[TASK_COMM_LEN]; offset:12; size:16;

field:pid_t prev_pid; offset:28; size:4;

field:int prev_prio; offset:32; size:4;

field:long prev_state; offset:40; size:8;

field:char next_comm[TASK_COMM_LEN]; offset:48; size:16;

field:pid_t next_pid; offset:64; size:4;

field:int next_prio; offset:68; size:4;

print fmt: "task %s:%d [%d] (%s) ==> %s:%d [%d]", REC->prev_comm, REC->prev_pid,

REC->prev_prio, REC->prev_state ? __print_flags(REC->prev_state, "|", { 1, "S"} ,

{ 2, "D" }, { 4, "T" }, { 8, "t" }, { 16, "Z" }, { 32, "X" }, { 64, "x" }, { 128,

"W" }) : "R", REC->next_comm, REC->next_pid, REC->next_prio

Note: Newer kernels may also display a signed entry for each field.

Notice that __entry is replaced with REC in the format file.

The first set of fields (common_*) are not from the TRACE_EVENT() macro, but are added to all events by Ftrace, which created this format file, other tracers could add different fields.

The format file provides user-space tools the information needed to parse the binary output containing sched_switch entries.

3、The header file

The TRACE_EVENT() macro cannot just be placed anywhere in the expectation that it will work with Ftrace or any other tracer.

The header file that contains the TRACE_EVENT() macro must follow a certain format.

These header files typically are located in the include/trace/events directory but do not need to be(但并不需要这样做). If they are not located in this directory, then other configurations are necessary.

The first line in the TRACE_EVENT() header is not the normal #ifdef _TRACE_SCHED_H, but instead has:

#undef TRACE_SYSTEM

#define TRACE_SYSTEM sched

#if !defined(_TRACE_SCHED_H) || defined(TRACE_HEADER_MULTI_READ)

#define _TRACE_SCHED_H

This example is for scheduler trace events, other event headers would use something other than sched and _TRACE_SCHED_H.

The TRACE_HEADER_MULTI_READ test allows this file to be included more than once; this is important for the processing of the TRACE_EVENT() macro.

The TRACE_SYSTEM must also be defined for the file and must be outside the guard of the #if.

The TRACE_SYSTEM defines what group the TRACE_EVENT() macros in the file belong to.

This is also the directory name that the events will be grouped under in the debugfs tracing/events directory.

This grouping is important for Ftrace as it allows the user to enable or disable events by group.

The file then includes any headers required by the contents of the TRACE_EVENT() macro. (e.g. #include <linux/sched.h>). The tracepoint.h file is required.

#include <linux/tracepoint.h>

All the trace events can now be defined with TRACE_EVENT() macros.

Please include comments that describe the tracepoint above the TRACE_EVENT() macros. Look at include/trace/events/sched.h as an example. The file ends with:

#endif /* _TRACE_SCHED_H */

/* This part must be outside protection */

#include <trace/define_trace.h>

The define_trace.h is where all the magic(魔力) lies in(在于) creating the tracepoints. The explanation of how this file works will be left to another article.

For now, it is sufficient to know that this file must be included at the bottom of the trace header file outside the protection of the #endif.

4、Using the tracepoint

Defining the tracepoint is meaningless(无意义的) if it is not used anywhere.

To use the tracepoint, the trace header must be included, but one C file (and only one) must also define CREATE_TRACE_POINTS before including the trace.

This will cause the define_trace.h to create the necessary functions needed to produce the tracing events. In kernel/sched.c the following is defined:

#define CREATE_TRACE_POINTS

#include <trace/events/sched.h>

If another file needs to use tracepoints that were defined in the trace file, then it only needs to include the trace file, and does not need to define CREATE_TRACE_POINTS.

Defining it more than once for the same header file will cause linker errors when building. For example, in kernel/fork.c only the header file is included:

#include <trace/events/sched.h>

Finally, the tracepoint is used in the code just as it was defined in the TRACE_EVENT() macro:

static inline void

context_switch(struct rq *rq, struct task_struct *prev,

struct task_struct *next)

{

struct mm_struct *mm, *oldmm;

prepare_task_switch(rq, prev, next);

trace_sched_switch(rq, prev, next);

mm = next->mm;

oldmm = prev->active_mm;

Coming soon

This article explained all that is needed to create a basic tracepoint within the core kernel code.

Part 2 will describe how to consolidate(巩固) tracepoints to keep the tracing footprint small, along with information about the TP_STRUCT__entry macros and TP_printk helper functions (like __print_flags).

Part 3 will look at defining tracepoints outside of the include/trace/events directory (for modules and architecture-specific tracepoints) as well as a look at how the TRACE_EVENT() macro does its magic.

Both articles will have a few practical examples of how to use tracepoints. Stay tuned ...

标签：__,TRACE,tracepoint,next,Part,Using,prev,EVENT
From： https://blog.csdn.net/qq_30896803/article/details/136709212

Using the TRACE_EVENT() macro (Part 1)

相关文章

赞助商

阅读排行