Linux内核调试技术——进程上下文R状态死锁监测

合集下载

相关主题

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。

Linux内核调试技术——进程上下文R

状态死锁监测

一、lockup detector机制分析

lockup detector机制在内核代码的kernel/watchdog.c中实现，本文以Linux 4.1.15版本源码为例进行分析。首先了解其背后的设计原理：利用进程上下文、中断、nmi中断的不同优先级实现死锁监测。它们3者的优先级关系为“进程上下文< 中断< nmi中断”，其中进程上下文优先级最低，可通过中断来进行监测进程的运行状态，nmi中断的优先级最高，它是一种不可屏蔽的中断，在中断上下文中发生死锁时，nmi中断处理也可正常进入，因此可用来监测中断中的死锁。不过可惜的是目前绝大多数的arm32芯片都不支持nmi中断，也包括我手中树莓派的bcm2835芯片。从程序的命名中就可以看出，该程序其实实现了一种软看门狗的功能，下面给出整体的软件流程框图：

该程序为每个cpu创建了一个进程和一个高精度定时器，其中进程用来喂狗，定时器用来唤醒喂狗进程和检测是否存在死锁进程，在检测到死锁进程后就触发报警，接下来详细分析源代码：

[cpp] view plain copy 在CODE上查看代码片派生到我的代码片

void __init lockup_detector_init(void)

{

set_sample_period();

if (watchdog_enabled)

watchdog_enable_all_cpus();

}

首先入口函数lockup_detector_init()，该函数会在内核启动流程中按如下路径调用：start_kernel() --> rest_init() --> kernel_init()（启内核线程）--> kernel_init_freeable() --> lockup_detector_init()。该函数首先计算高精度定时器的到期时间（即喂狗时间），该值为监测超时时间值的1/5，默认为4s（20s/5），然后判断开关标识来确定是否启用监测机制，该标识在没有启用hard lockup detect的情况下默认为SOFT_WATCHDOG_ENABLED，表示开

启soft lockup detect。于此同时内核也提供了如下的__setup接口，可从内核启动参数cmd line 中设置值和开关：

[cpp] view plain copy 在CODE上查看代码片派生到我的代码片

static int __init softlockup_panic_setup(char *str)

{

softlockup_panic = simple_strtoul(str, NULL, 0);

return 1;

}

__setup("softlockup_panic=", softlockup_panic_setup);

static int __init nowatchdog_setup(char *str)

{

watchdog_enabled = 0;

return 1;

}

__setup("nowatchdog", nowatchdog_setup);

static int __init nosoftlockup_setup(char *str)

{

watchdog_enabled &= ~SOFT_WATCHDOG_ENABLED;

return 1;

}

__setup("nosoftlockup", nosoftlockup_setup);

此处假定开启soft lockup detect，接下来调用watchdog_enable_all_cpus()函数，该函数会尝试为每个CPU创建一个喂狗任务（并不会立即启动主函数执行）：

[cpp] view plain copy 在CODE上查看代码片派生到我的代码片

static int watchdog_enable_all_cpus(void)

{

int err = 0;

if (!watchdog_running) {

err = smpboot_register_percpu_thread(&watchdog_threads);

if (err)

pr_err("Failed to create watchdog threads, disabled\n");

else

watchdog_running = 1;

} else {

/*

* Enable/disable the lockup detectors or

* change the sample period 'on the fly'.

*/

update_watchdog_all_cpus();

}

return err;

}

该函数首先判断是否已经启动了任务，若没有则调用smpboot_register_percpu_thread()函数来创建任务，否则则调用update_watchdog_all_cpus()函数来更新定时器的到期时间。首先分析前一个分支，看一下watchdog_threads结构体的实现：

[cpp] view plain copy 在CODE上查看代码片派生到我的代码片

static struct smp_hotplug_thread watchdog_threads = {

.store = &softlockup_watchdog,

.thread_should_run = watchdog_should_run,

.thread_fn = watchdog,

.thread_comm = "watchdog/%u",

.setup = watchdog_enable,

.cleanup = watchdog_cleanup,

.park = watchdog_disable,

.unpark = watchdog_enable,

};

该结构注册了许多的回调函数，先简单了解一下：（1）softlockup_watchdog是一个全局的per cpu指针，它用来保存创建任务的进程描述符task_struct结构；（2）watchdog_should_run()是任务运行的判断函数，它会判断进程是否需要调用thread_fn指针指向的函数运行；（3）watchdog()是任务运行的主函数，该函数实现线程喂狗的动作；（4）setup回调函数watchdog_enable会在任务首次启动时调用，该函数会创建高精度定时器，用来激活喂狗任务和监测死锁超时；（5）cleanup回调函数用来清除任务，它会关闭定时器；（6）最后的park 和unpark回调函数用于暂停运行和恢复运行任务。（7）thread_comm是任务名字，cpu0是watchdog/0，cpu1是watchdog/1，以此类推。

下面来简单看一下smpboot_register_percpu_thread()函数是如何为每个cpu创建任务的，同时又在何处调用上述的那些回调函数的（kernel/smpboot.c）：

[cpp] view plain copy 在CODE上查看代码片派生到我的代码片

int smpboot_register_percpu_thread(struct smp_hotplug_thread *plug_thread)

{

unsigned int cpu;

int ret = 0;

get_online_cpus();

mutex_lock(&smpboot_threads_lock);

for_each_online_cpu(cpu) {

ret = __smpboot_create_thread(plug_thread, cpu);

if (ret) {

smpboot_destroy_threads(plug_thread);

goto out;

}

smpboot_unpark_thread(plug_thread, cpu);

}

list_add(&plug_thread->list, &hotplug_threads);