Fixing kernel: NMI watchdog: BUG: soft lockup and server provisioning.

Today I have to take off the developer hat for a moment and wear the sysadmin one to deal with the below issue that arose during the provisioning of a testing machine I got which have not received love in a long time since this server will be under heavy loads running database upgrades we need to fix some their problems.

Symptoms

The below message would be appearing on any terminal open once in a while (I’m running Oracle Linux 7)

Message from syslogd@gdlaa008 at May 15 21:04:45 …
kernel:NMI watchdog: BUG: soft lockup - CPU#19 stuck for 22s! [kmemleak:425]

Cause

Long story short, the system is running slow, either by a slow NAS or device, more info here

Triage

By running this command abrt-cli list we get the following output

id 53ee4245537c0c5dacf83bb605f8aba103806457
reason: NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [kmemleak:425]
time: Sun 26 Apr 2020 11:10:14 AM PDT
cmdline: BOOT_IMAGE=/vmlinuz-4.1.12-124.20.3.el7uek.x86_64.debug root=UUID=741b797f-3790-4d82-ab94-e71a189a4f69 ro crashkernel=auto biosdevname=0 net.ifnames=0 rhgb quiet console=tty0 numa=off transparent_hugepage=never LANG=en_US.UTF-8
package: kernel
uid: 0 (root)
count: 300
Directory: /var/spool/abrt/oops-2020-04-26-18:10:14-12138-0
Reported: cannot be reported

a quick tail on /var/messages shows

[root@gdlaa008:~]# tail -100f /var/log/messages
May 15 20:54:09 gdlaa008 kernel: 00000000000927c0 00000000000927c0 0000000000000000 ffffffff8127c5f0
May 15 20:54:09 gdlaa008 kernel: Call Trace:
May 15 20:54:09 gdlaa008 kernel: [] kmemleak_scan+0x586/0x7a0
May 15 20:54:09 gdlaa008 kernel: [] ? kmemleak_scan+0x4eb/0x7a0
May 15 20:54:09 gdlaa008 kernel: [] ? trace_hardirqs_on+0xd/0x10
May 15 20:54:09 gdlaa008 kernel: [] ? kmemleak_write+0x470/0x470
May 15 20:54:09 gdlaa008 kernel: [] kmemleak_scan_thread+0x63/0xd0
May 15 20:54:09 gdlaa008 kernel: [] kthread+0x106/0x120
May 15 20:54:09 gdlaa008 kernel: [] ? local_clock+0x25/0x30
May 15 20:54:09 gdlaa008 kernel: [] ? kthread_create_on_node+0x250/0x250
May 15 20:54:09 gdlaa008 kernel: [] ret_from_fork+0x58/0x90
May 15 20:54:09 gdlaa008 kernel: [] ? kthread_create_on_node+0x250/0x250
May 15 20:54:09 gdlaa008 kernel: Code: 55 08 48 8d 7f 18 53 48 89 f3 be 01 00 00 00 e8 7c e3 8c ff 4c 89 e7 e8 d4 26 8d ff f6 c7 02 74 1f e8 2a bc 8c ff 48 89 df 57 9d <0f> 1f 44 00 00 5b 41 5c 65 ff 0d 66 48 7d 7e 5d c3 0f 1f 40 00

The best we can do are two things, update the machine and mitigate the error messages by increasing watchdog_thresh according to this article

[root@gdlaa008:~]# cat /proc/sys/kernel/watchdog_thresh
10
[root@gdlaa008:~]#echo 30 > /proc/sys/kernel/watchdog_thresh
[root@gdlaa008:~]# cat /proc/sys/kernel/watchdog_thresh
30

This won’t fix the issue but at least will make the annoying message to disappear.

Misc tasks

Since most of the workloads we run are through virtual machines we need to be sure the machine is aware of that 🙂

sudo yum install tuned
tuned-adm profile virtual-host
$lscpu 
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 48
On-line CPU(s) list: 0-47
Thread(s) per core: 2
Core(s) per socket: 12
Socket(s): 2
NUMA node(s): 1
...
$free -g
total used free shared buff/cache available
Mem: 251 1 1 0 248 213
Swap: 17 0 17

Free song

References
https://forums.centos.org/viewtopic.php?t=60087