Fixing kernel: NMI watchdog: BUG: soft lockup and server provisioning.
Today I have to take off the developer hat for a moment and wear the sysadmin one to deal with the below issue that arose during the provisioning of a testing machine I got which have not received love in a long time since this server will be under heavy loads running database upgrades we need to fix some their problems.
Symptoms
The below message would be appearing on any terminal open once in a while (I’m running Oracle Linux 7)
Message from syslogd@gdlaa008 at May 15 21:04:45 …
kernel:NMI watchdog: BUG: soft lockup - CPU#19 stuck for 22s! [kmemleak:425]
Cause
Long story short, the system is running slow, either by a slow NAS or device, more info here
Triage
By running this command abrt-cli list
we get the following output
id 53ee4245537c0c5dacf83bb605f8aba103806457
reason: NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [kmemleak:425]
time: Sun 26 Apr 2020 11:10:14 AM PDT
cmdline: BOOT_IMAGE=/vmlinuz-4.1.12-124.20.3.el7uek.x86_64.debug root=UUID=741b797f-3790-4d82-ab94-e71a189a4f69 ro crashkernel=auto biosdevname=0 net.ifnames=0 rhgb quiet console=tty0 numa=off transparent_hugepage=never LANG=en_US.UTF-8
package: kernel
uid: 0 (root)
count: 300
Directory: /var/spool/abrt/oops-2020-04-26-18:10:14-12138-0
Reported: cannot be reported
a quick tail on /var/messages
shows
[root@gdlaa008:~]# tail -100f /var/log/messages
May 15 20:54:09 gdlaa008 kernel: 00000000000927c0 00000000000927c0 0000000000000000 ffffffff8127c5f0
May 15 20:54:09 gdlaa008 kernel: Call Trace:
May 15 20:54:09 gdlaa008 kernel: [] kmemleak_scan+0x586/0x7a0
May 15 20:54:09 gdlaa008 kernel: [] ? kmemleak_scan+0x4eb/0x7a0
May 15 20:54:09 gdlaa008 kernel: [] ? trace_hardirqs_on+0xd/0x10
May 15 20:54:09 gdlaa008 kernel: [] ? kmemleak_write+0x470/0x470
May 15 20:54:09 gdlaa008 kernel: [] kmemleak_scan_thread+0x63/0xd0
May 15 20:54:09 gdlaa008 kernel: [] kthread+0x106/0x120
May 15 20:54:09 gdlaa008 kernel: [] ? local_clock+0x25/0x30
May 15 20:54:09 gdlaa008 kernel: [] ? kthread_create_on_node+0x250/0x250
May 15 20:54:09 gdlaa008 kernel: [] ret_from_fork+0x58/0x90
May 15 20:54:09 gdlaa008 kernel: [] ? kthread_create_on_node+0x250/0x250
May 15 20:54:09 gdlaa008 kernel: Code: 55 08 48 8d 7f 18 53 48 89 f3 be 01 00 00 00 e8 7c e3 8c ff 4c 89 e7 e8 d4 26 8d ff f6 c7 02 74 1f e8 2a bc 8c ff 48 89 df 57 9d <0f> 1f 44 00 00 5b 41 5c 65 ff 0d 66 48 7d 7e 5d c3 0f 1f 40 00
The best we can do are two things, update the machine and mitigate the error messages by increasing watchdog_thresh
according to this article
[root@gdlaa008:~]# cat /proc/sys/kernel/watchdog_thresh
10
[root@gdlaa008:~]#echo 30 > /proc/sys/kernel/watchdog_thresh
[root@gdlaa008:~]# cat /proc/sys/kernel/watchdog_thresh
30
This won’t fix the issue but at least will make the annoying message to disappear.
Misc tasks
Since most of the workloads we run are through virtual machines we need to be sure the machine is aware of that 🙂
sudo yum install tuned
tuned-adm profile virtual-host
$lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 48
On-line CPU(s) list: 0-47
Thread(s) per core: 2
Core(s) per socket: 12
Socket(s): 2
NUMA node(s): 1
...
$free -g
total used free shared buff/cache available
Mem: 251 1 1 0 248 213
Swap: 17 0 17
One thought on “Fixing kernel: NMI watchdog: BUG: soft lockup and server provisioning.”