App Performance monitoring tools in Linux (for humans)

CPU

In most Operating Systems the CPU time is split into two parts user-time and system-time where the user-time is the time in which the CPU is running an application code whereas the system-time is the percentage of time the application is running kernel-code.

When an application does I/O (reading/writing to devices) the OS will execute kernel-code to give the application access to those resources.

The more I/O the application does, the more system-time the application is likely to use and it will translate in low-performance, so, if the code must execute fast, it must reduce the I/O to reduce time spent by the system-time, this is usually done isolating/restructuring the software to make it able to use as much CPU as possible in the less amount of time possible and if I/O is needed, moving those operations to different components (separate threads/processes for example)

vmstat

donhk@vm:~/$ vmstat 1
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 1  0  17164 440948  51852 1136180    0    0    40   118  213  332 11  1 87  0  0
 1  0  17164 440940  51852 1136180    0    0     0     0  808  848 25  0 75  0  0
 1  0  17164 440940  51860 1136180    0    0     0    48 1621 2894 26  1 73  0  0

Here the stats are taken every second 1 second = 1000 milliseconds
us is the percentage of time the CPU is running app code

11% (110 ms) -> 25% (250 ms) -> 26% (260 ms)

sy is the percentage of time the system was running kernel code

1% (10 ms) -> 0% (0 ms) -> 1% (10 ms)

id is the percentage of time the system was idle (the app was wither blocked waiting for an I/O operation, waiting on a synchronized operation primitive or it has nothing to do)

87%(870 ms) -> 75%(750 ms) -> 73%(730 ms)

This last could be something that can be improved to make the application run faster (tune up the I/O resource to make it faster, change the lock mechanisms, etc.)

sar

sar 1 3
Linux 4.14.35-1902.11.3.el7uek.x86_64 (devbox) 	06/27/2020 	_x86_64_	(4 CPU)
09:44:06 PM     CPU     %user     %nice   %system   %iowait    %steal     %idle
09:44:07 PM     all     32.63      0.00      1.84     53.95      0.00     11.58
09:44:08 PM     all     37.97      0.00      1.60     56.15      0.00      4.28
09:44:09 PM     all     45.33      0.00      2.55     37.11      0.00     15.01
Average:        all     38.48      0.00      1.99     49.32      0.00     10.21

Collects and displays ALL system activities statistics, 3 samples of 1 second each.

  • %user – 32.63% of the second was running user code
  • %system – 1.84% of the second was running kernel code
  • %iowait – 53.95% of the second, there was no runnable operation but there was a I/O operation in progress
  • %idle – was waiting but not an I/O operation

Disk

The disk tends to be the bottleneck of the OS, a slow/damaged disk usually leads to low throughput or poor I/O which makes everything go slower.

Some background about the Disk and the OS

Most operating systems try to get the most from the HW where they are running, one popular technic is to create a virtual block of memory which “looks continue” and allow the programmers to allocate “sequential memory” (from this place comes the famous array data structure where its memory is “sequential”), at HW level, the OS will take the physical RAM cells (either local (UMA) or remote (NUMA) ), merge them and make this abstraction of memory.

Most applications don’t really use all the RAM they allocate, in order to “free” these resources, the OS uses swapping (or pagination in Windows) where they will write one part of the RAM into the disk whereas it is not “active”.

When an application does a lot of I/O to the disk by either writing to it or reading from it (or the OS send it to the swap), the I/O performance degradation becomes more evident.

Knowing when the storage is the problem

Determine this info is not trivial, a deep analysis has to be done in the code itself to determine whether there is a bad design/usage of buffers is made, if we assume the app code works just fine, then there are certain indications we can look for that can give us hints to know that the storage is the problem.

Each storage as such has a threshold in terms of I/O, for example:

  • Mechanical 5400RPM disk has around 100MB/s Read/Write speed
  • NVMe has around 3000MB/s Read/Write speed
  • NAS storage may vary depending on the NIC and traffic configuration.

if the application is doing more I/O than the storage can handle, then, the application will perform bad, a way to know this is through the I/O system stats.

iostat

Example running on a NAS device

iostat -xm 5
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          34.27    0.00    1.45   18.12    0.00   46.16
Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               9.60    31.00   30.40  204.80     2.45     5.63    70.37     2.07    8.81   34.17    5.05   2.59  60.92
dm-0              0.00     0.00   39.80  233.00     2.45     5.63    60.67     2.68    9.81   37.31    5.12   2.23  60.92
dm-1              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00

The above command will run a stats sampling of the current disks that the OS is using

Device: sda

  • w_await – the time to service each I/O write is 10.68 milliseconds
  • %util – disk usage percentage is 60.92%
  • %system – kernel time used is 1.45%
  • w/s – writes per seconds, in this case it is 204 (quiet high for 5.63MB written)
  • wMB/s – MB written per second, 5.63 MB
  • %iowait – percentage of time spent waiting for disk, 18.12%

if %util were about 100% and %iowait were above 50% that would tell us that the storage cannot keep up with the app, you either, hire new devs or throw money at the problem.

vmstat

vmstat 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 2  2      0 191348  99216 10253392    0    0   731  1872 1661 1361 27  2 37 33  0
 2  2      0  95704  99232 10348884    0    0  6166 28134 9202 4911 32  1 41 25  0
 1  2      0  78092  99252 10341488    0    0   875 17731 9903 4861 38  1 53  9  0

si, for swap in, and so, for swap out, that alert us if the system is swapping, in this example, both is 0

Network Usage

If the software you are monitoring does network I/O, one monitoring tool you can use is nicstat, often, when network is involved in an application, it tends to be a critical area to monitor, in distributed systems theory, network-related operations are always considered vectors of problems because a network resource is never guaranteed to work, it can source of multiple problems such as:

  • The request took more time than expected
  • The request never finished (hang)
  • There was an intermittent network error
  • There was a fatal network error
  • There network works fine but the end point is dead
  • The response is corrupted

If the application needs to wait for confirmation, things get worst, but this post is not about it, in the monitoring side, by using nicstat it is possible to gather nic statistics.

nicstat 5
    Time      Int   rKB/s   wKB/s   rPk/s   wPk/s    rAvs    wAvs %Util    Sat
21:32:31   enp0s8    0.02    0.00    0.08    0.00   236.1    0.00  0.00   0.00
21:32:31  enp0s10    0.02    0.00    0.08    0.00   236.1    0.00  0.00   0.00
21:32:31   enp0s9    0.02    0.00    0.08    0.00   236.1    0.00  0.00   0.00
21:32:31       lo    0.01    0.01    0.12    0.12   44.60   44.60  0.00   0.00
21:32:31   enp0s3   12.51    1.27   16.83    8.44   761.0   153.9  0.01   0.00

enp0s3 is a 100Mbps interface, from that capacity, there is 0.01% used, 1.27KB written and 12.51KB read

Final Notes

In order to properly profile an application it is necessary to understand the underlying limitations and characteristics of the HW where it is running and if possible, perform performance tests on systems with similar characteristics, otherwise, the numbers are meaningless.

Some of these indicators may vary or get affected when the app runs within a container, in such case, it is necessary to profile it within the container environment and take into account any imposed limitation by the container orchestrator such as a limit in the CPU usage, bandwidth, etc.

Links
  • https://docs.oracle.com/cd/E19455-01/805-7229/spmonitor-40118/index.html#:~:text=(The%20sar%20command%20without%20any,with%20no%20work%20to%20do.
  • Java Performance, 2nd Edition by Scott Oaks