App Performance monitoring tools in Linux (for humans)
CPU
In most Operating Systems the CPU time is split into two parts user-time
and system-time
where the user-time
is the time in which the CPU is running an application code whereas the system-time
is the percentage of time the application is running kernel-code
.
When an application does I/O (reading/writing to devices) the OS will execute kernel-code
to give the application access to those resources.
The more I/O the application does, the more system-time
the application is likely to use and it will translate in low-performance, so, if the code must execute fast, it must reduce the I/O to reduce time spent by the system-time
, this is usually done isolating/restructuring the software to make it able to use as much CPU as possible in the less amount of time possible and if I/O is needed, moving those operations to different components (separate threads/processes for example)
vmstat
donhk@vm:~/$ vmstat 1 procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st 1 0 17164 440948 51852 1136180 0 0 40 118 213 332 11 1 87 0 0 1 0 17164 440940 51852 1136180 0 0 0 0 808 848 25 0 75 0 0 1 0 17164 440940 51860 1136180 0 0 0 48 1621 2894 26 1 73 0 0
Here the stats are taken every second 1 second = 1000 millisecondsus
is the percentage of time the CPU is running app code
11% (110 ms) -> 25% (250 ms) -> 26% (260 ms)
sy
is the percentage of time the system was running kernel code
1% (10 ms) -> 0% (0 ms) -> 1% (10 ms)
id
is the percentage of time the system was idle (the app was wither blocked waiting for an I/O operation, waiting on a synchronized operation primitive or it has nothing to do)
87%(870 ms) -> 75%(750 ms) -> 73%(730 ms)
This last could be something that can be improved to make the application run faster (tune up the I/O resource to make it faster, change the lock mechanisms, etc.)
sar
sar 1 3 Linux 4.14.35-1902.11.3.el7uek.x86_64 (devbox) 06/27/2020 _x86_64_ (4 CPU) 09:44:06 PM CPU %user %nice %system %iowait %steal %idle 09:44:07 PM all 32.63 0.00 1.84 53.95 0.00 11.58 09:44:08 PM all 37.97 0.00 1.60 56.15 0.00 4.28 09:44:09 PM all 45.33 0.00 2.55 37.11 0.00 15.01 Average: all 38.48 0.00 1.99 49.32 0.00 10.21
Collects and displays ALL system activities statistics, 3 samples of 1 second each.
%user
– 32.63% of the second was running user code%system
– 1.84% of the second was running kernel code%iowait
– 53.95% of the second, there was no runnable operation but there was a I/O operation in progress%idle
– was waiting but not an I/O operation
Disk
The disk tends to be the bottleneck of the OS, a slow/damaged disk usually leads to low throughput or poor I/O which makes everything go slower.
Some background about the Disk and the OS
Most operating systems try to get the most from the HW where they are running, one popular technic is to create a virtual block of memory which “looks continue” and allow the programmers to allocate “sequential memory” (from this place comes the famous array data structure where its memory is “sequential”), at HW level, the OS will take the physical RAM cells (either local (UMA) or remote (NUMA) ), merge them and make this abstraction of memory.
Most applications don’t really use all the RAM they allocate, in order to “free” these resources, the OS uses swapping (or pagination in Windows) where they will write one part of the RAM into the disk whereas it is not “active”.
When an application does a lot of I/O to the disk by either writing to it or reading from it (or the OS send it to the swap), the I/O performance degradation becomes more evident.
Knowing when the storage is the problem
Determine this info is not trivial, a deep analysis has to be done in the code itself to determine whether there is a bad design/usage of buffers is made, if we assume the app code works just fine, then there are certain indications we can look for that can give us hints to know that the storage is the problem.
Each storage as such has a threshold in terms of I/O, for example:
- Mechanical 5400RPM disk has around 100MB/s Read/Write speed
- NVMe has around 3000MB/s Read/Write speed
- NAS storage may vary depending on the NIC and traffic configuration.
if the application is doing more I/O than the storage can handle, then, the application will perform bad, a way to know this is through the I/O system stats.
iostat
Example running on a NAS device
iostat -xm 5 avg-cpu: %user %nice %system %iowait %steal %idle 34.27 0.00 1.45 18.12 0.00 46.16 Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 9.60 31.00 30.40 204.80 2.45 5.63 70.37 2.07 8.81 34.17 5.05 2.59 60.92 dm-0 0.00 0.00 39.80 233.00 2.45 5.63 60.67 2.68 9.81 37.31 5.12 2.23 60.92 dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
The above command will run a stats sampling of the current disks that the OS is using
Device: sda
w_await
– the time to service each I/O write is 10.68 milliseconds%util
– disk usage percentage is 60.92%%system
– kernel time used is 1.45%w/s
– writes per seconds, in this case it is 204 (quiet high for 5.63MB written)wMB/s
– MB written per second, 5.63 MB%iowait
– percentage of time spent waiting for disk, 18.12%
if %util
were about 100% and %iowait
were above 50% that would tell us that the storage cannot keep up with the app, you either, hire new devs or throw money at the problem.
vmstat
vmstat 5 procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st 2 2 0 191348 99216 10253392 0 0 731 1872 1661 1361 27 2 37 33 0 2 2 0 95704 99232 10348884 0 0 6166 28134 9202 4911 32 1 41 25 0 1 2 0 78092 99252 10341488 0 0 875 17731 9903 4861 38 1 53 9 0
si
, for swap in, and so
, for swap out, that alert us if the system is swapping, in this example, both is 0
Network Usage
If the software you are monitoring does network I/O, one monitoring tool you can use is nicstat
, often, when network is involved in an application, it tends to be a critical area to monitor, in distributed systems theory, network-related operations are always considered vectors of problems because a network resource is never guaranteed to work, it can source of multiple problems such as:
- The request took more time than expected
- The request never finished (hang)
- There was an intermittent network error
- There was a fatal network error
- There network works fine but the end point is dead
- The response is corrupted
If the application needs to wait for confirmation, things get worst, but this post is not about it, in the monitoring side, by using nicstat
it is possible to gather nic statistics.
nicstat 5 Time Int rKB/s wKB/s rPk/s wPk/s rAvs wAvs %Util Sat 21:32:31 enp0s8 0.02 0.00 0.08 0.00 236.1 0.00 0.00 0.00 21:32:31 enp0s10 0.02 0.00 0.08 0.00 236.1 0.00 0.00 0.00 21:32:31 enp0s9 0.02 0.00 0.08 0.00 236.1 0.00 0.00 0.00 21:32:31 lo 0.01 0.01 0.12 0.12 44.60 44.60 0.00 0.00 21:32:31 enp0s3 12.51 1.27 16.83 8.44 761.0 153.9 0.01 0.00
enp0s3
is a 100Mbps interface, from that capacity, there is 0.01% used, 1.27KB written and 12.51KB read
Final Notes
In order to properly profile an application it is necessary to understand the underlying limitations and characteristics of the HW where it is running and if possible, perform performance tests on systems with similar characteristics, otherwise, the numbers are meaningless.
Some of these indicators may vary or get affected when the app runs within a container, in such case, it is necessary to profile it within the container environment and take into account any imposed limitation by the container orchestrator such as a limit in the CPU usage, bandwidth, etc.
Links
- https://docs.oracle.com/cd/E19455-01/805-7229/spmonitor-40118/index.html#:~:text=(The%20sar%20command%20without%20any,with%20no%20work%20to%20do.
- Java Performance, 2nd Edition by Scott Oaks