Gain insight into the operational status of your Linux servers and ensure you are efficiently monitoring them.
You are here
Introduction to Processor Utilization Monitoring and Troubleshooting on Linux
The Linux operating system has earned a reputation for being lightweight, secure, and performant. It is the standard operating system for modern, microservices based infrastructure.
Despite the lightweight, optimized design of Linux, applications can always introduce anomalies and performance problems can, and will, creep into your business services. Remaining vigilant to these issues, and tracking them down, can be a bit tricky. Your strategy here needs to include monitoring knowledge, as well as an understanding of some of the basic command line tools that will help you quickly diagnose the issues.
The very first step in troubleshooting performance issues is knowing they exist. Even better is knowing they exist before they pose a problem. Even a small environment benefits from system monitoring, and large ones cannot go without it. All monitoring is not created equally, though. It’s important that your monitoring solution is aware of system status as well as potential performance issues, before they arise. Opsview Monitor accomplishes this by maintaining a baseline of system performance and offering highly configurable notification mechanisms to alert you if something appears to be going awry. Other monitoring solutions may offer a similar capability. If you have one in place, make sure it’s properly configured. If you need help setting up Opsview to ensure maximum system performance of all your systems, feel free to reach out. We’re here to help. Whether you’re using Opsview or another monitoring solution, make sure it’s configured to proactively alert you before minor performance issues become major performance issues.
Just knowing that a performance issue exists, or mighty exist soon, isn’t enough. When confronted with a performance alert, it helps to know where to start. Initially, check your monitoring software to understand the history of the performance metrics, as well as why you’re receiving an alert. Once you have an idea of what’s going on, you may already know the issue. If not, continue reading to get an introduction to monitoring performance on Linux systems using standard Linux tools and commands.
Determining the Problem
The first step in troubleshooting is to identify the issue. Users or internal staff may report that a web server is “sluggish”, but the web server process itself may not be to blame. Obtaining a list of current running processes and their CPU and memory is your first step. There are two main ways of doing this - either manually, per-server, via a command line utility like top, or by using monitoring software like Opsview. For now, let's focus on the command line to determine the low level aspects of potential processor performance issues.
Monitoring Utilization at the Command Line
To determine overall CPU load, you’ll first need to determine how many processor cores are installed on the machine. If you aren’t sure of this value run the nproc command. Once you have this value, run the uptime command. You’ll see something similar:
NOTE: If the nproc command isn’t available on your system substitute this command:
cat /proc/cpuinfo | grep “processor” | wc - 1
In this example, we have 4 processor cores and a load average of 0.00, 0.03, 0.91. The first number is the load average over the last minute, the second value is the load average over the past 5 minutes, and the third covers the past 15 minutes.
A load average of 0.00 means no load, where a value of 1.00 means (in general) that the system is at capacity. Technically, the values refer to how many processes are waiting on the CPU and other resources like I/O (disk, network, etc.), and if this value is above one then that means you have contention for resources which will slow everything down.
Unfortunately, Linux doesn’t scale this load average for multiple CPU systems. If you only have one CPU core, which is unlikely these days, a load average of 1 is ideal. But since most servers have multiple cores, we’ll need to do a bit of math to turn this into a usable CPU usage percentage like what you’d find on the Task Manager in Windows.
If a system has 4 CPU cores, like our example above, we would want to shoot for a long-term (15 minute average) load average of 4 or less. So, we would take the 0.91 value and divide this by 4. This would give us 0.22, which translates to 22% CPU usage. This calculation is not an absolute measure of CPU contention, but it gives a reasonable estimate.
Armed with knowledge of your total CPU usage, you can now examine your running processes for usage. Via a console, run the top command. You’ll see a list of running processes much like this:
In this screenshot, you’ll see the “apache2” process (the Apache web server) mentioned multiple times. It might be easy to assume that it is the culprit of higher load because it is listed many times, but in this instance, that is not the case. Multi-process applications like Apache will be listed multiple times.
The %CPU column shows the percentage of CPU usage the application is currently using. You’ll note that currently the apache2 process isn’t using any CPU. The TIME+ column is a cumulative time value that shows the amount of CPU time (in hours, minutes, and seconds), that a process has used. In this case, apache2 has used 3 hours of CPU time, but since the system has been running 79 days (displayed in the first line) and other processes have historically used far more time, it is likely not using an excessive amount of CPU, either currently or in recent history.
By default, top will update its view every 3 seconds. This is generally a reasonable value, but you may want to watch for more precise load spikes to check for a more intermittent problem. In this case, press the “s” button and type “1” and hit enter, lowering the update value to once per second. In extremely detailed cases, you can enter a decimal value to update more than once per second.
By pressing SHIFT and the < and > keys, you can sort the various columns. This is very helpful when you want to sort by cumulative time. You can press SHIFT+M to immediately sort by RAM usage, which is helpful in finding memory leaks. Once you’re done with top, you can process “q” to quit.
The ps auxww command will present a similar list, but instead of being updated repeatedly, a one-time list will be presented. With this you can pipe the output to search for various tasks. For example, to see the list of just Apache processes, run:
ps auxww | grep “apache” -i
(note the -i switch will make the search case insensitive)
In the resulting list, you’ll see the Apache processes listed.
Using the above commands will provide you with a considerable amount of metrics to help troubleshoot your performance issues. With this knowledge, you’ll be able to determine if you need to optimize Apache, your application, or increase your hardware resources.
Automate Linux Monitoring with Opsview
The manual processes described above dive into some low level metrics on a per server basis. For multiple systems, this process may not scale. That’s why you need persistent monitoring software like Opsview Monitor. By leveraging Opsview Monitor agentless, with agents, or via SNMP, you can automate the procedure of identifying performance issues. Additionally, tasks and notifications can be automated, letting you easily process and assimilate metrics for pain-free troubleshooting. Not only does this free up technical resources for more development and innovation, but it ensures IT failures are found and fixed before business is impacted.
If you’re ready to try automating your Linux monitoring with Opsview, download a free trial today. Any questions about best practices for monitoring system performance, or just systems in general, please contact us.
More like this
At Opsview, like in many other IT monitoring and Application Performance Management companies, we like to talk about the next generation of...
In this post, we’ll look at how we can monitor Solr, what performance metrics we might want to gather and how we can easily achieve this with...