System management

Document history

Date

Version

Author

Description

11/12/00

0.01.0

D. Doubrava <linux_monitor(at)volny(dot)cz>

Newly created document.

 

 

 

 

 

 

 

 



Typographic convention

Font

Example

Description

 

 

 

 

 

Blue text

This can be...

Ideas or undefined processes

 

Theory

Key objective of system management is keeping system in running state accordingly to application's and user's needs. If is the system or service unavailable, then it can have some negative impacts:

To avoid unavailability You must be able to predict problems or failures. You have to be proactive not reactive.

Reactive behavior

The actions are taken after the problem was discovered. I.e. If disk space is full then nobody isn't able to add any data there. In this case system administrator looks for filed which can be removed. Trigger of this administrator's action is "user call" when user saw this problem.

Proactive behavior

Administrator have a "tools" which notified him if some condition occurs. For our example the message is send to administrator if disk free space drops under set threshold. Then administrator is able to free some additional disk space for data, and users have no problem. In some causes can be fired some automatic action, like a deleting a old backup files.

Notes

We can use a set of different tools to monitor different symptoms of problems. As I saw: one administrator of about 30 unix boxes use a one screen with 30 xload windows, one for each node. In case of unexpected load he have to log on specific host and try to localizing the problem. This is a way to hell.

If we need some monitoring and management tools we generally need one management environment, which collects a data or events.

Management scope

We can monitor systems with different levels.

The level of monitoring depends on functions which is provided by specific node, and on needed system availability.

If we have a production database server we want to monitor system load (CPU usage, disk utilization, memory utilization). If we have file-server we need to monitor disk space.

If we having a WWW server the needed availability is 7x24. If we having a company accounting system, then it have to be available during business hours. And if I have a home computer then it can be available when I'm having a time for playing a games.

Next table shows what can be monitored on different servers, and which level is needed. This table is only example without any relation to some existent server or application. I real world can be used different metrics.


Mission critical server

Office server

Test server

Server's function

Availability

24x7

Business hours

Business hours

Minimized downtime

Downtime: tenths of minutes

Downtime: no special request

Database server

Is database running?

System load.

Free space in database.

Is database running?

Free space in database.

Is database running?

Web server

Network connectivity

Is Web server listening?

System Load

Security incidents

Network connectivity

Is Web server listening?

Is Web server listening?

File server

File server process (i.e. smb)

Disk space

Memory (cache) utilization

File server process (i.e. smb)

Disk space

File server process (i.e. smb)

Print server

Print server process

Print queue length

Print queue usage

Print server process

Print queue usage

Print server process



Generally: If we have higher needs to system availability, we have to monitor more metrics to better prediction of problems.

Note: If we need highly available system we have to use also another techniques such as disk RAID arrays.