This tutorial exposes the method I used to send hardware notification from DELL servers to Nagios.
I rely on the software DELL gives away with its servers, named OpenManage. OpenManage is instructed to run a command whenever a hardware problem happens, and this command sends a passive check to Nagios.
Another way to inform Nagios of hardware events is to have OpenManage send SNMP traps to Nagios.
Nagios configuration
First of all, you need a properly running Nagios, from version 1.2 to 2.4.
I usually use templates to configure things in Nagios, so here are my generic service template, and an inherited passive service template:
# Generic service definition template
define service{
name generic-service
register 0
check_period 24x7
max_check_attempts 3
normal_check_interval 15
retry_check_interval 5
active_checks_enabled 1
passive_checks_enabled 0
parallelize_check 1
obsess_over_service 0
check_freshness 0
event_handler_enabled 0
flap_detection_enabled 0
process_perf_data 1
retain_status_information 1
retain_nonstatus_information 1
notification_interval 60
notification_period 24x7
notification_options w,u,c,r
notifications_enabled 1
}
# Generic passive service definition template
define service{
name passive-service
use generic-service
register 0
active_checks_enabled 0
passive_checks_enabled 1
max_check_attempts 1
check_freshness 0
check_command check_dummy!1
}
define service{
# DELL OpenManage (+ nsca)
use passive-service
name dell-probe
register 0
service_description Hardware
contact_groups nt-admins
}
This last template can be used for your DELL server, named MY_SERVER for example:
define service{
host_name MY_SERVER
use dell-probe
}
Enough with Nagios.
DELL server install
Operating system
I only tested OpenManage on Windows. But Linux OpenManage should work the same way. If you try it, please let me know.
NSCA
NSCA stands for "Nagios Service Check Acceptor", a Nagios plug-in that can be downloaded from Nagios official web site. This tutorial is based on version 2.4.
NSCA has a server component, that runs on Nagios box, and a client. NSCA client for Windows can be found at Nagios Exchange.
Windows client is called send_nsca.exe. Client and server installations are documented in NSCA's package.
OpenManage
OpenManage has to be installed on remote server: insert DELL server bundled CDROM and follow instructions. Reboot may be necessary.
I have tested a few versions of OpenManage, from 1.5.0 to 1.7.0.
Configuration
NSCA
On the Nagios box, I made NSCA server run under xinetd daemon. All options left to default.
On the Windows box, I created a directory that will hold NSCA client files, named C:\program files\send_nsca. It should contain:
- libmcrypt.dll
- send_nsca.cfg
- send_nsca.exe
- send_nsca.cmd
The only file that doesn't come with standard distribution is send_nsca.cmd. It's a script I wrote, that OpenManage will call upon hardware failure. Here it is:
@echo off
rem From OpenManage to Nagios with love
rem (c)Xavier Dusart 11/2004 - http://xavier.dusart.free.fr/
echo %COMPUTERNAME%;Hardware;2;%1 | "c:\program files\send_nsca\send_nsca" -H nagios_server -d ; -c "c:\program files\send_nsca\send_nsca.cfg"
Pay attention to red words.
- By chance, I named my hosts under Nagios with their NetBIOS name, in CAPITAL LETTERS. If you didn't, Windows variable %COMPUTERNAME% must be replaced with the name of the Windows server as known by Nagios.
- %1 is the variable filled with the parameter given to OpenManage command (see OpenManage configuration below). It is a line of text describing the event.
- You also noticed that I always send a "critical" (2) event to Nagios: it is because OpenManage won't notify when things return to normal, so the script is only called upon failure.
- nagios_server, to be replaced by the name or IP address of your Nagios box.
- C:\program files\send_nsca, to be replaced by the directory path where you decided to install NSCA client, if it not this one.
NSCA client configuration file, named send_nsca.cfg contains: encryption_method=1. Quite short.
OpenManage
Open your browser at https://MY_SERVER:1311, and go to Alarm management. For each event that you want to be notified of, instruct OpenManage to run c:\progra~1\send_nsca\send_nsca.cmd "alarm description". Alarm description may be for example "Power supply critical"...
Process
- OpenManage sends a critical notification to Nagios concerning "Hardware" service of local host.
- Admin (that means you) reacts to the alarm (hopefully solves the problem).
- Admin manually submits passive check result to Nagios, with an Ok state, to reset the alarm.
- The first time you restart Nagios with a new "hardware" service, submit the same passive check result to initialize service state. Otherwise it will stay in "pending" state.
Option
Add a link to OpenManage in Nagios
To add a link to OpenManage in Nagios, you can use Nagios configuration file serviceextinfo.cfg:
define serviceextinfo{
host_name MY_SERVER
service_description Hardware
notes_url https://$HOSTNAME$:1311/
icon_image dell.gif
}
Here is the logo I use: .
OpenManage easy configuration
Configuring alarms in OpenManage may be a tedious task, but you will find it is stored in flat files. Depending on OpenManage's version, you will have either of these:
C:\Program Files\Dell\OpenManage\omsa\ini\dcprv32.Ini, where you can append the following lines:
.../...
[HWC Configuration]
lraRObj.settings.00B5=256
lraRObj.epfName.00B5=c:\progra~1\send_nsca\send_nsca.cmd "Panne detectee par un capteur de ventilateur"
lraRObj.settings.00BD=256
lraRObj.epfName.00BD=c:\progra~1\send_nsca\send_nsca.cmd "Echec anticipe de la memoire"
lraRObj.settings.00BE=256
lraRObj.epfName.00BE=c:\progra~1\send_nsca\send_nsca.cmd "Echec de la memoire"
lraRObj.settings.00B1=256
lraRObj.epfName.00B1=c:\progra~1\send_nsca\send_nsca.cmd "Bloc d'alimentation critique"
lraRObj.settings.00BB=256
lraRObj.epfName.00BB=c:\progra~1\send_nsca\send_nsca.cmd "Degradation de la redondance
lraRObj.settings.00BC=256
lraRObj.epfName.00BC=c:\progra~1\send_nsca\send_nsca.cmd "Perte de la redondance"
lraRObj.settings.00B2=256
lraRObj.epfName.00B2=c:\progra~1\send_nsca\send_nsca.cmd "Avertissement des capteurs de temperature"
lraRObj.settings.00B3=256
lraRObj.epfName.00B3=c:\progra~1\send_nsca\send_nsca.cmd "Panne detectee par un capteur de temperature"
lraRObj.settings.00B6=256
lraRObj.epfName.00B6=c:\progra~1\send_nsca\send_nsca.cmd "Avertissement des capteurs de tension"
lraRObj.settings.00B7=256
lraRObj.epfName.00B7=c:\progra~1\send_nsca\send_nsca.cmd "Panne detectee par un capteur de tension"
lraRObj.settings.00B4=256
lraRObj.epfName.00B4=c:\progra~1\send_nsca\send_nsca.cmd "Avertissement des capteurs de ventilateur"
lraRObj.settings.00B8=256
lraRObj.epfName.00B8=c:\progra~1\send_nsca\send_nsca.cmd "Avertissement des capteurs de courant"
lraRObj.settings.00BA=256
lraRObj.epfName.00BA=c:\progra~1\send_nsca\send_nsca.cmd "Detection d'une intrusion dans le chassis"
lraRObj.settings.00B9=256
lraRObj.epfName.00B9=c:\progra~1\send_nsca\send_nsca.cmd "Panne detectee par les capteurs de courant"
Or C:\Program Files\Dell\OpenManage\omsa\ini\dclrdy32.ini, which must contain (full file):
;--------------------------------------------------------------------
;
; Dell Inc. PROPRIETARY INFORMATION
; This software is supplied under the terms of a license agreement or
; nondisclosure agreement with Dell Inc. and may not
; be copied or disclosed except in accordance with the terms of that
; agreement.
;
; Copyright (c) 1995-2004 Dell Inc.
; All Rights Reserved.
;
; Module Name:
;
; DCLRDY32.INI
;
; Abstract/Purpose:
;
; Local Response Agent ("Dynamic" Data) INI file
;
;--------------------------------------------------------------------
; XDU 09/2004 : a copier dans C:\Program Files\Dell\OpenManage\omsa\ini
; vérifier la section "LRA Resp Configuration Section"
; relancer le service "Systems management data manager"
[LRA Resp Configuration Section]
lrarespid.0x00=175 ; Watchdog ASR event
lrarespid.0x01=177 ; Power supply critical
lrarespid.0x02=178 ; Temperature non-critical
lrarespid.0x03=179 ; Temperature critical
lrarespid.0x04=180 ; Fan non-critical
lrarespid.0x05=181 ; Fan critical
lrarespid.0x06=182 ; Voltage non-critical
lrarespid.0x07=183 ; Voltage critical
lrarespid.0x08=184 ; Current non-critical
lrarespid.0x09=185 ; Current critical
lrarespid.0x0a=186 ; Intrusion detected
lrarespid.0x0b=187 ; Redundancy degraded
lrarespid.0x0c=188 ; Redundancy lost
lrarespid.0x0d=189 ; Memory ECC error non-critical
lrarespid.0x0e=190 ; Memory ECC error critical
lrarespid.0x0f=304 ; Hardware log (ESM log) near full
lrarespid.0x10=305 ; Hardware log (ESM log) full
lrarespid.0x11=306 ; Processor warning
lrarespid.0x12=307 ; Processor failure
lrarespid.0x13=308 ; Power supply non-critical
[175]
settings=256
epfName=c:\progra~1\send_nsca\send_nsca.cmd "ASR de surveillance"
[177]
settings=256
epfName=c:\progra~1\send_nsca\send_nsca.cmd "Bloc d'alimentation critique"
[178]
settings=256
epfName=c:\progra~1\send_nsca\send_nsca.cmd "Avertissement des capteurs de temperature"
[179]
settings=256
epfName=c:\progra~1\send_nsca\send_nsca.cmd "Panne detectee par un capteur de temperature"
[180]
settings=256
epfName=c:\progra~1\send_nsca\send_nsca.cmd "Avertissement des capteurs de ventilateur"
[181]
settings=256
epfName=c:\progra~1\send_nsca\send_nsca.cmd "Panne detectee par un capteur de ventilateur"
[182]
settings=256
epfName=c:\progra~1\send_nsca\send_nsca.cmd "Avertissement des capteurs de tension"
[183]
settings=256
epfName=c:\progra~1\send_nsca\send_nsca.cmd "Panne detectee par un capteur de tension"
[184]
settings=256
epfName=c:\progra~1\send_nsca\send_nsca.cmd "Avertissement des capteurs de courant"
[185]
settings=256
epfName=c:\progra~1\send_nsca\send_nsca.cmd "Panne detectee par les capteurs de courant"
[186]
settings=256
epfName=c:\progra~1\send_nsca\send_nsca.cmd "Detection d'une intrusion dans le chassis"
[187]
settings=256
epfName=c:\progra~1\send_nsca\send_nsca.cmd "Degradation de la redondance"
[188]
settings=256
epfName=c:\progra~1\send_nsca\send_nsca.cmd "Perte de la redondance"
[189]
settings=256
epfName=c:\progra~1\send_nsca\send_nsca.cmd "Echec anticipe de la memoire"
[190]
settings=256
epfName=c:\progra~1\send_nsca\send_nsca.cmd "Echec de la memoire"
[304]
settings=256
epfName=c:\progra~1\send_nsca\send_nsca.cmd "Journal du materiel : avertissement"
[305]
settings=256
epfName=c:\progra~1\send_nsca\send_nsca.cmd "Journal du materiel : erreur"
[306]
settings=256
epfName=c:\progra~1\send_nsca\send_nsca.cmd "Processeur : avertissement"
[307]
settings=256
epfName=c:\progra~1\send_nsca\send_nsca.cmd "Processeur : panne"
[308]
settings=256
epfName=c:\progra~1\send_nsca\send_nsca.cmd "Bloc d'alimentation : avertissement"
[LRA Prot Configuration Section]
lraprotid.0x00=1045 ; Thermal Protect
[1045]
activateTimeout=60
reCheckTimeout=6
condition=0
canBeForced=true
[HWC Configuration]
migrationCompleted=TRUE
Sorry for the event descriptions being in french. You'll have to translate it in your own language. Be sure to check that your OpenManage version uses the same kind of file. Once your changes are done, restart OpenManage service, or reboot your server.
To Do
I'm sure you can now derive this tutorial for Compaq Insight Manager, for example. If you do so, please let me know how.
Thanks
- To Maël Le Saout who pinpointed Windows variable %COMPUTERNAME% potential problem.