The iSeries just sits there most of the time, plugging away without being challenged. Plus, the system never seems to go down. Why not use it to monitor those unreliable servers sitting around it? I created some CL's and one RPG program to loop though and check the e-mail route and each of our servers, routers, etc. The messages are then sent to our on-call digital phone (that is the reason for the cryptic messages) or any e-mail address we want to send to. I have included an e-mail checker and a device checker. We are able to use the e-mail checker due to having a second e-mail server that is not used. The messages are sent out through this server, not the one we are checking.
The NetMonitor program is intended to check the e-mail and specified IP addresses. The whole purpose is to detect and maintain the system with the highest availability possible.
This is the main program that will act as a platform to call other programs to perform special functions. MonitorM loops two different cycles, one for e-mails and one for devices. Data Areas determine each loop. The e-mail switches loop values based on whether the off hour time is hit or not. Off-hours are also determined through data areas. This will allow MIS to set a time when the cycle can be a little relaxed.
This program will send an e-mail to the Your Relay account, then wait. If no response is encountered, a message is sent to the on-call phone indicating that the e-mail service is down. The system will then go into aggressive mode. E-mails are sent and responses are checked every minute instead of the current e-mail loop value. When the e-mail is restored, a message is sent to the on-call phone indicating the restoration.
This program is called from MonitorM to check for the return e-mail from the Your Relay loop. All return messages are sent to QSYSOPR at Your400. This program also deletes the distributions for QSYSOPR, getting it ready for the next check. A variable is sent back to MonitorM to let the main program know if a return message was sent or not.
This program is also called out of MonitorM. This file reads through file DeviceD in Your Library. A ping is performed on each IP address. If the ping is unsuccessful, the loop-counter for that address is incremented. Once the defined threshold is met, a message is sent to the on-call phone, and the distribution list Your List on the AS/400, giving the IP address and Name of the device that is not responding. Also, the date and time of the outage is recorded in the message and a down flag is also set.
The message to the distribution list on the AS/400 is determined by a flag set on each device.
When a device has a down flag, the system goes into aggressive mode. Instead of pinging each defined cycle, the system will start pinging every minute until the device is up. This will happen no matter what the threshold value for the device is set to.
If the connection is subsequently restored, a message is sent to the on-call phone informing them of the recovery. The date and time of the recovery is also sent in the message.
However, if the device goes down and recovers before the threshold is reached, no message is sent.
The loop cycle value for devices is also stored in a data area.
This is the RPG program called out of MonitorD to increment the counter or reset it, depending on the situation. It also updates the file with the date and time of the initial outage and sets the down flag.
3. Files to Update
This is the Device Definitions file. Any device with an IP address can be defined in this file to allow MonitorD to check it.
Fields DEVADR - IP address of the device to check DEVNAM - Name of the device to check DEVFLG - Flag to allow the program to check or skip this device (Y - Check, N - Skip) DEVTHR - Threshold, set this to the number of cycles to wait to alert MIS (Each Cycle is determined by the value in the data area) DEVLPF - Loop Flag, leave as is. DEVDAT - The last date the line was not up. Leave as is for new devices. DEVTIM - The last time the line was not up. Leave as is for new devices. DEVDFL - This is the Down Flag used to send the program into aggressive mode for this device. DEVMSG - This is the flag used to determine if a break message is sent to the Your List distribution list or not.
4. Data Areas
NETEMAILSS - Net Monitor off hour Start Time This is set to the start time of the off-hour for e-mail checking. NETEMAILSE - Net Monitor off-hour End Time This is set to the ending time of the off-hour for e-mail checking. NETMONOFFH - Net Monitor Loop Value This is used to set the Email loop cycle value during off-hours. NETMONPRIM - Net Monitor Loop Value This is used to set the Email loop cycle value during prime time. NETMONDEVL - Net Monitor Loop Value This is used to set the Device loop cycle value.
To view the source code click here.