Monday, February 25, 2019

How to use Nagios to monitor services running under AWS auto scaling group (ASG)

What is expected from reader

The reader is expected to have basic knowledge about below technologies/concepts.
  • Nagios Core
  • AWS Auto Scaling Group (ASG)
  • PHP REST API
  • DevOps

Problem statement:

If we have deployed application on AWS EC2 instance in cloud, we can monitor EC2 health states using CloudWatch but CloudWatch does not provide health or status of underlying running application services. There are tools like Dynatrace and Datadog which provides both application and hardware monitoring but that is superfluous because we already have CloudWatch. For monitoring application services, we can use open source Nagios solution which sends instant alert if your application service is down. Only problem with Nagios is, you need to keep instance hostnames or IP addresses updated on Nagios server. This is straight forward and easy if it is stand alone instance but it will be challenging if instances are dynamically added or deleted by AWS ASG. 

Solution to this problem:

Only way to solve this problem is to build a mechanism where newly added instance automatically register itself to nagios server. Each instance in cloud will sends an XML with details to nagios server as soon as it starts in cloud. This will be achieved using PHP REST API exposed by nagios server. PHP code on nagios server will process this XML and automatically restart nagios service after changing host file with new instance details. This will immediately start monitoring of newly created instances. In addition, we will have scheduled job on nagios server to clean up scaled-down instances.

Required Scripts:

There are different sets of scripts for this solution. 

1. PHP REST API: 

You will have to keep this script on your nagios server at location /var/www/rest.com/public_html. Your nagios server will host this RESTful web service written in PHP to serve requests to new instances. As this will purely work on HTTP calls, your nagios server has to allow communications only from port 80. You will need to install PHP on your nagios server. You can download this script from below location.

Download script from here. 

Please complete below steps after you copy code on nagios server.
  1. Please make sure you change all variables inside Config.php file as per your environment. 
  2. Create database as per section 3.
  3. Add apache configuration file as per section 4.
If you have properly configured and code is working fine then you should be able to access below REST API. you could try showall API using postman. You should get all rows from database as output if setup is correct.
  1. https://<nagios_ip>/register/me - To register new instance by passing XML with instance details 
  2. https://<nagios_ip>/register/showall - Display all registered instance on nagios server 
  3. https://<nagios_ip>/cleanup - Clean nagios host file for old and deleted instances.

2. Auto registration scripts:

You can download this script from below location.

Download script from here.

DevOps engineer needs to agree and perform below steps.
  1. Install NRPE/NSCP nagios agents before creating AMI and update value for allowed_hosts with actual nagios server IPs/hostnames.
  2. Copy this script and XML file at locaton C:\nagios for windows instances and at location /usr/local/nagios on linux instances with write permission on folders to required users.
  3. Update value of <ns></ns>XML tag in register.xml file with actual nagios server IPs/hostnames. Use comma seperated values if you have master/slave nagios servers for high availability. Auto registration script will use these details to register instance.
  4. Bake this script and XML file inside AMI so that each new instance will already have this script. 
  5. create cronjob or windows scheduler to automatically run auto_register.sh or auto_register.ps1 file on start of the instance as well as every one hour. section 2.1 has required script to create schedulers
  6. Each EC2 instance should have AWS Tag with key "Application" and appropriate value. Please note, value of this key should be equal to hostgroup_name on our nagios server.

2.1 Create scheduler script:

For Windows Instances: DevOps engineer can use below powershell scripts to create scheduler on windows instances. Please make sure to keep all files on location C:\nagios\
And

For Linux Instances: DevOps engineer can use below powershell scripts to create scheduler on linux instances. Please make sure to keep all files on location /usr/local/nagios/

3. Create database script:

PHP REST API on your nagios server will need one database to maintain information about currently active instances. You can either use RDS or local database. You can use below script for creating database.

4. Apache configuration

PHP REST API will be redirect to access your php code based on mapping inside apache configuration file. Please create /etc/httpd/conf.d/rest.com.conf file and add content as described below.

5. Cleanup Job

As ASG scaled down instances has to be automatically delete, we require scheduled job to cleanup down hosts from nagios host file. You can create below cron job on your nagios server. This job will automatically find hosts which are down for specific period of time (Time which is configured in Config.php script)

That's It!! You are all set for your automated solution for nagios instances under ASG. Please let me know in comment section if you face any issue and I will be happy to assist you.