Infrastructure Monitoring

The pro-active monitoring of our services is key to maintaining our excellent level of high availability. By monitoring every layer of our infrastructure and web application we can quickly respond to any service degridation, often automatically taking action before any noticable impact has occurred. Such actions include automatically failing over to different hardware or removing failed nodes from a cluster.

We also monitor for issues affecting capacity, identifying when resource needs to be scaled out through the addition of more application servers, or when things such as extra storage need to be provisioned.

Network monitoring

We monitor connectivity to our infrastructure from a remote datacentre operated by a third party provider. In the unlikely event that there is a complete loss of network connectivity our technical teams are immediately alerted via email and/or SMS and can investigate accordingly.

Our service uptime is calculated from this remote location so it is an accurate representation of a remote user accessing our services.

Security monitoring

We utilise a range of industry leading monitoring tools to constantly analyse traffic to and from our infrastructure for signs of any potential security incidents. All audit data generated by the ESB platform is continuously monitored for signs of unusual activity and these are acted upon accordingly.

High availability and disaster recovery monitoring

An important part of our monitoring is ensuring the health of our high availability and disaster recovery measures. For example monitoring the health of cluster nodes that make up a high availability cluster service such as SQL server, and monitoring the production and validity of backups to ensure our disaster recovery procedures can be implemented if necessary.

Tooling and Technology

Our infrastructure monitoring system is based on Nagios - a powerful and reliable open-source monitoring platform.

We run a mixture of Linux and Windows workloads and each is tightly integrated with Nagios through a mixture of passive and active checks. We also integrate with CloudWatch so we can track performance metrics associated with the underlying resources provided by AWS.

Currently over 500 metrics are monitored, each with their own set of thresholds and associated actions.