-
Notifications
You must be signed in to change notification settings - Fork 12
AWS monitor heartbeat
Objective: Continuously use gs-netcat to test all GSRN servers (within and outside of AWS). On failure notify the admin. This is a fully functional test. For non-functional tests and for general system metrics (like CPU and Memory usage) use NetData instead.
The functional test heartbeat.sh is embedded within a docker image.
- Use AWS Elastic Container Registry (ECR) to store the docker image
- Use AWS Fargate to run the docker image
- Use AWS Cloudewatch to send notification if GSRN goes bad
We use AWS region us-east-2.
Select IAM -> Policies
- Select
Create Policy. - Under Service select
Elastic Container Registry. - Select
All Elastic Container Registry actions (ecr:*) - Under
ResourcesselectSpecificandAdd ARNand forRepository nameselectAny. - Click
Next: TagsandNext: Review. - Under Name specify
ECR_FullAccess(or any other name you like). - Click
Create policy.
Create a new user under IAM -> Users
- Select
Add usersand name the new userfargate_user. - Select
Programmatic accessandAWS Management Console Access. - De-select
User must create a new password at next sign-in. - Click
Next: Permissions.
- Select
Attach existing policies directlyand selectAmazonECS_FullAccessandAmazonEC2ContainerRegistryPowerUserandECR_FullAccess. - Click
Next: Tagsand thenNext: Reviewand thenCreate user. - Note down the
Account ID,Access key ID,Secret access keyandPassword.
Select Elastic Container Registry.
- Select
Create repositoryand fill in the name of the repository asgsrn_heartbeatand leave everything else default. - Note down the
URI(e.g. [account ID].dkr.ecr.us-east-2.amazonaws.com/gsrn_heartbeat).
Sign in to AWS using the credentials of fargate_user:
aws configure
Retrieve the AWS login password and ready docker to sign in to the ECR Registry:
aws ecr get-login-password --region us-east-2 | docker login --username AWS --password-stdin [account ID].dkr.ecr.us-east-2.amazonaws.com
Create the Docker image
docker build -t gsrn_hb .
TAG the docker image with the URI.
docker tag gsrn_hb [account ID].dkr.ecr.us-east-2.amazonaws.com/gsrn_heartbeat
Push the tagged image
docker push [account ID].dkr.ecr.us-east-2.amazonaws.com/gsrn_heartbeat
With your AWS user (not fargate_user):
- Go to
Elastic Container Service (ECS). - Select
Create ClusterandNetworking only. - Name the cluster
fargate-gsrn-heartbeatand leave the rest as is.
- Select
Create new Task DefinitionandFargate. - Enter the name for task (
GSRN heartbeat). - Select
0.5for Task Memory and0.25vCPU for Task CPU.
Under Container definition select Add Container.
- We use
GSRN-Heartbeatas Container name. - Enter the ARN of the docker image
[account ID].dkr.ecr.us-east-2.amazonaws.com/gsrn_heartbeat - Select
Linuxfor Operationg system family. - Under Command enter
gs1.thc.org gs2.thc.rog gs3.thc.org gs4.thc.org gs5.thc.org. - Select
Addand thenCreate.
Select the GSRNHeartbeat:1 task and under Action select Run Task.
- Set Launch Type to
FARGATE - Select any Cluster VPC and any Subnets.
- Select
Run Task
The steps are:
- Create a Topic under AWS Simple Notification Service (SNS) and add an email address to it.
- Create a filter that matches a string from the log file and records every match to a metric.
- Create an alarm if a change in the metric is detected.
Go to Simple Notification Service -> Create Topic
- Use
Standard - Set the Name to
GSRN-Heartbeat - Click
Create topicat the bottom right.
Create a subscription
- Set Protocol to
Email - Under Endpoint enter the email address. I use
[email protected]. - Select
Create Subscriptionat the bottom right.
It is not directly possible to match a pattern in a log file and raise an alarm (e.g. send an email). Instead a pattern is matched and a metric is created. Then an alarm can be triggered when the metric changes.
The heartbeat docker instance creates a log entry with "OK_COUNT=" every 60 seconds (unless GSRN is failing). Create a metric entry every time this pattern is encountered (1 every 60 seconds).
Go to Logs -> Log Group
- Select the log group
/ecs/GSRN-heartbeat - Click
Action->Create Metric Filter - Under Filter Pattern enter
OK_COUNT=and clickNext. - Under Filter Name write
OK-COUNT-FILTER
Under Metric Details set
- Metric namespace to
GSRN Heartbeat - Metric name to
OK-COUNT - Metric value to
1 - Default value to
0 - Click on
Nextand thenCreate metric filterat the bottom right.
Go to Alarm -> All Alarms -> Create alarm -> Select metric
- Under Custom namespace click on
GSRN Heartbeat. - Click on
Metrics with no dimensions. - Select
OK-COUNT - Click on
Select metricat the bottom right. - Under Conditions select
Lowerand under than... write1. - Under Additional configuration -> Missing data treatment select
Treat missing data as bad (breaching threshold). - Select
Nextat the bottom right. - Under Send a notification to... select
GSRN-Heartbeat1. - Click
Nextat the bottom right.
Under Add name and description:
- Set Alarm Name to
FAILED GSRN HEARTBEAT. - Set Alarm description to
GSRN Server may be down. Check Cloudwatch log.. - Click
Nextat the bottom right and thenCreate alarm.
TODO:
- How to allow task to have NAT access to Internet without assigning a public IP and without creating my own NAT GW?
- How to include the log entry that triggered the metrics alarm?
Helpful links: