ELK + EA — Silencing ElastAlert Alerts

Many shops are realizing the benefit of the ELK stack / Elastic Stack, and the great flexibility that it brings to an infrastructure in the form of centralized logging and reporting which has always been critical when troubleshooting difficult and/or distributed problems. Having many input options (via elastic beats) to choose from, and lots of flexibility in data manipulation (via logstash) has only increased the usefulness of this stack. As a consultant, I’m finding this stack deployed more and more often with clients and it’s enjoyable to work with.

I’ve had the opportunity to implement ElastAlert to provide monitoring and alerting services against an established Elastic Stack deployment. ElastAlert is a Yelp product that is written in Python and is “a simple framework for alerting on anomalies, spikes, or other patterns of interest from data in ElasticSearch”.

With ElastAlert, much of what has traditionally been monitored via Nagios, or similar tool, can now be done against ElasticSearch. ElastAlert also provides many notification options, templates, and formats. Also, there is a fairly straightforward enhancement process where local modifications can be made against the framework without diverting from the main code base when additional processing or manipulation may be desired.

With a very strong background in Nagios and related tools, the one failure (with an existing enhancement request) in ElastAlert is no ability to silence or suppress or acknowledge alerts. They are either alerting or not, relative to the realert setting. This is a huge inconvenience if all alerts are not immediately resolved as ElastAlert will continue to notify the alerting endpoint with alerts that are already being acted upon and require no new action. This is a (IMO) bad way to do business as it may result in missed alerts and poor service to customers.

ElastAlert will send out an alert on every run unless it has an entry in the elastalert_status (or equivalent metadata index) index under the silence type, ie “_type:silence” with an “until” field that has not yet expired, for the rule in question. This is how ElastAlert maintains the proper realert time period for alerts where notifications have already been sent and the run time is more frequent than the realert time. We can add an appropriate entry to this index to silence an alert and provide the same functionality as acknowledging in Nagios or New Relic, or similar behavior in other alerting systems.

To provide an example, start with the rule that has the following configuration:

es_host: 172.16.0.4
es_port: 80
aws_region: us-west-2
name: "nginx-web-app error in logs"
index: sales-promos-*
type: any
filter:
  - query_string:
    query: "type:log AND source:'/var/log/nginx/sales-promo.log' AND message: *error*"
query_delay:
  minutes: 1
query_key: host
realert:
  minutes: 15
alert:
 - "sns"
sns_topic_arn: arn:aws:sns:us-west-2:************:duty-pager-alerts

The rule name that we would use is “nginx-web-app error in logs”. The realert time is 15 minutes. This means any time we get errors in the logs, we’ll see an error alert every 15 minutes, as long as the error condition continues. In order to suppress this alert for 1 hours, we’d issue the following curl command (or similar):

$ export ES_ENDPOINT=172.16.0.4
$ export ES_INDEX=elastalert_status

$ curl -X POST https://${ES_ENDPOINT}/${ES_INDEX}/silence/ -d '{
  "rule_name": "nginx-web-app error in logs.ip-172-16-0-10",
  "@timestamp": "2017-08-07T16:43:24.000000Z",
  "exponent": 0,
  "until": "2017-08-07T20:43:24.000000Z"
}'

Note also that when using a query_key, the node identified by the query_key can be silenced without silencing the alert in general. This is incredibly helpful as one problem should not disable the entire monitoring system. This example above shows silencing the alert for only the host with a hostname of ip-172-16-0-10. Note that if a query_key is specified in a silence entry when the rule does not have a query_key defined, ElastAlert will fail to run.

To delete an entry, in the event an error has occurred, issue a curl delete, after locating the index ID of the entry to delete, ie:

$ curl -X DELETE https://${ES_ENDPOINT}/${ES_INDEX}/silence/${index_id_of_bad_entry}

It may take a few minutes for ElasticSearch to re-index the data after a delete so the error may not go away immediately.


Comments

One response to “ELK + EA — Silencing ElastAlert Alerts”

  1. Ajay B. Avatar
    Ajay B.

    Fully agree with your observations. Ability to ack, delete or silence alerts is required when in a data center type of setup a central monitoring station needs to act/ initiate actions on alerts.

Leave a Reply

Your email address will not be published. Required fields are marked *