Include status in stats about delivery times

Previously these metrics weren't very useful because they could be skewed by long timings for failed notifications, which can take up to 72 hours to deliver. I'm intentionally not trying to have a dual running period (with the old and new names) because: - We don't use the current stats for anything (checking Grafana). - The current stats get turned into a "bucket" metric in Prometheus [1][2], which isn't very useful because it can only tell us the mean time to deliver, but we're actually interested in percentiles. Switching to a new naming is an opportunity to fix the raw data and the way it's aggregated, using the same kind of "summary" metric that we now use for stats about our Celery tasks [3]. [1]: c330a8ac8a/paas/statsd/statsd-mapping.yml (L82) [2]: https://prometheus.io/docs/practices/histograms/#quantiles [3]: https://github.com/alphagov/notifications-aws/pull/890
2025-12-22 00:11:16 -05:00 · 2021-10-20 17:22:59 +01:00
parent c84daf0b7b
commit f974108934
4 changed files with 8 additions and 4 deletions
--- a/app/celery/process_sms_client_response_tasks.py
+++ b/app/celery/process_sms_client_response_tasks.py
@@ -75,7 +75,7 @@ def _process_for_status(notification_status, client_name, provider_reference, de

    if notification.sent_at:
        statsd_client.timing_with_dates(
-            'callback.{}.elapsed-time'.format(client_name.lower()),
+            f'callback.{client_name.lower()}.{notification_status}.elapsed-time',
            datetime.utcnow(),
            notification.sent_at
        )