If a polygon is smaller than the largest polygon in our dataset of
simplified polygons then we’re only throwing away useful detail by
simplifying it.
We should still simplify larger polygons as a fallback, to avoid sending
anything to the CBC that we’re not sure it will like.
The thresholds here are low: we can raise them as we test and experiment
more.
Here’s some data about the Flood Warning Service polygons
Percentile | 80% | 90% | 95% | 98% | 99% | 99.9%
-----------|-----|-------|--------|---------|---------|---------
Point count| 226 | 401.9 | 640.45 | 1015.38 | 1389.07 | 3008.609
Percentile | 80% | 90% | 95% | 98% | 99% | 99.9%
--------------|-----|-------|--------|---------|---------|---------
Polygon count |2----|3------|5-------|8--------|10-------|40.469
This new version of utils implements the transformation of our polygons
to a Cartesian plane. In other words, it converts them from being
defined in spherical degrees to metres.
For the API this means our simplification will be slightly more
accurate.
As stated in the comment, this would have been helpful during an
incident to give further reassurance that a task had at least
started running - at the time the only evidence for this was the
Cronitor dashboard itself, which we don't often look at.
I've removed other, equivalent "starting" logs, but kept those
that provide additional information in the log message.
Note that the new base class doesn't include a bespoke feature we
had here: 'log_on_worker_shutdown'. We've agreed it's reasonable
to remove it for now as it was introduced many years ago and its
use case is unclear - we can always add it back if needed.
We have made it so that gov.uk/alerts shows a ‘1 planned test’ banner
for the whole of the day when there has been an operator test on that
day.
We need to remove the banner when the day is over.
The most straightforward way to do this is to republish the site at the
start of every day. The gov.uk/alerts code[1] will work out if there are
or aren’t any planned tests to show that day.
1. 5a274af6d0/app/models/alerts.py (L38-L44)
There were two problems with the existing message.
1. There was no space between the new status and the time taken
which made reading and searching harder
2. They key bits of information (before and after status) were
separated by the time taken (which will always be unique) meaning
you couldn't do an easy search for a message that is say in delivered
being attempted to be set to temporary-failure.
Previously we were catching one type of exception if something went
wrong adding a notification to the queue for high volume services.
In reality there are two types of exception so this adds a second
handler to cover both.
For context, this is code we changed experimentally as part of the
upgrade to Celery 5 [1]. At the time we didn't check how the new
exception compared to the old one. It turns out they behaved the
same and we were always vulnerable to the scenario now covered by
the second exception, where the behaviour has changed in Celery 5 -
testing with a large task invocation gives...
Before (Celery 3, large-ish task):
'process_job.apply_async(["a" * 200000])'...
boto.exception.SQSError: SQSError: 400 Bad Request
<?xml version="1.0"?><ErrorResponse xmlns="http://queue.amazonaws.com/doc/2012-11-05/"><Error><Type>Sender</Type><Code>InvalidParameterValue</Code><Message>One or more parameters are invalid. Reason: Message must be shorter than 262144 bytes.</Message><Detail/></Error><RequestId>96162552-cd96-5a14-b3a5-7f503300a662</RequestId></ErrorResponse>
Before (Celery 3, very large task):
<hangs forever>
After (Celery 5, large-ish task):
botocore.exceptions.ClientError: An error occurred (InvalidParameterValue) when calling the SendMessage operation: One or more parameters are invalid. Reason: Message must be shorter than 262144 bytes.
After (Celery 5, very large task):
botocore.parsers.ResponseParserError: Unable to parse response (syntax error: line 1, column 0), invalid XML received. Further retries may succeed:
b'HTTP content length exceeded 1662976 bytes.'
[1]: 29c92a9e54
Previously these logs wouldn't have a Request ID attached since the
Celery hooks run after the __call__ method where we enable request
tracing for normal application logs. For the failure log especially
it will be useful to have this feature.
Previously we passed along this piece of state via the kwargs for
a task, but this runs the risk of the task accidentally receiving
the extra kwarg unless we've covered all the code paths that could
invoke it directly e.g. retries don't invoke __call__.
This switches to using Celery "headers" to pass the extra state. It
turns out that a Celery has two "header" concepts, which leads to
some confusion and even a bug with the framework [1]:
- In older (pre v4.4) versions of Celery, the "headers" specified
by apply_async() would become _the_ headers in the message that
gets passed around workers, etc. These would be available later on
via "self.request.headers".
- Since Celery protocol v2, the meaning of "headers" in the message
changed to become (basically) _all_ metadata about the task [2],
with the "headers" option in apply_async() being merged [3] into
the big dict of metadata.
This makes using headers a bit confusing unfortunately, since the
data structure we put in is subtly different to what comes out in
the request context. Nonetheless, it still works. I've added some
comments to try and clarify it.
Note that one of the original tests is no longer necessary, since we
don't need to worry about argument passing styles with headers.
[1]: https://github.com/celery/celery/issues/4875
[2]: 663e4d3a0b (diff-07a65448b2db3252a9711766beec23372715cd7597c3e309bf53859eabc0107fR343)
[3]: 681a922220/celery/app/amqp.py (L495)
We already had the `replay-create-pdf-for-templated-letter` command.
This adds a new command,
`recreate-pdf-for-precompiled-or-uploaded-letter` which does the same
thing but for non-templated letters.
This adds a task which is designed to be used if we want to recreate the
PDF for a precompiled letter (either one that has been created using the
API or one that has been uploaded through the website).
The task takes the `notification_id` of the letter and passes template
preview the details it needs in order to sanitise the original file and
then replace the version in the letters-pdf bucket with the freshly
sanitised version.
Previously this would repeat the task even the current iteration of
the loop had processed a non-full batch. This could cause the task
to error incorrectly if one or two notifications breach the timeout
threshold in between iterations.
From experimenting in production we found a "!=" caused the engine
to use a sequential scan, whereas explicitly listing all the types
ensured an index scan was used.
We also found that querying for many (over 100K) items leads to
the task stalling - no logs, but no evidence of it running either -
so we also add a limit to the query.
Since the query now only returns a subset of notifications, we need
to ensure the subsequent "update" query operates on the same batch.
Also, as a temporary measure, we have a loop in the task code to
ensure it operates on the total set of notifications to "time out",
which we assume is less than 500K for the time being.
We use this config option when running workers that process non-memory-safe tasks to restart the worker after n tasks.
Celery 5 requires this to be passed as an int or None.
Signed-off-by: Richard Baker <richard.baker@digital.cabinet-office.gov.uk>
This was added in Celery 4 [1]. and appears to be incompatible with
our approach of injecting "request_id" into task arguments (example
exception below). Although our other apps are on Celery 5 our logs
don't show any similar issues, probably because all their tasks are
invoked without request IDs. In the longterm we should decide if we
want to enable argument checking and fix the tracing approach, or
stop tracing request IDs in Celery tasks.
[1]: https://docs.celeryproject.org/en/stable/userguide/tasks.html#argument-checking
2021-11-01T11:37:36 delivery delivery ERROR None "RETRY: Email notification f69a9305-686f-42eb-a2ee-61bc2ba1f5f3 failed" [in /Users/benthorner/Documents/Projects/api/app/celery/provider_tasks.py:68]
Traceback (most recent call last):
File "/Users/benthorner/Documents/Projects/api/app/celery/provider_tasks.py", line 53, in deliver_email
raise TypeError("test retry")
TypeError: test retry
[2021-11-01 11:37:36,385: ERROR/ForkPoolWorker-1] RETRY: Email notification f69a9305-686f-42eb-a2ee-61bc2ba1f5f3 failed
Traceback (most recent call last):
File "/Users/benthorner/Documents/Projects/api/app/celery/provider_tasks.py", line 53, in deliver_email
raise TypeError("test retry")
TypeError: test retry
[2021-11-01 11:37:36,394: WARNING/ForkPoolWorker-1] Task deliver_email[449cd221-173c-4e18-83ac-229e88c029a5] reject requeue=False: deliver_email() got an unexpected keyword argument 'request_id'
Traceback (most recent call last):
File "/Users/benthorner/Documents/Projects/api/app/celery/provider_tasks.py", line 53, in deliver_email
raise TypeError("test retry")
TypeError: test retry
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Users/benthorner/.pyenv/versions/notifications-api/lib/python3.6/site-packages/celery/app/task.py", line 731, in retry
S.apply_async()
File "/Users/benthorner/.pyenv/versions/notifications-api/lib/python3.6/site-packages/celery/canvas.py", line 219, in apply_async
return _apply(args, kwargs, **options)
File "/Users/benthorner/.pyenv/versions/notifications-api/lib/python3.6/site-packages/celery/app/task.py", line 537, in apply_async
check_arguments(*(args or ()), **(kwargs or {}))
TypeError: deliver_email() got an unexpected keyword argument 'request_id'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Users/benthorner/.pyenv/versions/notifications-api/lib/python3.6/site-packages/celery/app/trace.py", line 450, in trace_task
R = retval = fun(*args, **kwargs)
File "/Users/benthorner/Documents/Projects/api/app/celery/celery.py", line 74, in __call__
return super().__call__(*args, **kwargs)
File "/Users/benthorner/.pyenv/versions/notifications-api/lib/python3.6/site-packages/celery/app/trace.py", line 731, in __protected_call__
return self.run(*args, **kwargs)
File "/Users/benthorner/Documents/Projects/api/app/celery/provider_tasks.py", line 71, in deliver_email
self.retry(queue=QueueNames.RETRY)
File "/Users/benthorner/.pyenv/versions/notifications-api/lib/python3.6/site-packages/celery/app/task.py", line 733, in retry
raise Reject(exc, requeue=False)
celery.exceptions.Reject: (TypeError("deliver_email() got an unexpected keyword argument 'request_id'",), False)
previously, we were confusing things by appending to CELERY_QUEUES in
both dev and test configs - these are executed at import time, so the
list contained all queues twice, regardless of what config you're
actually using.
Fortunately, the -Q command that we supply the workers with overrides
this config option, so other environments weren't affected. Given that,
we can tidy up this code by just declaring it in the base config every
time
Previously we sent them emails about this manually. We also tried
a Zendesk macro/trigger approach, but using a CC means:
- We can control the behaviour ourselves (Zendesk triggers can only
be edited by admins outside our team).
- We keep the DVLA notification approach consistent and in one place,
so notifications always go to the same people.
- Any further (public) updates to the ticket will also trigger a
notification to DVLA (previous trigger only notified on creation).
This reverts commit f2f2509c9b.
Raw request stats were added to investigate a hunch about a
performance issue we were seeing [1], but turned out not to
be relevant. We don't use them anymore so we can tidy up.
[1]: https://github.com/alphagov/notifications-api/pull/2858
When a precompiled letter is sent to us, we set the `to` field as
'Provided as PDF' in
1c1023a877/app/v2/notifications/post_notifications.py (L100-L104)
This then also sets `normalised_to` as `providedaspdf`.
However, when template preview sanitises the letter, pulls out the
address and gives it to the API, we were only setting `to` to be
the new address and had forgotten to also amend `normalised_to` to
be the normalised version. This meant that for all these letters
we accidentally left `normalised_to` as `providedaspdf`. The impact
of this was that we can not then search for these letters in the
admin user interface as they rely on the `normalised_to` field
containing the recipient address.
This commit fixes that bug by also setting the `normalised_to`
field
We don’t store everything that comes in the CAP XML when someone creates
a broadcast via the API.
One thing we do store is `<identifier>` (in a column called `reference`)
which is a unique (to the external system) identifier for the broadcast.
We show this in the front end instead of the template name, because
broadcasts created from the API don’t use templates.
However this ID isn’t very friendly – the Environment Agency just supply
a UUID.
The Environment Agency also populate the `<event>` field with some human
readable text, for example:
> 013 Issue Severe Flood Warning EA
(013 is an area code which will be meaningful to the Flood Warning
Service team)
We should show this in the UI instead of the reference. The first step
towards this is storing it in the database and returning it in the REST
endpoints.
Later we can have the admin app prefer `cap_event` over `reference`,
where `cap_event` is present.
We can’t backfill this data because we don’t keep a copy of the original
XML.
Seems like `<event>` is a mandatory property of `<info>`, so we don’t
need to worry about the field being missing (`<info>` is optional in
CAP but we require it because it contains stuff like the areas which
we need in order to send out the broadcast`).
***
https://www.pivotaltracker.com/story/show/176927060
Previously these metrics weren't very useful because they could be
skewed by long timings for failed notifications, which can take up
to 72 hours to deliver. I'm intentionally not trying to have a dual
running period (with the old and new names) because:
- We don't use the current stats for anything (checking Grafana).
- The current stats get turned into a "bucket" metric in Prometheus
[1][2], which isn't very useful because it can only tell us the mean
time to deliver, but we're actually interested in percentiles.
Switching to a new naming is an opportunity to fix the raw data and
the way it's aggregated, using the same kind of "summary" metric that
we now use for stats about our Celery tasks [3].
[1]: c330a8ac8a/paas/statsd/statsd-mapping.yml (L82)
[2]: https://prometheus.io/docs/practices/histograms/#quantiles
[3]: https://github.com/alphagov/notifications-aws/pull/890
The `auto-expire-broadcast-messages` task checks for expired broadcasts
at five minute intervals. This change now calls the
`publish-govuk-alerts` task in govuk-alerts if there are expired
broadcasts so that the site is updated.
Co-authored-by: Katie Smith <katie.smith@digital.cabinet-office.gov.uk>
When we send or cancel a broadcast message, we now trigger a task
in govuk-alerts repo that polls our API for alerts and
publishes a fresh list of alerts.
Co-authored-by: Pea Tyczynska <pea.tyczynska@digital.cabinet-office.gov.uk>
Previously this was causing the wrapper function to become a
command before it started mirroring the original (functools.wraps),
which meant any previous option decorators were "lost".*
We didn't notice the problem in the original PR [1] because the new
command under test has its option decorators *after* the command
decorator, in contrast with all other (now broken) commands.
The original wrapper applied the functools decorator first [2],
so this change just reinstates that ordering.
*This is a hand-wavey explanation as I haven't looked into how
functools.wraps interacts with option decorators.
[1]: 922fd2f333#
[2]: 922fd2f333 (diff-c4e75c8613e916687a97191a7a79110dfb47e96ef7df96f7ba25dd94ba64943dL101)
include a link to a runbook entry.
also the list of acknowledgement files can be very long, so make that
the last thing, and use new lines to space out the message.
It’s confusing that changing `MAX_VERIFY_CODE_COUNT` also limits the
number of failed login attempts that a user of text messages 2FA can
make.
This makes the parameters independent, and adds a test to make sure any
future changes which affect the limit of failed login attempts are
covered.
I was doing some analysis and saw that in the last 24 hours the most
codes that anyone had was in a 15 minute window was 3.
So I think we can safely reduce this to 5 to get a bit more security
with enough headroom to not have any negative impact to the user.
People with dyslexia and dyscalculia find it difficult to transpose
codes which have consecutive, repeated digits[1].
This commits enhances the algorithm for generating codes to not repeat
the previous digit in a code.
This reduces the key space for our codes from 100,000 possibilities to
65,610 possibilities.
1. https://twitter.com/annaecook/status/1442567679710150662
This updates the tickets that are created when the
`check_if_letters_still_pending_virus_check` scheduled task detects
letters in the `pending-virus-check` state.