Commit Graph

8709 Commits

Author SHA1 Message Date
David McDonald
18776e4160 Merge pull request #3377 from alphagov/zero-case-performance-page
Fix division by zero error on performance page
2021-11-22 13:44:32 +00:00
David McDonald
106187ba04 Fix division by zero error on performance page
For preview and staging environments, we often send no messages
in a single day. This is currently causing a `DivisionByZero` error
that is rendering the page with no results. This makes it impossible
to look at preview/staging and see if the performance page is
working correctly or not.

(psycopg2.errors.DivisionByZero) division by zero

[SQL: SELECT CAST(ft_processing_time.bst_date AS TEXT) AS date, ft_processing_time.messages_total AS ft_processing_time_messages_total, ft_processing_time.messages_within_10_secs AS ft_processing_time_messages_within_10_secs, (ft_processing_time.messages_within_10_secs / CAST(ft_processing_time.messages_total AS FLOAT)) * %(param_1)s AS percentage
FROM ft_processing_time
WHERE ft_processing_time.bst_date >= %(bst_date_1)s AND ft_processing_time.bst_date <= %(bst_date_2)s ORDER BY ft_processing_time.bst_date]
[parameters: {'param_1': 100, 'bst_date_1': datetime.date(2021, 11, 12), 'bst_date_2': datetime.date(2021, 11, 19)}]
(Background on this error at: http://sqlalche.me/e/14/9h9h)

I've fixed this by falling back to 100.0% for days we send
no messages. Maybe some argument that it should be N/A rather than
100% but I think it doesn't really matter as this is only
going to affect preview and staging as we will never have a day
sending no messages in production.
2021-11-22 11:11:52 +00:00
Chris Hill-Scott
2b6a550cdc Merge pull request #3372 from alphagov/update-utils-coordinate-transformation-2
Update utils to bring in coordinate transformation (attempt 2)
2021-11-18 16:10:45 +00:00
Chris Hill-Scott
c0742fe83d Pass polygons through if they’re small already
If a polygon is smaller than the largest polygon in our dataset of
simplified polygons then we’re only throwing away useful detail by
simplifying it.

We should still simplify larger polygons as a fallback, to avoid sending
anything to the CBC that we’re not sure it will like.

The thresholds here are low: we can raise them as we test and experiment
more.

Here’s some data about the Flood Warning Service polygons

Percentile | 80% | 90%   | 95%    | 98%     | 99%     | 99.9%
-----------|-----|-------|--------|---------|---------|---------
Point count| 226 | 401.9 | 640.45 | 1015.38 | 1389.07 | 3008.609

Percentile    | 80% | 90%   | 95%    | 98%     | 99%     | 99.9%
--------------|-----|-------|--------|---------|---------|---------
Polygon count |2----|3------|5-------|8--------|10-------|40.469
2021-11-18 15:48:45 +00:00
Chris Hill-Scott
4feb3fdc10 Bump utils
This new version of utils implements the transformation of our polygons
to a Cartesian plane. In other words, it converts them from being
defined in spherical degrees to metres.

For the API this means our simplification will be slightly more
accurate.
2021-11-18 15:43:39 +00:00
Rebecca Law
443f197fee Merge pull request #3376 from alphagov/update-query-for_insert_notification_history_delete_notifications
Small update to query to reduce load on the task.
2021-11-18 11:28:44 +00:00
Rebecca Law
30a5852685 Update the query to only return the count from the table since that is
all we care about.

https://www.pivotaltracker.com/story/show/180262357
2021-11-17 14:46:52 +00:00
Ben Thorner
bffca39223 Merge pull request #3373 from alphagov/centralise-celery-180213914
Use central NotifyCelery base class in utils
2021-11-17 12:09:28 +00:00
Ben Thorner
e6b91f67d6 Merge pull request #3374 from alphagov/log-periodic-180330449
Log activity on all periodic Celery tasks
2021-11-17 12:09:19 +00:00
Ben Thorner
666ac1ab4f Log activity on all periodic Celery tasks
As stated in the comment, this would have been helpful during an
incident to give further reassurance that a task had at least
started running - at the time the only evidence for this was the
Cronitor dashboard itself, which we don't often look at.

I've removed other, equivalent "starting" logs, but kept those
that provide additional information in the log message.
2021-11-17 09:48:03 +00:00
Ben Thorner
528223ed61 Use central NotifyCelery base class in utils
Note that the new base class doesn't include a bespoke feature we
had here: 'log_on_worker_shutdown'. We've agreed it's reasonable
to remove it for now as it was introduced many years ago and its
use case is unclear - we can always add it back if needed.
2021-11-16 13:58:12 +00:00
David McDonald
782aef351c Merge pull request #3369 from alphagov/remove-o-fair
Remove -Ofair option from celery worker
2021-11-16 11:49:52 +00:00
Ben Thorner
4e7b5e0104 Merge pull request #3371 from alphagov/reduce-concurrency-180116935
Reduce concurrency to match number of CPUs
2021-11-16 10:22:21 +00:00
Ben Thorner
fd2d411085 Merge pull request #3370 from alphagov/mention-pycurl
Link to guidance about installing pycurl
2021-11-16 10:18:33 +00:00
Ben Thorner
82e4c3dad2 Reduce concurrency to match number of CPUs
This got missed in [1].

[1]: 9e9091e980
2021-11-15 16:45:05 +00:00
Chris Hill-Scott
4b44e3e223 Merge pull request #3358 from alphagov/remove-yesterdays-planned-tests-on-govuk-alerts
Republish gov.uk/alerts every night to clear down planned tests
2021-11-15 15:34:21 +00:00
Ben Thorner
0fbca71545 Link to guidance about installing pycurl
This seems to be an issue for several people when we install new
versions of the package. Older versions of the package seem to
be equally affected, so the new need for this is likely related
to us using a newer OS / XCode version.
2021-11-15 15:21:09 +00:00
Chris Hill-Scott
0236318189 Republish gov.uk/alerts every night to clear down planned tests
We have made it so that gov.uk/alerts shows a ‘1 planned test’ banner
for the whole of the day when there has been an operator test on that
day.

We need to remove the banner when the day is over.

The most straightforward way to do this is to republish the site at the
start of every day. The gov.uk/alerts code[1] will work out if there are
or aren’t any planned tests to show that day.

1. 5a274af6d0/app/models/alerts.py (L38-L44)
2021-11-15 14:23:32 +00:00
Chris Hill-Scott
2f3c6112ba Merge pull request #3361 from alphagov/celery-5.2.0
Bump Celery to latest version
2021-11-15 14:23:05 +00:00
David McDonald
c646176594 Remove -Ofair option from celery worker
In version 4.0 of celery, -Ofair became the default
scheduling strategy:
https://docs.celeryproject.org/en/latest/history/whatsnew-4.0.html?highlight=fair#ofair-is-now-the-default-scheduling-strategy

This appears to still be the case:
5d68d781de/celery/concurrency/asynpool.py (L80)

Note, it took me a while to be certain of this as the documentation
for the celery CLI suggests a choice of `default` or `fair` which
isn't so useful as both of these are `fair`:
https://docs.celeryproject.org/en/latest/reference/cli.html#cmdoption-celery-worker-O
2021-11-15 11:52:57 +00:00
Chris Hill-Scott
0aa7cf1aaf Tell Pyup to ignore outdated Eventlet version
We already do this in the admin app:
https://github.com/alphagov/notifications-admin/pull/3876/files

Upgrading Eventlet is blocked until this change in Gunicorn is released:
https://github.com/benoitc/gunicorn/pull/2581/files
2021-11-15 11:14:34 +00:00
Chris Hill-Scott
6c0bda0388 Bump Celery to latest version
This brings in the version 5.2.1 of Kombu, which fixes a security
vulnerability:
> Celery 5.2.0 includes 'kombu' v5.2.1, which includes dependencies
> updates that resolve security issues.
— https://pyup.io/repos/github/alphagov/notifications-api/commits/?page=1#b654c27699a5164cbbe50e042d5d34141f560255

This is the commit from Kombu:
f3b04558fa

I believe the dependency of Kombu which has issues is urllib3, which
has two open advisories for versions less than 1.26.5:
- https://github.com/urllib3/urllib3/security/advisories/GHSA-q2q7-5pp4-w6pg
- https://github.com/urllib3/urllib3/security/advisories/GHSA-5phf-pp7p-vc2r
2021-11-15 11:12:33 +00:00
David McDonald
608ef12573 Merge pull request #3367 from alphagov/better-log-message
Improve log message searchability for duplicate receipts
2021-11-15 09:38:30 +00:00
David McDonald
c98996a461 Improve log message searchability for duplicate receipts
There were two problems with the existing message.

1. There was no space between the new status and the time taken
   which made reading and searching harder
2. They key bits of information (before and after status) were
   separated by the time taken (which will always be unique) meaning
   you couldn't do an easy search for a message that is say in delivered
   being attempted to be set to temporary-failure.
2021-11-12 14:06:38 +00:00
Ben Thorner
48e1482d90 Merge pull request #3366 from alphagov/celery-extend-request-id-180213914
Extend request tracing to cover Celery logs
2021-11-12 11:10:38 +00:00
Ben Thorner
d66c68d6d6 Merge pull request #3364 from alphagov/celery-headers-request-id-180213914
Move Celery task Request ID injection into headers
2021-11-12 11:10:29 +00:00
Ben Thorner
4a577eca62 Merge pull request #3359 from alphagov/improve-clarify-botocore-exception-180017131
Improve and clarify large task error handling
2021-11-12 11:10:17 +00:00
Ben Thorner
1872854a4e Improve and clarify large task error handling
Previously we were catching one type of exception if something went
wrong adding a notification to the queue for high volume services.
In reality there are two types of exception so this adds a second
handler to cover both.

For context, this is code we changed experimentally as part of the
upgrade to Celery 5 [1]. At the time we didn't check how the new
exception compared to the old one. It turns out they behaved the
same and we were always vulnerable to the scenario now covered by
the second exception, where the behaviour has changed in Celery 5 -
testing with a large task invocation gives...

Before (Celery 3, large-ish task):

    'process_job.apply_async(["a" * 200000])'...

    boto.exception.SQSError: SQSError: 400 Bad Request
    <?xml version="1.0"?><ErrorResponse xmlns="http://queue.amazonaws.com/doc/2012-11-05/"><Error><Type>Sender</Type><Code>InvalidParameterValue</Code><Message>One or more parameters are invalid. Reason: Message must be shorter than 262144 bytes.</Message><Detail/></Error><RequestId>96162552-cd96-5a14-b3a5-7f503300a662</RequestId></ErrorResponse>

Before (Celery 3, very large task):

    <hangs forever>

After (Celery 5, large-ish task):

    botocore.exceptions.ClientError: An error occurred (InvalidParameterValue) when calling the SendMessage operation: One or more parameters are invalid. Reason: Message must be shorter than 262144 bytes.

After (Celery 5, very large task):

    botocore.parsers.ResponseParserError: Unable to parse response (syntax error: line 1, column 0), invalid XML received. Further retries may succeed:
    b'HTTP content length exceeded 1662976 bytes.'

[1]: 29c92a9e54
2021-11-11 17:37:50 +00:00
Leo Hemsted
850cdc16a0 Merge pull request #3363 from alphagov/py39
Python 3.9
2021-11-11 15:04:37 +00:00
Leo Hemsted
ccdba8a0d8 make pyup point at new files
'(requirements-dev hasn'\''t existed for ages anyway)'
2021-11-11 13:54:21 +00:00
Leo Hemsted
036bc92245 switch from freeze reqs script to pip-tools
instead of alexey's home-grown script, pip-tools offers a quicker, more
efficient and better supported way to freeze requirements.

see prior art here:

https://github.com/alphagov/notifications-admin/pull/3753
https://github.com/alphagov/notifications-ftp/pull/333
2021-11-11 13:54:21 +00:00
Leo Hemsted
6b5d7ca639 switch to python 3.9 2021-11-11 13:54:14 +00:00
Ben Thorner
ac06529128 Enable request tracing on Celery success/fail logs
Previously these logs wouldn't have a Request ID attached since the
Celery hooks run after the __call__ method where we enable request
tracing for normal application logs. For the failure log especially
it will be useful to have this feature.
2021-11-10 18:04:20 +00:00
Ben Thorner
369a9f7521 Refactor queue_name and request_id into properties
This reduces the complexity of the original functions, which will
go up a bit in the next commit.
2021-11-10 18:04:19 +00:00
Ben Thorner
89a8dd1a03 Move Celery task Request ID injection into headers
Previously we passed along this piece of state via the kwargs for
a task, but this runs the risk of the task accidentally receiving
the extra kwarg unless we've covered all the code paths that could
invoke it directly e.g. retries don't invoke __call__.

This switches to using Celery "headers" to pass the extra state. It
turns out that a Celery has two "header" concepts, which leads to
some confusion and even a bug with the framework [1]:

- In older (pre v4.4) versions of Celery, the "headers" specified
by apply_async() would become _the_ headers in the message that
gets passed around workers, etc. These would be available later on
via "self.request.headers".

- Since Celery protocol v2, the meaning of "headers" in the message
changed to become (basically) _all_ metadata about the task [2],
with the "headers" option in apply_async() being merged [3] into
the big dict of metadata.

This makes using headers a bit confusing unfortunately, since the
data structure we put in is subtly different to what comes out in
the request context. Nonetheless, it still works. I've added some
comments to try and clarify it.

Note that one of the original tests is no longer necessary, since we
don't need to worry about argument passing styles with headers.

[1]: https://github.com/celery/celery/issues/4875
[2]: 663e4d3a0b (diff-07a65448b2db3252a9711766beec23372715cd7597c3e309bf53859eabc0107fR343)
[3]: 681a922220/celery/app/amqp.py (L495)
2021-11-10 18:03:40 +00:00
Katie Smith
770d323274 Merge pull request #3341 from alphagov/rebuild-letters
Add task to recreate the PDF file for a non-templated letter
2021-11-10 11:18:16 +00:00
Katie Smith
3cffba6d09 Add command to run recreate_pdf_for_precompiled_or_uploaded_letter
We already had the `replay-create-pdf-for-templated-letter` command.
This adds a new command,
`recreate-pdf-for-precompiled-or-uploaded-letter` which does the same
thing but for non-templated letters.
2021-11-10 09:51:31 +00:00
Katie Smith
3d4796c924 Add task to resanitise and replace a PDF for precompiled letter
This adds a task which is designed to be used if we want to recreate the
PDF for a precompiled letter (either one that has been created using the
API or one that has been uploaded through the website).

The task takes the `notification_id` of the letter and passes template
preview the details it needs in order to sanitise the original file and
then replace the version in the letters-pdf bucket with the freshly
sanitised version.
2021-11-10 09:51:31 +00:00
Katie Smith
ec9c3cac5f Rename replay_create_pdf_letters command
This changes the name to make it clearer that this command is for
templated letters only, and not for PDF letters.
2021-11-10 09:51:31 +00:00
Ben Thorner
ff78ea3232 Merge pull request #3360 from alphagov/test-limit-timeout-notifications
Optimise query to get notifications to "time out"
2021-11-09 15:54:04 +00:00
Ben Thorner
cdb43fbaf6 Only loop timeout task if there's more work
Previously this would repeat the task even the current iteration of
the loop had processed a non-full batch. This could cause the task
to error incorrectly if one or two notifications breach the timeout
threshold in between iterations.
2021-11-09 15:41:14 +00:00
Ben Thorner
77c8c0a501 Optimise query to get notifications to "time out"
From experimenting in production we found a "!=" caused the engine
to use a sequential scan, whereas explicitly listing all the types
ensured an index scan was used.

We also found that querying for many (over 100K) items leads to
the task stalling - no logs, but no evidence of it running either -
so we also add a limit to the query.

Since the query now only returns a subset of notifications, we need
to ensure the subsequent "update" query operates on the same batch.
Also, as a temporary measure, we have a loop in the task code to
ensure it operates on the total set of notifications to "time out",
which we assume is less than 500K for the time being.
2021-11-09 13:50:32 +00:00
David McDonald
98b6c1d67d Merge pull request #3354 from alphagov/bump-utils-to-fix-no-break-space
Bump utils to 48.0.0
2021-11-05 16:37:27 +00:00
David McDonald
e4f523e3a0 Bump utils to 48.0.0
Brings in fixes to support for non breaking spaces

See https://github.com/alphagov/notifications-utils/pull/908
2021-11-05 15:09:09 +00:00
Ben Thorner
e7fbd018d1 Merge pull request #3355 from alphagov/celery-5-180017131
Upgrade to Celery 5
2021-11-05 13:04:14 +00:00
Richard Baker
e10f45b3a7 Cast Celery worker_max_tasks_per_child to int or None
We use this config option when running workers that process non-memory-safe tasks to restart the worker after n tasks.

Celery 5 requires this to be passed as an int or None.

Signed-off-by: Richard Baker <richard.baker@digital.cabinet-office.gov.uk>
2021-11-05 11:09:09 +00:00
sakisv
9e9091e980 Reduce concurrency for other workers too for consistency
Any worker that had `--concurrency` > 4 is now set to 4 for consistency
with the how volume workers.

See previous commit (Reduce concurrency on high volume workers) for
details
2021-11-04 16:31:22 +02:00
sakisv
92086e2090 Reduce concurrency on high volume workers
We noticed that having high concurrency led to significant memory usage.

The hypothesis is that because of long polling, there are many
connections being held open which seems to impact the memory usage.

Initially the high concurrency was put in place as a way to get around
the lack of long polling: We were spawning multiple processes and each
one was doing many requests to SQS to check for and receive new tasks.

Now with long polling enabled and reduced concurrency, the workers are
much more efficient at their job (the tasks are being picked up so fast
that the queues are practically empty) and much lighter on resource
requirements. (This last bit will allow us to reduce the memory
requirement for heavy workers like the sender and reduce our costs)

The concurrency number was chosen semi-arbitrarily: Usually this is set
to the number of CPUs available to the system. Because we're running on
PaaS and that number is both abstracted and may be claimed for by other
processes, we went for a conservative one to also reduce the competion
for CPU among the processes of the same worker instance.
2021-11-04 11:38:05 +02:00
Ben Thorner
3ecbdbb260 Temporarily disable task argument checking
This was added in Celery 4 [1]. and appears to be incompatible with
our approach of injecting "request_id" into task arguments (example
exception below). Although our other apps are on Celery 5 our logs
don't show any similar issues, probably because all their tasks are
invoked without request IDs. In the longterm we should decide if we
want to enable argument checking and fix the tracing approach, or
stop tracing request IDs in Celery tasks.

[1]: https://docs.celeryproject.org/en/stable/userguide/tasks.html#argument-checking

    2021-11-01T11:37:36 delivery delivery ERROR None "RETRY: Email notification f69a9305-686f-42eb-a2ee-61bc2ba1f5f3 failed" [in /Users/benthorner/Documents/Projects/api/app/celery/provider_tasks.py:68]
    Traceback (most recent call last):
      File "/Users/benthorner/Documents/Projects/api/app/celery/provider_tasks.py", line 53, in deliver_email
        raise TypeError("test retry")
    TypeError: test retry
    [2021-11-01 11:37:36,385: ERROR/ForkPoolWorker-1] RETRY: Email notification f69a9305-686f-42eb-a2ee-61bc2ba1f5f3 failed
    Traceback (most recent call last):
      File "/Users/benthorner/Documents/Projects/api/app/celery/provider_tasks.py", line 53, in deliver_email
        raise TypeError("test retry")
    TypeError: test retry
    [2021-11-01 11:37:36,394: WARNING/ForkPoolWorker-1] Task deliver_email[449cd221-173c-4e18-83ac-229e88c029a5] reject requeue=False: deliver_email() got an unexpected keyword argument 'request_id'
    Traceback (most recent call last):
      File "/Users/benthorner/Documents/Projects/api/app/celery/provider_tasks.py", line 53, in deliver_email
        raise TypeError("test retry")
    TypeError: test retry

    During handling of the above exception, another exception occurred:

    Traceback (most recent call last):
      File "/Users/benthorner/.pyenv/versions/notifications-api/lib/python3.6/site-packages/celery/app/task.py", line 731, in retry
        S.apply_async()
      File "/Users/benthorner/.pyenv/versions/notifications-api/lib/python3.6/site-packages/celery/canvas.py", line 219, in apply_async
        return _apply(args, kwargs, **options)
      File "/Users/benthorner/.pyenv/versions/notifications-api/lib/python3.6/site-packages/celery/app/task.py", line 537, in apply_async
        check_arguments(*(args or ()), **(kwargs or {}))
    TypeError: deliver_email() got an unexpected keyword argument 'request_id'

    During handling of the above exception, another exception occurred:

    Traceback (most recent call last):
      File "/Users/benthorner/.pyenv/versions/notifications-api/lib/python3.6/site-packages/celery/app/trace.py", line 450, in trace_task
        R = retval = fun(*args, **kwargs)
      File "/Users/benthorner/Documents/Projects/api/app/celery/celery.py", line 74, in __call__
        return super().__call__(*args, **kwargs)
      File "/Users/benthorner/.pyenv/versions/notifications-api/lib/python3.6/site-packages/celery/app/trace.py", line 731, in __protected_call__
        return self.run(*args, **kwargs)
      File "/Users/benthorner/Documents/Projects/api/app/celery/provider_tasks.py", line 71, in deliver_email
        self.retry(queue=QueueNames.RETRY)
      File "/Users/benthorner/.pyenv/versions/notifications-api/lib/python3.6/site-packages/celery/app/task.py", line 733, in retry
        raise Reject(exc, requeue=False)
    celery.exceptions.Reject: (TypeError("deliver_email() got an unexpected keyword argument 'request_id'",), False)
2021-11-01 11:39:57 +00:00
Ben Thorner
29c92a9e54 Try removing boto package again 2021-11-01 09:54:10 +00:00