Commit Graph

8564 Commits

Author SHA1 Message Date
Richard Baker
e10f45b3a7 Cast Celery worker_max_tasks_per_child to int or None
We use this config option when running workers that process non-memory-safe tasks to restart the worker after n tasks.

Celery 5 requires this to be passed as an int or None.

Signed-off-by: Richard Baker <richard.baker@digital.cabinet-office.gov.uk>
2021-11-05 11:09:09 +00:00
sakisv
9e9091e980 Reduce concurrency for other workers too for consistency
Any worker that had `--concurrency` > 4 is now set to 4 for consistency
with the how volume workers.

See previous commit (Reduce concurrency on high volume workers) for
details
2021-11-04 16:31:22 +02:00
sakisv
92086e2090 Reduce concurrency on high volume workers
We noticed that having high concurrency led to significant memory usage.

The hypothesis is that because of long polling, there are many
connections being held open which seems to impact the memory usage.

Initially the high concurrency was put in place as a way to get around
the lack of long polling: We were spawning multiple processes and each
one was doing many requests to SQS to check for and receive new tasks.

Now with long polling enabled and reduced concurrency, the workers are
much more efficient at their job (the tasks are being picked up so fast
that the queues are practically empty) and much lighter on resource
requirements. (This last bit will allow us to reduce the memory
requirement for heavy workers like the sender and reduce our costs)

The concurrency number was chosen semi-arbitrarily: Usually this is set
to the number of CPUs available to the system. Because we're running on
PaaS and that number is both abstracted and may be claimed for by other
processes, we went for a conservative one to also reduce the competion
for CPU among the processes of the same worker instance.
2021-11-04 11:38:05 +02:00
Ben Thorner
3ecbdbb260 Temporarily disable task argument checking
This was added in Celery 4 [1]. and appears to be incompatible with
our approach of injecting "request_id" into task arguments (example
exception below). Although our other apps are on Celery 5 our logs
don't show any similar issues, probably because all their tasks are
invoked without request IDs. In the longterm we should decide if we
want to enable argument checking and fix the tracing approach, or
stop tracing request IDs in Celery tasks.

[1]: https://docs.celeryproject.org/en/stable/userguide/tasks.html#argument-checking

    2021-11-01T11:37:36 delivery delivery ERROR None "RETRY: Email notification f69a9305-686f-42eb-a2ee-61bc2ba1f5f3 failed" [in /Users/benthorner/Documents/Projects/api/app/celery/provider_tasks.py:68]
    Traceback (most recent call last):
      File "/Users/benthorner/Documents/Projects/api/app/celery/provider_tasks.py", line 53, in deliver_email
        raise TypeError("test retry")
    TypeError: test retry
    [2021-11-01 11:37:36,385: ERROR/ForkPoolWorker-1] RETRY: Email notification f69a9305-686f-42eb-a2ee-61bc2ba1f5f3 failed
    Traceback (most recent call last):
      File "/Users/benthorner/Documents/Projects/api/app/celery/provider_tasks.py", line 53, in deliver_email
        raise TypeError("test retry")
    TypeError: test retry
    [2021-11-01 11:37:36,394: WARNING/ForkPoolWorker-1] Task deliver_email[449cd221-173c-4e18-83ac-229e88c029a5] reject requeue=False: deliver_email() got an unexpected keyword argument 'request_id'
    Traceback (most recent call last):
      File "/Users/benthorner/Documents/Projects/api/app/celery/provider_tasks.py", line 53, in deliver_email
        raise TypeError("test retry")
    TypeError: test retry

    During handling of the above exception, another exception occurred:

    Traceback (most recent call last):
      File "/Users/benthorner/.pyenv/versions/notifications-api/lib/python3.6/site-packages/celery/app/task.py", line 731, in retry
        S.apply_async()
      File "/Users/benthorner/.pyenv/versions/notifications-api/lib/python3.6/site-packages/celery/canvas.py", line 219, in apply_async
        return _apply(args, kwargs, **options)
      File "/Users/benthorner/.pyenv/versions/notifications-api/lib/python3.6/site-packages/celery/app/task.py", line 537, in apply_async
        check_arguments(*(args or ()), **(kwargs or {}))
    TypeError: deliver_email() got an unexpected keyword argument 'request_id'

    During handling of the above exception, another exception occurred:

    Traceback (most recent call last):
      File "/Users/benthorner/.pyenv/versions/notifications-api/lib/python3.6/site-packages/celery/app/trace.py", line 450, in trace_task
        R = retval = fun(*args, **kwargs)
      File "/Users/benthorner/Documents/Projects/api/app/celery/celery.py", line 74, in __call__
        return super().__call__(*args, **kwargs)
      File "/Users/benthorner/.pyenv/versions/notifications-api/lib/python3.6/site-packages/celery/app/trace.py", line 731, in __protected_call__
        return self.run(*args, **kwargs)
      File "/Users/benthorner/Documents/Projects/api/app/celery/provider_tasks.py", line 71, in deliver_email
        self.retry(queue=QueueNames.RETRY)
      File "/Users/benthorner/.pyenv/versions/notifications-api/lib/python3.6/site-packages/celery/app/task.py", line 733, in retry
        raise Reject(exc, requeue=False)
    celery.exceptions.Reject: (TypeError("deliver_email() got an unexpected keyword argument 'request_id'",), False)
2021-11-01 11:39:57 +00:00
Ben Thorner
29c92a9e54 Try removing boto package again 2021-11-01 09:54:10 +00:00
Ben Thorner
efe4c6f06e Fix notify-api crashing in PaaS
This is purely by elimination: I couldn't see anything in the logs
to indicate the cause of the crashes, just that the processes were
exiting. The crash seemed to happen immediately after the AWS logs
part of the wrapper script, which was a small indicator it might be
something AWS-related. Since this package is no longer included by
other dependencies, we need to include it explicitly.
2021-11-01 09:54:09 +00:00
Ben Thorner
89e390a3fc Run "make freeze-requirements"
Most of these are due to dependency changes in celery / kombu:

-boto==2.49.0
9b2a172078

+cached-property==1.5.2
560518287a

+click-didyoumean==0.3.0
+click-plugins==1.1.1
+click-repl==0.2.0
f462a437e3/requirements/default.txt

+pycurl==7.43.0.5
59d88326b8/requirements/extras/sqs.txt

+vine==5.0.0
f6c3b3313f

I'm not sure about the following, but neither are critical so
I don't think it's worth tracking down where they came from.

+prompt-toolkit==3.0.21
+wcwidth==0.2.5
2021-11-01 09:54:08 +00:00
Ben Thorner
d0550533a7 Remove redundant polling_interval setting
This appeared without explanation in [1], but it's the same as the
default value [2] so we don't need to specify it - doing so gives
the impression we made a decision, but that's not clear here.

[1]: https://github.com/alphagov/notifications-api/pull/2142/files#diff-84f1a9419471e289c6b6e2b0209b329e20df6cef81d1f7f0a193ddc2fc6ad69dR153
[2]: https://docs.celeryproject.org/en/stable/getting-started/backends-and-brokers/sqs.html#polling-interval
2021-11-01 09:54:07 +00:00
Ben Thorner
60799399ab Remove anyjson package
This is no longer required by Celery [1] and now causes an error
when deploying with the new versions of other packages:

        use_2to3 is invalid

[1]: https://docs.celeryproject.org/en/stable/history/whatsnew-4.0.html#requirements
2021-11-01 09:54:06 +00:00
Ben Thorner
44b3b42aba Rewrite config to fix deprecation warnings
The new format was introduced in Celery 4 [1] and is due for removal
in Celery 6 [2], hence the warnings e.g.

    [2021-10-26 14:31:57,588: WARNING/MainProcess] /Users/benthorner/.pyenv/versions/notifications-api/lib/python3.6/site-packages/celery/app/utils.py:206: CDeprecationWarning:
        The 'CELERY_TIMEZONE' setting is deprecated and scheduled for removal in
        version 6.0.0. Use the timezone instead

      alternative=f'Use the {_TO_NEW_KEY[setting]} instead')

This rewrites the config to match our other apps [3][4]. Some of the
settings have been removed entirely:

- "CELERY_ENABLE_UTC = True" - this has been enabled by default since
  Celery 3 [5].

- "CELERY_ACCEPT_CONTENT = ['json']", "CELERY_TASK_SERIALIZER = 'json'"
  - these are the default settings since Celery 4 [6][7].

Finally, this removes a redundant (and broken) bit of development config
- NOTIFICATION_QUEUE_PREFIX - that should be set in environment.sh [8].

[1]: https://docs.celeryproject.org/en/stable/history/whatsnew-4.0.html#lowercase-setting-names
[2]: https://docs.celeryproject.org/en/stable/history/whatsnew-5.0.html#step-2-update-your-configuration-with-the-new-setting-names
[3]: 252ad01d39/app/config.py (L27)
[4]: 03df0d9252/app/__init__.py (L33)
[5]: https://docs.celeryproject.org/en/stable/userguide/configuration.html#std-setting-enable_utc
[6]: https://docs.celeryproject.org/en/stable/userguide/configuration.html#std-setting-task_serializer
[7]: https://docs.celeryproject.org/en/stable/userguide/configuration.html#std-setting-accept_content
[8]: 2edbdec4ee/README.md (environmentsh)
2021-11-01 09:54:05 +00:00
Leo Hemsted
19394ab9dd construct celery queues once in the base config
previously, we were confusing things by appending to CELERY_QUEUES in
both dev and test configs - these are executed at import time, so the
list contained all queues twice, regardless of what config you're
actually using.

Fortunately, the -Q command that we supply the workers with overrides
this config option, so other environments weren't affected. Given that,
we can tidy up this code by just declaring it in the base config every
time
2021-11-01 09:54:04 +00:00
Ben Thorner
c2fe1b04bb Fix test checking for nested exception
Previously this type of exception was raised at the top level and
the task did not retry [1]. Since Celery 4+ the behaviour changed
so that a Retry exception will be raised unless we explicitly say
we want to raise the original one [2].

It's unclear if we actually want to retry this task for any type
of exception, but it's out-of-scope for this PR to decide on this,
so here we just reraise the exception to make it compatible with
the new version of Celery and the existing test.

[1]: https://github.com/alphagov/notifications-api/pull/2832/files#diff-926badba91648d56a973e16bd92da3345b23bc60dc89360119b1df08de52723fL77
[2]: 32b52ca875 (diff-db604dd7cb51e386710260ff2eba378aac19ba11eec97904bbf097b68caeada6L625)
2021-11-01 09:54:03 +00:00
Ben Thorner
3e49de5330 Upgrade to Celery 5.1.2
There are several other changes we need to make in order to install
the new version. For more context, see:

- 208e90e40f
- e3d1993a58
- 7e93611fce

In the next commits we'll look at tidying up the config and other
dependencies so the change is deployable.
2021-11-01 09:54:00 +00:00
Ben Thorner
4125ed3f10 Merge pull request #3352 from alphagov/remove-raw-request-stats-180016688
Revert "add raw request timings to provider send functions"
2021-10-29 15:04:25 +01:00
Ben Thorner
fd7373ed73 Merge pull request #3353 from alphagov/letter-sending-cc-dvla-180096891
CC DVLA in tickets about outstanding letters
2021-10-29 15:04:16 +01:00
Ben Thorner
d1586a8f81 CC DVLA in tickets about outstanding letters
Previously we sent them emails about this manually. We also tried
a Zendesk macro/trigger approach, but using a CC means:

- We can control the behaviour ourselves (Zendesk triggers can only
be edited by admins outside our team).

- We keep the DVLA notification approach consistent and in one place,
so notifications always go to the same people.

- Any further (public) updates to the ticket will also trigger a
notification to DVLA (previous trigger only notified on creation).
2021-10-29 11:46:29 +01:00
Ben Thorner
64327c10ae Bump utils to 47.1.0
This includes the new email_ccs feature needed for the next commit,
but also an upgrade to bleach [1].

[1]: https://github.com/alphagov/notifications-utils/pull/909
2021-10-29 11:46:28 +01:00
Ben Thorner
3eeba0266b Revert "add raw request timings to provider send functions"
This reverts commit f2f2509c9b.
Raw request stats were added to investigate a hunch about a
performance issue we were seeing [1], but turned out not to
be relevant. We don't use them anymore so we can tidy up.

[1]: https://github.com/alphagov/notifications-api/pull/2858
2021-10-28 11:12:18 +01:00
Ben Thorner
32873ef70f Merge pull request #3349 from alphagov/freeze-requirements-180017131
Run "make freeze-requirements"
2021-10-28 10:46:01 +01:00
David McDonald
ffc7aec61c Merge pull request #3350 from alphagov/normalised_to_update
Bug fix: update normalised_to, not just `to` after letter sanitise
2021-10-27 12:12:06 +01:00
David McDonald
5a51ab6131 Bug fix: update normalised_to, not just to after letter sanitise
When a precompiled letter is sent to us, we set the `to` field as
'Provided as PDF' in
1c1023a877/app/v2/notifications/post_notifications.py (L100-L104)

This then also sets `normalised_to` as `providedaspdf`.

However, when template preview sanitises the letter, pulls out the
address and gives it to the API, we were only setting `to` to be
the new address and had forgotten to also amend `normalised_to` to
be the normalised version. This meant that for all these letters
we accidentally left `normalised_to` as `providedaspdf`. The impact
of this was that we can not then search for these letters in the
admin user interface as they rely on the `normalised_to` field
containing the recipient address.

This commit fixes that bug by also setting the `normalised_to`
field
2021-10-27 11:56:25 +01:00
Ben Thorner
32e8f9cbc6 Run "make freeze-requirements"
This is so we can clear the diff prior to upgrading to Celery 5,
which has a number of transitive package changes associated with
it. It makes sense for this to be a separate change in case it
causes issues of its own. However, the only major difference in
this commit is pyparsing [1].

[1]: https://github.com/pyparsing/pyparsing/blob/master/docs/whats_new_in_3_0_0.rst
2021-10-27 11:00:48 +01:00
Chris Hill-Scott
2edbdec4ee Merge pull request #3344 from alphagov/broadcast-event-field
Store the `event` field from CAP XML broadcasts
2021-10-26 12:00:17 +01:00
Chris Hill-Scott
54bcf618da Store the event field from CAP XML broadcasts
We don’t store everything that comes in the CAP XML when someone creates
a broadcast via the API.

One thing we do store is `<identifier>` (in a column called `reference`)
which is a unique (to the external system) identifier for the broadcast.
We show this in the front end instead of the template name, because
broadcasts created from the API don’t use templates.

However this ID isn’t very friendly – the Environment Agency just supply
a UUID.

The Environment Agency also populate the `<event>` field with some human
readable text, for example:
> 013 Issue Severe Flood Warning EA

(013 is an area code which will be meaningful to the Flood Warning
Service team)

We should show this in the UI instead of the reference. The first step
towards this is storing it in the database and returning it in the REST
endpoints.

Later we can have the admin app prefer `cap_event` over `reference`,
where `cap_event` is present.

We can’t backfill this data because we don’t keep a copy of the original
XML.

Seems like `<event>` is a mandatory property of `<info>`, so we don’t
need to worry about the field being missing (`<info>` is optional in
CAP but we require it because it contains stuff like the areas which
we need in order to send out the broadcast`).

***

https://www.pivotaltracker.com/story/show/176927060
2021-10-26 11:12:27 +01:00
Ben Thorner
d703251b13 Merge pull request #3348 from alphagov/better-callback-stats-180016688
Include status in stats about delivery times
2021-10-22 11:59:24 +01:00
Ben Thorner
f974108934 Include status in stats about delivery times
Previously these metrics weren't very useful because they could be
skewed by long timings for failed notifications, which can take up
to 72 hours to deliver. I'm intentionally not trying to have a dual
running period (with the old and new names) because:

- We don't use the current stats for anything (checking Grafana).

- The current stats get turned into a "bucket" metric in Prometheus
[1][2], which isn't very useful because it can only tell us the mean
time to deliver, but we're actually interested in percentiles.

Switching to a new naming is an opportunity to fix the raw data and
the way it's aggregated, using the same kind of "summary" metric that
we now use for stats about our Celery tasks [3].

[1]: c330a8ac8a/paas/statsd/statsd-mapping.yml (L82)
[2]: https://prometheus.io/docs/practices/histograms/#quantiles
[3]: https://github.com/alphagov/notifications-aws/pull/890
2021-10-20 17:22:59 +01:00
Leo Hemsted
0b8c6ef263 Merge pull request #3339 from alphagov/letter-runbook-link
tweak zendesk message for no ack files alert
2021-10-20 15:23:33 +01:00
Pea Tyczynska
cf1662e8b1 Merge pull request #3343 from alphagov/publish-alerts
Call publish-govuk-alerts task when alert is sent, cancelled or expires
2021-10-20 14:48:10 +01:00
Chris Hill-Scott
a1345a74d4 Merge pull request #3345 from alphagov/werkzeug-2.0.2
Bump Werkzeug to version 2.0.2
2021-10-18 16:41:11 +01:00
Chris Hill-Scott
ecd2b0c4a3 Bump Werkzeug to version 2.0.2
This is the newest version.

Pyup is complaining about vulnerabilities in version 1.0.1, specifically
> Werkzeug version 2.0.2 improves the security of the debugger cookies.
> "SameSite" attribute is set to "Strict" instead of "None", and the
> secure flag is added when on HTTPS.

Previously we were using whatever version of Werkzeug that Flask
specified this pins it to get rid of the vulnerability without having to
upgrade everything at once.

We’ve done this for the admin app already:
https://github.com/alphagov/notifications-admin/pull/4042/files

I suspect the memory usage issues we saw with version 2.0.0 have been
fixed in 2.0.2, per this line in the changelog:
> Fix memory usage for locals when using Python 3.6 or pre 0.4.17 greenlet versions.
> https://github.com/pallets/werkzeug/pull/2212https://werkzeug.palletsprojects.com/en/2.0.x/changes/
2021-10-18 15:00:39 +01:00
Pea Tyczynska
1b6f9505da Call publish-govuk-alerts task when alert expires
The `auto-expire-broadcast-messages` task checks for expired broadcasts
at five minute intervals. This change now calls the
`publish-govuk-alerts` task in govuk-alerts if there are expired
broadcasts so that the site is updated.

Co-authored-by: Katie Smith <katie.smith@digital.cabinet-office.gov.uk>
2021-10-18 08:41:25 +01:00
Katie Smith
04bfd6bfdb Trigger task to publish alerts when sending or cancelling alert
When we send or cancel a broadcast message, we now trigger a task
in govuk-alerts repo that polls our API for alerts and
publishes a fresh list of alerts.

Co-authored-by: Pea Tyczynska <pea.tyczynska@digital.cabinet-office.gov.uk>
2021-10-18 08:41:24 +01:00
Ben Thorner
c84daf0b7b Merge pull request #3340 from alphagov/fix-decorator-ordering
Fix incorrect ordering in command wrapper
2021-10-08 15:01:01 +01:00
Ben Thorner
7d631960eb Fix incorrect ordering in command wrapper
Previously this was causing the wrapper function to become a
command before it started mirroring the original (functools.wraps),
which meant any previous option decorators were "lost".*

We didn't notice the problem in the original PR [1] because the new
command under test has its option decorators *after* the command
decorator, in contrast with all other (now broken) commands.

The original wrapper applied the functools decorator first [2],
so this change just reinstates that ordering.

*This is a hand-wavey explanation as I haven't looked into how
functools.wraps interacts with option decorators.

[1]: 922fd2f333#
[2]: 922fd2f333 (diff-c4e75c8613e916687a97191a7a79110dfb47e96ef7df96f7ba25dd94ba64943dL101)
2021-10-08 14:21:59 +01:00
Leo Hemsted
b8c4e19072 tweak zendesk message for no ack files alert
include a link to a runbook entry.

also the list of acknowledgement files can be very long, so make that
the last thing, and use new lines to space out the message.
2021-10-08 13:45:02 +01:00
Chris Hill-Scott
b8aea23f6a Merge pull request #3189 from alphagov/reduce-concurrent-verify-codes
Reduce max concurrent 2 factor codes
2021-10-07 11:08:31 +01:00
Chris Hill-Scott
544bfbf569 Add separate config item for failed login count
It’s confusing that changing `MAX_VERIFY_CODE_COUNT` also limits the
number of failed login attempts that a user of text messages 2FA can
make.

This makes the parameters independent, and adds a test to make sure any
future changes which affect the limit of failed login attempts are
covered.
2021-10-04 10:45:07 +01:00
Chris Hill-Scott
786893d920 Reduce max concurrent 2 factor codes
I was doing some analysis and saw that in the last 24 hours the most
codes that anyone had was in a 15 minute window was 3.

So I think we can safely reduce this to 5 to get a bit more security
with enough headroom to not have any negative impact to the user.
2021-10-04 10:45:06 +01:00
Chris Hill-Scott
b3597a2e54 Merge pull request #3336 from alphagov/non-consecutive-security-code
Don’t repeat digits in security codes
2021-09-30 10:34:59 +01:00
Chris Hill-Scott
19ad11e383 Don’t repeat digits in security codes
People with dyslexia and dyscalculia find it difficult to transpose
codes which have consecutive, repeated digits[1].

This commits enhances the algorithm for generating codes to not repeat
the previous digit in a code.

This reduces the key space for our codes from 100,000 possibilities to
65,610 possibilities.

1. https://twitter.com/annaecook/status/1442567679710150662
2021-09-30 10:24:17 +01:00
Katie Smith
75901a02a6 Merge pull request #3334 from alphagov/new-zendesk-form
Use the new Zendesk form for all tickets
2021-09-29 14:09:11 +01:00
Katie Smith
58597653df Update how "sending to TV numbers" Zendesk tickets are created 2021-09-29 11:26:20 +01:00
Katie Smith
0c0c7f4478 Update how "letters still created status" Zendesk tickets are created 2021-09-29 11:23:28 +01:00
Katie Smith
2f66e38fb9 Update how "missing ackfile for letters" Zendesk tickets are created 2021-09-29 11:10:50 +01:00
Katie Smith
64c0a3fb9d Update how 'letters still sending' Zendesk tickets are created
These now use the new Zendesk form.
2021-09-29 11:07:37 +01:00
Katie Smith
b114dadcae Update how pending virus check Zendesk tickets are created
This updates the tickets that are created when the
`check_if_letters_still_pending_virus_check` scheduled task detects
letters in the `pending-virus-check` state.
2021-09-29 11:03:48 +01:00
Katie Smith
9ff0ca0363 Update how live broadcast Zendesk tickets are created
These now use the Notify Form in Zendesk
2021-09-29 10:59:07 +01:00
Katie Smith
329a418cc4 Bump utils to 46.1.0
This is to bring in the Zendesk changes which allow us to create tickets
using the Notify Form in Zendesk.
2021-09-24 08:16:06 +01:00
Ben Thorner
0d45168cd8 Merge pull request #3332 from alphagov/api-docs-179208442
Add new guidance on writing public APIs
2021-09-20 10:36:50 +01:00
Ben Thorner
d1826af3c7 Remove out-of-date docs on new workers
In response to: https://github.com/alphagov/notifications-api/pull/3332#issuecomment-921897641

These were so out-of-date as to be misleading. We can always refer
back to them in the version history if needed.
2021-09-17 16:53:42 +01:00