notifications-api

mirror of https://github.com/GSA/notifications-api.git synced 2025-12-23 17:01:35 -05:00

Author	SHA1	Message	Date
Richard Baker	e10f45b3a7	Cast Celery worker_max_tasks_per_child to int or None We use this config option when running workers that process non-memory-safe tasks to restart the worker after n tasks. Celery 5 requires this to be passed as an int or None. Signed-off-by: Richard Baker <richard.baker@digital.cabinet-office.gov.uk>	2021-11-05 11:09:09 +00:00
sakisv	9e9091e980	Reduce concurrency for other workers too for consistency Any worker that had `--concurrency` > 4 is now set to 4 for consistency with the how volume workers. See previous commit (Reduce concurrency on high volume workers) for details	2021-11-04 16:31:22 +02:00
sakisv	92086e2090	Reduce concurrency on high volume workers We noticed that having high concurrency led to significant memory usage. The hypothesis is that because of long polling, there are many connections being held open which seems to impact the memory usage. Initially the high concurrency was put in place as a way to get around the lack of long polling: We were spawning multiple processes and each one was doing many requests to SQS to check for and receive new tasks. Now with long polling enabled and reduced concurrency, the workers are much more efficient at their job (the tasks are being picked up so fast that the queues are practically empty) and much lighter on resource requirements. (This last bit will allow us to reduce the memory requirement for heavy workers like the sender and reduce our costs) The concurrency number was chosen semi-arbitrarily: Usually this is set to the number of CPUs available to the system. Because we're running on PaaS and that number is both abstracted and may be claimed for by other processes, we went for a conservative one to also reduce the competion for CPU among the processes of the same worker instance.	2021-11-04 11:38:05 +02:00
Ben Thorner	3ecbdbb260	Temporarily disable task argument checking This was added in Celery 4 [1]. and appears to be incompatible with our approach of injecting "request_id" into task arguments (example exception below). Although our other apps are on Celery 5 our logs don't show any similar issues, probably because all their tasks are invoked without request IDs. In the longterm we should decide if we want to enable argument checking and fix the tracing approach, or stop tracing request IDs in Celery tasks. [1]: https://docs.celeryproject.org/en/stable/userguide/tasks.html#argument-checking 2021-11-01T11:37:36 delivery delivery ERROR None "RETRY: Email notification f69a9305-686f-42eb-a2ee-61bc2ba1f5f3 failed" [in /Users/benthorner/Documents/Projects/api/app/celery/provider_tasks.py:68] Traceback (most recent call last): File "/Users/benthorner/Documents/Projects/api/app/celery/provider_tasks.py", line 53, in deliver_email raise TypeError("test retry") TypeError: test retry [2021-11-01 11:37:36,385: ERROR/ForkPoolWorker-1] RETRY: Email notification f69a9305-686f-42eb-a2ee-61bc2ba1f5f3 failed Traceback (most recent call last): File "/Users/benthorner/Documents/Projects/api/app/celery/provider_tasks.py", line 53, in deliver_email raise TypeError("test retry") TypeError: test retry [2021-11-01 11:37:36,394: WARNING/ForkPoolWorker-1] Task deliver_email[449cd221-173c-4e18-83ac-229e88c029a5] reject requeue=False: deliver_email() got an unexpected keyword argument 'request_id' Traceback (most recent call last): File "/Users/benthorner/Documents/Projects/api/app/celery/provider_tasks.py", line 53, in deliver_email raise TypeError("test retry") TypeError: test retry During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/Users/benthorner/.pyenv/versions/notifications-api/lib/python3.6/site-packages/celery/app/task.py", line 731, in retry S.apply_async() File "/Users/benthorner/.pyenv/versions/notifications-api/lib/python3.6/site-packages/celery/canvas.py", line 219, in apply_async return _apply(args, kwargs, *options) File "/Users/benthorner/.pyenv/versions/notifications-api/lib/python3.6/site-packages/celery/app/task.py", line 537, in apply_async check_arguments((args or ()), *(kwargs or {})) TypeError: deliver_email() got an unexpected keyword argument 'request_id' During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/Users/benthorner/.pyenv/versions/notifications-api/lib/python3.6/site-packages/celery/app/trace.py", line 450, in trace_task R = retval = fun(args, *kwargs) File "/Users/benthorner/Documents/Projects/api/app/celery/celery.py", line 74, in __call__ return super().__call__(args, *kwargs) File "/Users/benthorner/.pyenv/versions/notifications-api/lib/python3.6/site-packages/celery/app/trace.py", line 731, in __protected_call__ return self.run(args, **kwargs) File "/Users/benthorner/Documents/Projects/api/app/celery/provider_tasks.py", line 71, in deliver_email self.retry(queue=QueueNames.RETRY) File "/Users/benthorner/.pyenv/versions/notifications-api/lib/python3.6/site-packages/celery/app/task.py", line 733, in retry raise Reject(exc, requeue=False) celery.exceptions.Reject: (TypeError("deliver_email() got an unexpected keyword argument 'request_id'",), False)	2021-11-01 11:39:57 +00:00
Ben Thorner	29c92a9e54	Try removing boto package again	2021-11-01 09:54:10 +00:00
Ben Thorner	efe4c6f06e	Fix notify-api crashing in PaaS This is purely by elimination: I couldn't see anything in the logs to indicate the cause of the crashes, just that the processes were exiting. The crash seemed to happen immediately after the AWS logs part of the wrapper script, which was a small indicator it might be something AWS-related. Since this package is no longer included by other dependencies, we need to include it explicitly.	2021-11-01 09:54:09 +00:00
Ben Thorner	89e390a3fc	Run "make freeze-requirements" Most of these are due to dependency changes in celery / kombu: -boto==2.49.0 `9b2a172078` +cached-property==1.5.2 `560518287a` +click-didyoumean==0.3.0 +click-plugins==1.1.1 +click-repl==0.2.0 `f462a437e3/requirements/default.txt` +pycurl==7.43.0.5 `59d88326b8/requirements/extras/sqs.txt` +vine==5.0.0 `f6c3b3313f` I'm not sure about the following, but neither are critical so I don't think it's worth tracking down where they came from. +prompt-toolkit==3.0.21 +wcwidth==0.2.5	2021-11-01 09:54:08 +00:00
Ben Thorner	d0550533a7	Remove redundant polling_interval setting This appeared without explanation in [1], but it's the same as the default value [2] so we don't need to specify it - doing so gives the impression we made a decision, but that's not clear here. [1]: https://github.com/alphagov/notifications-api/pull/2142/files#diff-84f1a9419471e289c6b6e2b0209b329e20df6cef81d1f7f0a193ddc2fc6ad69dR153 [2]: https://docs.celeryproject.org/en/stable/getting-started/backends-and-brokers/sqs.html#polling-interval	2021-11-01 09:54:07 +00:00
Ben Thorner	60799399ab	Remove anyjson package This is no longer required by Celery [1] and now causes an error when deploying with the new versions of other packages: use_2to3 is invalid [1]: https://docs.celeryproject.org/en/stable/history/whatsnew-4.0.html#requirements	2021-11-01 09:54:06 +00:00
Ben Thorner	44b3b42aba	Rewrite config to fix deprecation warnings The new format was introduced in Celery 4 [1] and is due for removal in Celery 6 [2], hence the warnings e.g. [2021-10-26 14:31:57,588: WARNING/MainProcess] /Users/benthorner/.pyenv/versions/notifications-api/lib/python3.6/site-packages/celery/app/utils.py:206: CDeprecationWarning: The 'CELERY_TIMEZONE' setting is deprecated and scheduled for removal in version 6.0.0. Use the timezone instead alternative=f'Use the {_TO_NEW_KEY[setting]} instead') This rewrites the config to match our other apps [3][4]. Some of the settings have been removed entirely: - "CELERY_ENABLE_UTC = True" - this has been enabled by default since Celery 3 [5]. - "CELERY_ACCEPT_CONTENT = ['json']", "CELERY_TASK_SERIALIZER = 'json'" - these are the default settings since Celery 4 [6][7]. Finally, this removes a redundant (and broken) bit of development config - NOTIFICATION_QUEUE_PREFIX - that should be set in environment.sh [8]. [1]: https://docs.celeryproject.org/en/stable/history/whatsnew-4.0.html#lowercase-setting-names [2]: https://docs.celeryproject.org/en/stable/history/whatsnew-5.0.html#step-2-update-your-configuration-with-the-new-setting-names [3]: `252ad01d39/app/config.py (L27)` [4]: `03df0d9252/app/__init__.py (L33)` [5]: https://docs.celeryproject.org/en/stable/userguide/configuration.html#std-setting-enable_utc [6]: https://docs.celeryproject.org/en/stable/userguide/configuration.html#std-setting-task_serializer [7]: https://docs.celeryproject.org/en/stable/userguide/configuration.html#std-setting-accept_content [8]: `2edbdec4ee/README.md (environmentsh)`	2021-11-01 09:54:05 +00:00
Leo Hemsted	19394ab9dd	construct celery queues once in the base config previously, we were confusing things by appending to CELERY_QUEUES in both dev and test configs - these are executed at import time, so the list contained all queues twice, regardless of what config you're actually using. Fortunately, the -Q command that we supply the workers with overrides this config option, so other environments weren't affected. Given that, we can tidy up this code by just declaring it in the base config every time	2021-11-01 09:54:04 +00:00
Ben Thorner	c2fe1b04bb	Fix test checking for nested exception Previously this type of exception was raised at the top level and the task did not retry [1]. Since Celery 4+ the behaviour changed so that a Retry exception will be raised unless we explicitly say we want to raise the original one [2]. It's unclear if we actually want to retry this task for any type of exception, but it's out-of-scope for this PR to decide on this, so here we just reraise the exception to make it compatible with the new version of Celery and the existing test. [1]: https://github.com/alphagov/notifications-api/pull/2832/files#diff-926badba91648d56a973e16bd92da3345b23bc60dc89360119b1df08de52723fL77 [2]: `32b52ca875 (diff-db604dd7cb51e386710260ff2eba378aac19ba11eec97904bbf097b68caeada6L625)`	2021-11-01 09:54:03 +00:00
Ben Thorner	3e49de5330	Upgrade to Celery 5.1.2 There are several other changes we need to make in order to install the new version. For more context, see: - `208e90e40f` - `e3d1993a58` - `7e93611fce` In the next commits we'll look at tidying up the config and other dependencies so the change is deployable.	2021-11-01 09:54:00 +00:00
Ben Thorner	4125ed3f10	Merge pull request #3352 from alphagov/remove-raw-request-stats-180016688 Revert "add raw request timings to provider send functions"	2021-10-29 15:04:25 +01:00
Ben Thorner	fd7373ed73	Merge pull request #3353 from alphagov/letter-sending-cc-dvla-180096891 CC DVLA in tickets about outstanding letters	2021-10-29 15:04:16 +01:00
Ben Thorner	d1586a8f81	CC DVLA in tickets about outstanding letters Previously we sent them emails about this manually. We also tried a Zendesk macro/trigger approach, but using a CC means: - We can control the behaviour ourselves (Zendesk triggers can only be edited by admins outside our team). - We keep the DVLA notification approach consistent and in one place, so notifications always go to the same people. - Any further (public) updates to the ticket will also trigger a notification to DVLA (previous trigger only notified on creation).	2021-10-29 11:46:29 +01:00
Ben Thorner	64327c10ae	Bump utils to 47.1.0 This includes the new email_ccs feature needed for the next commit, but also an upgrade to bleach [1]. [1]: https://github.com/alphagov/notifications-utils/pull/909	2021-10-29 11:46:28 +01:00
Ben Thorner	3eeba0266b	Revert "add raw request timings to provider send functions" This reverts commit `f2f2509c9b`. Raw request stats were added to investigate a hunch about a performance issue we were seeing [1], but turned out not to be relevant. We don't use them anymore so we can tidy up. [1]: https://github.com/alphagov/notifications-api/pull/2858	2021-10-28 11:12:18 +01:00
Ben Thorner	32873ef70f	Merge pull request #3349 from alphagov/freeze-requirements-180017131 Run "make freeze-requirements"	2021-10-28 10:46:01 +01:00
David McDonald	ffc7aec61c	Merge pull request #3350 from alphagov/normalised_to_update Bug fix: update normalised_to, not just `to` after letter sanitise	2021-10-27 12:12:06 +01:00
David McDonald	5a51ab6131	Bug fix: update normalised_to, not just `to` after letter sanitise When a precompiled letter is sent to us, we set the `to` field as 'Provided as PDF' in `1c1023a877/app/v2/notifications/post_notifications.py (L100-L104)` This then also sets `normalised_to` as `providedaspdf`. However, when template preview sanitises the letter, pulls out the address and gives it to the API, we were only setting `to` to be the new address and had forgotten to also amend `normalised_to` to be the normalised version. This meant that for all these letters we accidentally left `normalised_to` as `providedaspdf`. The impact of this was that we can not then search for these letters in the admin user interface as they rely on the `normalised_to` field containing the recipient address. This commit fixes that bug by also setting the `normalised_to` field	2021-10-27 11:56:25 +01:00
Ben Thorner	32e8f9cbc6	Run "make freeze-requirements" This is so we can clear the diff prior to upgrading to Celery 5, which has a number of transitive package changes associated with it. It makes sense for this to be a separate change in case it causes issues of its own. However, the only major difference in this commit is pyparsing [1]. [1]: https://github.com/pyparsing/pyparsing/blob/master/docs/whats_new_in_3_0_0.rst	2021-10-27 11:00:48 +01:00
Chris Hill-Scott	2edbdec4ee	Merge pull request #3344 from alphagov/broadcast-event-field Store the `event` field from CAP XML broadcasts	2021-10-26 12:00:17 +01:00
Chris Hill-Scott	54bcf618da	Store the `event` field from CAP XML broadcasts We don’t store everything that comes in the CAP XML when someone creates a broadcast via the API. One thing we do store is `<identifier>` (in a column called `reference`) which is a unique (to the external system) identifier for the broadcast. We show this in the front end instead of the template name, because broadcasts created from the API don’t use templates. However this ID isn’t very friendly – the Environment Agency just supply a UUID. The Environment Agency also populate the `<event>` field with some human readable text, for example: > 013 Issue Severe Flood Warning EA (013 is an area code which will be meaningful to the Flood Warning Service team) We should show this in the UI instead of the reference. The first step towards this is storing it in the database and returning it in the REST endpoints. Later we can have the admin app prefer `cap_event` over `reference`, where `cap_event` is present. We can’t backfill this data because we don’t keep a copy of the original XML. Seems like `<event>` is a mandatory property of `<info>`, so we don’t need to worry about the field being missing (`<info>` is optional in CAP but we require it because it contains stuff like the areas which we need in order to send out the broadcast`). *** https://www.pivotaltracker.com/story/show/176927060	2021-10-26 11:12:27 +01:00
Ben Thorner	d703251b13	Merge pull request #3348 from alphagov/better-callback-stats-180016688 Include status in stats about delivery times	2021-10-22 11:59:24 +01:00
Ben Thorner	f974108934	Include status in stats about delivery times Previously these metrics weren't very useful because they could be skewed by long timings for failed notifications, which can take up to 72 hours to deliver. I'm intentionally not trying to have a dual running period (with the old and new names) because: - We don't use the current stats for anything (checking Grafana). - The current stats get turned into a "bucket" metric in Prometheus [1][2], which isn't very useful because it can only tell us the mean time to deliver, but we're actually interested in percentiles. Switching to a new naming is an opportunity to fix the raw data and the way it's aggregated, using the same kind of "summary" metric that we now use for stats about our Celery tasks [3]. [1]: `c330a8ac8a/paas/statsd/statsd-mapping.yml (L82)` [2]: https://prometheus.io/docs/practices/histograms/#quantiles [3]: https://github.com/alphagov/notifications-aws/pull/890	2021-10-20 17:22:59 +01:00
Leo Hemsted	0b8c6ef263	Merge pull request #3339 from alphagov/letter-runbook-link tweak zendesk message for no ack files alert	2021-10-20 15:23:33 +01:00
Pea Tyczynska	cf1662e8b1	Merge pull request #3343 from alphagov/publish-alerts Call publish-govuk-alerts task when alert is sent, cancelled or expires	2021-10-20 14:48:10 +01:00
Chris Hill-Scott	a1345a74d4	Merge pull request #3345 from alphagov/werkzeug-2.0.2 Bump Werkzeug to version 2.0.2	2021-10-18 16:41:11 +01:00
Chris Hill-Scott	ecd2b0c4a3	Bump Werkzeug to version 2.0.2 This is the newest version. Pyup is complaining about vulnerabilities in version 1.0.1, specifically > Werkzeug version 2.0.2 improves the security of the debugger cookies. > "SameSite" attribute is set to "Strict" instead of "None", and the > secure flag is added when on HTTPS. Previously we were using whatever version of Werkzeug that Flask specified this pins it to get rid of the vulnerability without having to upgrade everything at once. We’ve done this for the admin app already: https://github.com/alphagov/notifications-admin/pull/4042/files I suspect the memory usage issues we saw with version 2.0.0 have been fixed in 2.0.2, per this line in the changelog: > Fix memory usage for locals when using Python 3.6 or pre 0.4.17 greenlet versions. > https://github.com/pallets/werkzeug/pull/2212 — https://werkzeug.palletsprojects.com/en/2.0.x/changes/	2021-10-18 15:00:39 +01:00
Pea Tyczynska	1b6f9505da	Call `publish-govuk-alerts` task when alert expires The `auto-expire-broadcast-messages` task checks for expired broadcasts at five minute intervals. This change now calls the `publish-govuk-alerts` task in govuk-alerts if there are expired broadcasts so that the site is updated. Co-authored-by: Katie Smith <katie.smith@digital.cabinet-office.gov.uk>	2021-10-18 08:41:25 +01:00
Katie Smith	04bfd6bfdb	Trigger task to publish alerts when sending or cancelling alert When we send or cancel a broadcast message, we now trigger a task in govuk-alerts repo that polls our API for alerts and publishes a fresh list of alerts. Co-authored-by: Pea Tyczynska <pea.tyczynska@digital.cabinet-office.gov.uk>	2021-10-18 08:41:24 +01:00
Ben Thorner	c84daf0b7b	Merge pull request #3340 from alphagov/fix-decorator-ordering Fix incorrect ordering in command wrapper	2021-10-08 15:01:01 +01:00
Ben Thorner	7d631960eb	Fix incorrect ordering in command wrapper Previously this was causing the wrapper function to become a command before it started mirroring the original (functools.wraps), which meant any previous option decorators were "lost".* We didn't notice the problem in the original PR [1] because the new command under test has its option decorators after the command decorator, in contrast with all other (now broken) commands. The original wrapper applied the functools decorator first [2], so this change just reinstates that ordering. *This is a hand-wavey explanation as I haven't looked into how functools.wraps interacts with option decorators. [1]: `922fd2f333`# [2]: `922fd2f333 (diff-c4e75c8613e916687a97191a7a79110dfb47e96ef7df96f7ba25dd94ba64943dL101)`	2021-10-08 14:21:59 +01:00
Leo Hemsted	b8c4e19072	tweak zendesk message for no ack files alert include a link to a runbook entry. also the list of acknowledgement files can be very long, so make that the last thing, and use new lines to space out the message.	2021-10-08 13:45:02 +01:00
Chris Hill-Scott	b8aea23f6a	Merge pull request #3189 from alphagov/reduce-concurrent-verify-codes Reduce max concurrent 2 factor codes	2021-10-07 11:08:31 +01:00
Chris Hill-Scott	544bfbf569	Add separate config item for failed login count It’s confusing that changing `MAX_VERIFY_CODE_COUNT` also limits the number of failed login attempts that a user of text messages 2FA can make. This makes the parameters independent, and adds a test to make sure any future changes which affect the limit of failed login attempts are covered.	2021-10-04 10:45:07 +01:00
Chris Hill-Scott	786893d920	Reduce max concurrent 2 factor codes I was doing some analysis and saw that in the last 24 hours the most codes that anyone had was in a 15 minute window was 3. So I think we can safely reduce this to 5 to get a bit more security with enough headroom to not have any negative impact to the user.	2021-10-04 10:45:06 +01:00
Chris Hill-Scott	b3597a2e54	Merge pull request #3336 from alphagov/non-consecutive-security-code Don’t repeat digits in security codes	2021-09-30 10:34:59 +01:00
Chris Hill-Scott	19ad11e383	Don’t repeat digits in security codes People with dyslexia and dyscalculia find it difficult to transpose codes which have consecutive, repeated digits[1]. This commits enhances the algorithm for generating codes to not repeat the previous digit in a code. This reduces the key space for our codes from 100,000 possibilities to 65,610 possibilities. 1. https://twitter.com/annaecook/status/1442567679710150662	2021-09-30 10:24:17 +01:00
Katie Smith	75901a02a6	Merge pull request #3334 from alphagov/new-zendesk-form Use the new Zendesk form for all tickets	2021-09-29 14:09:11 +01:00
Katie Smith	58597653df	Update how "sending to TV numbers" Zendesk tickets are created	2021-09-29 11:26:20 +01:00
Katie Smith	0c0c7f4478	Update how "letters still created status" Zendesk tickets are created	2021-09-29 11:23:28 +01:00
Katie Smith	2f66e38fb9	Update how "missing ackfile for letters" Zendesk tickets are created	2021-09-29 11:10:50 +01:00
Katie Smith	64c0a3fb9d	Update how 'letters still sending' Zendesk tickets are created These now use the new Zendesk form.	2021-09-29 11:07:37 +01:00
Katie Smith	b114dadcae	Update how pending virus check Zendesk tickets are created This updates the tickets that are created when the `check_if_letters_still_pending_virus_check` scheduled task detects letters in the `pending-virus-check` state.	2021-09-29 11:03:48 +01:00
Katie Smith	9ff0ca0363	Update how live broadcast Zendesk tickets are created These now use the Notify Form in Zendesk	2021-09-29 10:59:07 +01:00
Katie Smith	329a418cc4	Bump utils to 46.1.0 This is to bring in the Zendesk changes which allow us to create tickets using the Notify Form in Zendesk.	2021-09-24 08:16:06 +01:00
Ben Thorner	0d45168cd8	Merge pull request #3332 from alphagov/api-docs-179208442 Add new guidance on writing public APIs	2021-09-20 10:36:50 +01:00
Ben Thorner	d1826af3c7	Remove out-of-date docs on new workers In response to: https://github.com/alphagov/notifications-api/pull/3332#issuecomment-921897641 These were so out-of-date as to be misleading. We can always refer back to them in the version history if needed.	2021-09-17 16:53:42 +01:00

1 2 3 4 5 ...

8564 Commits