Commit Graph

269 Commits

Author SHA1 Message Date
Rebecca Law
8bba04e1ce Update the MMG/Firetext ratio to 70/30 as requested. 2020-07-15 14:07:19 +01:00
Pea M. Tyczynska
fbdfa6416f Merge pull request #2921 from alphagov/remove-statsd-http-api-decorators
Remove statsd http api decorators and turn statsd back on for celery apps
2020-07-14 10:16:44 +01:00
Leo Hemsted
eca37d0853 add send_broadcast_message task
task takes a brodcast_message_id, and makes a post to the cbc-proxy
for now, hardcode the url to the notify stub. the stub requires template
as the admin/api get it, so use the marshmallow schema to json dump it.
Note - this also required us to tweak the BroadcastMessage.serialize
function so that it converts uuids in to ids - flask's jsonify function
does that for free but requests.post doesn't sadly.

if the request fails (either 4xx or 5xx) just raise an exception and let
it bubble up for now - in the future we'll add retry logic
2020-07-09 18:23:30 +01:00
Pea Tyczynska
fff004da2b Turn statsd back on, but not for api main app, only for delivery
workers.
2020-07-09 10:10:24 +01:00
Pea Tyczynska
e02508d6f7 Turn off the sms stub and email stub in staging 2020-07-02 17:23:50 +01:00
Chris Hill-Scott
3ffdb3093b Revert "Revert "Merge pull request #2887 from alphagov/cache-the-serialised-things""
This reverts commit 7e85e37e1d.
2020-06-26 14:10:12 +01:00
Chris Hill-Scott
7e85e37e1d Revert "Merge pull request #2887 from alphagov/cache-the-serialised-things"
This reverts commit b8c2c6b291, reversing
changes made to 351aca2c5a.
2020-06-26 13:42:44 +01:00
Chris Hill-Scott
b8c2c6b291 Merge pull request #2887 from alphagov/cache-the-serialised-things
Serialise and cache services and API keys
2020-06-26 09:18:45 +01:00
Leo Hemsted
bb05dcf221 disable statsd 2020-06-24 16:35:44 +01:00
Chris Hill-Scott
6a9818b5fd Cache services and API keys in memory
Same as we’ve done for templates.

For high volume services this should mean avoiding calls to external
services, either the database or Redis.

TTL is set to 2 seconds, so that’s the maximum time it will take for
revoking an API key or renaming a service to propagate.

Some of the tests created services with the same service ID. This
caused intermittent failures because the cache relies on unique service
IDs (like we have in the real world) to key itself.
2020-06-24 08:46:13 +01:00
David McDonald
48abc8ffb5 Turn on email stub for load testing 2020-06-23 11:09:24 +01:00
Leo Hemsted
728e5eee24 re-enable statsd (and cache statsd DNS lookup)
see https://github.com/alphagov/notifications-utils/pull/752 for
implementation details. Cache DNS for 15 seconds. Note: This cache is
per eventlet, so concurrent requests will handle their requests
separately, but the cache does persist between sequential requests.

re-enable statsd for all environments.
2020-06-18 11:20:00 +01:00
David McDonald
92c3170fe3 Turn off statsd for the API in all environments
We have seen bad latency across all apps in the last week. After running
a canary with statsd turned off we have seen these issues dissapear on
that canary instance. We therefore will turn off statsd for all
instances of the API. We will still need to investigate tomorrow what
exactly changed or is causing the issue with statsd as we hadn't made
any changes to it ourself.

Note, this keeps statsd on for all the other apps but this will be a
first step. We will also need to check performance of the other apps
after releasing this.
2020-06-17 17:31:36 +01:00
Pea Tyczynska
bb3672100f Turn off email stub on staging
To check if it was our congestion point for the soak test.
(Email stub is a bit slower to respond than Amazon SES, and
the difference could lead to us having to many connections to db open
as we wait for response from the stub)
2020-06-09 11:21:48 +01:00
Pea Tyczynska
2aa6470817 Point API for staging at email and sms stubs for the soak tests.
This is done to avoid sending real email and sms and incurring
unnecessary charges while we run the soak tests.
2020-06-08 17:14:16 +01:00
Pea Tyczynska
8b7a0b88cb Ensure that aws ses stub client is not run in production 2020-06-04 15:43:47 +01:00
Pea Tyczynska
6422a88c8c Stub SES email client to avoid hitting SES during load testing
If we set an environment variable, we can stub out calls to SES and send
them to our own stub app. If the environment variable is not set, things
work as normal.

To be used alongside
https://github.com/alphagov/notifications-email-provider-stub
2020-06-03 11:11:43 +01:00
Pete Herlihy
8a67e14e2b Moving SMS supplier resting point to 50/50 2020-06-02 12:27:14 +01:00
Pea Tyczynska
3a00c19390 Polish and test the small task that updates billable units for letter 2020-05-11 13:32:09 +01:00
Pea Tyczynska
24a89c1c19 Modify tasks for getting letter pdf and updating billable units
So that they talk with new template preview task for pdf creation
2020-05-11 13:30:59 +01:00
David McDonald
a237162106 Reduce concurrency and prefetch count of reporting celery app
We have seen the reporting app run out of memory multiple times when
dealing with overnight tasks. The app runs 11 worker threads and we
reduce this to 2 worker threads to put less pressure on a single
instance.

The number 2 was chosen as most of the tasks processed by the reporting
app only take a few minutes and only one or two usually take more than
an hour. This would mean with 2 processes across our current 2
instances, a long running task should hopefully only wait behind a few
short running tasks before being picked up and therefore we shouldn't
see large increase in overall time taken to run all our overnight
reporting tasks.

On top of reducing the concurrency for the reporting app, we also set
CELERYD_PREFETCH_MULTIPLIER=1. We do this as suggested by the celery
docs because this app deals with long running tasks.
https://docs.celeryproject.org/en/3.1/userguide/optimizing.html#optimizing-prefetch-limit

The chance in prefetch multiplier should again optimise the overall time
it takes to process our tasks by ensuring that tasks are given to
instances that have (or will soon have) spare workers to deal with them,
rather than committing to putting all the tasks on certain workers in
advance.

Note, another suggestion for improving suggested by the docs for
optimising is to start setting `ACKS_LATE` on the long running tasks.
This setting would effectively change us from prefetching 1 task per
worker to prefetching 0 tasks per worker and further optimise how we
distribute our tasks across instances. However, we decided not to try
this setting as we weren't sure whether it would conflict with our
visibility_timeout. We decided not to spend the time investigating but
it may be worth revisiting in the future, as long as tasks are
idempotent.

Overall, this commit takes us from potentially having all 18 of our
reporting tasks get fetched onto a single instance to now having a
process that will ensure tasks are distributed more fairly across
instances based on when they have available workers to process the
tasks.
2020-04-28 10:47:46 +01:00
Leo Hemsted
ad419f7592 restart reporting worker after each task
The reporting worker tasks fetch large amounts of data from the db, do
some processing then store back in the database. As the reporting worker
only processes the create nightly billing/stats table tasks, which
aren't high performance or high volume, we're fine with the performance
hit from restarting the worker between every task (which based on
limited local testing takes about a second or so).

This causes some real funky shit with the app_context (used for
accessing current_app.logger). To access flask's global state we use the
standard way of importing `from flask import current_app`. However,
processes after the first one don't have the current_app available on
shut down (they're fine during the actual task running), and are unable
to call `with current_app.app_context()` to create it. They _are_ able
to call `with app.app_context()` to create it, where `app` is the
initial app that we pass in to `NotifyCelery.init_app`.

NotifyCelery.init_app is only called once, in the master process - I
think the application state is then stored and passed to the celery
workers. But then it looks like the teardown might clear it, but it
never gets set up again for the new workers? Unsure.

To fix this, store a copy of the initial flask app on the NotifyCelery
object and then use that from within the shutdown signal logging
function.

Nothing's ever easy ¯\_(ツ)_/¯
2020-04-24 12:28:25 +01:00
Katie Smith
92c979fcdb Restore provider resting points
We temporarily updated the provider resting points in
https://github.com/alphagov/notifications-api/pull/2804.
This puts the provider resting points back to their original value now
that both providers seem to be functioning well.
2020-04-20 10:45:19 +01:00
David McDonald
9b1b9ea75b Merge pull request #2798 from alphagov/delete-notifications-older-than-retention
Delete notifications older than retention
2020-04-15 12:14:07 +01:00
David McDonald
8a27b3b9f2 Move provider resting balance as temporary measure
We are seeing issues with one of our providers. Move all traffic to the
other. We have done this with the provider load balancer, however this
will be reverted back 10 percent every hour to these resting points,
which we want to stop happening until we are confident our provider has
fixed their issues.

In the long run, we should add functionality to pause our load balancer
behaviour.
2020-04-14 18:07:25 +01:00
David McDonald
f512fc63a1 Run delete-notifications-older-than-retention at 3am
Moving it from 4:15am to 3:00. This will mean we do more deleting before
the 'work day' starts and improve our DB performance.

I've looked at the last 14 day of logs for the
`create-nightly-notification-status` subtasks and the
`create-nightly-billing` subtasks. The latest they appear to finish is
2.30AM. There were some outliers but I believe these were people running
the tasks in the middle of the day as a manual process.

Obviously, this still means there is a risk that those tasks conflict
with `delete-notifications-older-than-retention`, even more so now that
we move this to 3am.
2020-04-07 17:04:31 +01:00
David McDonald
085e3a8435 Make deleting of notifications to be sequential
Based on this https://github.com/alphagov/notifications-api/pull/2788
where some concerns were raised. This should be a quicker fix to get our
the deletions to run sequentially for all the notification types. Note,
email is first as most important as makes up the larger numbers (we
wouldn't want it to start with SMS, fail half way for a reason that only
affects SMS and for that to affect the email deletion).

We hope that by running sequentially we will reduce conflicts writing to
the same index and this will speed up the total time it takes to finish
deleting all notification types older than their retention time. There
is a risk that whilst quicker per job, as they now run sequentially
rather than potentially overlapping, they will take longer overall. We
will need to monitor to see.
2020-04-07 17:03:17 +01:00
Leo Hemsted
9673619519 Update provider splits
also fix tests so they're independent of future config changes
2020-04-06 15:16:00 +01:00
Chris Hill-Scott
4aed012d92 Merge pull request #2759 from alphagov/delete-contact-list
Add an endpoint to delete a contact list
2020-03-27 14:50:51 +00:00
Rebecca Law
dc44cb29d1 To make the deployment and testing a little easier move the high volume service ids to the credential repo.
This way we can only add the ids when we are ready and all the infrastrure for the new service has been applied.
2020-03-27 08:02:51 +00:00
Chris Hill-Scott
4a6143aeb1 Remove the list from S3 once we don’t need it
Once a contact list is gone from the database there’s no way to
reference it again. Any jobs have made their own copy.

So we can clean it up, meaning we’re not storing personal data longer
than we need to.
2020-03-26 17:42:38 +00:00
Rebecca Law
d0a2e0f3ce Move high volume service id to config 2020-03-25 15:25:45 +00:00
Rebecca Law
db4b4d929d - If the task runs twice and the notification already exists ignore the primary key constraint.
- Remove prints
- Add some more tests
- Only allow the new method to run for emails
2020-03-25 12:39:15 +00:00
Rebecca Law
a13bcc6697 Reduce the pressure on the db for API post email requests.
Instead of saving the email notification to the db add it to a queue to save later.
This is an attempt to alleviate pressure on the db from the api requests.
This initial PR is to trial it see if we see improvement in the api performance an a reduction in queue pool errors. If we are happy with this we could remove the hard coding of the service id.

In a nutshell:
 - If POST /v2/notification/email is from our high volume service (hard coded for now) then create a notification to send to a queue to persist the notification to the db.
 - create a save_api_email task to persist the notification
 - return the notification
 - New worker app to process the save_api_email tasks.
2020-03-25 07:59:05 +00:00
Katie Smith
3a07d1e13d Create new sms-callbacks queue
The `delivery-worker-receipts` app will listen to this new queue, which will
be used for processing the responses from Firetext and MMG.
2020-03-19 13:41:14 +00:00
David McDonald
f56795655e Remove unused STATSD_PREFIX variable
We moved from sending statsd metrics to hosted graphite to sending to
one that is running on the paas. Therefore we no longer need to send
statsd metrics to a particular prefix at the statsd app as it is only
receiving statsd metrics from our apps (not other users like would have
been the case with HostedGraphite).

This should change no behaviour as the only place the environment
variable was being used was in the gunicorn config and it was an empty
string which is the default behaviour anyway as per:
https://docs.gunicorn.org/en/stable/settings.html#statsd-prefix
2020-03-05 10:41:26 +00:00
David McDonald
e6767590d4 Change function and task name to be more accurate
Will require us to change a cronitor set up
2020-02-21 15:01:19 +00:00
David McDonald
2dc5550159 Change variable name to make more descriptive
Also remove unnecessary if statement
Also add manifest change to make sure relevant environment variables
makes it into the app
2020-02-20 15:48:15 +00:00
David McDonald
7246306447 Support multiple secrets for ADMIN_CLIENT_SECRETS
This will allow us to accept two different ones and therefore allow us
to rotate the secret that the admin client is sending to the API

Due to how the notifications-python-client throws exceptions, we run
into exactly the same issue with not being able to distinguish if a
`TokenDecodeError` is thrown because the token was encrypted with a
different secret key or if because there was a different error when
decoding. I've copied the TODO from `requires_auth` as this is exactly
the same issue.

I've also added a test case for functionality that was missing for an
out of date admin token (old IAT).
2020-02-20 13:47:39 +00:00
David McDonald
52d3df49d4 Make ADMIN_CLIENT_SECRET a list of a single secret
And support this change across our code. Note, this is a halfway step
where it is not a list rather than a string but still only supports a
single secret, ie one item in the list.
2020-02-20 13:43:10 +00:00
David McDonald
a14d5f0225 Remove task that no longer runs
We no longer puts files in these s3 buckets (and have in fact deleted
the buckets) therefore this task is redundant and can be removed.
2020-02-06 10:57:43 +00:00
Rebecca Law
f4c0f70ba9 Send the alert for letters-still-sending an hour earlier.
These alerts are sent to our postal provider. And it usually arrives as they are getting ready to go home for the day or the weekend.
Which means they get missed/overlooked. They have agreed to get the alert an hour earlier, perhaps that will improved the response time.
2020-01-13 10:42:30 +00:00
Katie Smith
8f144be29c Add config for new template preview task
Added the queue and task names for the new template preview task to the
config. Also added the new bucket name that template preview will use
for the sanitised letters to the config for all environments.
2019-12-16 11:30:56 +00:00
Leo Hemsted
31d1abd6d1 add task to move sms providers back towards shared load
we generally aim to share the load between the two providers equally
(more or less). When one provider has struggled, we deprioritise them,
this commit adds a function that gradually restores balance. It checks
every five minutes, if it's been more than an hour since the providers
were last changed then it adjusts them towards a 50/50 split. Except
it's not quite 50/50 due to #reasons (we want to slightly favour MMG),
it's actually 60/40. That's defined in a new dict in config.py.
2019-12-13 10:02:39 +00:00
Pea M. Tyczynska
2019070536 Merge pull request #2667 from alphagov/warn-team-about-high-failure-rates
Warn team about high failure rates
2019-12-09 11:28:25 +00:00
Pea Tyczynska
d72ab4f4a6 Send zendesk ticket when services found with high failure rates 2019-12-06 16:57:04 +00:00
Leo Hemsted
4701e5d9af don't define MMG_URL and FIRETEXT_URL in manifest
these URLs never change, and it lead to surprising issues where an
updated default MMG_URL wasn't actually respected on PaaS. These urls
aren't private and don't need to be stored in credentials.

By not defining them in the manifest, we expect them to use the default
unless `cf set-env` has been specifically used to modify them in an app.
2019-12-04 15:26:49 +00:00
Leo Hemsted
6b9afa358f update utils to bring in full welsh diacritics range
note: this includes updating the MMG api url to their v2a api. Their
previous API doesn't include support for capital o with grave accent
(Ò)
2019-11-28 15:12:52 +00:00
Rebecca Law
9def176e7a Fix typo in config 2019-11-06 10:56:37 +00:00
Rebecca Law
74546a265e Added cron for check_for_missing_rows_in_completed_jobs
Run the task every 10 minutes. Does that seem reasonable? Maybe that is too often.
2019-11-06 10:49:46 +00:00