at the moment only EE is enabled (this is set in app.config, but also,
only EE have a function defined for them so even if another provider was
enabled without changing the dict in cbc_proxy.py we won't trigger
anything). this commit just adds wrapper tasks that check what providers
are enabled, and invokes the send function for each provider.
The send function doesn't currently distinguish between providers for
now - as we only have EE set up. in the future we'll want to separate
the cbc_proxy_client into separate clients for separate providers.
Different providers have different lambda functions, and have different
requirements. For example, we know that the two different CBC software
solutions handle references to previous messages differently.
This is causing the disk of the CBCs to fill up quickly, and their
logrotate seems a bit flakey
Reducing the rate will ensure the disks fill up less often
Signed-off-by: Toby Lorne <toby.lornewelch-richards@digital.cabinet-office.gov.uk>
We are phasing out our cbc-proxy stub which displayed CAP XML messages
We are in the process of testing with real CBCs, so maintaining our own
stub is not useful
This commit
* removes the HTTP POST requests to the CBC proxy
* writes up the update/cancel methods of the cbc_client (not impl)
Signed-off-by: Toby Lorne <toby.lornewelch-richards@digital.cabinet-office.gov.uk>
We are going to invoke a lambda to send a message to the CBC
We need a CBC Proxy Client to do this
The Client will be able to send/update/cancel broadcasts in the CBC
Unless we have configured the app with AWS credentials for the
CBCProxyClient, we just want to use a client that does nothing: the noop
client
The AWS access keys are separate for the CBC Proxy vs other Notify AWS
things because the CBC Proxy lives in another AWS account
Signed-off-by: Toby Lorne <toby.lornewelch-richards@digital.cabinet-office.gov.uk>
Co-authored-by: Pea <pea.tyczynska@digital.cabinet-office.gov.uk>
Co-authored-by: Katie <katie.smith@digital.cabinet-office.gov.uk>
This keeps things consistent with the live environment and also how we
do it for the admin app where it is entirely up to environment variables
whether redis is enabled or not. This changes nothing in terms of
functionality as currently in our environment variables redis is enabled
for the API in staging.
task takes a brodcast_message_id, and makes a post to the cbc-proxy
for now, hardcode the url to the notify stub. the stub requires template
as the admin/api get it, so use the marshmallow schema to json dump it.
Note - this also required us to tweak the BroadcastMessage.serialize
function so that it converts uuids in to ids - flask's jsonify function
does that for free but requests.post doesn't sadly.
if the request fails (either 4xx or 5xx) just raise an exception and let
it bubble up for now - in the future we'll add retry logic
Same as we’ve done for templates.
For high volume services this should mean avoiding calls to external
services, either the database or Redis.
TTL is set to 2 seconds, so that’s the maximum time it will take for
revoking an API key or renaming a service to propagate.
Some of the tests created services with the same service ID. This
caused intermittent failures because the cache relies on unique service
IDs (like we have in the real world) to key itself.
see https://github.com/alphagov/notifications-utils/pull/752 for
implementation details. Cache DNS for 15 seconds. Note: This cache is
per eventlet, so concurrent requests will handle their requests
separately, but the cache does persist between sequential requests.
re-enable statsd for all environments.
We have seen bad latency across all apps in the last week. After running
a canary with statsd turned off we have seen these issues dissapear on
that canary instance. We therefore will turn off statsd for all
instances of the API. We will still need to investigate tomorrow what
exactly changed or is causing the issue with statsd as we hadn't made
any changes to it ourself.
Note, this keeps statsd on for all the other apps but this will be a
first step. We will also need to check performance of the other apps
after releasing this.
To check if it was our congestion point for the soak test.
(Email stub is a bit slower to respond than Amazon SES, and
the difference could lead to us having to many connections to db open
as we wait for response from the stub)
If we set an environment variable, we can stub out calls to SES and send
them to our own stub app. If the environment variable is not set, things
work as normal.
To be used alongside
https://github.com/alphagov/notifications-email-provider-stub
We have seen the reporting app run out of memory multiple times when
dealing with overnight tasks. The app runs 11 worker threads and we
reduce this to 2 worker threads to put less pressure on a single
instance.
The number 2 was chosen as most of the tasks processed by the reporting
app only take a few minutes and only one or two usually take more than
an hour. This would mean with 2 processes across our current 2
instances, a long running task should hopefully only wait behind a few
short running tasks before being picked up and therefore we shouldn't
see large increase in overall time taken to run all our overnight
reporting tasks.
On top of reducing the concurrency for the reporting app, we also set
CELERYD_PREFETCH_MULTIPLIER=1. We do this as suggested by the celery
docs because this app deals with long running tasks.
https://docs.celeryproject.org/en/3.1/userguide/optimizing.html#optimizing-prefetch-limit
The chance in prefetch multiplier should again optimise the overall time
it takes to process our tasks by ensuring that tasks are given to
instances that have (or will soon have) spare workers to deal with them,
rather than committing to putting all the tasks on certain workers in
advance.
Note, another suggestion for improving suggested by the docs for
optimising is to start setting `ACKS_LATE` on the long running tasks.
This setting would effectively change us from prefetching 1 task per
worker to prefetching 0 tasks per worker and further optimise how we
distribute our tasks across instances. However, we decided not to try
this setting as we weren't sure whether it would conflict with our
visibility_timeout. We decided not to spend the time investigating but
it may be worth revisiting in the future, as long as tasks are
idempotent.
Overall, this commit takes us from potentially having all 18 of our
reporting tasks get fetched onto a single instance to now having a
process that will ensure tasks are distributed more fairly across
instances based on when they have available workers to process the
tasks.
The reporting worker tasks fetch large amounts of data from the db, do
some processing then store back in the database. As the reporting worker
only processes the create nightly billing/stats table tasks, which
aren't high performance or high volume, we're fine with the performance
hit from restarting the worker between every task (which based on
limited local testing takes about a second or so).
This causes some real funky shit with the app_context (used for
accessing current_app.logger). To access flask's global state we use the
standard way of importing `from flask import current_app`. However,
processes after the first one don't have the current_app available on
shut down (they're fine during the actual task running), and are unable
to call `with current_app.app_context()` to create it. They _are_ able
to call `with app.app_context()` to create it, where `app` is the
initial app that we pass in to `NotifyCelery.init_app`.
NotifyCelery.init_app is only called once, in the master process - I
think the application state is then stored and passed to the celery
workers. But then it looks like the teardown might clear it, but it
never gets set up again for the new workers? Unsure.
To fix this, store a copy of the initial flask app on the NotifyCelery
object and then use that from within the shutdown signal logging
function.
Nothing's ever easy ¯\_(ツ)_/¯
We temporarily updated the provider resting points in
https://github.com/alphagov/notifications-api/pull/2804.
This puts the provider resting points back to their original value now
that both providers seem to be functioning well.
We are seeing issues with one of our providers. Move all traffic to the
other. We have done this with the provider load balancer, however this
will be reverted back 10 percent every hour to these resting points,
which we want to stop happening until we are confident our provider has
fixed their issues.
In the long run, we should add functionality to pause our load balancer
behaviour.
Moving it from 4:15am to 3:00. This will mean we do more deleting before
the 'work day' starts and improve our DB performance.
I've looked at the last 14 day of logs for the
`create-nightly-notification-status` subtasks and the
`create-nightly-billing` subtasks. The latest they appear to finish is
2.30AM. There were some outliers but I believe these were people running
the tasks in the middle of the day as a manual process.
Obviously, this still means there is a risk that those tasks conflict
with `delete-notifications-older-than-retention`, even more so now that
we move this to 3am.
Based on this https://github.com/alphagov/notifications-api/pull/2788
where some concerns were raised. This should be a quicker fix to get our
the deletions to run sequentially for all the notification types. Note,
email is first as most important as makes up the larger numbers (we
wouldn't want it to start with SMS, fail half way for a reason that only
affects SMS and for that to affect the email deletion).
We hope that by running sequentially we will reduce conflicts writing to
the same index and this will speed up the total time it takes to finish
deleting all notification types older than their retention time. There
is a risk that whilst quicker per job, as they now run sequentially
rather than potentially overlapping, they will take longer overall. We
will need to monitor to see.
Once a contact list is gone from the database there’s no way to
reference it again. Any jobs have made their own copy.
So we can clean it up, meaning we’re not storing personal data longer
than we need to.