We know there is at least one system which wants to integrate with
Notify to send out emergency alerts, rather than creating them manually.
This commit adds an endpoint to the public API to let them do that.
To start with we’ll just let the system create them in a single call,
meaning they still have to be approved manually. This reduces the risk
of an attacker being able to broadcast an alert via the API, should the
other system be compromised.
We’ve worked with the owners of the other system to define which fields
we should care about initially.
this is a pretty big and convoluted refactor unfortunately.
Previously:
There was one global `cbc_proxy_client` object in apps. This class has
the information about how to invoke the bt-ee lambda, and handles all
calls to lambda. This includes calls to the canary too (which is a
separate lambda).
The future:
There's one global `cbc_proxy_client`. This knows about the different
provider functions and lambdas, and you'll need to ask this client for a
proxy for your chosen provider. call cbc_proxy_client.get_proxy('ee')`
and it'll return you a proxy that knows what ee's lambda function is,
how to transform any content in a way that is exclusive to ee, and in
future how to parse any response from ee.
The present:
I also cleaned up some duplicate tests.
I'm really not sure about the names of some of these variables - in
particular `cbc_proxy_client` isn't a client - it's more of a java style
factory, where you call a function on it to get the client of your
choice.
i think it's causing havoc with my attempts to mock stuff in the
`app.clients` directory because it's also accessible at that path. the
name's super vague and doesn't explain what it is anyway
We are going to invoke a lambda to send a message to the CBC
We need a CBC Proxy Client to do this
The Client will be able to send/update/cancel broadcasts in the CBC
Unless we have configured the app with AWS credentials for the
CBCProxyClient, we just want to use a client that does nothing: the noop
client
The AWS access keys are separate for the CBC Proxy vs other Notify AWS
things because the CBC Proxy lives in another AWS account
Signed-off-by: Toby Lorne <toby.lornewelch-richards@digital.cabinet-office.gov.uk>
Co-authored-by: Pea <pea.tyczynska@digital.cabinet-office.gov.uk>
Co-authored-by: Katie <katie.smith@digital.cabinet-office.gov.uk>
We don't want an exception while recording metrics to affect a user action. A KeyError exception was thrown today, that meant that a user say a 500, the action being performed was to download a document from the document-download-frontend app. By catching the error we prevent the user from seeing a 500 when a recording the connection metric fails.
Also catch the exception in the checkout event.
new blueprint `/service/<id>/broadcast-message` with the following
endpoints:
* GET / - get all broadcast messages for a service
* GET /<id> - get a single broadcast message
* POST / - create a new broadcast message
* POST /<id> - update an existing broadcast message's data
* POST /<id>/status - move a broadcast message to a new status
I've kept the regular data update (eg personalisation, start and end
times) separate from the status update, just to keep separation of
concerns a bit more rigid, especially around who can update.
I've included schemas for the three POSTs, they're pretty
straightforward.
This commit turns off StatsD metrics for the following
- the `dao_create_notification` function
- the `user-agent` metric
- the response times and response codes per flask endpoint
This has been done with the purpose of not having the creation of text
messages or emails call out to StatsD during the request process. These
are the three current metrics that are currently called during the
processing of one of those requests and so have been removed from the
API.
The reason for removing the calls out to StatsD when processing a
request to send a notification is that we have seen two incidents
recently affected by DNS resolution for StatsD (either by a slow down in
resolution time or a failure to resolve). These POST requests are our
most critical code path and we don't want them to be affected by any
potential unforeseen trouble with StatsD DNS resolution.
We are not going to miss the removal of these metrics.
- the `dao_create_notification` metric is rarely/never looked at
- the `user-agent` metric is rarely/never looked at and can be got from
our logs if we want it
- the response times and response codes per flask endpoint are already
exposed using the gds metrics python library
I did not remove the statsd metrics from any other parts of the API
because
- As the POST notification endpoints are the main source of web traffic,
we should have already removed most calls to StatsD which should
greatly reduce the chance of their being any further issues with
DNS resolution
- Some of the other metrics still provide value so no point deleting
them if we don't need to
- The metrics on celery tasks will not affect any HTTP requests from
users as they are async and also we do not currently have the
infrastructure set up to replace them with a prometheus HTTP endpoint that
can be scraped so this would require more work
Years ago we started to implement a way to schedule a notification. We hit a problem but we never came up with a good solution and the feature never made it back to the top of the priority list.
This PR removes the code for scheduled_for. There will be another PR to drop the scheduled_notifications table and remove the schedule_notifications service permission
Unfortunately, I don't think we can remove the `scheduled_for` attribute from the notification.serialized method because out clients might fail if something is missing. For now I have left it in but defaulted the value to None.
we have one global metrics variable `metrics = GDSMetrics()`, and we
then call `metrics.init_app` from within the flask application set up.
The v2/test_errors.py app_for_test fixture calls create_app, would call
metrics.init_app multiple times for the same metrics instance. This
causes errors, so change the fixture to session level so it only calls
once per test run.
grab the worker app name and task name rather than the web host and
endpoint. also add a fallback for if we're not in a web request or a
celery task. I think that'll probably happen when we use alembic, or if
we do things from within flask shell
add the following prometheus metrics to keep track of general app
instance health.
# sqs_apply_async_duration
how long does the actual SQS call (a standard web request to AWS) take.
a histogram with default bucket sizes, split up per task that was
created.
# concurrent_web_request_count
how many web requests is this app currently serving. this is split up
per process, so we'd expect multiple responses per app instance
# db_connection_total_connected
how many connections does this app (process) have open to the database.
They might be idle.
# db_connection_total_checked_out
how many connections does this app (process) have open that are
currently in use by a web worker
# db_connection_open_duration_seconds
a histogram per endpoint of how long the db connection was taken from
the pool for. won't have any data if a connection was never opened.
If we set an environment variable, we can stub out calls to SES and send
them to our own stub app. If the environment variable is not set, things
work as normal.
To be used alongside
https://github.com/alphagov/notifications-email-provider-stub
Now that the encryption module has been moved from this app to utils, we
can remove it from here (along with its tests) and import it from utils
instead. This also renames the `encryption.py` file to `hashing.py`,
since it no longer contains the encryption class.
It is likely this endppoint will need additional data for the UI to display, for the first iteration this will enable the /uploads page to show both letters and jobs. Only letter uploaded by the UI are included in the resultset.
Add file name to resultset.
we don't use it since we wrote our own provider stubs for performance
tests.
this removes it from the api - it's still in the DB and will be
retrieved by queries, but is set to disabled on prod
This table is no longer used or referenced in the code.
We can remove this mapping table now that the organisation to service relationship is handled by the foreign key in Services.
Handle werkzeug http errors separately. Also, improve the generic
`Exception` handler to be more flexible when exceptions don't look
like http errors.
note: because they're 405s the route isn't matched, so the v2 error
handlers don't trip properly. this might be an issue because the
internal endpoints expect a different return format. Might have to
think about how to handle this.
A recent issue with a long-running query (#2288) highlighted the
fact that even though the original HTTP connection might be closed
(for example after gorouter timeout of 15 minutes, which returns a
504 response to the client), the request worker will not be stopped.
This means that the worker is spending time and potentially DB
resources generating a response that will never be delivered.
Gunicorn's timeout setting only applies to sync workers and there
doesn't seem to be an option to interrupt individual requests in
gevent/eventlet deployments.
Since the most likely (and potentially most dangerous) scenario for
this is a long-running DB query, we can set a statement timeout on
our DB connections. This will raise a sqlalchemy.exc.OperationalError
(wrapping psycopg2.extensions.QueryCanceledError), interrupting the
request after the given timeout has been reached.
This is a Postgres client setting, so the database itself will abort
the transaction when it reaches the set timeout.
Since this will also apply to our celery tasks (including potentially
long-running nightly tasks) we set a timeout of 20 minutes to begin
with.
This can potentially be split in the future to set a different value
for each app, so that we could limit API requests even more.
By default, SQLAlchemy will start a transaction with an
existing connection without checking that the connection
is still valid.
Enabling "pre-ping" makes the ORM send a `SELECT 1` when
acquiring a connection, which should help avoid some errors
caused by connections breaking during a DB failover.
The added statement has a constant overhead for all transactions,
so we should only keep it enabled when we're planning to switch
or upgrade the database server.
https://docs.sqlalchemy.org/en/latest/core/pooling.html#disconnect-handling-pessimistic
* create template folder
* rename template folder
* get list of template folders for service (not nested/presented in any
particular way)
* delete template folder
Also removed `lazy=dynamic` from the `template_folder.templates`
relationship. lazy=dynamic returns a query object (which you can then
filter further). We just want to return the entire fetched list, at
least for now.
Created a platform-stats blueprint and moved the new platform stats
endpoint to the new blueprint (it was previously in the service
blueprint). Since the original platform stats route and the new platform
stats route are now in different blueprints, their view functions can
have the same name without any issues.
> On Python 3.3 or newer, monotonic will be an alias of time.monotonic
> from the standard library. On older versions, it will fall back to an
> equivalent implementation.
– https://pypi.org/project/monotonic/
Allows uploading documents to the Document Download API.
The client is configured with an API host and auth token. There's
no need for a flag to disable the client in the test environments
at the moment since the upload is only triggered by a specific
payload which would only be sent with an explicit goal of using
document download.
The main drive behind this is to allow us to enable http healthchecks on
the `/_status` endpoint. The healthcheck requests are happening directly
on the instances without going to the proxy to get the header properly
set.
In any case, endpoints like `/_status` should be generally accessible by
anything without requiring any form of authorization.
notable things that have been kept until migration is complete:
* passing in `organisation` to update_service will update email branding
* both `/email-branding` and `/organisation` hit the same code
* service endpoints still return organisation as well as email branding
logging at info level and as such no longer prints out the celery task
timing which are found to be use to find out if a tasks has been called
but also the timing for the task. Added an extra timing message for
celery tasks so that it can be determined if the these are less frequent
than the API calls and provide more useful information