Previously we used a combination of "provider.name" and "get_name()"
which was confusing. Using a non-property function also gave me the
impression that the name was more dynamic than it actually is.
At the moment, when we are processing and sending an SMS we open
a DB connection at the start of the celery task and then close it
at the end of the celery task. Nice and simple.
However, during that celery task we make an HTTP call out to our
SMS providers. If our SMS providers have problems or response times
start to slow then it means we have an open DB connection sat waiting
for our SMS providers to respond which could take seconds. If our
SMS providers grind to a halt, this would cause all of the
celery tasks to hold on to their connections and we would run out
of DB connections and Notify would fall over.
We think we can solve this by closing the DB session which releases
the DB connection back to the pool.
Note, we've seen this happen in staging during load testing if our
SMS provider stub has fallen over. We've never seen it in production
and it may be less unlikely to happen as we are balancing traffic
across two providers and they generally have very good uptime.
One downside to be aware of is there could be a slight increase in
time spent to send an SMS as we will now spend a bit of extra time
closing the DB session and then reopening it again after the HTTP
request is done.
Note, there is no reason this approach couldn't be copied for our
email provider too if it appears successful.
As of 041d8b48a2
it’s not valid to call `random.choices` without giving at least one of
the options a positive weighting.
This makes sense, because giving a zero weighting is effectively saying
‘theres’s only one choice, but don’t choose it’.
In our codebase this is applicable where there’s only one international
provider, which we want to use even when it’s been de-prioritised for
domestic SMS.
This doesn’t cause a problem now, but will if we upgrade to Python
versions greater than 3.9.0.
Just looks a bit tidier and less repetitive.
I’ve only done this for the serialised service because:
- we’re only checking this in places where we’re already using the
serialised service
- if we want to check this elsewhere there’s a good chance that new code
should be using the serialised service, since it’ll itself be doing
some kind of performance optimisation
We want to start using Firetext for sending international SMS. They
require us to use a different API key for international SMS because it
requires a new code path to switch the sender ID to something that the
country will accept.
This PR does not include switching the sender of international SMS to
Firetext but sets us up to do so.
Add caching by using the SeriralisedTemplate and SerialisedService objects
Removed extra call to the database to fetch the notification after the commit by saving the created_at and key_type to a local variable. After the update to the notification to mark it as sending the db.session is committed. Any reference to the the Notification data model after that will require a query to fetch the object again because it is considered "dirty" or out of date.
Added name, sms_prefix and email branding to SerialisedService.
Refactor the get_html_options to work with the SerialisedService object.
Removed the need to validate and format the to field by using `normalised_to`, since when persisting the notification the `normalised_to` field has already had this done.
Removed the validate and format for reply_to_text for email reply_to, this has been done when the email address has been added via the frontend, no need to validate this address every time a services sends an email.
By adding SerialisedTemplate we can avoid a database call for the template. This is useful when sending many many emails/sms for the same template/version.
Remove 2 extra select queries after the update and commit. Once a transaction is committed SQLAlchemy will query for the db model if referenced after a commit.
For every email or text message we send we have to work out which
provider to send it with. Every time we do this we go and load the list
of providers from the database.
For emails, the result will always be the same.
For text messages the result is randomly chosen to balance the load
between the providers.
For international text messages the result is always the same (we only
have one international text message provider).
This commit adds an in-memory cache with a 2 second TTL so that we’re
not fetching the providers from the database every time, which should
speed things up a bit.
This does mean that, for text messages, the random choice will ‘stick’
for two seconds on each instance, before being re-chosen. I think this
is OK because it will even out to the same distribution over time.
I really don’t like having to clear the cache in the tests, so would
welcome suggestions on a better way of doing this…
`service.id` is a uuid so will not be matched to anything in
`current_app.config.get('HIGH_VOLUME_SERVICE')` because that is a list
of strings.
This is why we are never falling into the first if statement and having
any metrics for high volume services on our dashboards at the moment.
Note, I had taken the existing line from the `post_notification`
endpoint, but that is using a serialised service which already has the
UUID converted to a string.
Changes the high volume and not high volume metrics to both only include
non test notifications. This is because when looking at the grafana
metrics, it was impossible to tell what affect the high volume/non high
volume effect was having vs the test/live notification effect.
This leaves us with no break down of high volume/not high volume sending
times for test notifications but I don't think we really need that.
We currently measure the sending time for all. This commit then breaks
it down into
- test keys and non test keys
- high volume services and non high volume services
Breaking it down into test keys and non test keys is important because
we don't care as much about sending test notifications within 10
seconds, only non test keys so we don't want our graphs to reflect poor
performance if it's just test keys affecting this
Breaking it down into high volume and non high volume will allow us to
easily debug issues with slow sending if they are high volume or non
high volume issues
i think it's causing havoc with my attempts to mock stuff in the
`app.clients` directory because it's also accessible at that path. the
name's super vague and doesn't explain what it is anyway
This prevents a race condition when we get delivery receipt before
updating notification to sending, and so the sending status would
supersede the delivered status, and the notification would time out
as temporary-failure after three days.
it's not acceptable for a constantly failing provider to take 50 minutes
to drain (5x reducing priority by 10). But similarly, we need _some_
delay, or a handful of concurrent failures will completely turn off a
provider, rendering the whole excercise kinda pointless. Setting the
delay before it tries to reduce priority again to one minute is nice
because it means that if one request times out and returns 502, then any
other requests that are in flight at that time will time out before the
one minute is up and not switch, but any requests made after the switch
that take sixty seconds to time out will affect it.
retrive the sms providers from the DB, and decrease the chosen
provider's priority by 10, while increasing the other by 10.
add a check in to ensure we never decrease below 0 or increase above 100
- this is per provider, we don't check that the two add up to 100 or
anything. If the values are outside of this range (eg: set via the UI)
then they'll probably* fix themselves at some point - we've added tests
to document these cases.
Use with_for_update to ensure that the method can only run once at a
time - other invocations of the function will be held on that line until
the currently running one ends and commits the transaction. This doesn't
affect anyone doing things from the UI.
Utils 33.0.0 adds alt text to email branding - the HTMLEmailTemplate now
initializes slightly differently as a result (with both `branding_name`
and `branding_text`).
- Remove some redundant code for research mode.
- The international parameter in update_notification_to_sending is not needed.
- Update unit tests and removed duplicates
If we try to send an SMS to the provider and the provider throws an exception
(because they return a 503 status code) the notification should retry. But if
we get the callback from the provider before the notification has been retried, the
notification will have no billable units or provider set.
To avoid this, we now set billable_units and provider even if there has been
an exception from our provider.
‘GOV.UK’ doesn’t make sense as a type of brand. It only made sense as
a type of branding that a service had.
Since we’ve:
- deprecated the service branding column
- made sure it’s not used as a value in the email branding table
we can remove this value from the table of possible brand types.
https://www.pivotaltracker.com/story/show/159986276
We are now setting the type of branding on the branding itself, not on
the service.
This commit switches over from looking in the old place (on the service)
to looking in the new place (on the branding).
The JobStatistics table is going to be deleted. There are currently
3 tasks which use the JobStatistics model via the Statistics DAO, so we
need to make sure that these tasks aren't being used before they are
deleted in a separate PR.
This commit deletes:
* The `create_initial_notification_statistic_tasks` function which gets
used to call the `record_initial_job_statistics` task.
* The `create_outcome_notification_statistic_tasks` function which gets
used to call the `record_outcome_job_statistics` task.
* And the scheduling of the `timeout-job-statistics` scheduled task.
notable things that have been kept until migration is complete:
* passing in `organisation` to update_service will update email branding
* both `/email-branding` and `/organisation` hit the same code
* service endpoints still return organisation as well as email branding
This PR is a proposal to reduce the average messages we see for a single notification from about 7 messages to 2.
Messaging would change to something like this:
February 2nd 2018, 15:39:05.885 Full delivery response from Firetext for notification: 8eda51d5-cd82-4569-bfc9-d5570cdf2126
{'status': ['0'], 'reference': ['8eda51d5-cd82-4569-bfc9-d5570cdf2126'], 'time': ['2018-02-02 15:39:01'], 'code': ['000']}
February 2nd 2018, 15:39:05.885 Firetext callback return status of 0 for reference: 8eda51d5-cd82-4569-bfc9-d5570cdf2126
February 2nd 2018, 15:38:57.727 SMS 8eda51d5-cd82-4569-bfc9-d5570cdf2126 sent to provider firetext at 2018-02-02 15:38:56.716814
February 2nd 2018, 15:38:56.727 Starting sending SMS 8eda51d5-cd82-4569-bfc9-d5570cdf2126 to provider at 2018-02-02 15:38:56.408181
February 2nd 2018, 15:38:56.727 Firetext request for 8eda51d5-cd82-4569-bfc9-d5570cdf2126 finished in 0.30376038211397827
February 2nd 2018, 15:38:49.449 sms 8eda51d5-cd82-4569-bfc9-d5570cdf2126 created at 2018-02-02 15:38:48.439113
February 2nd 2018, 15:38:49.449 sms 8eda51d5-cd82-4569-bfc9-d5570cdf2126 sent to the priority-tasks queue for delivery
To somthing like this:
February 2nd 2018, 15:39:05.885 Firetext callback return status of 0 for reference: 8eda51d5-cd82-4569-bfc9-d5570cdf2126
February 2nd 2018, 15:38:49.449 sms 8eda51d5-cd82-4569-bfc9-d5570cdf2126 created at 2018-02-02 15:38:48.439113
* unused variables
* variables in loops overshadowing imports
* excepts with no defined exc type (tried to avoid `except Exception` too)
* history mapper is still too complex
* default variables should never be mutable