awscli has a requirement of a new version of botocore
moto has a requirement of an old version of boto3, which requires an
old version of botocore
We had to pin boto to an older version, because of the moto issues.
this commit pins awscli to the version currently deployed on prod, so
that it plays nice with that older version of boto/botocore
We want to bring the start dates for first class letter rates forward by
a month so that we don't see billing errors when sending first class letters now.
(The feature will still go live at the planned time - this is to let us test things
beforehand.)
we had an issue where the notification postage constraint command ran
into a deadlock, after trying to acquire two exclusive access locks on
large frequently modified/read tables.
To avoid this happening, we've had to split the upgrade script into
three - one script to apply the not-valid constraint to notifications
table, one for notification_history, and a third to validate the two
constraints.
Note: The first two scripts acquire exclusive access locks, but the
third only needs a row by row lock.
since this involves changing the exsiting alembic upgrades, if you've
upgraded your db you'll need to run the following three commands to
revert your database to a previous good state.
```
alter table notifications drop constraint chk_notifications_postage_null;
alter table notification_history drop constraint chk_notification_history_postage_null;
update alembic_version set version_num = '0229_new_letter_rates';
```
There are two fun quirks of postgres/sql that we need to work around:
* any `x = y` where x or y is NULL returns NULL, rather than false.
* check constraints accept NULL or true values as good.
so, the check `postage in ('first', 'second')` returns `null` rather
than `false` when postage is null itself. This surprisingly passes the
check constraint. To get around this, we have to add an explicit not
null check as well.
A not valid constraint only checks against new rows, not existing rows.
We can call VALIDATE CONSTRAINT against this new constraint to check
the old rows (which we know are good, having run the command from
74961781). Adding a normal constraint acquires an ACCESS EXCLUSIVE
lock, but validate constraint only needs a SHARE UPDATE EXCLUSIVE lock.
see 9d4b8961 and 0a50993f for more information on marking constraints
as not valid.
sets all letters in notification history (and notifications) to
"second", so that there's no null letters for billing etc. Commits in
ten day chunks (up to ~30k letters)
This was introduced in #1811 as a way to avoid sending traffic to newly
created apps where gunicorn had not started yet, such as the case during
a scaling event. These days we depend mostly on scheduled scaling and we
rarely need to scale above the scheduled values.
Yesterday we had an event where (during a traffic spike) the healthcheck
failed causing the instance to be killed and sending a 5XX response code
to all the connections that this instance was handling at the time.
However, this instance was not unhealthy and was serving traffic. The
problem stems from a combination of using async workers, having to limit
the number of database connections and a thread holding onto a db
connection for the entire duration of the request.
Specifically, we end up having requests queued up in gunicorn waiting
for other requests to finish and release the db connection. Some pages
such as the dashboard generate queries that can take >5s.
If a healthcheck request is sent during a traffic spike and the instance in
question was "unfortunate" enough to get handled a few of these long
running queries, the healthcheck request will be queued up behind these
slow requests and will fail to receive a response within 1s [docs].
Ideally we should be able to configure the healthcheck timeout to a
value of our choosing, since we can end up in this situation again in
the future.
docs: https://docs.cloudfoundry.org/devguide/deploy-apps/healthchecks.html#types
To start with this will be an attribute on the service, at the time the notification is created it will look at Service.letter_class to decide what class to use for the letter.
This PR adds Service.letter class as a nullable column.
Updated the create_service and update_service method to default the value to second.
Subsequent PRs will add the check constraint to ensure we only get first or second in the letter_class column and make that column nullable.
This can't be done all at once because it will cause an error if someone inserts or updates a service during the deploy.