The task was raising a JobIncompleteError, yet it's not an error the task is performing it's task correctly and calling the appropriate task to restart the job.
Also used apply_sync to create the task instead of send_task.
This will help us better understand how far through the task has got if
it gets interrupted halfway (as was the case this morning and we
struggled to understand).
solves `AttributeError: 'DummySession' object has no attribute 'query'`
if you don't do this you get really hard to diagnose errors in unrelated
tests, due to strange import order problems or something
task takes a brodcast_message_id, and makes a post to the cbc-proxy
for now, hardcode the url to the notify stub. the stub requires template
as the admin/api get it, so use the marshmallow schema to json dump it.
Note - this also required us to tweak the BroadcastMessage.serialize
function so that it converts uuids in to ids - flask's jsonify function
does that for free but requests.post doesn't sadly.
if the request fails (either 4xx or 5xx) just raise an exception and let
it bubble up for now - in the future we'll add retry logic
We will split them into three categories:
- first class
- second class
- international - this zip file will have letters for both europe
and rest-of-world postage classes
grab the worker app name and task name rather than the web host and
endpoint. also add a fallback for if we're not in a web request or a
celery task. I think that'll probably happen when we use alembic, or if
we do things from within flask shell
add the following prometheus metrics to keep track of general app
instance health.
# sqs_apply_async_duration
how long does the actual SQS call (a standard web request to AWS) take.
a histogram with default bucket sizes, split up per task that was
created.
# concurrent_web_request_count
how many web requests is this app currently serving. this is split up
per process, so we'd expect multiple responses per app instance
# db_connection_total_connected
how many connections does this app (process) have open to the database.
They might be idle.
# db_connection_total_checked_out
how many connections does this app (process) have open that are
currently in use by a web worker
# db_connection_open_duration_seconds
a histogram per endpoint of how long the db connection was taken from
the pool for. won't have any data if a connection was never opened.
us not recognising a code or provider not having sent the detailed
status.
It seems like Firetext is sometimes sending us permanent-failure
without detailed status. It could be due to:
- them really not sending any detailed status
- them sending a status code we don't recognise
- them sending 000 code that means 'no errors', which we ignore
To see which one it is, and to debug such issues quicker in the
future, this PR adds status and detailed status codes to the logs.
Also log detailed delivery status for firetext in the same place in addition
to it being logged from notifications_dao.
Logging detailed delivery statuses will help us see why messages
fail to deliver. In the future we could persist detailed delivery
status in the database.
If a service has permission to send international letters then it should
tell template preview, so that template preview knows what rule to
apply when it’s validating the address of the letter.
Depends on:
- [ ] https://github.com/alphagov/notifications-template-preview/pull/445
We've seen only some of these reporting tasks happen but with no log
messages to indicate what happened and no app crashes. This hopefully
will give us a better picture of a timeline.
Note, I've tried to make our message format very consistent and good for
searching for in kibana so I've changed that across this whole file for
consistency.
The reporting worker tasks fetch large amounts of data from the db, do
some processing then store back in the database. As the reporting worker
only processes the create nightly billing/stats table tasks, which
aren't high performance or high volume, we're fine with the performance
hit from restarting the worker between every task (which based on
limited local testing takes about a second or so).
This causes some real funky shit with the app_context (used for
accessing current_app.logger). To access flask's global state we use the
standard way of importing `from flask import current_app`. However,
processes after the first one don't have the current_app available on
shut down (they're fine during the actual task running), and are unable
to call `with current_app.app_context()` to create it. They _are_ able
to call `with app.app_context()` to create it, where `app` is the
initial app that we pass in to `NotifyCelery.init_app`.
NotifyCelery.init_app is only called once, in the master process - I
think the application state is then stored and passed to the celery
workers. But then it looks like the teardown might clear it, but it
never gets set up again for the new workers? Unsure.
To fix this, store a copy of the initial flask app on the NotifyCelery
object and then use that from within the shutdown signal logging
function.
Nothing's ever easy ¯\_(ツ)_/¯
create-letters-pdf-tasks is for contacting template preview.
letters-tasks is for doing other things around letters (including putting things on template preview queue)
At the moment we’re not consistent:
Precompiled (API and one-off):
`to` has the whole address
`normalised_to` has nothing
Templated (API, CSV and one off):
`to` has the first line of the address
`normalised_to` has nothing
This commit makes us consistently store the whole address in the `to`
field. We think that people might want to search by postcode, not just
first line of the address.
This commit also starts to populate the normalised_to field with the
address lowercased and with all spaces removed, to make it easier to
search on.
We don’t need to reformat the postcode here once template preview takes
care of it when rendering the PDF.
It’s better (and less code) to store what people give us, so we give
them back the same thing.
Based on this https://github.com/alphagov/notifications-api/pull/2788
where some concerns were raised. This should be a quicker fix to get our
the deletions to run sequentially for all the notification types. Note,
email is first as most important as makes up the larger numbers (we
wouldn't want it to start with SMS, fail half way for a reason that only
affects SMS and for that to affect the email deletion).
We hope that by running sequentially we will reduce conflicts writing to
the same index and this will speed up the total time it takes to finish
deleting all notification types older than their retention time. There
is a risk that whilst quicker per job, as they now run sequentially
rather than potentially overlapping, they will take longer overall. We
will need to monitor to see.
Text messages have placeholders in their body.
Emails have them in their subject line too.
Letters have them in their body, subject line and contact block.
We were only looking in the the body and subject when processing a job,
therefore the thing assembling the letter was not looking in all the
CSV columns it needed to, because it hadn’t been told about any
placeholders in the contact block.
Fixing this means always making sure we use the correct `Template`
instance for the type of template we’re dealing with. Which we were
already doing in a different part of the codebase. So it makes sense to
reuse that.
Turns out we fixed the same bug for email subjects over 3 years ago:
3ed97151ee
This has been replaced by a new task, `sanitise-letter`, to this deletes
all the code in the old task and ensures that when antivirus is not
enabled locally we are calling the new task.
Move finding of letter logic into a separate method so it can be unit
tested rather than having to test it by checking the contents of a
zendesk ticket api call.
This will enable us to change the zendesk api ticket call message
without needing to edit lots of tests.
This task didn't have retries before, based on the assumption that if
the task failed it was likely to be due to a Boto error, so retrying
would not help because a file was probably not in the expected bucket.
During an incident with the database, we had some letters that were
stuck in the `pending-virus-check` state because this task failed.
This change adds retries to the task if there was an Exception that was
not a `BotoClientError`.