use yield_per instead of limit

limit means we only return 50k letters, if there are more than that for
a service we'll skip them and they won't be picked up until the next
day.

If you remove the limit, sqlalchemy prefetches query results so it can
build up ORM results, for example collapsing joined rows into single
objects with chidren. SQLAlchemy streams the data into a buffer, and
normally will still prefetch the entire resultset so it can ensure
integrity of the session, (so that if you modify one result that is
duplicated further down in the results, both rows are updated in the
session for example). However, we don't care about that, but we do care
about preventing the result set taking up too much memory. We can use
`yield_per` to yield from sqlalchemy to the iterator (in this case the
`for letter in letters_awaiting_sending` loop in letters_pdf_tasks.py) -
this means every time we hit 10000 rows, we go back to the database to
get the next 10k. This way, we only ever need 10k rows in memory at a
time.

This has some caveats, mostly around how we handle the data the query
returns. They're a bit hard to parse but I'm pretty sure the notable
limitations are:

* It's dangerous to modify ORM objects returned by yield_per queries
* It's dangerous to join in a yield_per query if you think there will be
  more than one row per item (for example, if you join from notification
  to service, there'll be multiple result rows containing the same
  service, and if these are split over different yield chunks, then we
  may experience undefined behaviour.

These two limitations are focused around there being no guarantee of
having one unique row per item.

For more reading:
https://docs.sqlalchemy.org/en/13/orm/query.html?highlight=yield_per#sqlalchemy.orm.query.Query.yield_per
https://www.mail-archive.com/sqlalchemy@googlegroups.com/msg12443.html
This commit is contained in:
Leo Hemsted
2020-10-23 20:06:24 +01:00
parent ed182c2a22
commit 3bc3ed88b3
2 changed files with 30 additions and 9 deletions

View File

@@ -717,9 +717,20 @@ def notifications_not_yet_sent(should_be_sending_after_seconds, notification_typ
return notifications
def dao_get_letters_to_be_printed(print_run_deadline, postage):
def dao_get_letters_to_be_printed(print_run_deadline, postage, query_limit=10000):
"""
Return all letters created before the print run deadline that have not yet been sent
Return all letters created before the print run deadline that have not yet been sent. This yields in batches of 10k
to prevent the query taking too long and eating up too much memory. As each 10k batch is yielded, the
get_key_and_size_of_letters_to_be_sent_to_print function will go and fetch the s3 data, andhese start sending off
tasks to the notify-ftp app to send them.
CAUTION! Modify this query with caution. Modifying filters etc is fine, but if we join onto another table, then
there may be undefined behaviour. Essentially we need each ORM object returned for each row to be unique,
and we should avoid modifying state of returned objects.
For more reading:
https://docs.sqlalchemy.org/en/13/orm/query.html?highlight=yield_per#sqlalchemy.orm.query.Query.yield_per
https://www.mail-archive.com/sqlalchemy@googlegroups.com/msg12443.html
"""
notifications = db.session.query(
Notification.id,
@@ -738,7 +749,7 @@ def dao_get_letters_to_be_printed(print_run_deadline, postage):
).order_by(
Notification.service_id,
Notification.created_at
).limit(50000)
).yield_per(query_limit)
return notifications