use yield_per instead of limit

limit means we only return 50k letters, if there are more than that for a service we'll skip them and they won't be picked up until the next day. If you remove the limit, sqlalchemy prefetches query results so it can build up ORM results, for example collapsing joined rows into single objects with chidren. SQLAlchemy streams the data into a buffer, and normally will still prefetch the entire resultset so it can ensure integrity of the session, (so that if you modify one result that is duplicated further down in the results, both rows are updated in the session for example). However, we don't care about that, but we do care about preventing the result set taking up too much memory. We can use `yield_per` to yield from sqlalchemy to the iterator (in this case the `for letter in letters_awaiting_sending` loop in letters_pdf_tasks.py) - this means every time we hit 10000 rows, we go back to the database to get the next 10k. This way, we only ever need 10k rows in memory at a time. This has some caveats, mostly around how we handle the data the query returns. They're a bit hard to parse but I'm pretty sure the notable limitations are: * It's dangerous to modify ORM objects returned by yield_per queries * It's dangerous to join in a yield_per query if you think there will be more than one row per item (for example, if you join from notification to service, there'll be multiple result rows containing the same service, and if these are split over different yield chunks, then we may experience undefined behaviour. These two limitations are focused around there being no guarantee of having one unique row per item. For more reading: https://docs.sqlalchemy.org/en/13/orm/query.html?highlight=yield_per#sqlalchemy.orm.query.Query.yield_per https://www.mail-archive.com/sqlalchemy@googlegroups.com/msg12443.html
2026-02-01 07:35:34 -05:00 · 2020-10-23 20:06:24 +01:00
parent ed182c2a22
commit 3bc3ed88b3
2 changed files with 30 additions and 9 deletions
--- a/app/dao/notifications_dao.py
+++ b/app/dao/notifications_dao.py
@@ -717,9 +717,20 @@ def notifications_not_yet_sent(should_be_sending_after_seconds, notification_typ
    return notifications


-def dao_get_letters_to_be_printed(print_run_deadline, postage):
+def dao_get_letters_to_be_printed(print_run_deadline, postage, query_limit=10000):
    """
-    Return all letters created before the print run deadline that have not yet been sent
+    Return all letters created before the print run deadline that have not yet been sent. This yields in batches of 10k
+    to prevent the query taking too long and eating up too much memory. As each 10k batch is yielded, the
+    get_key_and_size_of_letters_to_be_sent_to_print function will go and fetch the s3 data, andhese  start sending off
+    tasks to the notify-ftp app to send them.
+
+    CAUTION! Modify this query with caution. Modifying filters etc is fine, but if we join onto another table, then
+    there may be undefined behaviour. Essentially we need each ORM object returned for each row to be unique,
+    and we should avoid modifying state of returned objects.
+
+    For more reading:
+    https://docs.sqlalchemy.org/en/13/orm/query.html?highlight=yield_per#sqlalchemy.orm.query.Query.yield_per
+    https://www.mail-archive.com/sqlalchemy@googlegroups.com/msg12443.html
    """
    notifications = db.session.query(
        Notification.id,
@@ -738,7 +749,7 @@ def dao_get_letters_to_be_printed(print_run_deadline, postage):
    ).order_by(
        Notification.service_id,
        Notification.created_at
-    ).limit(50000)
+    ).yield_per(query_limit)
    return notifications