Improve and clarify large task error handling

Previously we were catching one type of exception if something went
wrong adding a notification to the queue for high volume services.
In reality there are two types of exception so this adds a second
handler to cover both.

For context, this is code we changed experimentally as part of the
upgrade to Celery 5 [1]. At the time we didn't check how the new
exception compared to the old one. It turns out they behaved the
same and we were always vulnerable to the scenario now covered by
the second exception, where the behaviour has changed in Celery 5 -
testing with a large task invocation gives...

Before (Celery 3, large-ish task):

    'process_job.apply_async(["a" * 200000])'...

    boto.exception.SQSError: SQSError: 400 Bad Request
    <?xml version="1.0"?><ErrorResponse xmlns="http://queue.amazonaws.com/doc/2012-11-05/"><Error><Type>Sender</Type><Code>InvalidParameterValue</Code><Message>One or more parameters are invalid. Reason: Message must be shorter than 262144 bytes.</Message><Detail/></Error><RequestId>96162552-cd96-5a14-b3a5-7f503300a662</RequestId></ErrorResponse>

Before (Celery 3, very large task):

    <hangs forever>

After (Celery 5, large-ish task):

    botocore.exceptions.ClientError: An error occurred (InvalidParameterValue) when calling the SendMessage operation: One or more parameters are invalid. Reason: Message must be shorter than 262144 bytes.

After (Celery 5, very large task):

    botocore.parsers.ResponseParserError: Unable to parse response (syntax error: line 1, column 0), invalid XML received. Further retries may succeed:
    b'HTTP content length exceeded 1662976 bytes.'

[1]: 29c92a9e54
This commit is contained in:
Ben Thorner
2021-11-08 10:38:53 +00:00
parent 98b6c1d67d
commit 1872854a4e
2 changed files with 14 additions and 5 deletions

View File

@@ -232,9 +232,10 @@ def process_sms_or_email_notification(
reply_to_text=reply_to_text
)
return resp
except botocore.exceptions.ClientError:
# if SQS cannot put the task on the queue, it's probably because the notification body was too long and it
# went over SQS's 256kb message limit. If so, we
except (botocore.exceptions.ClientError, botocore.parsers.ResponseParserError):
# If SQS cannot put the task on the queue, it's probably because the notification body was too long and it
# went over SQS's 256kb message limit. If the body is very large, it may exceed the HTTP max content length;
# the exception we get here isn't handled correctly by botocore - we get a ResponseParserError instead.
current_app.logger.info(
f'Notification {notification_id} failed to save to high volume queue. Using normal flow instead'
)