Gracefully handle API failures when LB is not available
Overview
This document addresses the scenario when a vendor calls one of our Lambda endpoints, but LB is not available to process the xAPI payload. This can happen during monthly security patching, non-routine maintenance windows, or because of a service disruption.
Regardless of the reason, we would like our Lambda layer to be more resilient and to hold onto those messages until LB is available to process them.
Design concept
At a high level:
If the xAPI call fails, re-queue the SQS message
Backoff strategy: if n failures occur within a short period then implement a backoff strategy to reduce load on server while it’s down, then automatically reset backoff after a successful request