Gracefully handle API failures when LB is not available

Gracefully handle API failures when LB is not available

Overview

This document addresses the scenario when a vendor calls one of our Lambda endpoints, but LB is not available to process the xAPI payload. This can happen during monthly security patching, non-routine maintenance windows, or because of a service disruption.

Regardless of the reason, we would like our Lambda layer to be more resilient and to hold onto those messages until LB is available to process them.

Design concept

At a high level:

  • If the xAPI call fails, re-queue the SQS message

  • Backoff strategy: if n failures occur within a short period then implement a backoff strategy to reduce load on server while it’s down, then automatically reset backoff after a successful request